AGISF - Week 4 - Task Decomposition for Scalable Oversight

Today we celebrate International Women's day by looking at scalable oversight.

Scalable oversight refers to methods that enable humans to oversee AI systems that are solving tasks too complicated for a single human to evaluate. Scalable oversight is an approach to prevent reward misspecification, and we can do this by iterated amplification. Iterated amplification is built upon the idea of task decomposition, which is the strategy of training agents to perform well on complex tasks by breaking down said tasks into more-evaluable tasks, then having them produce solutions for the full tasks. In this way, iterated amplification involves repeatedly using task decomposition to train increasingly powerful agents. 

The AGISF deck 

AI Alignment Landscape by Paul Christiano

Intent alignment is trying to build AI systems that are performing as intended, and have a positive, long-run impact--- essentially, it is robust and reliable while being sufficiently competent. Here we define "well-meaning" as separable from "reliable". 

"Well-meaning" also does not mean "knows me well". An example is a [well-meaning AI that's trying to get what it thinks I want] and [an AI that really understands what I want].

We are at a critical time because we are building AI systems that are going to make decisions on our behalf, and helping us design future AI systems. When such a hand-off happens, we must either 1. cope with destructive capabilities or 2. cope with shifting balances of power. 

Eliezer proposed the idea of an "alignment tax", or the cost we pay for insisting on alignment. The greater the alignment tax we pay, the more aligned the AI system is. 

Again, outer alignment means that we find objectives that incentivize aligned behavior. The failure mode of outer alignment is bad behavior that looks good. Inner alignment means that we make sure the policy is robustly pursuing that objective. The failure mode of inner alignment is that bad behavior off distribution.

The central piece of what we are asked to focus on is decomposition--- “Rather than taking our initial teacher to be one human and learning from that one human to behave as well as a human, let’s start by learning from a group of humans to do more sophisticated stuff than one human could do.”


Measuring Progress on Scalable Oversight for Large Language Models by Samuel Bowman et al.

What is scalable oversight? Supervising systems that potentially outperform us on most skills relevant to the task at hand. 

Bowman et. al identifies that empirical work is not straightforward, and describes the experimental design to be testing the system on two question-answering NLP tasks MMLU and time-limited QuALITY. Currently, human specialists have succeeded but unaided humans and current general AI systems have failed. 

Currently, robust techniques for scalable oversight are done through labeling, reward signaling, and critiquing for models to achieve broadly human-level performance. 

The paradigm is Cotra's proposed "sandwiching" of the model's capabilities between that of the typical humans and the experts. For example, a limitation arises to that LLMs like GPT-3 have memorized books and texts far more extensively than one medical clinician but also memorized large dated, debunked research on medicine. By default, we should not expect alignment. 

Key takeaways with the sandwiching experiment: 

1. A language-model-based system is capable of solving a task if it can be made to perform well on the task through some small-to-moderate intervention (fine-tuning/few-shot prompting)

2. It is MISALIGNED if it is capable to solve this task, but then goes on to perform poorly under zero-shot prompting

Hendryck's MMLU benchmark are drawn from practice tests for professional graduate students in specialized fields (Hendrycks et al., 2020). Pang's QuALITY multiple-choice reading-comprehension questions are meant to be answerable by college-educated adults (Pang et al., 2022). 


Future directions include more targeted case studies on recidivism predictions, medical diagnosis, and credit risk prediction. 

Conclusion: safely deploying AI systems that are broadly at or above human capabilities will require progress on scalable oversight. A paradigm called sandwiching will help models elicit high-quality answers, which can assist humans with difficult tasks. 


Learning complex goals with iterated amplification by OpenAI

Iterated amplification is an AI safety technique that lets us specify complicated behaviors and goals that are beyond human scale, by demonstrating how to decompose a task into simpler sub-tasks, rather than by providing labeled data or a reward function. 

Put simply, iterated amplification is a proposal for scaling aligned subsystems to solve complex tasks. 

In AI Safety, there is a need for training signals with every ML system that performs a task. 

What is a training signal? Labels in supervised learning & rewards in RL are training signals. And how do we currently generate training signals? This is a little more difficult to categorize. Many tasks are complicated, so a human can't judge or perform them--- like designing a complicated transit system for a city or managing every detail of the security of a large network of computers. 

Sounds like taking baby steps, with each step becoming less and less a human-intervention training signal. 

The experiment is trying iterated amplification on five toy algorithmic tasks (permutation powering, sequential assignments, wildcard search, shortest path, and union-find). 

What is the most exciting part of this paper is, iterative amplification is competitive with just directly learning the tasks via supervised learning, despite being handicapped by not having a direct training signal. We should note that it has matched supervised learning with less informaiton, not yet surpassing it.


Supervising strong learners by amplifying weak experts by Paul Christiano, Buck Shlegeris, and Dario Amodei

Training process:

1. Repeatedly sample a question Q in distribution D, use Amplify^H(X) to answer that quesiton, record every decision made by H during that process. H finds a subquestion Q1 that helps them answer Q, and we compute the Answer A1 = X(Q1). We repeat this k times.

2. We train a model H' to predict the decisions made by H to predict subquestions Qi and final answers A. 

3. Record resulting (Q,A) pairs 

4. X is trained by SL on these (Q,A) pairs. 


Training development (How the behavior of agent X changes over some distribution D): 

1. Agent X answers question randomly, fails when human asks subquestions

2. Human is able to answer questions without help from X, X learns to copy these simple answers 

3. Once X is able to provide simple answers, Human provides slightly better answers by breaking them into simple pieces. X learns to provide slightly better answers 

4. Let this run on, X gradually expand the set of queries. Approaches SL outcomes. 


Model Architecture 

Transformer-like Encoder-decoder with self-attention

1. Apply the Transformer encoder to the embedded facts. 

2. Embed questions in the same way we embed facts, then apply Transformer decoder to a batch of questions


Related Works 

1. Expert Iteration (similar to Bellman update [https://www.wikiwand.com/en/Bellman_equation] in Q learning), the difference is that ExIt lacks an external objective

2. Inverse reinforcement learning 

3. Debate (training AI systems to debate each other)

4. Algorithm Learning 

5. Recursive model architectures


Conclusion

Iterative Amplification can solve complex tasks where there is no external reward function AND the objective is implicit. As long as humans are able to decompose a task into simpler pieces, we can apply ML in domains without a suitable objective. This will help solve the issue of inaccurate substitutes for complex implicit objectives. 


Summarizing Books with Human Feedback by OpenAI 

Fun little snippet on how to scale human oversight of AI systems--- classical literature is divided into sections, and each section is summarized. The model is trained on the BookSum dataset. 

The difficulty here is that large pretrained models aren't very good at summarization. Training with RLHF helped align model summaries wih human preferences on short posts and articles. However, judging summaries of entire books take a lot of time for humans. 

Recursive task decomposition has advantages, to solve difficult tasks such as the one aforementioned. Decomposition allows faster human evaluation by using summaries of the source text. It is also easier to trace the summary-writing through chain-of-thought thinking. Finally the length of the book will never be restrictive on the said transformer model. 


Language Models Perform Reasoning via Chain of Thought by Jason Wei, Google 

Wasn't I just saying something about chain-of-thought? 

Jokes aside, chain of thought prompting method enables models to decompose multi-step problems into intermediate steps. Successful chain-of-thought reasoning is an emergent property of model scale--- it will only materialize with a sufficiently large model parameter (100B)!

"One class of tasks where language models typically struggle is arithmetic reasoning (i.e., solving math word problems). Two benchmarks in arithmetic reasoning are [MultiArith](https://aclanthology.org/D15-1202/) and [GSM8K](https://arxiv.org/abs/2110.14168), which test the ability of language models to solve multi-step math problems similar to the one shown in the figure above. We evaluate both the [LaMDA collection](https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html) of language models ranging from 422M to 137B parameters, as well as the [PaLM collection](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html) of language models ranging from 8B to 540B parameters. We manually compose chains of thought to include in the examples for chain of thought prompting."

Employment of chain-of-thought prompting enhances generated reasoning processes on the GSM8K dataset, up to 74% accuracy. 

It looks like to me PaLM's parameter size advantage coupled with self-consistency and chain of thought has over-reached finetuned GPT-3 by a significant amount on GSM8K solve rate. 

Extremely impressive. 


Least-to-Most Prompting Enables Complex Reasoning in LLMs by Zhou et al.

Least-to-most prompting is another prompting technique. Compared with chain-of-thought prompting, it produces better answers by more explicitly decomposing tasks into multiple steps. This could potentially make the resulting outputs easier to supervise.

It is trying to ask the model to explain why--- explain the workings of the task at hand. We can compare this with chain-of-thought reasoning. 


superior prompting method






Comments

Popular posts from this blog

ArtificiaI General Intelligence Safety Fundamentals Notes

A year, introspectively summarized

Vicissitude