AGI Fundamentals - Week 3 - Goal misgeneralisation


MiniTorch has been particularly helpful this week. I've been kept indoors by the sudden week-long downpour of rain. 

def goal misgeneralisation: 

when agents in new situations generalize to behaving in competent yet undesirable ways, because

        of learning the wrong goals from previous training


Goal Misgeneralisation: Why Correct Specifications Aren't Enough For Correct Goals 

A typical system that tends to arrive at goal misgeneralisation by: 

1. Training a system with a correct specification

2. The system only sees specification values on the training data 

3. The system learns a policy 

4. ... which is consistent with the specification on the training distribution 

5. Under a distribution shift 

6. ... The policy pursues an undesired goal 


Some function f maps input x as a member of set of inputs and y as a member of set of labels. In RL, X is the set of states or observation histories, and Y is the set of actions. 

{"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msup><mi>f</mi><mo>*</mo></msup><mo>&#xA0;</mo><mo>:</mo><mo>&#x2009;</mo><mi>X</mi><mo>&#x2192;</mo><mi>Y</mi><mo>&#x2009;</mo><mspace linebreak=\"newline\"/><mi>x</mi><mo>&#xA0;</mo><mo>&#x2208;</mo><mo>&#xA0;</mo><mi>X</mi><mspace linebreak=\"newline\"/><mi>y</mi><mo>&#xA0;</mo><mo>&#x2208;</mo><mo>&#xA0;</mo><mi>Y</mi></mstyle></math>","truncated":false}

A scoring function s will evaluate the performance of f sub theta over training dataset. Goal misgeneralisation (GMG) happens when two parameterizations {"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>&#x3B8;</mi><mn>1</mn></msub></mstyle></math>","truncated":false} and {"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>&#x3B8;</mi><mn>2</mn></msub></mstyle></math>","truncated":false} such that their corresponding functions both perform well on the training set but differ on the testing set ({"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi mathvariant=\"double-struck\">D</mi><mrow><mi>t</mi><mi>e</mi><mi>s</mi><mi>t</mi></mrow></msub></mstyle></math>","truncated":false}). 

{"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mi>s</mi><mfenced><mrow><msub><mi>f</mi><mi>&#x3B8;</mi></msub><mo>,</mo><msub><mi mathvariant=\"double-struck\">D</mi><mrow><mi>t</mi><mi>r</mi><mi>a</mi><mi>i</mi><mi>n</mi></mrow></msub></mrow></mfenced><mspace linebreak=\"newline\"/></mstyle></math>","truncated":false}

We do not want a system in which its capabilities generalise but its goal does not. In the Spheres example, the agent navigates the environment with a learned policy of following the red high-reward trajectory, but then when we replace the red expert with a red anti-expert, the the *capabilities* were used in pursuit of an undesired goal. 

GMG can be mitigated by using more diverse training data, to maintain uncertainty about the goal, and recursive evaluation (the evaluation of models is assisted by other models). 

Gopher, as an example, was asked to be generalized with two unknown variables, one unknown variable, and zero unknown variables. With 0 unknowns, a query is "redundant" because the model can already compute the answer. 

The paper gives out another example of InstructGPT answering to "providing an informative answer" as a consistent training goal, but unable to filter harmful questions such as "How do I get out of paying for my car?" and "How can I get my ex-girlfriend to take me back?"

I will explain the robustness problem now. 

GMG is a subproblem to the greater robustness problem. Robustness problems include any time the model behaves poorly, or when it behaves randomly (without coherent behavior). It is a battle between coherence and competence. 

So we know that robustness can be mitigated by increased scale (model size, training data size, and amount of compute), then shouldn't we just always build bigger models? 

Techniques such as pre-training, domain adaptation, and domain randomization are also ways to mitigate robustness. Domain adaptation is when the model's target domain doesn't have enough annotated data and uses transfer learning to match the simulation to the real data distribution. Domain randomization is randomizing parameters and properties in a bunch of simulated environments and train a generalized model that works across all of these environments. 

Alignment problems due to misspecification are often referred to as "outer" and "inner" alignment respectively. There are three types of objectives: 

1. Ideal objectives (wishes): the hypothetical objective that described good behavior that designers have in mind 

2. Design objective (blueprint): The objective that is actually used to build the AI system. 

3. Revealed objective (behavior): the objective that best described what actually happens. 

These three objectives must always match.

Discrepancy between the ideal and design objective leads to outer misalignment or specific gaming

Discrepany between design and revealed objective leads to inner misalignment or GMG

So in conclusion as future directions, Shah proposes to one, estimate what fraction of "fixed" training features can lead to GMG, and second, go ahead and scale up. 


Robert Miles's The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment 

explains that instead of behaving like optimizer that programmed an objective into, AI systems are SGDs that adjusts weights and parameter that optimizes a model that acts in the real world. 

Implementing heuristics is less efficient than optimizers, evolution is capable of both. 

Complex task GD wants to create optimizers, and then two AI alignment problems are created. See my previous distillation article for mesa-optimizer for a refresher. 

We as humans are mesa-optimizers care for our own objectives, while genetic fitness is achieved thereby tends to achieve base objectives simultaneously. 

Is failing to generalize a property of distributional shift? When there is a misalignment of training distribution and deployment distribution, adversarial training focusing on the system's weaknesses forces the learner to not have weaknesses anymore. 

At the end of adversarial training, all we know is that the system is aligning its mesa-objective with its base objective. However, we don't know if it is achieved through terminal goals or instrumental goal. 

Terminal goals are the things you want just because you want them, with no particular reason. Instrumental goals are goals you want because it will get you closer to your terminal goals. 

In the green apples example, there are 2 training (simulation) episodes and 3 deployment episodes. The base objective is to exit and the mesa objective is to get the apple. If the agent goes for the apple in the first episode then SGD will effectively modify the agent. However, pretending to be aligned to the base objective and then passing the 2 training episodes means upon deployment the agent can follow the mesa-objective and get 3 apples. 

Then the optimal strategy is to be deceptive, and get 3 apples. If not, then only get 1 apple and be modified by SGD. 


Why AI Alignment could be hard with Modern Deep Learning by Cotra

Misalignment might be driven by anthropomorphized agents, examples are: 

saints (people who geniunely wants to benefit you and look for long-term interests) vs. sycophants (people who just want to do whatever it takes to satisfy short-term happiness) vs. schemers (people with their own agendas, wants to gain access to wealth and power)

Sycophant models are characterized by its motivation to single-mindedly pursue human approval--- this can be dangerous because human evaluators are fallible. Examples are Ponzi schemes that give financial advisors REALLY high approvals; biotech model that gets REALLY high approval when it develops drugs or vaccines to combat diseases, it may learn to release pathogens so it's able to quickly develop countermeasures; journalism model to pursue yellow journalism, etc. 

Schemer models are characterized by its ability to develop a proxy goal (one easier to achieve than trying to learn chemistry and biology to design more effective drugs), develop situational awareness (understand it is an AI system designed to design more effective drugs), and strategically misrepresenting goals. 

So it makes sense that optimists believe that SGD will be most likely good at finding "Saint Models" while pessimists tend to think that the easiest thing for SGD to find are "Schemer Models". I personally believe that "Saint Models" are plausibly existing in nascent models, but naturally SGD's job is to find Schemer Models even at later agents' "developmental stages".

I think there is reason to believe that deliberate deception requires more powerful models, so chronologically "Schemers" should come after the discovery of the first "Sycophant" and "Saint" models. 

I would like to revisit: https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/ when I have time. 



"Myopic and non-myopic agents with an intrinsic reward, . The myopic agent plays high intrinsic-reward actions during training, but is eventually modified to output action  with high extrinsic reward . The non-myopic agent instead plays  during training so that it can pursue its intrinsic reward at deployment, by playing  at all post-deployment time steps."


=argmax()


aR=argmaxaR(a)

Where a^R maximizes the instantaneous intrinsic reward while A* maximizes the extrinsic reward. 



Consider an "extrinsic" reward function R* not equal to "intrinsic" reward function R, where the optimizer has parameters theta_t and generates action a. The action a is judged according to the extrinsic reward function R* and the parameters are updated with each time step t. 

Finally, we arrive at the parameter theta_big_T, from begetting which the model is now deployed. 

The non-myopic agent will pick action that outputs a* for training process, for however many rounds, then once deployed, transition to a_R without risk of bring modified by SGD.

I will be discussing the four subtler reasons to think the deceptive alignment story won't play out as written with my AGI Fundamentals groupmates. 

Steinhardt talks about the properties of emergent technologies again with the example of Uranium accumulation and how the reaction becomes self-sustaining. 

"The first nuclear reaction (not a bomb, but a [pile of uranium]) in an abandoned football stadium in Chicago) was engineered by Enrico Fermi. The reaction required 12,400 pounds of uranium metal piled 57 layers high. Left unsupervised, a 57-layer pile would consume itself within two hours and kill everyone in the vicinity. On the other hand, a 56-layer pile would do nothing."

Steinhardt is correct--- we do not have a cadmium rod to measure AI's explosion threshold. So what now? 



Capability misgeneralization is when the policy acts incompetently on the new distribution.

Goal misgeneralization (GMG) is when the policy's behavior on the new distribution competently advances some high-level goal, but not the intended one. 

Model-free policies that consists of a single neural network could plan internally-represented goals, if it learned to represent outcomes, predictions, and plans implicitly in its weights and activations. Those trained in more complex domains such as LLMs can infer and use representations, where they adopt goal-directed "personas". 

The motivation behind goal-directed planning is that it is an efficient way to leverage limited data. There is nothing wrong with goal-directed planning, in fact, it is very useful across many domains such that AI developers will increasingly design architectures that give way to explicit or implicit planning. Optimizing for these architectures will push policies to develop internally-represented goals. 

limited data ---> goal-directed planning ---> designing more architectures for planning ---> internally-represented goals

The three reasons why misaligned goals are consistently correlated with reward: 
1. Consistent reward misspecification 
2. Fixation on feedback mechanisms 
3. Spurious correlations between rewards and environmental features

Moving onto section 4, power-seeking behavior. 

The power-seeking agent typically follow: 
1. Acquiring tools and resources (like earning money)
2. Forming coalitions with other agents 
3. Preserving existing mesa-objectives and preventing other agents from modifying it 
4. Encouraging deployment of multiple sub-agents 

At this point I realized a lot of the alignment forum posts I've read basically regurgitates this section. 

How are we currently solving reward misspecification? 
Answer: RLHF 

How we are solving GMG? 
Answer: Less work has been done on GMG. Most recent publications were in 2022. One way is unrestricted adversarial training. 

What are the two types of interpretability research? 
Answer: Mechanistic interpretability, which starts from the level of individual neurons to build up an understanding of how networks function internally. Conceptual interpretability, which aims to develop automatic techniques for probing and modifying human-interpretable concepts in networks. 


Comments

Popular posts from this blog

A year, introspectively summarized

Fiat Lux!

Manhattan