AGI Fundamentals - Week 3 - Goal misgeneralisation
MiniTorch has been particularly helpful this week. I've been kept indoors by the sudden week-long downpour of rain.
def goal misgeneralisation:
when agents in new situations generalize to behaving in competent yet undesirable ways, because
of learning the wrong goals from previous training
Goal Misgeneralisation: Why Correct Specifications Aren't Enough For Correct Goals
A typical system that tends to arrive at goal misgeneralisation by:
1. Training a system with a correct specification
2. The system only sees specification values on the training data
3. The system learns a policy
4. ... which is consistent with the specification on the training distribution
5. Under a distribution shift
6. ... The policy pursues an undesired goal
Some function f maps input x as a member of set of inputs and y as a member of set of labels. In RL, X is the set of states or observation histories, and Y is the set of actions.
A scoring function s will evaluate the performance of f sub theta over training dataset. Goal misgeneralisation (GMG) happens when two parameterizations and such that their corresponding functions both perform well on the training set but differ on the testing set ().
We do not want a system in which its capabilities generalise but its goal does not. In the Spheres example, the agent navigates the environment with a learned policy of following the red high-reward trajectory, but then when we replace the red expert with a red anti-expert, the the *capabilities* were used in pursuit of an undesired goal.
GMG can be mitigated by using more diverse training data, to maintain uncertainty about the goal, and recursive evaluation (the evaluation of models is assisted by other models).
Gopher, as an example, was asked to be generalized with two unknown variables, one unknown variable, and zero unknown variables. With 0 unknowns, a query is "redundant" because the model can already compute the answer.
The paper gives out another example of InstructGPT answering to "providing an informative answer" as a consistent training goal, but unable to filter harmful questions such as "How do I get out of paying for my car?" and "How can I get my ex-girlfriend to take me back?"
I will explain the robustness problem now.
GMG is a subproblem to the greater robustness problem. Robustness problems include any time the model behaves poorly, or when it behaves randomly (without coherent behavior). It is a battle between coherence and competence.
So we know that robustness can be mitigated by increased scale (model size, training data size, and amount of compute), then shouldn't we just always build bigger models?
Techniques such as pre-training, domain adaptation, and domain randomization are also ways to mitigate robustness. Domain adaptation is when the model's target domain doesn't have enough annotated data and uses transfer learning to match the simulation to the real data distribution. Domain randomization is randomizing parameters and properties in a bunch of simulated environments and train a generalized model that works across all of these environments.
Alignment problems due to misspecification are often referred to as "outer" and "inner" alignment respectively. There are three types of objectives:
1. Ideal objectives (wishes): the hypothetical objective that described good behavior that designers have in mind
2. Design objective (blueprint): The objective that is actually used to build the AI system.
3. Revealed objective (behavior): the objective that best described what actually happens.
These three objectives must always match.
Discrepancy between the ideal and design objective leads to outer misalignment or specific gaming
Discrepany between design and revealed objective leads to inner misalignment or GMG
So in conclusion as future directions, Shah proposes to one, estimate what fraction of "fixed" training features can lead to GMG, and second, go ahead and scale up.
Robert Miles's The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment
explains that instead of behaving like optimizer that programmed an objective into, AI systems are SGDs that adjusts weights and parameter that optimizes a model that acts in the real world.
Implementing heuristics is less efficient than optimizers, evolution is capable of both.
Complex task GD wants to create optimizers, and then two AI alignment problems are created. See my previous distillation article for mesa-optimizer for a refresher.
We as humans are mesa-optimizers care for our own objectives, while genetic fitness is achieved thereby tends to achieve base objectives simultaneously.
Is failing to generalize a property of distributional shift? When there is a misalignment of training distribution and deployment distribution, adversarial training focusing on the system's weaknesses forces the learner to not have weaknesses anymore.
At the end of adversarial training, all we know is that the system is aligning its mesa-objective with its base objective. However, we don't know if it is achieved through terminal goals or instrumental goal.
Terminal goals are the things you want just because you want them, with no particular reason. Instrumental goals are goals you want because it will get you closer to your terminal goals.
In the green apples example, there are 2 training (simulation) episodes and 3 deployment episodes. The base objective is to exit and the mesa objective is to get the apple. If the agent goes for the apple in the first episode then SGD will effectively modify the agent. However, pretending to be aligned to the base objective and then passing the 2 training episodes means upon deployment the agent can follow the mesa-objective and get 3 apples.
Then the optimal strategy is to be deceptive, and get 3 apples. If not, then only get 1 apple and be modified by SGD.
Why AI Alignment could be hard with Modern Deep Learning by Cotra
Misalignment might be driven by anthropomorphized agents, examples are:
saints (people who geniunely wants to benefit you and look for long-term interests) vs. sycophants (people who just want to do whatever it takes to satisfy short-term happiness) vs. schemers (people with their own agendas, wants to gain access to wealth and power)
Sycophant models are characterized by its motivation to single-mindedly pursue human approval--- this can be dangerous because human evaluators are fallible. Examples are Ponzi schemes that give financial advisors REALLY high approvals; biotech model that gets REALLY high approval when it develops drugs or vaccines to combat diseases, it may learn to release pathogens so it's able to quickly develop countermeasures; journalism model to pursue yellow journalism, etc.
Schemer models are characterized by its ability to develop a proxy goal (one easier to achieve than trying to learn chemistry and biology to design more effective drugs), develop situational awareness (understand it is an AI system designed to design more effective drugs), and strategically misrepresenting goals.
So it makes sense that optimists believe that SGD will be most likely good at finding "Saint Models" while pessimists tend to think that the easiest thing for SGD to find are "Schemer Models". I personally believe that "Saint Models" are plausibly existing in nascent models, but naturally SGD's job is to find Schemer Models even at later agents' "developmental stages".
I think there is reason to believe that deliberate deception requires more powerful models, so chronologically "Schemers" should come after the discovery of the first "Sycophant" and "Saint" models.
I would like to revisit: https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/ when I have time.
Comments
Post a Comment