How likely is deception alignment in practice?
How likely is deceptive alignment in practice?
What does ML inductive biases look like?
1. High path-dependence:
- Different training runs can converge to very different models depending on the particular path taken through model space
2. Low path-dependence
- Similar training processes converge to essentially the same, simple solution, regardless of early training dynamics
Deceptive alignment in the high path-dependence world
Suppose our training process is good enough that, for the model to do well, it has to fully understand what we want--- essentially what you get in the limit of doing enough adversarial training.
Goal attainment for models:
1. How much marginal performance improvement do we get from each step toward the model class?
2. How many steps are needed until the model becomes a member of that class?
Types of Alignment
1. Internal alignment: An internally aligned mesa-optimizer is a robustly aligned mesa-optimizer that has internalized the base objective in its mesa-objective.
2. Corrigible alignment: A corrigibly aligned mesa-optimizer is a robustly aligned mesa-optimizer that has a mesa-objective that “points to” its epistemic model of the base objective.
3. Deceptive alignment: A deceptively aligned mesa-optimizer is a pseudoaligned mesa-optimizer that has enough information about the base objective to seem more fit from the perspective of the base optimizer than it actually is
How likely is internal alignment?
The internal alignment path starts with a proxy-aligned model, followed by stochastic gradient descent (SGD)
How likely is corrigible alignment?
1. We start with a proxy-aligned model
2. In early training, SGD jointly focused on improving the model's understanding of the world along with improving its proxies
3. Model learns about the training process from its input data ---> diminishing returns to better pointers seem to imply that this would take many consecutive steps with rapidly diminishing performance improvements
Simplicity Analysis
M = world_model + optimization_procedure + mesa_objective
Takeaways
Deceptively aligned models have to take the extra step of deducing that they should be trying to play along in the training process. The problem is that speed bias seems uncompetitive. Whether in a world of high or low path dependence, SGD will want to make the model deceptive. We must enact intervention that will change the training dynamics to prevent deception.
Comments
Post a Comment