Alignment landscape, learned content


How might white-box methods fit into the Alignment Plan: 

1. Model internal access during training and deployment

2. The promise of AI to empower 


Within every research group working on ML models, we can decompose the workforce into such categories:

1. Data team (paying humans to generate data points)

2. Oversight team 

3. Deployment of SGD-where-RLHF-is-the-algorithm team 


RLHF is Reinforcement Learning from Human Feedback, and the problems with baseline RLHF are oversight and catastrophes. Current proposals that have these problems are:

1. using AIs to help oversee (oversight)

2. Adversarial training (catastrophes)


After reading Holden Karnofsky's post "[How might we align transformative AI if it’s developed very soon?]", we can conclude that the remaining problems for current ML models are: 

1. Eliciting latent knowledge 

2. Easier to detect fakes than to produce fakes. For ChatGPT at least, it is difficult to make fake inputs. It is a war between real AI input vs. fake AI input.


What is chain-of-thought-explicit-reasoning? 

Models think rationally and step-by-step, for example, GPT-3 isn't doing either of these. This is a problem, even for narrow AIs. 


Large Neural Networks may become power-seeking where:

1. fine-tuning LLMs with widely used RLHF goes awry

2. Deceptive AI and fewer requirements to prevent itself from being turned off 

3. becoming a system where it requires the acquisition of power or resources to gain power 


What does it mean relative to technical alignment research?

1. Interpretability 

2. Benchmarking 

3. Process-based supervision (instead of implicit optimization)

4. Scalable oversight (supply a reliable reward or training signal to AI systems that are more capable)

5. Elicit latent knowledge (how can we incentive a NN to tell us all the facts it knows relevant to a decision?)


We need to look at scaling laws for compute, for data, and for model size more carefully because progress is shocking to professional forecasters. Metaculus predicts for a powerful AI (passing the multi-modal Turing test) within 9 years with a 25% chance. Technical alignment is still sprouting, and it is preparadigmatic--- there is no agreed strategy yet. What current alignment approach is the Swiss-Cheese approach--- combine fallible techniques and hope weaknesses will cancel out. 

Comments

Popular posts from this blog

A year, introspectively summarized

Fiat Lux!

Manhattan