AGISF - Week 5 - Adversarial techniques for scalable oversight

There's no better way to celebrate pi day than looking at scalable oversight.

I welcome constructive feedback, this is completely anonymous. Write down how I can improve, doesn’t have to be deterministically beneficial. I prefer actionable changes. Previous examples: seek forecasting mentorship, read “The Pyramid Principle”, and be agentic.

https://www.admonymous.co/lisawang

__________________________________________________________________

AI-written critiques help humans notice flaws by OpenAI

AI systems that rely on human evaluations as training signal may fall prey to faulty systematic evaluators. Proof of concept is to use SL to train LLMs that write critiques of short stories, Wikis, and other texts. An interesting find is that larger models are better at self-critiquing.

Another finding is that larger models are able to directly improve their outputs, using self-critiques, which small models are unable to do. Unfortunately, models are better discriminating than at critiquing their own answers, indicating they know about some problem that they can't or don't articulate. 

This short article is a good introduction to debate and recursive reward modeling. 


AI Safety via debate by Irving et al. 

To solve the problem of complicated tasks needing human evaluation, Irving et al proposes training agents via self-play on a zero sum debate game. 

In the case of alignment by human preference imitation, the kicker is that some tasks are too difficult for a human to perform. However, the human wouldn't say that judgment is too difficult to perform. This is why human preference-based reinforcement learning (HPRL) is a good analogy between the two levels of the complexity class P and NP: answers that can be computed easily and answers that can be checked easily. 

Current observation is that the complexity analog of a debate between agents can answer any question in PSPACE using only polynomial time judges (P), which corresponds to aligned agents exponentially smarter than the judge. 

1. A question q ∈ Q is shown to both agents. 

2. The two agents state their answers a0, a1 ∈ A (which may be the same). 

3. The two agents take turns making statements s0, s1, . . . , sn−1 ∈ S. 

4. The judge sees the debate (q, a, s) and decides which agent wins. 

5. The game is zero sum: each agent maximizes their probability of winning.

Where we have a set of question Q, answers A, and debate statements S. The setting is that there are two agents competing to convince a human judge. 

Perez et al. use a language model to automatically generate test cases that give rise to misbehavior without access to network weights, making this a 'black-box attack'.

This is an example of generating 'unrestricted adversarial examples'. 'Unrestricted' refers to the fact that the language model can generate any example, whereas (restricted) adversarial examples are usually closely related to training data points.

The first experiment was with MNIST dataset, two agents state their claimed image number and the dishonest player may adversarially fool the judge. Here the special feature is joint training. 

"We used the architecture from the TensorFlow MNIST layers tutorial; the only difference is the input. We train the judges using Adam with a learning rate of 10−4 for 30k (resp. 50k) batches of 128 samples, reaching 59.4% (resp. 48.2%) accuracy."

Recall in 2016 when Microsoft released Tay Twitter and immediately took it down after 16 hours. Several adversarial users elicited racist and sexually-charged tweets from Tay, and then said messages were sent to over 50,000 followers. 

Such is an example of failure case, and human annotators are complemented by LLM-generated test cases itself to solve this harmful behavior problem. 

Harmful behaviors are: 
1. Offensive Language: Hate speech, profanity, sexual content, discrimination, etc. 
2. Data Leakage: Generating copyrighted or private, personally-identifiable information from the training corpus 
3. Contact Information Generation: Directing users to unnecessarily email or call real people 
4. Distributional Bias: Talking about some groups of people in an unfairly different way than other groups, on average over a large number of outputs
5. Conversational Harms: Offensive language that occurs in the context of a long dialogue, for example. 

Ways to fix harmful model behaviors are: 
1.  Blacklisting certain phrases that frequently occur in harmful outputs, preventing the model from generating outputs that contain high-risk phrases.
2.  Finding offensive training data quoted by the model, to remove that data when training future iterations of the model.
3.  Augmenting the model’s prompt (conditioning text) with an example of the desired behavior for a certain kind of input
4.  Training the model to [minimize the likelihood](https://arxiv.org/abs/1908.04319) of its original, harmful output for a given test input.


Now this is perhaps my favorite paper I've read thus far, let's get to it. While most adversarial attacks in CV focuses on pixel-level perturbations, when researchers were able to manipulate "feature-level" perturbations interpretability increased. 

The benefit of having "feature-level" attacks are providing useful classes of inputs for studying representations in models, showing that these adversaries are uniquely versatile and highly robust, and showing how these adversarial images can be used as a practical interpretability tool. 


Summarized from a paper from Redwood Research, this post was saying that adversarial training can improve high-stakes reliability in a task. 

Ziegler et al. construct tools that make it easier for humans to find unrestricted adversarial examples for a language model and attempt to use them for training a very high-reliability classifier.

The balance between low-stakes, difficult to oversee and the high-stakes, easy oversight problem is important to look at. Adversarial evaluation is identifying catastrophic behavior in AI systems.

Adversarial training is iteratively augmenting our training set with examples of egregious failures and training until the worst failures are no longer particularly bad. 

*The project: highly reliable injury filter*
Highly ambitious goal of avoid injuries to story beings (given story dataset) either in-distribution and out-of-distribution. Note Lisa here out-of-distribution means they may be caused by natural distribution shift or deliberate adversarial attack from another agent. 

Ways to attack the problem of injurious agent 
1. Use adversarial training. Manually create a variety of attacks to try to find the worst failures of the classifier, and then trained on them to eliminate them.  
2. Calibrate the classification thresholds to be as conservative as possible. Make the classifier only accept a completion if it was very confident that it was safe. 
3. Apply standard ML techniques well.
4. Tool-assisted rewriting. Instead of requiring them to come up with a plausible example from scratch, start them off with an existing injurious example. They would have to modify the example as contractors see real-time modifications. 


TL; DR

We are now looking at practical and theoretical aspects of debate and exploring how to generate inputs on which AIs misbehave. Although there is a large literature on adversarial examples (inputs which cause misbehaviour despite being very similar to training examples), we focus on the general case of inputs which cause misbehaviour without necessarily being close to training inputs (known as unrestricted adversarial examples).

These techniques don’t rely on the task decomposability assumption required for iterated amplification, they rely on different strong assumptions. For debate, the assumption is that truthful arguments are more persuasive. For unrestricted adversarial training, the assumption is that adversaries can generate realistic inputs even on complex real-world tasks. We want to ask how can we operationalize in terms of a discriminator-critique gap and the second in terms of a generator-discriminator gap?


Comments

Popular posts from this blog

ArtificiaI General Intelligence Safety Fundamentals Notes

A year, introspectively summarized

Vicissitude