Posts

Showing posts from March, 2023

AGISF - Week 6 - Interpretability

Image
I'm blown away by how cool distill.pub is, definitely worth revisiting this spring break.  What is interpretability?  Develop methods of training capable neural networks to produce human-interpretable checkpoints, such that we know what these networks are doing and where to exert interference.  Mechanistic interpretability is a subfield of interpretability which aims to understand networks on the level of individual neurons. After understanding neurons, we can identify how they construct increasingly complex representations, and develop a bottom-up understanding of how neural networks work. Concept-based interpretability focuses on techniques for automatically probing (and potentially modifying) human-interpretable concepts stored in representations within neural networks. Feature Visualization (2017) by Chris Olah, Alexander Mordvintsez and Ludwig Schubert A big portion of feature visualization is answering questions about what a netowork---or parts of a network--- are...

Reverie

Image
  Floral Patterns, Edward Denton I am your Sunday reverie  button nose, slender waist, killer looks  You'll be on guard for Monday blues  workaholics, cynics, intoxicated crooks You are tossing and turning in your sleep idolizing my witty remarks as art I'm messing with your idea of perfection and every fortification is a futile rampart 

Desiderata

Image
 In my dreams Ehrmann the orator  would chant his fervent "Desiderata";  And virtuous Desdemona  bathe in the glamour of an inamorata 

AGISF - Week 5 - Adversarial techniques for scalable oversight

Image
There's no better way to celebrate pi day than looking at scalable oversight. I welcome constructive feedback, this is completely anonymous. Write down how I can improve, doesn’t have to be deterministically beneficial. I prefer actionable changes. Previous examples: seek forecasting mentorship, read “The Pyramid Principle”, and be agentic. https://www.admonymous.co/lisawang __________________________________________________________________ AI-written critiques help humans notice flaws by OpenAI AI systems that rely on human evaluations as training signal may fall prey to faulty systematic evaluators. Proof of concept is to use SL to train LLMs that write critiques of short stories, Wikis, and other texts. An interesting find is that larger models are better at self-critiquing. Another finding is that larger models are able to directly improve their outputs, using self-critiques, which small models are unable to do. Unfortunately, models are better discriminating than at critiquing...

AGISF - Week 4 - Task Decomposition for Scalable Oversight

Image
Today we celebrate International Women's day by looking at scalable oversight. Scalable oversight refers to methods that enable humans to oversee AI systems that are solving tasks too complicated for a single human to evaluate. Scalable oversight is an approach to prevent reward misspecification, and we can do this by iterated amplification. Iterated amplification is built upon the idea of task decomposition, which is the strategy of training agents to perform well on complex tasks by breaking down said tasks into more-evaluable tasks, then having them produce solutions for the full tasks. In this way, iterated amplification involves repeatedly using task decomposition to train increasingly powerful agents.  The AGISF deck   AI Alignment Landscape by Paul Christiano Intent alignment is trying to build AI systems that are performing as intended, and have a positive, long-run impact--- essentially, it is robust and reliable while being sufficiently competent. Here we define "wel...

The Poisonmaster

Image
The Poisonmaster Written on September 25, 2015      The man vanished into a dim alley, which was silent, dark, and menacing. It was no more than a gap between two old terraced houses built many years ago, bleak and uninviting, especially in November. You couldn't see the far end, for a large, heavy oak gate had blocked it off many years ago to stop the thieves and misfits of the area, cutting through to the wastes that lay beyond the rear of the old, boarded up houses. The walls ran with slime, which covered the now long-forgotten graffitied brickwork. Bold footsteps echoed from the man’s heels, and moths circled above the flickering street light.     The man is husky, large in comparison to the angular alley. His mouth and nose exhaled warm steam as he mumbled incoherent words; eyes darting from empty corridors and ghostly-lit windows. Charcoal strands of hair sat awkwardly on his oversized head, he was presumably 40; his barely visible neck is tied loos...