[…]
One simple approach to interpretability research is to first understand what the individual components (neurons and attention heads) are doing. This has traditionally required humans to manually inspect neurons to figure out what features of the data they represent. This process doesn’t scale well: it’s hard to apply it to neural networks with tens or hundreds of billions of parameters. We propose an automated process that uses GPT-4 to produce and score natural language explanations of neuron behavior and apply it to neurons in another language model.
This work is part of the third pillar of our approach to alignment research: we want to automate the alignment research work itself. A promising aspect of this approach is that it scales with the pace of AI development. As future models become increasingly intelligent and helpful as assistants, we will find better explanations.
How it works
Our methodology consists of running 3 steps on every neuron.
[…]
Step 1: Generate explanation using GPT-4
Given a GPT-2 neuron, generate an explanation of its behavior by showing relevant text sequences and activations to GPT-4.
[…]
Step 2: Simulate using GPT-4
Simulate what a neuron that fired for the explanation would do, again using GPT-4
[…]
Step 3: Compare
Score the explanation based on how well the simulated activations match the real activations
[…]
What we found
Using our scoring methodology, we can start to measure how well our techniques work for different parts of the network and try to improve the technique for parts that are currently poorly explained. For example, our technique works poorly for larger models, possibly because later layers are harder to explain.
1e+51e+61e+71e+81e+90.020.030.040.050.060.070.080.090.100.110.12
Although the vast majority of our explanations score poorly, we believe we can now use ML techniques to further improve our ability to produce explanations. For example, we found we were able to improve scores by:
- Iterating on explanations. We can increase scores by asking GPT-4 to come up with possible counterexamples, then revising explanations in light of their activations.
- Using larger models to give explanations. The average score goes up as the explainer model’s capabilities increase. However, even GPT-4 gives worse explanations than humans, suggesting room for improvement.
- Changing the architecture of the explained model. Training models with different activation functions improved explanation scores.
We are open-sourcing our datasets and visualization tools for GPT-4-written explanations of all 307,200 neurons in GPT-2, as well as code for explanation and scoring using publicly available models on the OpenAI API. We hope the research community will develop new techniques for generating higher-scoring explanations and better tools for exploring GPT-2 using explanations.
We found over 1,000 neurons with explanations that scored at least 0.8, meaning that according to GPT-4 they account for most of the neuron’s top-activating behavior. Most of these well-explained neurons are not very interesting. However, we also found many interesting neurons that GPT-4 didn’t understand. We hope as explanations improve we may be able to rapidly uncover interesting qualitative understanding of model computations.
Source: Language models can explain neurons in language models
Robin Edgar
Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft