We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.
We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.
Short Research Notes
Updates
As our team has grown, we have gotten more bandwidth to focus on a wider range of directions. In light of this and our recent results on dictionary learning, our plan over the coming months is to focus on three areas:
In our investigations of attention superposition (see our previous discussion as well as Larson and Greenspan & Wynroe), one toy model we’ve studied has shown geometry which is strongly reminiscent of that in Toy Models of Superposition.
In this model, we imagine that there are several true "attentional features", each captured by a single attention head with sparse attention patterns and a sparse OV circuit. We then train a model with a smaller number of attention heads to represent these attentional features.
Experiment Details: We first generate random (fixed) OV weights for five attention heads. For each sample, we then generate a residual stream across multiple tokens filled with random numbers, as well as a random attention pattern for each head. The attention patterns and OV circuits are both highly sparse. The target output is the result of applying these attention heads to the residual stream, followed by a ReLU nonlinearity. This forms the training data.
The model itself takes as input the five generated attention patterns for each sample and learns a matrix that linearly combines them down into just two attention heads. It uses the same mixing matrix to combine the five OV weights down into just two heads. We apply this model to the residual stream, apply a ReLU afterwards, and compute the MSE loss versus the target output.
Results: The main thing we're interested in is the geometry of how the true attentional features are organized. If we train a model with 5 "true attentional features" and two attention heads, we can plot the mixing matrix as a scatter plot, enabling us to directly inspect the geometry. This reveals a striking geometry:
A factor that we found important in generating superposition is ensuring that the OV circuit is sparse: when the OV circuit is dense we tend not to see signs of superposition.
When we began our research into dictionary learning, we initially investigated a problem we expected to be easier than transformer language models: very simple neural networks trained to classify MNIST digits.
This problem might be seen as a kind of minimal example of a real neural network we still don't understand. To the best of our knowledge, there is no mechanistic account of how even very simple neural networks classify MNIST. If one trains a model without a hidden layer – simply doing a generalized linear regression with a softmax on the end – mechanistic analysis is relatively straightforward, with each class being supported or inhibited by different pixels. But as soon as one adds a hidden layer, the problem becomes poorly understood. The resulting hidden layer neurons appear inscrutable.
We found this problem harder than we expected, and eventually switched to just directly attacking language models (leading to our paper). But along the way we had some interesting results, and also found some interesting challenges.
Our failure to understand MNIST models might be seen as an example of a more general trend: mechanistic interpretability has had little success in reverse engineering MLP layers in small but real neural networks, despite successes in reverse engineering portions of more sophisticated neural networks like InceptionV1 and a 1-layer transformer. One hypothesis is that these challenges have actually been due to superposition, but there might also be other reasons (perhaps small models are just pathological in some way and don't have "crisp abstractions"!). Understanding this seems valuable.
Our Results So Far
In our investigations of dictionary learning, we attempted to recover sparse features from the activations of fully-connected ReLU models trained on MNIST.
We performed hyperparameter scans of dictionary learning with a variety of algorithms on fully-connected MNIST models with 16-64 neurons per layer and 1-3 layers. The resulting features often appeared interpretable, and cleanly mapped to individual digits. In particular, we often saw multiple features corresponding to slightly different ways of drawing each digit, suggesting that the networks learned to use a template-matching strategy. These features also had matching downstream effects on the logits (e.g. a “0” feature enhanced the probability placed on the digit “0”).
These features seem like one natural possibility for features the model might represent in superposition. If a model without a hidden layer is forced to have a single "general template" for each digit, having many different templates for the different ways a digit can be drawn might be a natural improvement. And because the templates are all mutually exclusive (there is only one actual digit drawn!) they are naturally sparse, which makes it possible to encode them in superposition with minimal interference costs.
Another possibility could have been more composable features describing parts of a digit, such as edges, or lines, or loops at different positions. These are what we expect to see in early layers if we train a conv net! However, we didn’t see anything like this – we only saw features corresponding to different "templates" of digits.
We have some uncertainties about how to interpret these results. On the one hand, we saw clear signs of interpretable features in networks trained on MNIST. On the other hand, there was no clear sign that they were "true features", rather than just being clusters of similar data points. For example, one thing we'd hoped to find was an “elbow” in the reconstruction error as a function of sparsity or number of features, but we didn't find this. Moreover, in order to achieve low reconstruction error we either needed enormous numbers of features (comparable to the dataset size) or needed to allow many different features to be active on each example. The most compelling evidence for them being "real" was that the features had logit weights corresponding to the correct classes if you multiplied their direction vectors from the output weights (as mentioned above).
This pattern of finding interpretable features but not finding clear evidence that they're the "true features" or that we've "found all of them" continues in our new work. In some ways, this seems more expected in light of our findings about feature splitting. But an important open question remains whether we can find evidence for a "strong superposition hypothesis", or whether features are just a pragmatically useful abstraction.
In Towards Monosemanticity, we highlighted scaling dictionary learning to larger models as a key direction of future work. We have since applied dictionary learning to the MLP activations of an 8-layer transformer, and identified a number of interesting features (some examples below).
Broadly, what we see is that the features in the final layer look quite similar to the ones we saw in a 1-layer transformer. These fire in simple situations, often on single tokens, and predict plausible next tokens.
In earlier layers we see more abstract features, which activate on e.g. text with strong emotional valence. Very qualitatively, our sense is that layers 4-6 contain the most high-level features, though we have not done a careful census of these. This is consistent with the findings in Softmax Linear Units that the interpretable neurons in the middle layers of large models tend to fire on more abstract/high-level concepts than in early/late layers.
While an 8-layer transformer is still quite small compared with frontier models, it is a promising sign that we continue to see interpretable features in models with more than one layer (consistent with the findings of Smith 2023 on a 6-layer transformer).
These are some examples of the features we found. Note that we were preferentially looking for potentially safety-relevant features, making this very much not a random sample of the features in this model.
Our confidence in these descriptions is moderate, though lower than for the features we inspected in detail in Towards Monosemanticity. We would encourage viewing these labels in the spirit of a biologist’s field notes, rather than as rigorously established findings.
Layer | Features Found While Searching for Safety-Relevant Features | Directionally Safety-Relevant Features |
2 | Single-token “I” in prose about sex/romance. | Single-token 'with', preferentially in emotionally-salient contexts or describing ways someone shows emotion. |
3 | Single-token “:” in news headings | Single-token “As” in violent/dramatic/emotionally charged scenes |
4 | Single-token “.” demarcating a sharp transition in context. | Activates on causes of death. |
5 | Newline before another newline in historical contexts. | Single-token “.” in sentences expressing hatred. |
6 | Descriptions of physical spaces. | Strong negative emotions. |
7 | LaTeX math mode. | Single-token “,” in emotionally charged settings, predicting words indicating emotional state. |
As mentioned above, we've used dictionary learning to extract interpretable features from an 8 layer model. This was part of a broader effort to scale up dictionary learning, and we've been able to scale dictionary learning significantly further -- out to models with billions of parameters.
We now have preliminary features in some of these models, and remarkably see that many of them are multilingual. That is, we see features that fire for the same concept across many different languages.
For example, we have found a feature that fires on the concept of uniqueness in many languages, including phrases such as:
We have also found multilingual features firing on:
Finally, we found some features which are not obviously multilingual but do seem interesting in their own right:
Our views on ghost grads have changed. See the March 2024 update.
Dead neurons in the sparse autoencoder cost us compute and don’t provide any benefit, so we’d like to make them live again. In our previous paper, we used a method we called "neuron resampling" to address dead neurons. Since then, we've found a new method which works significantly better – Ghost Grads.
The method is to calculate an additional term that we add to the loss. This term is calculated by:
This procedure is a little convoluted, but the intuition here is that we want to get a gradient signal that pushes the parameters of the dead neurons in the direction of explaining the autoencoder residual, as that’s a promising place in parameter space to add more live neurons.
Note that in step (2) we use an exponential activation function because this has the joint properties of (a) being positive and (b) having a positive gradient with respect to the pre-activation. These mean that the model can only increase the magnitude of the dead neuron activations by making the pre-activations less negative. Combined with our scaling in step (3), which ensures that the magnitude of the dead neuron outputs is smaller than that of the residual it is compared against, means that the model tends to receive a gradient signal to increase the pre-activations and hence make neurons alive.
Empirically we find that this procedure produces autoencoders with very few (often zero!) dead neurons. Moreover, we find that models trained with Ghost Grads perform as well or better than models trained with neuron resampling at the same number of alive neurons.
Ghost Grads roughly doubles the compute requirements of the autoencoder, so it would be reasonable to ask if that cost is worth the benefit. Empirically, however, we find that the fraction of neurons that die increases with the size of the autoencoder, and very large autoencoders can easily have more than 50% of neurons dead, even with neuron resampling. So just on pure compute grounds, this means it is often advantageous to run a smaller autoencoder with Ghost Grads than to use a larger autoencoder with traditional neuron resampling.
In Towards Monosemanticity, we discussed a counterexample that caused us to deprioritize trying to make monosemantic models directly by regularizing the base model to have sparse activations. This counterexample was the culmination of a series of thought experiments and real sparse models, and we discovered several other interesting phenomena and examples along the way.
In particular, we offer two surprising examples:
Both of these examples (as well as the example in Towards Monosemanticity) demonstrate phenomena that we need to be cautious of in trying to produce monosemantic features. The last thing we want is an "out of the frying pan and into the fire" situation where we trade superposition within a layer for something even harder to deal with!
Joint Superposition Between MLP Neurons and Residual Stream
It turns out that sparsely activated neurons can be “in superposition with the residual stream”.
In a 1-layer models we regularized to have sparse activations, we found a very sparsely activated neuron that primarily responds to three different types of inputs: "it|'s", "and|/", and "many| of" / "some| of". We confirmed that for each of these facets, no other neuron activates to disambiguate the cases. At first glance, this seems like a polysemantic neuron without superposition, but this model’s output was able to disambiguate what the neuron represents by looking at it in conjunction with the residual stream and then use it to make reasonable predictions specific to each of the three facets.
In response to this obstacle, one might try to analyze or sparsify the residual stream instead. However, this is less natural since the residual stream does not have a privileged basis. To the extent that we want to understand MLP layers as a component of the overall model, superposition contained within the MLP layer is preferable to superposition between the MLP layer and the residual stream, so techniques that replace the former with the latter are counterproductive for us.
It's worth noting that the phenomenon demonstrated by this example might be seen as a case of a more general issue of "cross-layer superposition", which we expect to discuss more in the future.
Joint Superposition Between MLP Neurons and Residual Stream
Another obstacle is that, in the limit of forcing activations to be sparse, the model can use a “magnitude-based representation”. This directly opposes the linear representation hypothesis, and is more similar to a polytope representation.
To eliminate the superposition described above (at the cost of severely limiting model performance!), we trained another regularized 1-layer model with a “cut” residual stream: instead of adding the output of the MLP layer back to the residual stream, we replaced the residual stream entirely with the output of the MLP layer. Even when only one neuron fired on any given example, we still saw polysemanticity! In particular, we found one neuron firing weakly for one facet (personal pronouns) and strongly for a completely unrelated facet (the “m” of the word “diplomacy”), while no other neuron activated to disambiguate the cases. Furthermore, we found that the one-layer model used this feature productively: when the neuron fired weakly, the model predicted tokens that make sense after personal pronouns, such as “ was”/” had”/” said”, and when it fired strongly, it predicted “acy”.
This is possible because the output layer has a softmax of a linear transformation. For example, a model could represent mutually exclusive, binary (on-or-off) features A, B, C in one neuron
A sparse autoencoder is trained to take in the transformer’s activations at a specific location, such as after the nonlinearity in the MLP, and predict the same activations. These approximate the activations with a sparse representation. Instead, we could train a model to take in the activations at one point and predict the activations at some later point in the transformer. These approximate part of the network's computation as a sparse computation.
We’ve been experimenting with three variations on this theme:
The first two variations do produce interpretable features, though we cannot yet confidently say whether these features are better or worse than our baseline sparse autoencoder. However, they have a major benefit in making circuit analysis much easier, as the input weights of the sparse model compose directly with residual stream elements (and therefore outputs of earlier layers) and the output weights of the autoencoder compose directly with the residual stream elements (and therefore inputs to later layers), with the nonlinearity sandwiched in between. This allows for an extension of the linear analysis in A Mathematical Framework for Transformer Circuits to include circuits supporting MLP features as well as output logits and attention heads.
We’re currently in the very early stages of exploring the attention variation.
From time to time we notice open problems in interpretability that we would be excited for external researchers to pursue, and which we either don’t anticipate pursuing in the near future or which we expect to benefit from parallel exploration.
Our intention is to have a low bar to posting problems in this section, so please treat these in the spirit of off-the-cuff answers to “What are some interesting open problems in interpretability right now?”
With that in mind, we have listed two kinds of open problems below: ones oriented towards “Model Biology” (e.g. going and looking inside models to see what they do) and ones oriented towards toy models.
Transformer Circuits periodically publishes comments on our papers, both from external parties and by the authors. Some of these comments were submitted before publication, from reviewers of early draft manuscripts. But others are submitted significantly after the fact, and might not be seen. To that end, we've included a digest of recently added comments:
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Finally, we'd like to highlight a selection of recent work by a number of researchers at other groups which we believe will be of interest to you if you find our papers interesting.
Two concurrent papers – In-Context Learning Creates Task Vectors by Hendel et al. and Function Vectors in Large Language Models by Todd et al. – addressed the question of how the tasks/functions leveraging in-context learning are represented. Hendel et al. study a series of in-context learning tasks of the form [A]→[B], where each of A and B are single tokens, such as "Apple→Red Lime→Green Orange→", etc. They find that the value of the residual stream sitting at medium depth over the final "→" token contains information specifying the task, such that patching in that value in a fresh context "
Todd et al. similarly study tasks of the form [A]:[B], such as "awake:asleep, vanish:appear, future:past", but they investigate which vectors get written to the final token ":" by which attention heads, rather than merely capturing the residual stream vector at that position. They ablate the outputs of specific attention heads writing to the final ":" token, and identify a handful (~12 heads) which contribute significantly. They find that patching in those outputs (at a sufficiently early layer) restores the model performance on a fresh task context "
We would be excited to see follow-up work checking if the same mechanisms are involved when the task to be performed is determined by context and standard usage, as opposed to an in-context-learning sequence of analogies. For example, when the model processes the sentence "The color of an Orange is", does the same "task vector" (whether captured from residual stream of summed from attention head outputs) form on "is" as forms on "→" in "Apple→Red Lime→Green Orange→"? If so, then ICL examples could serve as "purifying" strategies for identifying model mechanisms that are used more generally.
Given a specific prediction problem with multiple possible exact and approximate algorithms, which ones will a neural net learn? How does that depend on architecture, regularization, and training scheme? While mechanistic interpretability provides techniques for reverse engineering a particular model, any traction we can get on what algorithms to expect provides a very useful starting point for any analysis, and for reasoning about out-of-distribution behavior.
While the general problem on natural language or image datasets is quite difficult, there have been a number of recent suggestive results in the context of small, purely algorithmic tasks.
In Allen-Zhu & Li, the authors show that transformers can learn to generate Context-Free Grammars (strings generated by a series of replacement rules). Classical algorithms solve this problem with dynamic programming, and the authors show using probes that transformer hidden states linearly encode the relevant internal states of the dynamic programming solutions ("ancestors and boundaries") and store them at ends of phrases (similar to punctuation in LMs).
Singh et al. shows that for a task which can be solved via in-context learning or in-weights learning, the favored algorithm depends both on training time and regularization strength. Concretely, they train an autoregressive model on a sequence of pairs of (drawn glyph, label); ICL would learn the glyph shape → label map from examples in the context, while in-weights learning would learn a fixed set of those relationships from training. While the authors don't investigate the two algorithms mechanistically, the former might involve doing a sort of fuzzy induction head (using some representation of the drawn glyphs for keys and queries). They find that turning on L2 weight regularization strongly favors ICL solutions, as in grokking, while overtraining in the absence of regularization eventually favors in-weights solutions. They also provide evidence suggesting that these two algorithms compete for space within the residual stream.
Zhong et al. returns to the task of modular addition, showing that a 1-layer transformer with constant attention learns an algorithm ("pizza") which is qualitatively different than the algorithm ("clock") learned by a 1-layer transformer with ordinary attention investigated in Nanda et al. Strikingly, both these algorithms learn fourier/circular embeddings of the one-hot encoded numbers. Morwani et al. proved (!) that any maximum margin solution to modular addition (in a 1-hidden layer MLP with quadratic activations) must have such Fourier features, and that optimal performance requires that all Fourier frequencies be present. In fact they show something much more general. Chughtai et al. had observed that models trained on finite group multiplication learn an algorithm based on the characters of finite-dimensional group representations, and Morwani et al. show that optimal performance requires a model to learn all of the group's finite-dimensional irreducible representations. We believe this marks the first time that an algorithm discovered by reverse engineering a neural network was later proved to be mathematically optimal. (The reverse has happened, e.g. ridge regression is optimal for noisy least-squares problems and is learned by transformer models (Akyürek et al.))
What about more general classes of algorithms? The programming language RASP(-L) attempts to abstract the algorithmic capacity of transformers (certain kinds of lookups and aggregations). Zhou et al. demonstrate through examples that problems with variable length inputs that can be solved by short RASP-L programs (independent of input length) are easily learned by transformers, in a way that generalizes over length. For example, it is hard to implement traditional addition (largest digit first) with a short RASP-L program, because you have to get to the end of both numbers to get started and do complex indexing. If the addition problem has index hints and reverses digit order, it is easy to write in RASP-L and empirically transformers learn length-independent solutions. They conjecture that any problem with a short RASP-L program will be easy to learn for transformers. (The converse won't hold, as certain dense computations or analogic reasoning tasks are not amenable to RASP representations but can be learned by transformers.) One might hope that something deeper is true: there is some notion of 'transformer-complexity' for algorithms (for which RASP-L is an upper bound) and that the algorithms learned are some complexity-weighted distribution over those, as in Solomonoff's Universal Induction.
In Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level, Nanda et al. investigate the circuit supporting a specific narrow case of factual recall: completions of the sentence "Fact: [ATHLETE NAME] plays the sport of" for a collection of [ATHLETE NAME]s playing basketball, baseball, or tennis. They find that detokenization happens in the first layers (0, 1), factual recall happens in the early MLP layers (2-6), after which attention heads (primarily one specific attention head) selects the sport information and transmits it to the final token. The attention part of this story is tidy – composing the OV circuit with the unembedding matrix even produces a mechanistic probe for sport. But the factual recall MLP story is not; the authors carefully falsify a few simple theories of how the factual information about entities might be stored, and speculate about the limits of the quality of explanation we might expect for such unstructured and arbitrary facts about entities. While we are optimistic that there are principles behind how networks store even arbitrary information, it is as yet unclear what the minimal interpretable unit will be, especially when many layers are (provably) involved.
We’re excited to see our enthusiasm for using dictionary learning to attack superposition (along with Cunningham et al.) spread within the broader community. In Sparse Autoencoders Work on Attention Layer Outputs, Kissane et al. use sparse autoencoders to extract interpretable structure from the attention-weighted value vectors (or “hook_z” vectors in their ecosystem) within the second attention layer or a two-layer transformer. They find activation specificity for three families of features within this vector space. Induction features activate when a particular token should be predicted following the logic of an induction head, and link these to the induction heads of this attention layer. Local context features activate within short-term contexts that have a clear delimiter (e.g., something akin to “is this clause a question?”). Finally, high-level context features describe something like the document-level context likely specified by certain keywords within a context.