☆ Bonus

Suggested paper replications

Toy Models of Superposition

There are several aspects of this paper which we didn't cover in these exercises. In particular, superposition as a phase change studies the interaction between sparsity and relative feature importance, and finds a phase change in the optimal weight configuration as these inputs are varied. Some examples can be found in this notebook

This might be a good replication for you if:

You enjoy diving into the specific mathematical details of superposition
You find theoretical work interesting, as well as empirical work
You've enjoyed the first ~3 exercise sets in this section

Polysemanticity and Capacity in Neural Networks

This is a paper by Redwood Research, which builds on the ideas we discussed in the first three four of this paper (toy models of superposition, and the results on feature geometry).

They deeply study a measure called capacity, which is the same as what we called dimensionality above. Their results suggest an explanation for why features are often sharply "pinned" to either 0 or 1 capacity (i.e. not represented at all, or represented orthogonally to all other features).

This might be a good replication for you if:

You enjoy diving into the specific mathematical details of superposition
You're comfortable with mathematical topics like linear algebra and calculus
You find theoretical work interesting, as well as empirical work
You've enjoyed the first ~4 exercise sets in this section

Suggested topics for further exploration

Softmax Linear Units

This is a proposed architectural change which appears to increase the number of interpretable MLPs with low performance cost. In particular, it may reduce the instance of superposition.

TL;DR: SOLU is an activation function $\vec{x} \to \vec{x} * \operatorname{softmax}(\vec{x})$ which encourages sparsity in activations in the same way that softmax encourages sparsity (often softmax'ed probability distributions have one probability close to one and the others close to zero). Encouraging activation sparsity might make it harder for neurons to be polysemantic.

Replication of the results of this paper might not be a practical final week project. However, several transformers in the TransformerLens library have been trained with SOLU (see the model page for more details), which makes them a good candidate for closer study. Some questions you might want to explore:

How do Neel's SoLU and GELU models compare in neuroscope under the polysemanticity metric used in the SoLU paper? (what fraction of neurons seem monosemantic when looking at the top 10 activating dataset examples for 1 minute)
The SoLU metrics for polysemanticity are somewhat limited, since they only provide information about whether a neuron is monosemantic when activating strongly (and this may not be corrrelated to whether it is monosemantic in general - see this caveat in the paper). Can you find any better metrics? Can you be more reliable, or more scalable?
The paper speculates that the LayerNorm after the SoLU activations lets the model "smuggle through" superposition, by smearing features across many dimensions, having the output be very small, and letting the LayerNorm scale it up. Can you find any evidence of this in solu-1l?

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

There are many other interesting topics from Anthropic's dictionary learning paper which we didn't have time to dive into here, such as automated interpretability, feature motifs, and finite-state automata.

There's also a Future Work section at the end, which readers might find interesting for any project ideas!

Exciting Open Problems In Mech Interp v2

This document was written by Neel, and it collates a bunch of interesting open problems in mech interp (with a strong focus on SAE-related ones). Again, many of these could make great capstone projects! We encourage you to pick more achievable, less ambitious projects from this list though.

If any of the projects you're interested in involve training a sparse autoencoder, we strongly recommend this post by Arthur Conmy, which collates a bunch of different techniques for training SAEs well (most of which we didn't cover in these exercises).