[1.5.4] Toy Models of Superposition & Sparse Autoencoders

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

Superposition is a crucially important concept for understanding how transformers work. A definition from Neel Nanda's glossary:

Superposition is when a model represents more than n features in an $n$-dimensional activation space. That is, features still correspond to directions, but the set of interpretable directions is larger than the number of dimensions.

Why should we expect something like this to happen? In general, the world has way more features than the model has dimensions of freedom, and so we can't have a one-to-one mapping between features and values in our model. But the model has to represent these features somehow. Hence, it comes up with techniques for cramming multiple features into fewer dimensions (at the cost of adding noise and interference between features).

These exercises are specifically designed to dive deeply into the theory of superposition, focusing less on practical applications of SAEs with real language models. Although this theory underpins much of the current work on SAEs, engaging deeply with this chapter is not absolutely necessary for the more practical material in sections 1.3.3 and 1.4.2, which take you through using SAEs for concept discovery and circuit discovery respectively. If you want to speedrun the most useful parts of this section, we strongly recommend section 1️⃣ and 5️⃣, with a brief look at section 2️⃣ if you have time - you can probably skip the rest.

Reading Material

200 COP in MI: Exploring Polysemanticity and Superposition, 15 mins
- Read the post, up to and including "Tips" (although some parts of it might make more sense after you've read the other things here).
Neel Nanda's Dynalist notes on superposition, 10 mins
- These aren't long, you should skim through them, and also use them as a reference during these exercises.
Anthropic's Toy Models of Superposition, 20 mins
- You should read up to & including the "Summary: A Hierarchy of Feature Properties" section.
- The first few sections ("Key Results", "Definitions and Motivation", and "Empirical Phenomena" are particularly important).
- We'll also be going through other parts of this paper as we work through the exercises.

Content & Learning Objectives

1️⃣ Toy Models of Superposition: Superposition in a Nonprivileged Basis

In this section, you'll cover the basics of Anthropic's toy models of superposition. You'll learn through examples what superposition is, why it happens, and why it presents a problem for interpreting neural networks.

Learning Objectives

Understand the concept of superposition, and how it helps models represent a larger set of features

Understand the difference between superposition and polysemanticity

Learn how sparsity contributes to superposition

Understand the idea of the feature importance curve

Learn how feature correlation changes the nature and degree of superposition

2️⃣ Toy Models of Superposition: Superposition in a Privileged Basis

Superposition in a privileged basis is a bit different than a nonprivileged basis. Firstly, the model has a tendency to align more with the basis directions (even though it often needs some degree of misalignment in order to represent all features). Secondly, many privileged bases are privileged because they are performing some kind of computation on their inputs, rather than just storing representations, so it's important to understand how our model is able to perform this computation.

Learning Objectives

Understand the difference between neuron vs bottleneck superposition (or computational vs representational superposition)

Learn how models can perform computation in superposition, via the example f(x) = abs(x)

3️⃣ Feature Geometry

The section takes a slightly deeper dive into the geometry of superposition. It's not as essential as any exercises from the previous three sections, but should still be of interest and worth doing if you have time, or just want to dive more deeply into superposition. We cover dimensionality (a measure of how much capacity the model is affording to a single feature), and relate this metric to the forming of increasingly complex geometric structures.

Learning Objectives

Learn about dimensionality, which essentially measures what fraction of a dimension is allocated to a specific feature

Understand the geometric intuitions behind superposition, and how they relate to the more general ideas of superposition in larger models

4️⃣ Superposition & Deep Double Descent

Deep Double Descent is a long observed phenomena in deep learning, wherein the model's performance on a task first improves, then plateaus, then starts to improve again with further increases in model size, data size, or training time. The plateau is in line with the predictions of classical statistics (i.e. model overfitting), but the further decrease in loss isn't. Anthropic's paper on deep double descent connects this phenomenon to the idea of superposition, essentially arguing that the two scaling phases represent a memorizing solution and a generalizing solution respectively, with the first characterized by representing datapoints in superposition, the second by representing features in superposition.

This section guides you through a replication of this paper's key results. Relative to the other material here, this is quite unguided, so we recommend you treat it more as an optional extension than an exercise on the core pathway for today.

Learning Objectives

Understand and characterise the deep double descent phenomena

Relate the different phases of double descent to the idea of superposition

Practice replicating results from a paper in a more unguided way

5️⃣ Sparse Autoencoders in Toy Models

In this last section, you'll learn about sparse autoencoders, and how they might help us resolve problems of superposition. You'll train a sparse autoencoder on the toy model setup from earlier sections, and you'll also implement techniques like neuron resampling and different architectures (e.g. the Gated architecture from DeepMind's paper).

Learning Objectives

Learn about sparse autoencoders, and how they might be used to disentangle features represented in superposition

Train your own SAEs on the toy models from earlier sections, and visualise the feature reconstruction process

Understand important SAE training strategies (e.g. resampling) and architecture variants (e.g. Gated, Jump ReLU)

☆ Bonus

We end with a section of suggested bonus material & paper replications, like usual.

Questions

Here are a set of questions (with some brief answers provided) which you should be able to answer for yourself after reading the above material. Search for them on Neel's Dynalist notes if you didn't come across them during your reading.

What is a privileged basis? Why should we expect neuron activations to be privileged by default? Why shouldn't we expect the residual stream to be privileged?

Answer

A privileged basis is one where the standard basis directions are meaningful (due to the structure of computation being done on that basis). This doesn't necessarily mean that the basis is interpretable.

Neurons

Neuron activations are privileged because of the elementwise nonlinear function that gets applied. ReLU is easily described in the standard basis, e.g. in 2D:

\begin{bmatrix} x \\ y \end{bmatrix} \to \begin{bmatrix} \max(x, 0) \\ \max(y, 0) \end{bmatrix}

but if you redefine a basis $x' = (x+y)/\sqrt{2}$, $y' = (x-y)/\sqrt{2}$, then describing ReLU in this new basis becomes really messy. More importantly, we now get interference between the components $x'$ and $y'$, i.e. the ReLU is no longer acting on them independently.

\begin{bmatrix} x' \\ y' \end{bmatrix} \to \frac{1}{\sqrt{2}} \begin{bmatrix} \max(x, 0) + \max(y, 0) \\ \max(x, 0) - \max(y, 0) \end{bmatrix} = \frac{1}{2} \begin{bmatrix} \max(x'+y', 0) + \max(x'-y', 0) \\ \max(x'+y', 0) - \max(x'-y', 0) \end{bmatrix}

Residual stream

The residual stream is not privileged because anything that reads from it and writes to it uses a linear map. As a thought experiment, if we changed all the writing matrices (i.e. $W_{out}$ in the MLP layers and $W_O$ in the attention layers) to $W \to W R$, and all the reading matrices (i.e. $W_{in}$ in the MLP layers and $W_Q$, $W_K$, $W_V$ in the attention layers) to $W \to W R^{-1}$ where $R$ is some arbitrary rotation matrix, then the model's computation would be unchanged. Since the matrix $R$ is arbitrary, it can change the basis in any way it wants, so that basis can't be privileged.

To put this another way - if you claimed "I think the 47th element of the residual stream encoded some special information e.g. the plurality of the noun at that sequence position", I could call bullshit on your claim, because this thought experiment shows that any basis direction could just as easily be rotated & distributed as a linear combination of several different basis directions without fundamentally changing the computation done by the transformer. The same does not apply to neurons, because a rotation / change of basis would change the nature of computation done on them.

Summary

Something is a privileged basis if it is not rotation-independent, i.e. the nature of computation done on it means that the basis directions have some special significance.

Common misconception: privileged basis is equivalent to interpretable basis. This is NOT true (although it is the case that a basis must be privileged if the individual basis directions have some interpretable meaning; this is necessary but not sufficient).

What is the difference between superposition and polysemanticity?

Answer

Polysemanticity happens when one neuron corresponds to multiple features (see here for more discussion & examples). If we only had polysemanticity, this wouldn't really be a problem for us (there might exist a basis for features s.t. each basis vector corresponds to a single feature).

Superposition is when there are more features than dimensions. So it implies polysemanticity (because we must have dimensions representing more than one feature), but the converse is not true.

What are the importance and sparsity of features? Do you expect more or less polysemantic neurons if sparsity is larger?

Answer

Importance = how useful is this feature for achieving lower loss?

Sparsity = how frequently is it in the input data?

If sparsity is larger, then we expect more polysemantic neurons. This is because a single neuron can afford to represent several different sparse features (usually it'll only be representing one of them at any given time, so there won't be interference).

How would you define a feature?

Answer

There's no single correct answer to this. Many of the definitions are unfortunately circular (e.g. "a feature is a thing which could be represented by a neuron"). A few possible definitions are this one from Neel's Dynalist notes:

A feature is a property of an input to the model, or some subset of that input (eg a token in the prompt given to a language model, or a patch of an image).

or this similar one from Chris Olah's Distill circuits Thread:

A feature is a a scalar function of the input. In this essay, neural network features are directions, and often simply individual neurons. We claim such features in neural networks are typically meaningful features which can be rigorously studied. A meaningful feature is one that genuinely responds to an articulable property of the input, such as the presence of a curve or a floppy ear.

Setup (don't read, just run)

import sys
from collections import defaultdict
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Callable, Literal

import einops
import numpy as np
import pandas as pd
import plotly.express as px
import torch as t
from IPython.display import HTML, display
from jaxtyping import Float
from torch import Tensor, nn
from torch.distributions.categorical import Categorical
from torch.nn import functional as F
from tqdm.auto import tqdm

device = t.device("mps" if t.backends.mps.is_available() else "cuda" if t.cuda.is_available() else "cpu")

# Make sure exercises are in the path
chapter = "chapter1_transformer_interp"
section = "part54_toy_models_of_superposition_and_saes"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

import part54_toy_models_of_superposition_and_saes.tests as tests
import part54_toy_models_of_superposition_and_saes.utils as utils
from plotly_utils import imshow, line

MAIN = __name__ == "__main__"