1️⃣ TransformerLens: Introduction

Learning Objectives

Load and run a HookedTransformer model

Understand the basic architecture of these models

Use the model's tokenizer to convert text to tokens, and vice versa

Know how to cache activations, and to access activations from the cache

Use circuitsvis to visualise attention heads

Introduction

Note - most of this is written from the POV of Neel Nanda.

This is a demo notebook for TransformerLens, a library I (Neel Nanda) wrote for doing mechanistic interpretability of GPT-2 Style language models. The goal of mechanistic interpretability is to take a trained model and reverse engineer the algorithms the model learned during training from its weights. It is a fact about the world today that we have computer programs that can essentially speak English at a human level (GPT-3, PaLM, etc), yet we have no idea how they work nor how to write one ourselves. This offends me greatly, and I would like to solve this! Mechanistic interpretability is a very young and small field, and there are a lot of open problems - if you would like to help, please try working on one! Check out my list of concrete open problems to figure out where to start.

I wrote this library because after I left the Anthropic interpretability team and started doing independent research, I got extremely frustrated by the state of open source tooling. There's a lot of excellent infrastructure like HuggingFace and DeepSpeed to use or train models, but very little to dig into their internals and reverse engineer how they work. This library tries to solve that, and to make it easy to get into the field even if you don't work at an industry org with real infrastructure! The core features were heavily inspired by Anthropic's excellent Garcon tool. Credit to Nelson Elhage and Chris Olah for building Garcon and showing me the value of good infrastructure for accelerating exploratory research!

The core design principle I've followed is to enable exploratory analysis - one of the most fun parts of mechanistic interpretability compared to normal ML is the extremely short feedback loops! The point of this library is to keep the gap between having an experiment idea and seeing the results as small as possible, to make it easy for research to feel like play and to enter a flow state. This notebook demonstrates how the library works and how to use it, but if you want to see how well it works for exploratory research, check out my notebook analysing Indirect Objection Identification or my recording of myself doing research!

Loading and Running Models

TransformerLens comes loaded with >40 open source GPT-style models. You can load any of them in with HookedTransformer.from_pretrained(MODEL_NAME). For this demo notebook we'll look at GPT-2 Small, an 80M parameter model, see the Available Models section for info on the rest.

gpt2_small: HookedTransformer = HookedTransformer.from_pretrained("gpt2-small")

HookedTransformerConfig

Alternatively, you can define a config object, then call HookedTransformer.from_config(cfg) to define your model. This is particularly useful when you want to have finer control over the architecture of your model. We'll see an example of this in the next section, when we define an attention-only model to study induction heads.

Even if you don't define your model in this way, you can still access the config object through the cfg attribute of the model.

Exercise - inspect your model

Difficulty: 🔴⚪⚪⚪⚪

Importance: 🔵🔵🔵⚪⚪

Use gpt2_small.cfg to find the following, for your GPT-2 Small model:

Number of layers
Number of heads per layer
Maximum context window

You might have to check out the documentation page for some of these. If you're in VSCode then you can reach it by right-clicking on HookedTransformerConfig and choosing "Go to definition". If you're in Colab, then you can read the GitHub page.

Answer

The following parameters in the config object give you the answers:

cfg.n_layers == 12
cfg.n_heads == 12
cfg.n_ctx == 1024

Running your model

Models can be run on a single string or a tensor of tokens (shape: [batch, position], all integers). The possible return types are:

"logits" (shape [batch, position, d_vocab], floats),
"loss" (the cross-entropy loss when predicting the next token),
"both" (a tuple of (logits, loss))
None (run the model, but don't calculate the logits - this is faster when we only want to use intermediate activations)

model_description_text = """## Loading Models

HookedTransformer comes loaded with >40 open source GPT-style models. You can load any of them in with `HookedTransformer.from_pretrained(MODEL_NAME)`. Each model is loaded into the consistent HookedTransformer architecture, designed to be clean, consistent and interpretability-friendly.

For this demo notebook we'll look at GPT-2 Small, an 80M parameter model. To try the model out, let's find the loss on this paragraph!"""

loss = gpt2_small(model_description_text, return_type="loss")
print("Model loss:", loss)

Model loss: tensor(4.3443, device='cuda:0')

Transformer architecture

HookedTransformer is a somewhat adapted GPT-2 architecture, but is computationally identical. The most significant changes are to the internal structure of the attention heads:

The weights W_K, W_Q, W_V mapping the residual stream to queries, keys and values are 3 separate matrices, rather than big concatenated one.
The weight matrices W_K, W_Q, W_V, W_O and activations have separate head_index and d_head axes, rather than flattening them into one big axis.
- The activations all have shape [batch, position, head_index, d_head].
- W_K, W_Q, W_V have shape [head_index, d_model, d_head] and W_O has shape [head_index, d_head, d_model]
Important - we generally follow the convention that weight matrices multiply on the right rather than the left. In other words, they have shape [input, output], and we have new_activation = old_activation @ weights + bias.
- Click the dropdown below for examples of this, if it seems unintuitive.

Examples of matrix multiplication in our model

Query matrices
- Each query matrix W_Q for a particular layer and head has shape [d_model, d_head].
- So if a vector x in the residual stream has length d_model, then the corresponding query vector is x @ W_Q, which has length d_head.
Embedding matrix
- The embedding matrix W_E has shape [d_vocab, d_model].
- So if A is a one-hot-encoded vector of length d_vocab corresponding to a particular token, then the embedding vector for this token is A @ W_E, which has length d_model.

The actual code is a bit of a mess, as there's a variety of Boolean flags to make it consistent with the various different model families in TransformerLens - to understand it and the internal structure, I instead recommend reading the code in CleanTransformerDemo.

Parameters and Activations

It's important to distinguish between parameters and activations in the model.

Parameters are the weights and biases that are learned during training.
- These don't change when the model input changes.
- They can be accessed directly from the model, e.g. model.W_E for the embedding matrix.
Activations are temporary numbers calculated during a forward pass, that are functions of the input.
- We can think of these values as only existing for the duration of a single forward pass, and disappearing afterwards.
- We can use hooks to access these values during a forward pass (more on hooks later), but it doesn't make sense to talk about a model's activations outside the context of some particular input.
- Attention scores and patterns are activations (this is slightly non-intuitive because they're used in a matrix multiplication with another activation).

The link below shows a diagram of a single layer (called a TransformerBlock) for an attention-only model with no biases. Each box corresponds to an activation (and also tells you the name of the corresponding hook point, which we will eventually use to access those activations). The red text below each box tells you the shape of the activation (ignoring the batch dimension). Each arrow corresponds to an operation on an activation; where there are parameters involved these are labelled on the arrows.

Link to diagram

The next link is to a diagram of a TransformerBlock with full features (including biases, layernorms, and MLPs). Don't worry if not all of this makes sense at first - we'll return to some of the details later. As we work with these transformers, we'll get more comfortable with their architecture.

Link to diagram

A few shortcuts to make your lives easier when using these models:

You can index weights like W_Q directly from the model via e.g. model.blocks[0].attn.W_Q (which gives you the [nheads, d_model, d_head] query weights for all heads in layer 0).
- But an easier way is just to index with model.W_Q, which gives you the [nlayers, nheads, d_model, d_head] tensor containing every query weight in the model.
Similarly, there exist shortcuts model.W_E, model.W_U and model.W_pos for the embeddings, unembeddings and positional embeddings respectively.
With models containing MLP layers, you also have model.W_in and model.W_out for the linear layers.
The same is true for all biases (e.g. model.b_Q for all query biases).

Tokenization

The tokenizer is stored inside the model, and you can access it using model.tokenizer. There are also a few helper methods that call the tokenizer under the hood, for instance:

model.to_str_tokens(text) converts a string into a list of tokens-as-strings (or a list of strings into a list of lists of tokens-as-strings).
model.to_tokens(text) converts a string into a tensor of tokens.
model.to_string(tokens) converts a tensor of tokens into a string.

Examples of use:

print(gpt2_small.to_str_tokens("gpt2"))
print(gpt2_small.to_str_tokens(["gpt2", "gpt2"]))
print(gpt2_small.to_tokens("gpt2"))
print(gpt2_small.to_string([50256, 70, 457, 17]))

['<|endoftext|>', 'g', 'pt', '2']
[['<|endoftext|>', 'g', 'pt', '2'], ['<|endoftext|>', 'g', 'pt', '2']]
tensor([[50256,    70,   457,    17]], device='cuda:0')
<|endoftext|>gpt2

Aside - <|endoftext|>

A weirdness you may have noticed in the above is that to_tokens and to_str_tokens added a weird <|endoftext|> to the start of each prompt. We encountered this in the previous set of exercises, and noted that this was the Beginning of Sequence (BOS) token (which for GPT-2 is also the same as the EOS and PAD tokens - index 50256.

TransformerLens appends this token by default, and it can easily trip up new users. Notably, this includes model.forward (which is what's implicitly used when you do eg model("Hello World")). You can disable this behaviour by setting the flag prepend_bos=False in to_tokens, to_str_tokens, model.forward and any other function that converts strings to multi-token tensors.

prepend_bos is a bit of a hack, and I've gone back and forth on what the correct default here is. The reason I do this is that transformers tend to treat the first token weirdly - this doesn't really matter in training (where all inputs are >1000 tokens), but this can be a big issue when investigating short prompts! The reason for this is that attention patterns are a probability distribution and so need to add up to one, so to simulate being "off" they normally look at the first token. Giving them a BOS token lets the heads rest by looking at that, preserving the information in the first "real" token.

Further, some models are trained to need a BOS token (OPT and my interpretability-friendly models are, GPT-2 and GPT-Neo are not). But despite GPT-2 not being trained with this, empirically it seems to make interpretability easier.

Exercise - how many tokens does your model guess correctly?

Difficulty: 🔴🔴⚪⚪⚪

Importance: 🔵🔵🔵⚪⚪

You should spend up to ~10 minutes on this exercise.

Consider the model_description_text you fed into your model above. How many tokens did your model guess correctly? Which tokens were correct?

logits: Tensor = gpt2_small(model_description_text, return_type="logits")
prediction = logits.argmax(dim=-1).squeeze()[:-1]

# YOUR CODE HERE - get the model's prediction on the text

Hint

Use return_type="logits" to get the model's predictions, then take argmax across the vocab dimension. Then, compare these predictions with the actual tokens, derived from the model_description_text.

Remember, you should be comparing the [:-1]th elements of this tensor of predictions with the [1:]th elements of the input tokens (because your model's output represents a probability distribution over the next token, not the current one).

Also, remember to handle the batch dimension (since logits, and the output of to_tokens, will both have batch dimensions by default).

Answer - what you should see

Model accuracy: 33/111
Correct tokens: ['\n', '\n', 'former', ' with', ' models', '.', ' can', ' of', 'ooked', 'Trans', 'former', '_', 'NAME', '`.', ' model', ' the', 'Trans', 'former', ' to', ' be', ' and', '-', '.', '\n', '\n', ' at', 'PT', '-', ',', ' model', ',', "'s", ' the']

So the model got 33 out of 111 tokens correct. Not too bad!

Solution

logits: Tensor = gpt2_small(model_description_text, return_type="logits")
prediction = logits.argmax(dim=-1).squeeze()[:-1]

true_tokens = gpt2_small.to_tokens(model_description_text).squeeze()[1:]
is_correct = prediction == true_tokens

print(f"Model accuracy: {is_correct.sum()}/{len(true_tokens)}")
print(f"Correct tokens: {gpt2_small.to_str_tokens(prediction[is_correct])}")

Induction heads are a special kind of attention head which we'll examine a lot more in coming exercises. They allow a model to perform in-context learning of a specific form: generalising from one observation that token B follows token A, to predict that token B will follow A in future occurrences of A, even if these two tokens had never appeared together in the model's training data.

Can you see evidence of any induction heads at work, on this text?

Evidence of induction heads

The evidence for induction heads comes from the fact that the model successfully predicted 'ooked', 'Trans', 'former' following the token 'H'. This is because it's the second time that HookedTransformer had appeared in this text string, and the model predicted it the second time but not the first. (The model did predict former the first time, but we can reasonably assume that Transformer is a word this model had already been exposed to during training, so this prediction wouldn't require the induction capability, unlike HookedTransformer.)

print(gpt2_small.to_str_tokens("HookedTransformer", prepend_bos=False))     # --> ['H', 'ooked', 'Trans', 'former']

Caching all Activations

The first basic operation when doing mechanistic interpretability is to break open the black box of the model and look at all of the internal activations of a model. This can be done with logits, cache = model.run_with_cache(tokens). Let's try this out, on the first sentence from the GPT-2 paper.

Aside - a note on remove_batch_dim

Every activation inside the model begins with a batch dimension. Here, because we only entered a single batch dimension, that dimension is always length 1 and kinda annoying, so passing in the remove_batch_dim=True keyword removes it.

gpt2_cache_no_batch_dim = gpt2_cache.remove_batch_dim() would have achieved the same effect.

gpt2_text = "Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets."
gpt2_tokens = gpt2_small.to_tokens(gpt2_text)
gpt2_logits, gpt2_cache = gpt2_small.run_with_cache(gpt2_tokens, remove_batch_dim=True)

print(type(gpt2_logits), type(gpt2_cache))

<class 'torch.Tensor'> <class 'transformer_lens.ActivationCache.ActivationCache'>

If you inspect the gpt2_cache object, you should see that it contains a very large number of keys, each one corresponding to a different activation in the model. You can access the keys by indexing the cache directly, or by a more convenient indexing shorthand. For instance, here are 2 ways to extract the attention patterns for layer 0:

attn_patterns_from_shorthand = gpt2_cache["pattern", 0]
attn_patterns_from_full_name = gpt2_cache["blocks.0.attn.hook_pattern"]

t.testing.assert_close(attn_patterns_from_shorthand, attn_patterns_from_full_name)

Aside: utils.get_act_name

The reason these are the same is that, under the hood, the first example actually indexes by utils.get_act_name("pattern", 0), which evaluates to "blocks.0.attn.hook_pattern".

In general, utils.get_act_name is a useful function for getting the full name of an activation, given its short name and layer number.

You can use the diagram from the Transformer Architecture section to help you find activation names.

Exercise - verify activations

Difficulty: 🔴🔴🔴⚪⚪

Importance: 🔵🔵🔵⚪⚪

You should spend up to 10-15 minutes on this exercise. If you're already comfortable implementing things like attention calculations (e.g. having gone through Neel's transformer walkthrough) you can skip this exercise. However, it might serve as a useful refresher.

Verify that hook_q, hook_k and hook_pattern are related to each other in the way implied by the diagram. Do this by computing layer0_pattern_from_cache (the attention pattern taken directly from the cache, for layer 0) and layer0_pattern_from_q_and_k (the attention pattern calculated from hook_q and hook_k, for layer 0). Remember that attention pattern is the probabilities, so you'll need to scale and softmax appropriately.

layer0_pattern_from_cache = gpt2_cache["pattern", 0]

# YOUR CODE HERE - define `layer0_pattern_from_q_and_k` manually, by manually performing the
# steps of the attention calculation (dot product, masking, scaling, softmax)
t.testing.assert_close(layer0_pattern_from_cache, layer0_pattern_from_q_and_k)
print("Tests passed!")

Hint

You'll need to use three different cache indexes in all:

gpt2_cache["pattern", 0] to get the attention patterns, which have shape [nhead, seqQ, seqK]
gpt2_cache["q", 0] to get the query vectors, which have shape [seqQ, nhead, headsize]
gpt2_cache["k", 0] to get the key vectors, which have shape [seqK, nhead, headsize]

Solution

layer0_pattern_from_cache = gpt2_cache["pattern", 0]

q, k = gpt2_cache["q", 0], gpt2_cache["k", 0]
seq, nhead, headsize = q.shape
layer0_attn_scores = einops.einsum(q, k, "seqQ n h, seqK n h -> n seqQ seqK")
mask = t.triu(t.ones((seq, seq), dtype=t.bool), diagonal=1).to(device)
layer0_attn_scores.masked_fill_(mask, -1e9)
layer0_pattern_from_q_and_k = (layer0_attn_scores / headsize**0.5).softmax(-1)

Visualising Attention Heads

A key insight from the Mathematical Frameworks paper is that we should focus on interpreting the parts of the model that are intrinsically interpretable - the input tokens, the output logits and the attention patterns. Everything else (the residual stream, keys, queries, values, etc) are compressed intermediate states when calculating meaningful things. So a natural place to start is classifying heads by their attention patterns on various texts.

When doing interpretability, it's always good to begin by visualising your data, rather than taking summary statistics. Summary statistics can be super misleading! But now that we have visualised the attention patterns, we can create some basic summary statistics and use our visualisations to validate them! (Accordingly, being good at web dev/data visualisation is a surprisingly useful skillset! Neural networks are very high-dimensional object.)

Let's visualize the attention pattern of all the heads in layer 0, using Alan Cooney's CircuitsVis library (based on Anthropic's PySvelte library). If you did the previous set of exercises, you'll have seen this library before.

We will use the function cv.attention.attention_patterns, which takes two important arguments:

attention: the attention head patterns, of shape [n_heads, seq_len, seq_len]. This consists of the stacked grid of attention probabilities for each head, i.e. attention[head, d, s] is the attention probability from destination position d to source position s in attention head head.
tokens: List of tokens, which should have the same length as the seq_len dimension of attention. Make sure you're not accidentally passing in a list with a dummy dimension, or that differs from seq_len because of the BOS token!

This visualization is interactive! Try hovering over a token or head, and click to lock. The grid on the top left and for each head is the attention pattern as a destination position by source position grid. It's lower triangular because GPT-2 has causal attention, attention can only look backwards, so information can only move forwards in the network.

Note - you can also use the cv.attention.attention_heads function, which presents the data in a different way (the syntax is exactly the same as attention_patterns). Note, if you display this in VSCode then it may exhibit a bug where the main plot continually shrinks in size - if this happens, you should instead save the HTML (i.e. with html = cv.attention.attention_heads(...); with open("attn_heads.html", "w") as f: f.write(str(html))) and open the plot in your browser.

print(type(gpt2_cache))
attention_pattern = gpt2_cache["pattern", 0]
print(attention_pattern.shape)
gpt2_str_tokens = gpt2_small.to_str_tokens(gpt2_text)

print("Layer 0 Head Attention Patterns:")
display(
    cv.attention.attention_patterns(
        tokens=gpt2_str_tokens,
        attention=attention_pattern,
        attention_head_names=[f"L0H{i}" for i in range(12)],
    )
)

Hover over heads to see the attention patterns; click on a head to lock it. Hover over each token to see which other tokens it attends to (or which other tokens attend to it - you can see this by changing the dropdown to Destination <- Source).

Other circuitsvis functions - neuron activations

The circuitsvis library also has a number of cool visualisations for neuron activations. Here are some more of them (you don't have to understand them all now, but you can come back to them later).

The function below visualises neuron activations. The example shows just one sequence, but it can also show multiple sequences (if tokens is a list of lists of strings, and activations is a list of tensors).

neuron_activations_for_all_layers = t.stack([
    gpt2_cache["post", layer] for layer in range(gpt2_small.cfg.n_layers)
], dim=1)
# shape = (seq_pos, layers, neurons)

cv.activations.text_neuron_activations(
    tokens=gpt2_str_tokens,
    activations=neuron_activations_for_all_layers
)

The next function shows which words each of the neurons activates most / least on (note that it requires some weird indexing to work correctly).

neuron_activations_for_all_layers_rearranged = utils.to_numpy(einops.rearrange(neuron_activations_for_all_layers, "seq layers neurons -> 1 layers seq neurons"))

cv.topk_tokens.topk_tokens(
    # Some weird indexing required here ¯\_(ツ)_/¯
    tokens=[gpt2_str_tokens],
    activations=neuron_activations_for_all_layers_rearranged,
    max_k=7,
    first_dimension_name="Layer",
    third_dimension_name="Neuron",
    first_dimension_labels=list(range(12))
)