2️⃣ Finding induction heads

Learning Objectives

Understand what induction heads are, and the algorithm they are implementing

Inspect activation patterns to identify basic attention head patterns, and write your own functions to detect attention heads for you

Identify induction heads by looking at the attention patterns produced from a repeating random sequence

Introducing Our Toy Attention-Only Model

Here we introduce a toy 2L attention-only transformer trained specifically for today. Some changes to make them easier to interpret: - It has only attention blocks. - The positional embeddings are only added to the residual stream before calculating each key and query vector in the attention layers as opposed to the token embeddings - i.e. we compute queries as Q = (resid + pos_embed) @ W_Q + b_Q and same for keys, but values as V = resid @ W_V + b_V. This means that the residual stream can't directly encode positional information. - This turns out to make it way easier for induction heads to form, it happens 2-3x times earlier - see the comparison of two training runs here. (The bump in each curve is the formation of induction heads.) - The argument that does this below is positional_embedding_type="shortformer". - It has no MLP layers, no LayerNorms, and no biases. - There are separate embed and unembed matrices (i.e. the weights are not tied).

We now define our model with a HookedTransformerConfig object. This is similar to the Config object we used in the previous set of exercises, although it has a lot more features. You can look at the documentation page (Right-click, "Go to Definition" in VSCode) to see what the different arguments do.

cfg = HookedTransformerConfig(
    d_model=768,
    d_head=64,
    n_heads=12,
    n_layers=2,
    n_ctx=2048,
    d_vocab=50278,
    attention_dir="causal",
    attn_only=True,  # defaults to False
    tokenizer_name="EleutherAI/gpt-neox-20b",
    seed=398,
    use_attn_result=True,
    normalization_type=None,  # defaults to "LN", i.e. layernorm with weights & biases
    positional_embedding_type="shortformer",
)

Note that in the last section we had to define a tokenizer explicitly, and passed it into our model. But here, we just pass a tokenizer name, and the model will automatically create a tokenizer for us (under the hood, it calls AutoTokenizer.from_pretrained(tokenizer_name)).

Below, you'll load in your weights, with some boilerplate code to download your state dict from HuggingFace (you can do this for any model you've uploaded to HuggingFace yourself):

from huggingface_hub import hf_hub_download

REPO_ID = "callummcdougall/attn_only_2L_half"
FILENAME = "attn_only_2L_half.pth"

weights_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME)

Finally, we'll create our model and load in the weights:

model = HookedTransformer(cfg)
pretrained_weights = t.load(weights_path, map_location=device, weights_only=True)
model.load_state_dict(pretrained_weights)

Use the diagram at this link to remind yourself of the relevant hook names.

Exercise - visualise & inspect attention patterns

Difficulty: 🔴🔴⚪⚪⚪

Importance: 🔵🔵🔵⚪⚪

You should spend up to ~10 minutes on this exercise. It's important to be comfortable using circuitsvis, and the cache object.

This exercise should be very quick - you can reuse code from the previous section. You should look at the solution if you're still stuck after 5-10 minutes.

Visualise the attention patterns for both layers of your model, on the following prompt:

text = "We think that powerful, significantly superhuman machine intelligence is more likely than not to be created this century. If current machine learning techniques were scaled up to this level, we think they would by default produce systems that are deceptive or manipulative, and that no solid plans are known for how to avoid this."

logits, cache = model.run_with_cache(text, remove_batch_dim=True)

(Note that we've run the model on the string text, rather than on tokens like we did previously when creating a cache - this is something that HookedTransformer allows.)

Inspect the attention patterns. What do you notice about the attention heads?

You should spot three relatively distinctive basic patterns, which occur in multiple heads. What are these patterns, and can you guess why they might be present?

# YOUR CODE HERE - visualize attention

Click to see the expected output

Aside - what to do if your plots won't show up

A common mistake is to fail to pass the tokens in as arguments. If you do this, your attention patterns won't render.

If this isn't the problem, then it might be an issue with the Circuitsvis library. Rather than plotting inline, you can turn the output into str(...) and then save it as an HTML file, which can be downloaded and opened in a browser.

Discussion of results

We notice that there are three basic patterns which repeat quite frequently:

prev_token_heads, which attend mainly to the previous token (e.g. head 0.7)
current_token_heads, which attend mainly to the current token (e.g. head 1.6)
first_token_heads, which attend mainly to the first token (e.g. heads 0.3 or 1.4, although these are a bit less clear-cut than the other two)

The prev_token_heads and current_token_heads are perhaps unsurprising, because words that are close together in a sequence probably have a lot more mutual information (i.e. we could get quite far using bigram or trigram prediction).

The first_token_heads are a bit more surprising. The basic intuition here is that the first token in a sequence is often used as a resting or null position for heads that only sometimes activate (since our attention probabilities always have to add up to 1).

Solution

str_tokens = model.to_str_tokens(text)
for layer in range(model.cfg.n_layers):
    attention_pattern = cache["pattern", layer]
    display(cv.attention.attention_patterns(tokens=str_tokens, attention=attention_pattern))

Now that we've observed our three basic attention patterns, it's time to make detectors for those patterns!

Exercise - write your own detectors

Difficulty: 🔴🔴⚪⚪⚪

Importance: 🔵🔵🔵⚪⚪

You shouldn't spend more than 10-25 minutes on these exercises. These exercises aren't meant to be too challenging, just to get you thinking about how to characterize head behaviour. Use the hints if you're stuck.

You should fill in the functions below, which act as detectors for particular types of heads. Validate your detectors by comparing these results to the visual attention patterns above - summary statistics on their own can be dodgy, but are much more reliable if you can validate it by directly playing with the data.

Tasks like this are useful, because we need to be able to take our observations / intuitions about what a model is doing, and translate these into quantitative measures. As the exercises proceed, we'll be creating some much more interesting tools and detectors!

Note - there's no objectively correct answer for which heads are doing which tasks, and which detectors can spot them. You should just try and come up with something plausible-seeming, which identifies the kind of behaviour you're looking for. Don't spend too much time here looking for a perfect solution, just one that seems to roughly match up with your visual inspection of the attention patterns.

def current_attn_detector(cache: ActivationCache) -> list[str]:
    """
    Returns a list e.g. ["0.2", "1.4", "1.9"] of "layer.head" which you judge to be current-token heads
    """
    raise NotImplementedError()


def prev_attn_detector(cache: ActivationCache) -> list[str]:
    """
    Returns a list e.g. ["0.2", "1.4", "1.9"] of "layer.head" which you judge to be prev-token heads
    """
    raise NotImplementedError()


def first_attn_detector(cache: ActivationCache) -> list[str]:
    """
    Returns a list e.g. ["0.2", "1.4", "1.9"] of "layer.head" which you judge to be first-token heads
    """
    raise NotImplementedError()


print("Heads attending to current token  = ", ", ".join(current_attn_detector(cache)))
print("Heads attending to previous token = ", ", ".join(prev_attn_detector(cache)))
print("Heads attending to first token    = ", ", ".join(first_attn_detector(cache)))

Hint

Try and compute the average attention probability along the relevant tokens. For instance, you can get the tokens just below the diagonal by using t.diagonal with appropriate offset parameter:

>>> arr = t.arange(9).reshape(3, 3)
>>> arr
tensor([[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]])

>>> arr.diagonal()
tensor([0, 4, 8])

>>> arr.diagonal(-1)
tensor([3, 7])

Remember that you should be using cache["pattern", layer] to get all the attention probabilities for a given layer, and then indexing on the 0th dimension to get the correct head.

Expected output (yours might vary slightly depending on method)

Heads attending to current token  =  0.9
Heads attending to previous token =  0.7
Heads attending to first token    =  0.3, 1.4, 1.10

Solution (one possible method)

Note - choosing score=0.4 as a threshold in the code below is a bit arbitrary, but it seems to work well enough. In this particular case, a threshold of 0.5 results in no head being classified as a current-token head.

def current_attn_detector(cache: ActivationCache) -> list[str]:
    """
    Returns a list e.g. ["0.2", "1.4", "1.9"] of "layer.head" which you judge to be current-token heads
    """
    attn_heads = []
    for layer in range(model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            attention_pattern = cache["pattern", layer][head]
            # take avg of diagonal elements
            score = attention_pattern.diagonal().mean()
            if score > 0.4:
                attn_heads.append(f"{layer}.{head}")
    return attn_heads


def prev_attn_detector(cache: ActivationCache) -> list[str]:
    """
    Returns a list e.g. ["0.2", "1.4", "1.9"] of "layer.head" which you judge to be prev-token heads
    """
    attn_heads = []
    for layer in range(model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            attention_pattern = cache["pattern", layer][head]
            # take avg of sub-diagonal elements
            score = attention_pattern.diagonal(-1).mean()
            if score > 0.4:
                attn_heads.append(f"{layer}.{head}")
    return attn_heads


def first_attn_detector(cache: ActivationCache) -> list[str]:
    """
    Returns a list e.g. ["0.2", "1.4", "1.9"] of "layer.head" which you judge to be first-token heads
    """
    attn_heads = []
    for layer in range(model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            attention_pattern = cache["pattern", layer][head]
            # take avg of 0th elements
            score = attention_pattern[:, 0].mean()
            if score > 0.4:
                attn_heads.append(f"{layer}.{head}")
    return attn_heads

Compare the printouts to your attention visualisations above. Do they seem to make sense? As a bonus exercise, try inputting different text, and see how stable your results are. Do certain heads always get classified the same way?

Now, it's time to turn our attention to induction heads.

What are induction heads?

(Note: I use induction head to refer to the head in the second layer which attends to the 'token immediately after the copy of the current token', and induction circuit to refer to the circuit consisting of the composition of a previous token head in layer 0 and an induction head in layer 1)

Induction heads are the first sophisticated circuit we see in transformers! And are sufficiently interesting that we wrote another paper just about them.

An aside on why induction heads are a big deal

There's a few particularly striking things about induction heads:

They develop fairly suddenly in a phase change - from about 2B to 4B tokens we go from no induction heads to pretty well developed ones. This is a striking divergence from a 1L model see the comparison of in context learning performance curves for models with different layers and can be observed in much larger models (eg a 13B one)
- Phase changes are particularly interesting (and depressing) from an alignment perspective, because the prospect of a sharp left turn, or emergent capabilities like deception or situational awareness seems like worlds where alignment may be harder, and we get caught by surprise without warning shots or simpler but analogous models to test our techniques on.
They are responsible for a significant loss decrease - so much so that there's a visible bump in the loss curve when they develop (this change in loss can be pretty comparable to the increase in loss from major increases in model size, though this is hard to make an apples-to-apples comparison)
They seem to be responsible for the vast majority of in-context learning - the ability to use far back tokens in the context to predict the next token. This is a significant way in which transformers outperform older architectures like RNNs or LSTMs, and induction heads seem to be a big part of this.
The same core circuit seems to be used in a bunch of more sophisticated settings, such as translation or few-shot learning - there are heads that seem clearly responsible for those and which double as induction heads.

Again, you are strongly recommended to read the corresponding section of the glossary, before continuing (or this LessWrong post). In brief, however, the induction circuit consists of a previous token head in layer 0 and an induction head in layer 1, where the induction head learns to attend to the token immediately after copies of the current token via K-Composition with the previous token head.

Question - why couldn't an induction head form in a 1L model?

Answer

Because this would require a head which attends a key position based on the value of the token before it. Attention scores are just a function of the key token and the query token, and are not a function of other tokens.

(The attention pattern does in fact include effects from other tokens because of softmax - if another key token has a high attention score, softmax inhibits this pair. But this inhibition is symmetric across positions, so can't systematically favour the token next to the relevant one.)

Note that a key detail is that the value of adjacent tokens are (approximately) unrelated - if the model wanted to attend based on relative position this is easy.

Checking for the induction capability

A striking thing about models with induction heads is that, given a repeated sequence of random tokens, they can predict the repeated half of the sequence. This is nothing like its training data, so this is kind of wild! The ability to predict this kind of out of distribution generalisation is a strong point of evidence that you've really understood a circuit.

To check that this model has induction heads, we're going to run it on exactly that, and compare performance on the two halves - you should see a striking difference in the per token losses.

Note - we're using small sequences (and just one sequence), since the results are very obvious and this makes it easier to visualise. In practice we'd obviously use larger ones on more subtle tasks. But it's often easiest to iterate and debug on small tasks.

Exercise - plot per-token loss on repeated sequence

Difficulty: 🔴🔴⚪⚪⚪

Importance: 🔵🔵⚪⚪⚪

You shouldn't spend more than 10-15 minutes on these exercises.

You should fill in the functions below. We've given you the first line of the first function, which defines a prefix (remember we need the BOS token for GPT-2, since it was trained to have one). We've also given you the get_log_probs function from the previous set of exercises.

def generate_repeated_tokens(
    model: HookedTransformer, seq_len: int, batch_size: int = 1
) -> Int[Tensor, "batch_size full_seq_len"]:
    """
    Generates a sequence of repeated random tokens

    Outputs are:
        rep_tokens: [batch_size, 1+2*seq_len]
    """
    t.manual_seed(0)  # for reproducibility
    prefix = (t.ones(batch_size, 1) * model.tokenizer.bos_token_id).long()


def run_and_cache_model_repeated_tokens(
    model: HookedTransformer, seq_len: int, batch_size: int = 1
) -> tuple[Tensor, Tensor, ActivationCache]:
    """
    Generates a sequence of repeated random tokens, and runs the model on it, returning (tokens,
    logits, cache). This function should use the `generate_repeated_tokens` function above.

    Outputs are:
        rep_tokens: [batch_size, 1+2*seq_len]
        rep_logits: [batch_size, 1+2*seq_len, d_vocab]
        rep_cache: The cache of the model run on rep_tokens
    """
    raise NotImplementedError()


def get_log_probs(
    logits: Float[Tensor, "batch posn d_vocab"], tokens: Int[Tensor, "batch posn"]
) -> Float[Tensor, "batch posn-1"]:
    logprobs = logits.log_softmax(dim=-1)
    # We want to get logprobs[b, s, tokens[b, s+1]], in eindex syntax this looks like:
    correct_logprobs = eindex(logprobs, tokens, "b s [b s+1]")
    return correct_logprobs


seq_len = 50
batch_size = 1
(rep_tokens, rep_logits, rep_cache) = run_and_cache_model_repeated_tokens(model, seq_len, batch_size)
rep_cache.remove_batch_dim()
rep_str = model.to_str_tokens(rep_tokens)
model.reset_hooks()
log_probs = get_log_probs(rep_logits, rep_tokens).squeeze()

print(f"Performance on the first half: {log_probs[:seq_len].mean():.3f}")
print(f"Performance on the second half: {log_probs[seq_len:].mean():.3f}")

plot_loss_difference(log_probs, rep_str, seq_len)

Click to see the expected output

Hint

You can define the first half of the repeated tokens using t.randint(low, high, shape). Also remember to specify dtype=t.long.

Then you can concatenate together your prefix and two copies of the repeated tokens, using t.concat.

Solution

def generate_repeated_tokens(
    model: HookedTransformer, seq_len: int, batch_size: int = 1
) -> Int[Tensor, "batch_size full_seq_len"]:
    """
    Generates a sequence of repeated random tokens

    Outputs are:
        rep_tokens: [batch_size, 1+2*seq_len]
    """
    t.manual_seed(0)  # for reproducibility
    prefix = (t.ones(batch_size, 1) * model.tokenizer.bos_token_id).long()
    rep_tokens_half = t.randint(0, model.cfg.d_vocab, (batch_size, seq_len), dtype=t.int64)
    rep_tokens = t.cat([prefix, rep_tokens_half, rep_tokens_half], dim=-1).to(device)
    return rep_tokens


def run_and_cache_model_repeated_tokens(
    model: HookedTransformer, seq_len: int, batch_size: int = 1
) -> tuple[Tensor, Tensor, ActivationCache]:
    """
    Generates a sequence of repeated random tokens, and runs the model on it, returning (tokens,
    logits, cache). This function should use the `generate_repeated_tokens` function above.

    Outputs are:
        rep_tokens: [batch_size, 1+2*seq_len]
        rep_logits: [batch_size, 1+2*seq_len, d_vocab]
        rep_cache: The cache of the model run on rep_tokens
    """
    rep_tokens = generate_repeated_tokens(model, seq_len, batch_size)
    rep_logits, rep_cache = model.run_with_cache(rep_tokens)
    return rep_tokens, rep_logits, rep_cache

Looking for Induction Attention Patterns

The next natural thing to check for is the induction attention pattern.

First, go back to the attention patterns visualisation code from earlier (i.e. cv.attention.attention_heads or attention_patterns) and manually check for likely heads in the second layer. Which ones do you think might be serving as induction heads?

Note - above, we defined the rep_str object for you, so you can use it in your circuitsvis functions.

# YOUR CODE HERE - display the attention patterns stored in `rep_cache`, for each layer

Click to see the expected output

Some observations

The characteristic pattern of induction heads is a diagonal stripe, with the diagonal offset as seq_len-1 (because the destination token attends to the token after the destination token's previous occurrence).

You should see that heads 4 and 10 are strongly induction-y, head 6 is very weakly induction-y, and the rest aren't.

Solution

for layer in range(model.cfg.n_layers):
    attention_pattern = rep_cache["pattern", layer]
    display(cv.attention.attention_patterns(tokens=rep_str, attention=attention_pattern))

Exercise - make an induction-head detector

Difficulty: 🔴🔴⚪⚪⚪

Importance: 🔵🔵⚪⚪⚪

You shouldn't spend more than 5-15 minutes on this exercise. This exercise should be very similar to the earlier detector exercises (the only difference being how you index attention).

Now, you should make an induction pattern score function, which looks for the average attention paid to the offset diagonal. Do this in the same style as our earlier head scorers, just with a different kind of indexing that is appropriate for detecting the characteristic attention head pattern.

def induction_attn_detector(cache: ActivationCache) -> list[str]:
    """
    Returns a list e.g. ["0.2", "1.4", "1.9"] of "layer.head" which you judge to be induction heads

    Remember - the tokens used to generate rep_cache are (bos_token, *rand_tokens, *rand_tokens)
    """
    raise NotImplementedError()


print("Induction heads = ", ", ".join(induction_attn_detector(rep_cache)))

Help - I'm not sure what offset to use.

The offset in your diagonal should be -(seq_len-1) (where seq_len is the length of the random tokens which you repeat twice), because the second instance of random token T will attend to the token after the first instance of T.

Solution

def induction_attn_detector(cache: ActivationCache) -> list[str]:
    """
    Returns a list e.g. ["0.2", "1.4", "1.9"] of "layer.head" which you judge to be induction heads

    Remember - the tokens used to generate rep_cache are (bos_token, *rand_tokens, *rand_tokens)
    """
    attn_heads = []
    for layer in range(model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            attention_pattern = cache["pattern", layer][head]
            # take avg of (-seq_len+1)-offset elements
            seq_len = (attention_pattern.shape[-1] - 1) // 2
            score = attention_pattern.diagonal(-seq_len + 1).mean()
            if score > 0.4:
                attn_heads.append(f"{layer}.{head}")
    return attn_heads

If this function works as expected, then you should see output that matches your observations from circuitsvis (i.e. the heads which you observed to be induction heads are being classified as induction heads by your function here).