5️⃣ Full Replication: Minimial Circuits and more

Learning Objectives

Replicate most of the other results from the IOI paper

Practice more open-ended, less guided coding

This section will be a lot more open-ended and challenging. You'll be given somewhat less guidance in the exercises.

Copying & writing direction results

We'll start this section by replicating the paper's analysis of the name mover heads and negative name mover heads. Our previous analysis should have pretty much convinced us that these heads are copying / negatively copying our indirect object token, but the results here show this with a bit more rigour.

Exercise - replicate writing direction results

Difficulty: 🔴🔴🔴🔴⚪

Importance: 🔵🔵⚪⚪⚪

You should spend up to 20-25 minutes on this exercise. These exercises are much more challenging than they are conceptually important.

Let's look at figure 3(c) from the paper. This plots the output of the strongest name mover and negative name mover heads against the attention probabilities for END attending to IO or S (color-coded).

Some clarifications: * "Projection" here is being used synonymously with "dot product". * We're projecting onto the name embedding, i.e. the embedding vector for the token which is being paid attention to. This is not the same as the logit diff (which we got by projecting the heads' output onto the difference between the unembedding vectors for IO and S). * We're doing this because the question we're trying to answer is "does the attention head copy (or anti-copy) the names which it pays attention to?"

You should write code to replicate the paper's results in the cells below. Given four 1D tensors storing the results for a particular head (i.e. the projections and attention probabilities, for the IO and S tokens respectively), we've given you code to generate a plot which looks like the one in the paper. Again, you'll know that your code has worked if you can get results that resemble those in the paper.

def scatter_embedding_vs_attn(
    attn_from_end_to_io: Float[Tensor, "batch"],
    attn_from_end_to_s: Float[Tensor, "batch"],
    projection_in_io_dir: Float[Tensor, "batch"],
    projection_in_s_dir: Float[Tensor, "batch"],
    layer: int,
    head: int,
):
    scatter(
        x=t.concat([attn_from_end_to_io, attn_from_end_to_s], dim=0),
        y=t.concat([projection_in_io_dir, projection_in_s_dir], dim=0),
        color=["IO"] * N + ["S"] * N,
        title=f"Projection of the output of {layer}.{head} along the name<br>embedding vs attention probability on name",
        title_x=0.5,
        labels={"x": "Attn prob on name", "y": "Dot w Name Embed", "color": "Name type"},
        color_discrete_sequence=["#72FF64", "#C9A5F7"],
        width=650,
    )


def calculate_and_show_scatter_embedding_vs_attn(
    layer: int,
    head: int,
    cache: ActivationCache = ioi_cache,
    dataset: IOIDataset = ioi_dataset,
) -> None:
    """
    Creates and plots a figure equivalent to 3(c) in the paper.

    This should involve computing the four 1D tensors:
        attn_from_end_to_io
        attn_from_end_to_s
        projection_in_io_dir
        projection_in_s_dir
    and then calling the scatter_embedding_vs_attn function.
    """
    raise NotImplementedError()

calculate_and_show_scatter_embedding_vs_attn(9, 9)  # name mover head 9.9

calculate_and_show_scatter_embedding_vs_attn(11, 10)  # negative name mover head 11.10

Click to see the expected output

Solution

def scatter_embedding_vs_attn(
    attn_from_end_to_io: Float[Tensor, "batch"],
    attn_from_end_to_s: Float[Tensor, "batch"],
    projection_in_io_dir: Float[Tensor, "batch"],
    projection_in_s_dir: Float[Tensor, "batch"],
    layer: int,
    head: int,
):
    scatter(
        x=t.concat([attn_from_end_to_io, attn_from_end_to_s], dim=0),
        y=t.concat([projection_in_io_dir, projection_in_s_dir], dim=0),
        color=["IO"]  N + ["S"]  N,
        title=f"Projection of the output of {layer}.{head} along the name<br>embedding vs attention probability on name",
        title_x=0.5,
        labels={"x": "Attn prob on name", "y": "Dot w Name Embed", "color": "Name type"},
        color_discrete_sequence=["#72FF64", "#C9A5F7"],
        width=650,
    )
def calculate_and_show_scatter_embedding_vs_attn(
    layer: int,
    head: int,
    cache: ActivationCache = ioi_cache,
    dataset: IOIDataset = ioi_dataset,
) -> None:
    """
    Creates and plots a figure equivalent to 3(c) in the paper.
    This should involve computing the four 1D tensors:
        attn_from_end_to_io
        attn_from_end_to_s
        projection_in_io_dir
        projection_in_s_dir
    and then calling the scatter_embedding_vs_attn function.
    """
    # Get the value written to the residual stream at the end token by this head
    z = cache[utils.get_act_name("z", layer)][:, :, head]  # [batch seq d_head]
    N = z.size(0)
    output = z @ model.W_O[layer, head]  # [batch seq d_model]
    output_on_end_token = output[t.arange(N), dataset.word_idx["end"]]  # [batch d_model]
# Get the directions we'll be projecting onto
    io_unembedding = model.W_U.T[dataset.io_tokenIDs]  # [batch d_model]
    s_unembedding = model.W_U.T[dataset.s_tokenIDs]  # [batch d_model]
# Get the value of projections, by multiplying and summing over the d_model dimension
    projection_in_io_dir = (output_on_end_token  io_unembedding).sum(-1)  # [batch]
    projection_in_s_dir = (output_on_end_token  s_unembedding).sum(-1)  # [batch]
# Get attention probs, and index to get the probabilities from END -> IO / S
    attn_probs = cache["pattern", layer][:, head]  # [batch seqQ seqK]
    attn_from_end_to_io = attn_probs[
        t.arange(N), dataset.word_idx["end"], dataset.word_idx["IO"]
    ]  # [batch]
    attn_from_end_to_s = attn_probs[
        t.arange(N), dataset.word_idx["end"], dataset.word_idx["S1"]
    ]  # [batch]
# Show scatter plot
    scatter_embedding_vs_attn(
        attn_from_end_to_io,
        attn_from_end_to_s,
        projection_in_io_dir,
        projection_in_s_dir,
        layer,
        head,
    )

Exercise - replicate copying score results

Difficulty: 🔴🔴🔴🔴🔴

Importance: 🔵🔵⚪⚪⚪

You should spend up to 30-40 minutes on this exercise. These exercises are much more challenging than they are conceptually important.

Now let's do a different kind of test of the name mover heads' copying, by looking directly at the OV circuits.

From page 6 of the paper:

To check that the Name Mover Heads copy names generally, we studied what values are written via the heads’ OV matrix. Specifically, we first obtained the state of the residual stream at the position of each name token after the first MLP layer. Then, we multiplied this by the OV matrix of a Name Mover Head (simulating what would happen if the head attended perfectly to that token), multiplied by the unembedding matrix, and applied the final layer norm to obtain logit probabilities. We compute the proportion of samples that contain the input name token in the top 5 logits (N = 1000) and call this the copy score. All three Name Mover Heads have a copy score above 95%, compared to less than 20% for an average head.

Negative Name Mover Heads ... have a large negative copy score–the copy score calculated with the negative of the OV matrix (98% compared to 12% for an average head).

Note the similarity between their method and how we studied copying in induction heads, during an earlier set of exercises. However, there are differences (e.g. we're only looking at whether the head copies names, rather than whether it copies tokens in general).

You should replicate these results by completing the get_copying_scores function below.

You could do this by indexing from the ioi_cache, but a much more principled alternative would be to embed all the names in the NAMES list and apply operations like MLPs, layernorms and OV matrices manually. This is what the solutions do.

A few notes:

You can use model.to_tokens to convert the names to tokens. Remember to use prepend_bos=False, since you just want the tokens of names so you can embed them. Note that this function will treat the list of names as a batch of single-token inputs, which works fine for our purposes.
You can apply MLPs and layernorms as functions, by just indexing the model's blocks (e.g. use model.blocks[i].mlp or model.blocks[j].ln1 as a function). Remember that ln1 is the layernorm that comes before attention, and ln2 comes before the MLP.
Remember that you need to apply MLP0 before you apply the OV matrix (which is why we omit the 0th layer in our scores). The reason for this is that ablating MLP0 has a strangely large effect in gpt2-small relative to ablating other MLPs, possibly because it's acting as an extended embedding (see here for an explanation).

Also, you shouldn't expect to get exactly the same results as the paper (because some parts of this experiment have been set up very slightly different), although you probably shouldn't be off by more than 10%.

def get_copying_scores(
    model: HookedTransformer, k: int = 5, names: list = NAMES
) -> Float[Tensor, "2 layer-1 head"]:
    """
    Gets copying scores (both positive and negative) as described in page 6 of the IOI paper, for
    every (layer, head) pair in the model.

    Returns these in a 3D tensor (the first dimension is for positive vs negative).

    Omits the 0th layer, because this is before MLP0 (which we're claiming acts as an extended
    embedding).
    """
    raise NotImplementedError()


copying_results = get_copying_scores(model)

imshow(
    copying_results,
    facet_col=0,
    facet_labels=["Positive copying scores", "Negative copying scores"],
    title="Copying scores of attention heads' OV circuits",
    width=900,
)

heads = {"name mover": [(9, 9), (10, 0), (9, 6)], "negative name mover": [(10, 7), (11, 10)]}

for i, name in enumerate(["name mover", "negative name mover"]):
    make_table(
        title=f"Copying Scores ({name} heads)",
        colnames=["Head", "Score"],
        cols=[
            list(map(str, heads[name])) + ["[dark_orange bold]Average"],
            [f"{copying_results[i, layer - 1, head]:.2%}" for (layer, head) in heads[name]]
            + [f"[dark_orange bold]{copying_results[i].mean():.2%}"],
        ],
    )

Click to see the expected output

Copying Scores (name 
    mover heads)     
┏━━━━━━━━━┳━━━━━━━━━┓
┃ Head    ┃ Score   ┃
┡━━━━━━━━━╇━━━━━━━━━┩
│ (9, 9)  │ 100.00% │
│ (10, 0) │ 94.50%  │
│ (9, 6)  │ 94.50%  │
│ Average │ 10.11%  │
└─────────┴─────────┘
    Copying Scores    
 (negative name mover 
        heads)        
┏━━━━━━━━━━┳━━━━━━━━━┓
┃ Head     ┃ Score   ┃
┡━━━━━━━━━━╇━━━━━━━━━┩
│ (10, 7)  │ 100.00% │
│ (11, 10) │ 100.00% │
│ Average  │ 1.93%   │
└──────────┴─────────┘

Solution

def get_copying_scores(
    model: HookedTransformer, k: int = 5, names: list = NAMES
) -> Float[Tensor, "2 layer-1 head"]:
    """
    Gets copying scores (both positive and negative) as described in page 6 of the IOI paper, for
    every (layer, head) pair in the model.
    Returns these in a 3D tensor (the first dimension is for positive vs negative).
    Omits the 0th layer, because this is before MLP0 (which we're claiming acts as an extended
    embedding).
    """
    results = t.zeros((2, model.cfg.n_layers, model.cfg.n_heads), device=device)
# Define components from our model (for typechecking, and cleaner code)
    embed: Embed = model.embed
    mlp0: MLP = model.blocks[0].mlp
    ln0: LayerNorm = model.blocks[0].ln2
    unembed: Unembed = model.unembed
    ln_final: LayerNorm = model.ln_final
# Get embeddings for the names in our list
    name_tokens: Int[Tensor, "batch 1"] = model.to_tokens(names, prepend_bos=False)
    name_embeddings: Int[Tensor, "batch 1 d_model"] = embed(name_tokens)
# Get residual stream after applying MLP
    resid_after_mlp1 = name_embeddings + mlp0(ln0(name_embeddings))
# Loop over all (layer, head) pairs
    for layer in range(1, model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            # Get W_OV matrix
            W_OV = model.W_V[layer, head] @ model.W_O[layer, head]
# Get residual stream after applying W_OV or -W_OV respectively
            # (note, because of bias b_U, it matters that we do sign flip here, not later)
            resid_after_OV_pos = resid_after_mlp1 @ W_OV
            resid_after_OV_neg = resid_after_mlp1 @ -W_OV
# Get logits from value of residual stream
            logits_pos = unembed(ln_final(resid_after_OV_pos)).squeeze()  # [batch d_vocab]
            logits_neg = unembed(ln_final(resid_after_OV_neg)).squeeze()  # [batch d_vocab]
# Check how many are in top k
            topk_logits: Int[Tensor, "batch k"] = t.topk(logits_pos, dim=-1, k=k).indices
            in_topk = (topk_logits == name_tokens).any(-1)
            # Check how many are in bottom k
            bottomk_logits: Int[Tensor, "batch k"] = t.topk(logits_neg, dim=-1, k=k).indices
            in_bottomk = (bottomk_logits == name_tokens).any(-1)
# Fill in results
            results[:, layer - 1, head] = t.tensor(
                [in_topk.float().mean(), in_bottomk.float().mean()]
            )
return results
copying_results = get_copying_scores(model)
imshow(
    copying_results,
    facet_col=0,
    facet_labels=["Positive copying scores", "Negative copying scores"],
    title="Copying scores of attention heads' OV circuits",
    width=900,
)
heads = {"name mover": [(9, 9), (10, 0), (9, 6)], "negative name mover": [(10, 7), (11, 10)]}
for i, name in enumerate(["name mover", "negative name mover"]):
    make_table(
        title=f"Copying Scores ({name} heads)",
        colnames=["Head", "Score"],
        cols=[
            list(map(str, heads[name])) + ["[dark_orange bold]Average"],
            [f"{copying_results[i, layer - 1, head]:.2%}" for (layer, head) in heads[name]]
            + [f"[dark_orange bold]{copying_results[i].mean():.2%}"],
        ],
    )

Validation of early heads

There are three different kinds of heads which appear early in the circuit, which can be validated by looking at their attention patterns on simple random sequences of tokens. Can you figure out which three types these are, and how to validate them in this way?

Answer

Previous token heads, induction heads, and duplicate token heads.

We can validate them all at the same time, using sequences of n random tokens followed by those same n random tokens repeated. This works as follows:

Prev token heads, by measuring the attention patterns with an offset of one (i.e. one below the diagonal). Induction heads, by measuring the attention patterns with an offset of n-1 (i.e. the second instance of a token paying attention to the token after its first instance). * Duplicate token heads, by measuring the attention patterns with an offset of n (i.e. a token paying attention to its previous instance).

In all three cases, if heads score close to 1 on these metrics, it's strong evidence that they are working as this type of head.

Note, it's a leaky abstraction to say things like "head X is an induction head", since we're only observing it on a certain distribution. For instance, it's not clear what the role of induction heads and duplicate token heads is when there are no duplicates (they could in theory do something completely different).

Exercise - perform head validation

Difficulty: 🔴🔴🔴⚪⚪

Importance: 🔵🔵🔵⚪⚪

You should spend up to 20-30 minutes on this exercise. Understanding how to identify certain types of heads by their characteristic attention patterns is important.

Once you've read the answer in the dropdown above, you should perform this validation. The result should be a replication of Figure 18 in the paper.

We've provided a template for this function. Note use of typing.Literal, which is how we indicate that the argument should be one of the following options.

We've also provided a helper function generate_repeated_tokens (which is similar to the one you used from exercise set 1.2, except that it has no start token, to match the paper), and a helper function plot_early_head_validation_results which calls the get_attn_scores function and plots the results (in a way which should look like Figure 18). So it's just the get_attn_scores function that you need to fill in.

def generate_repeated_tokens(
    model: HookedTransformer, seq_len: int, batch: int = 1
) -> Float[Tensor, "batch 2*seq_len"]:
    """
    Generates a sequence of repeated random tokens (no start token).
    """
    rep_tokens_half = t.randint(0, model.cfg.d_vocab, (batch, seq_len), dtype=t.int64)
    rep_tokens = t.cat([rep_tokens_half, rep_tokens_half], dim=-1).to(device)
    return rep_tokens


def get_attn_scores(
    model: HookedTransformer,
    seq_len: int,
    batch: int,
    head_type: Literal["duplicate", "prev", "induction"],
) -> Float[Tensor, "n_layers n_heads"]:
    """
    Returns attention scores for sequence of duplicated tokens, for every head.
    """
    raise NotImplementedError()


def plot_early_head_validation_results(seq_len: int = 50, batch: int = 50):
    """
    Produces a plot that looks like Figure 18 in the paper.
    """
    head_types = ["duplicate", "prev", "induction"]

    results = t.stack(
        [get_attn_scores(model, seq_len, batch, head_type=head_type) for head_type in head_types]
    )

    imshow(
        results,
        facet_col=0,
        facet_labels=[
            f"{head_type.capitalize()} token attention prob.<br>on sequences of random tokens"
            for head_type in head_types
        ],
        labels={"x": "Head", "y": "Layer"},
        width=1300,
    )


model.reset_hooks()
plot_early_head_validation_results()

Click to see the expected output

Solution

def generate_repeated_tokens(
    model: HookedTransformer, seq_len: int, batch: int = 1
) -> Float[Tensor, "batch 2seq_len"]:
    """
    Generates a sequence of repeated random tokens (no start token).
    """
    rep_tokens_half = t.randint(0, model.cfg.d_vocab, (batch, seq_len), dtype=t.int64)
    rep_tokens = t.cat([rep_tokens_half, rep_tokens_half], dim=-1).to(device)
    return rep_tokens
def get_attn_scores(
    model: HookedTransformer,
    seq_len: int,
    batch: int,
    head_type: Literal["duplicate", "prev", "induction"],
) -> Float[Tensor, "n_layers n_heads"]:
    """
    Returns attention scores for sequence of duplicated tokens, for every head.
    """
    rep_tokens = generate_repeated_tokens(model, seq_len, batch)
_, cache = model.run_with_cache(
        rep_tokens, return_type=None, names_filter=lambda name: name.endswith("pattern")
    )
# Get the right indices for the attention scores
if head_type == "duplicate":
        src_indices = range(seq_len)
        dest_indices = range(seq_len, 2  seq_len)
    elif head_type == "prev":
        src_indices = range(seq_len)
        dest_indices = range(1, seq_len + 1)
    elif head_type == "induction":
        dest_indices = range(seq_len, 2 * seq_len)
        src_indices = range(1, seq_len + 1)
results = t.zeros(model.cfg.n_layers, model.cfg.n_heads, device=device, dtype=t.float32)
    for layer in range(model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            attn_scores = cache["pattern", layer]  # [batch seqQ seqK]
            avg_attn_on_duplicates = attn_scores[:, head, dest_indices, src_indices].mean().item()
            results[layer, head] = avg_attn_on_duplicates
return results
def plot_early_head_validation_results(seq_len: int = 50, batch: int = 50):
    """
    Produces a plot that looks like Figure 18 in the paper.
    """
    head_types = ["duplicate", "prev", "induction"]
results = t.stack(
        [get_attn_scores(model, seq_len, batch, head_type=head_type) for head_type in head_types]
    )
imshow(
        results,
        facet_col=0,
        facet_labels=[
            f"{head_type.capitalize()} token attention prob.<br>on sequences of random tokens"
            for head_type in head_types
        ],
        labels={"x": "Head", "y": "Layer"},
        width=1300,
    )
model.reset_hooks()
plot_early_head_validation_results()

Note - these figures suggest that it would be a useful bit of infrastructure to have a "wiki" for the heads of a model, giving their scores according to some metrics re head functions, like the ones we've seen here. HookedTransformer makes this pretty easy to make, as just changing the name input to HookedTransformer.from_pretrained gives a different model but in the same architecture, so the same code should work. If you want to make this, I'd love to see it!

As a proof of concept, I made a mosaic of all induction heads across the 40 models then in HookedTransformer.

Minimal Circuit

Background: faithfulness, completeness, and minimality

The authors developed three criteria for validating their circuit explanations: faithfulness, completeness and minimality. They are defined as follows:

Faithful = the circuit can perform as well as the whole model
Complete = the circuit contains all nodes used to perform the task
Minimal = the circuit doesn't contain nodes irrelevant to the task

If all three criteria are met, then the circuit is considered a reliable explanation for model behaviour.

Exercise - can you understand why each of these criteria is important? For each pair of criteria, is it possible for a circuit to meet them both but fail the third (and if yes, can you give examples?).

Answer

The naive circuit (containing the entire model) is trivially faithful and complete, but obviously not minimal. In general, the problem with non-minimal circuits is that they may not be mechanistically understandable, which defeats the purpose of this kind of circuit analysis.

Completeness obviously implies faithfulness, because if a node isn't involved in the task, then it can't improve the model's performance on that task.

You might initially think faithfulness implies completeness, but this actually isn't true. Backup name mover heads illustrate this point. They are used in the task, and without understanding the role they play you'll have an incorrect model of reality (e.g. you'll think ablating the name mover heads would destroy performance, which turns out not to be true). So if you define a circuit that doesn't contain backup name mover heads then it will be faithful (the backup name mover heads won't be used) but not complete.

Summary:

Faithful & complete, not minimal = possible (example: naive circuit) Faithful & minimal, not complete = possible (example: circuit missing backup name mover heads) Complete & minimal, not faithful = impossible (since completeness implies faithfulness)
In the paper, the authors formalise these concepts. Faithfulness is equivalent to $|F(C) - F(M)|$ being small (where $C$ is our circuit, $M$ is the model, and $F$ is our performance metric function), and completeness is equivalent to $|F(C\backslash K) - F(M\backslash K)|$ being small for any subset $K \subset C$ (including when $K$ is the empty set, showing that completeness implies faithfulness). Can you see how our circuit missing backup name mover heads violates this condition?

Answer

It violates the condition when $K$ is the set of name mover heads. $C \backslash K$ performs worse than $M \backslash K$, because the latter contains backup name mover heads while the former has lost its name mover heads and* backup name mover heads.

Now that we've analysed most components of the circuit and we have a rough idea of how they work, the next step is to ablate everything except those core components and verify that the model still performs well.

This ablation is pretty massive - we're ablating everything except for the output of each of our key attention heads (e.g. duplicate token heads or S-inhibition heads) at a single sequence position (e.g. for the DTHs this is the S2 token, and for SIHs this is the end token). Given that our core circuit has 26 heads in total, and our sequences have length around 20 on average, this means we're ablating all but $(26 / 144) / 20 \approx 1\%$ of our attention heads' output (and the number of possible paths through the model is reduced by much more than this).

How do we ablate? We could use zero-ablation, but this actually has some non-obvious problems. To explain why intuitively, heads might be "expecting" non-zero input, and setting the input to zero is essentially an arbitrary choice which takes it off distribution. You can think of this as adding a bias term to this head, which might mess up subsequent computation and lead to noisy results. We could also use mean-ablation (i.e. set a head's output to its average output over ioi_dataset), but the problem here is that taking the mean over this dataset might contain relevant information for solving the IOI task. For example the is_duplicated flag which gets written to S2 will be present for all sequences, so averaging won't remove this information.

Can you think of a way to solve this problem? After you've considered this, you can use the dropdown below to see how the authors handled this.

Answer

We ablate with the mean of the ABC dataset rather than the IOI dataset. This removes the problem of averages still containing relevant information from solving the IOI task.

One other complication - the sentences have different templates, and the positions of tokens like S and IO are not consistent across these templates (we avoided this problem in previous exercises by choosing a very small set of sentences, where all the important tokens had the same indices). An example of two templates with different positions:

"Then, [B] and [A] had a long argument and after that [B] said to [A]"
"After the lunch [B] and [A] went to the [PLACE], and [B] gave a [OBJECT] to [A]"

Can you guess what the authors did to solve this problem?

Answer

They took the mean over each template rather than over the whole dataset, and used these values to ablate with.

In other words, when they performed ablation by patching in the output of a head (which has shape (batch, seq, d_model)), the value patched into the (i, j, k)-th element of this tensor would be the average value of the k-th element of the vector at sequence position j, for all sentences with the same template as the i-th sentence in the batch.

Exercise - constructing the minimal circuit

Difficulty: 🔴🔴🔴🔴🔴

Importance: 🔵🔵⚪⚪⚪

This exercise is expected to take a long time; at least an hour. It is probably the most challenging exercise in this notebook.

You now have enough information to perform ablation on your model, to get the minimal circuit. Below, you can try to implement this yourself.

This exercise is very technically challenging, so you're welcome to skip it if it doesn't seem interesting to you. However, I recommend you have a read of the solution, to understand the rough contours of how this ablation works.

If you want to attempt this task, then you can start with the code below. We define two dictionaries, one mapping head types to the heads in the model which are of that type, and the other mapping head types to the sequence positions which we won't be ablating for those types of head.

CIRCUIT = {
    "name mover": [(9, 9), (10, 0), (9, 6)],
    "backup name mover": [(10, 10), (10, 6), (10, 2), (10, 1), (11, 2), (9, 7), (9, 0), (11, 9)],
    "negative name mover": [(10, 7), (11, 10)],
    "s2 inhibition": [(7, 3), (7, 9), (8, 6), (8, 10)],
    "induction": [(5, 5), (5, 8), (5, 9), (6, 9)],
    "duplicate token": [(0, 1), (0, 10), (3, 0)],
    "previous token": [(2, 2), (4, 11)],
}

SEQ_POS_TO_KEEP = {
    "name mover": "end",
    "backup name mover": "end",
    "negative name mover": "end",
    "s2 inhibition": "end",
    "induction": "S2",
    "duplicate token": "S2",
    "previous token": "S1+1",
}

To be clear, the things that we'll be mean-ablating are:

Every head not in the CIRCUIT dict
Every sequence position for the heads in CIRCUIT dict, except for the sequence positions given by the SEQ_POS_TO_KEEP dict

And we'll be mean-ablating by replacing a head's output with the mean output for abc_dataset, over all sentences with the same template as the sentence in the batch. You can access the templates for a dataset using the dataset.groups attribute, which returns a list of tensors (each one containing the indices of sequences in the batch sharing the same template).

Now, you can try and complete the following function, which should add a permanent hook to perform this ablation whenever the model is run on the ioi_dataset (note that this hook will only make sense if the model is run on this dataset, so we should reset hooks if we run it on a different dataset).

Permanent hooks are a useful feature of transformerlens. They behave just like regular hooks, except they aren't removed when you run the model (e.g. using model.run_with_cache or model.run_with_hooks). The only way to remove them is with:

model.reset_hooks(including_permanent=True)

You can add permanent hooks as follows:

model.add_hook(hook_name, hook_fn, is_permanent=True)

where hook_name can be a string or a filter function mapping strings to booleans.

def add_mean_ablation_hook(
    model: HookedTransformer,
    means_dataset: IOIDataset,
    circuit: dict[str, list[tuple[int, int]]] = CIRCUIT,
    seq_pos_to_keep: dict[str, str] = SEQ_POS_TO_KEEP,
    is_permanent: bool = True,
) -> HookedTransformer:
    """
    Adds a permanent hook to the model, which ablates according to the circuit and seq_pos_to_keep
    dictionaries.

    In other words, when the model is run on ioi_dataset, every head's output will be replaced with
    the mean over means_dataset for sequences with the same template, except for a subset of heads
    and sequence positions as specified by the circuit and seq_pos_to_keep dicts.
    """
    raise NotImplementedError()

Hint (docstrings of some functions which will be useful for your main function)

def compute_means_by_template(
    means_dataset: IOIDataset,
    model: HookedTransformer
) -> Float[Tensor, "layer batch seq head_idx d_head"]:
    '''
    Returns the mean of each head's output over the means dataset. This mean is
    computed separately for each group of prompts with the same template (these
    are given by means_dataset.groups).
    '''
    pass
def get_heads_and_posns_to_keep(
    means_dataset: IOIDataset,
    model: HookedTransformer,
    circuit: dict[str, list[tuple[int, int]]],
    seq_pos_to_keep: dict[str, str],
) -> dict[int, Bool[Tensor, "batch seq head"]]:
    '''
    Returns a dictionary mapping layers to a boolean mask giving the indices of the
    z output which shouldn't be mean-ablated.
    The output of this function will be used for the hook function that does ablation.
    '''
    pass
def hook_fn_mask_z(
    z: Float[Tensor, "batch seq head d_head"],
    hook: HookPoint,
    heads_and_posns_to_keep: dict[int, Bool[Tensor, "batch seq head"]],
    means: Float[Tensor, "layer batch seq head d_head"],
) -> Float[Tensor, "batch seq head d_head"]:
    '''
    Hook function which masks the z output of a transformer head.
    heads_and_posns_to_keep
        dict created with the get_heads_and_posns_to_keep function. This tells
        us where to mask.
    means
        Tensor of mean z values of the means_dataset over each group of prompts
        with the same template. This tells us what values to mask with.
    '''
    pass

Once you fill in these three functions, completing the main function is simple. It should:

Use compute_means_by_template to get the means Use get_heads_and_posns_to_keep to get the boolean mask Apply functools.partial to hook_fn_mask_z, using the outputs of the 2 previous functions, to get your hook function which performs the mean ablation Add this hook function to your model, as a permanent hook

Solution

def get_heads_and_posns_to_keep(
    means_dataset: IOIDataset,
    model: HookedTransformer,
    circuit: dict[str, list[tuple[int, int]]],
    seq_pos_to_keep: dict[str, str],
) -> dict[int, Bool[Tensor, "batch seq head"]]:
    """
    Returns a dictionary mapping layers to a boolean mask giving the indices of the z output which
    shouldn't be mean-ablated.
    The output of this function will be used for the hook function that does ablation.
    """
    heads_and_posns_to_keep = {}
    batch_size, seq_len, n_heads = len(means_dataset), means_dataset.max_len, model.cfg.n_heads
for layer in range(model.cfg.n_layers):
        mask = t.zeros(size=(batch_size, seq_len, n_heads))
for head_type, head_list in circuit.items():
            seq_pos = seq_pos_to_keep[head_type]
            indices = means_dataset.word_idx[seq_pos]
            for layer_idx, head_idx in head_list:
                if layer_idx == layer:
                    mask[range(batch_size), indices, head_idx] = 1
heads_and_posns_to_keep[layer] = mask.bool()
return heads_and_posns_to_keep
def hook_fn_mask_z(
    z: Float[Tensor, "batch seq head d_head"],
    hook: HookPoint,
    heads_and_posns_to_keep: dict[int, Bool[Tensor, "batch seq head"]],
    means: Float[Tensor, "layer batch seq head d_head"],
) -> Float[Tensor, "batch seq head d_head"]:
    """
    Hook function which masks the z output of a transformer head.
    heads_and_posns_to_keep
        dict created with the get_heads_and_posns_to_keep function. This tells us where to mask.
    means
        Tensor of mean z values of the means_dataset over each group of prompts with the same
        template. This tells us
        what values to mask with.
    """
    # Get the mask for this layer, and add d_head=1 dimension so it broadcasts correctly
    mask_for_this_layer = heads_and_posns_to_keep[hook.layer()].unsqueeze(-1).to(z.device)
# Set z values to the mean
    z = t.where(mask_for_this_layer, z, means[hook.layer()])
return z
def compute_means_by_template(
    means_dataset: IOIDataset, model: HookedTransformer
) -> Float[Tensor, "layer batch seq head_idx d_head"]:
    """
    Returns the mean of each head's output over the means dataset. This mean is computed separately
    for each group of prompts with the same template (these are given by means_dataset.groups).
    """
    # Cache the outputs of every head
    _, means_cache = model.run_with_cache(
        means_dataset.toks.long(),
        return_type=None,
        names_filter=lambda name: name.endswith("z"),
    )
    # Create tensor to store means
    n_layers, n_heads, d_head = model.cfg.n_layers, model.cfg.n_heads, model.cfg.d_head
    batch, seq_len = len(means_dataset), means_dataset.max_len
    means = t.zeros(size=(n_layers, batch, seq_len, n_heads, d_head), device=model.cfg.device)
# Get set of different templates for this data
    for layer in range(model.cfg.n_layers):
        z_for_this_layer = means_cache[utils.get_act_name("z", layer)]  # [batch seq head d_head]
        for template_group in means_dataset.groups:
            z_for_this_template = z_for_this_layer[template_group]
            z_means_for_this_template = einops.reduce(
                z_for_this_template, "batch seq head d_head -> seq head d_head", "mean"
            )
            means[layer, template_group] = z_means_for_this_template
return means
def add_mean_ablation_hook(
    model: HookedTransformer,
    means_dataset: IOIDataset,
    circuit: dict[str, list[tuple[int, int]]] = CIRCUIT,
    seq_pos_to_keep: dict[str, str] = SEQ_POS_TO_KEEP,
    is_permanent: bool = True,
) -> HookedTransformer:
    """
    Adds a permanent hook to the model, which ablates according to the circuit and seq_pos_to_keep
    dictionaries.
    In other words, when the model is run on ioi_dataset, every head's output will be replaced with
    the mean over means_dataset for sequences with the same template, except for a subset of heads
    and sequence positions as specified by the circuit and seq_pos_to_keep dicts.
    """
    model.reset_hooks(including_permanent=True)
# Compute the mean of each head's output on the ABC dataset, grouped by template
    means = compute_means_by_template(means_dataset, model)
# Convert this into a boolean map
    heads_and_posns_to_keep = get_heads_and_posns_to_keep(
        means_dataset, model, circuit, seq_pos_to_keep
    )
# Get a hook function which will patch in the mean z values for each head, at
    # all positions which aren't important for the circuit
    hook_fn = partial(hook_fn_mask_z, heads_and_posns_to_keep=heads_and_posns_to_keep, means=means)
# Apply hook
    model.add_hook(lambda name: name.endswith("z"), hook_fn, is_permanent=is_permanent)
return model

To test whether your function works, you can use the function provided to you, and see if the logit difference from your implementation of the circuit matches this one:

import part41_indirect_object_identification.ioi_circuit_extraction as ioi_circuit_extraction

model = ioi_circuit_extraction.add_mean_ablation_hook(
    model,
    means_dataset=abc_dataset,
    circuit=CIRCUIT,
    seq_pos_to_keep=SEQ_POS_TO_KEEP,
)
ioi_logits_minimal = model(ioi_dataset.toks)

print(f"""
Avg logit diff (IOI dataset, using entire model): {logits_to_ave_logit_diff_2(ioi_logits_original):.4f}
Avg logit diff (IOI dataset, only using circuit): {logits_to_ave_logit_diff_2(ioi_logits_minimal):.4f}
""")

model = add_mean_ablation_hook(
    model,
    means_dataset=abc_dataset,
    circuit=CIRCUIT,
    seq_pos_to_keep=SEQ_POS_TO_KEEP,
)
ioi_logits_minimal = model(ioi_dataset.toks)

print(f"""
Avg logit diff (IOI dataset, using entire model): {logits_to_ave_logit_diff_2(ioi_logits_original):.4f}
Avg logit diff (IOI dataset, only using circuit): {logits_to_ave_logit_diff_2(ioi_logits_minimal):.4f}
""")

You should find that the logit difference only drops by a small amount, and is still high enough to represent a high likelihood ratio favouring the IO token over S.

Exercise - calculate minimality scores

Difficulty: 🔴🔴🔴🔴🔴

Importance: 🔵🔵⚪⚪⚪

This exercise is expected to take a long time; at least an hour. It is probably the second most challenging exercise in this notebook.

We'll conclude this section by replicating figure 7 of the paper, which shows the minimality scores for the model.

Again, this exercise is very challenging and is designed to be done with minimal guidance. You will need to read the relevant sections of the paper which explain this plot: section 4 (experimental validation), from the start up to the end of section 4.2. Note that you won't need to perform the sampling algorithm described on page 10, because we're giving you the set $K$ for each component, in the form of the dictionary below (this is based on the information given in figure 20 of the paper, the "minimality sets" table).

K_FOR_EACH_COMPONENT = {
    (9, 9): set(),
    (10, 0): {(9, 9)},
    (9, 6): {(9, 9), (10, 0)},
    (10, 7): {(11, 10)},
    (11, 10): {(10, 7)},
    (8, 10): {(7, 9), (8, 6), (7, 3)},
    (7, 9): {(8, 10), (8, 6), (7, 3)},
    (8, 6): {(7, 9), (8, 10), (7, 3)},
    (7, 3): {(7, 9), (8, 10), (8, 6)},
    (5, 5): {(5, 9), (6, 9), (5, 8)},
    (5, 9): {(11, 10), (10, 7)},
    (6, 9): {(5, 9), (5, 5), (5, 8)},
    (5, 8): {(11, 10), (10, 7)},
    (0, 1): {(0, 10), (3, 0)},
    (0, 10): {(0, 1), (3, 0)},
    (3, 0): {(0, 1), (0, 10)},
    (4, 11): {(2, 2)},
    (2, 2): {(4, 11)},
    (11, 2): {(9, 9), (10, 0), (9, 6)},
    (10, 6): {(9, 9), (10, 0), (9, 6), (11, 2)},
    (10, 10): {(9, 9), (10, 0), (9, 6), (11, 2), (10, 6)},
    (10, 2): {(9, 9), (10, 0), (9, 6), (11, 2), (10, 6), (10, 10)},
    (9, 7): {(9, 9), (10, 0), (9, 6), (11, 2), (10, 6), (10, 10), (10, 2)},
    (10, 1): {(9, 9), (10, 0), (9, 6), (11, 2), (10, 6), (10, 10), (10, 2), (9, 7)},
    (11, 9): {(9, 9), (10, 0), (9, 6), (9, 0)},
    (9, 0): {(9, 9), (10, 0), (9, 6), (11, 9)},
}

Also, given a dictionary minimality_scores (which maps heads to their scores), the following code will produce a plot that looks like the one from the paper:

def plot_minimal_set_results(minimality_scores: dict[tuple[int, int], float]):
    """
    Plots the minimality results, in a way resembling figure 7 in the paper.

    minimality_scores:
        dict with elements like (9, 9): minimality score for head 9.9 (as described
        in section 4.2 of the paper)
    """

    CIRCUIT_reversed = {head: k for k, v in CIRCUIT.items() for head in v}
    colors = [CIRCUIT_reversed[head].capitalize() + " head" for head in minimality_scores.keys()]
    color_sequence = [px.colors.qualitative.Dark2[i] for i in [0, 1, 2, 5, 3, 6]] + ["#BAEA84"]

    bar(
        list(minimality_scores.values()),
        x=list(map(str, minimality_scores.keys())),
        labels={"x": "Attention head", "y": "Change in logit diff", "color": "Head type"},
        color=colors,
        template="ggplot2",
        color_discrete_sequence=color_sequence,
        bargap=0.02,
        yaxis_tickformat=".0%",
        legend_title_text="",
        title="Plot of minimality scores (as percentages of full model logit diff)",
        width=800,
        hovermode="x unified",
    )

Now, you should create the minimality_scores dictionary, and use the plot function given above to plot the results:

minimality_scores = {(9, 9): ...}
plot_minimal_set_results(minimality_scores)
# YOUR CODE HERE - create the `minimality_scores` dictionary, to be used in the plot function given above

Click to see the expected output

Hint (docstrings of some functions which will be useful for your main function)

def get_score(
    model: HookedTransformer,
    ioi_dataset: IOIDataset,
    abc_dataset: IOIDataset,
    K: set[tuple[int, int]],
    C: dict[str, list[tuple[int, int]]],
) -> float:
    '''
    Returns the value F(C \ K), where F is the logit diff, C is the
    core circuit, and K is the set of circuit components to remove.
    '''
    pass
def get_minimality_score(
    model: HookedTransformer,
    ioi_dataset: IOIDataset,
    abc_dataset: IOIDataset,
    v: tuple[int, int],
    K: set[tuple[int, int]],
    C: dict[str, list[tuple[int, int]]] = CIRCUIT,
) -> float:
    '''
    Returns the value | F(C \ K_union_v) - F(C | K) |, where F is
    the logit diff, C is the core circuit, K is the set of circuit
    components to remove, and v is a head (not in K).
    '''
    pass
def get_all_minimality_scores(
    model: HookedTransformer,
    ioi_dataset: IOIDataset = ioi_dataset,
    abc_dataset: IOIDataset = abc_dataset,
    k_for_each_component: dict = K_FOR_EACH_COMPONENT
) -> dict[tuple[int, int], float]:
    '''
    Returns dict of minimality scores for every head in the model (as
    a fraction of F(M), the logit diff of the full model).
    Warning - this resets all hooks at the end (including permanent).
    '''
    pass
minimality_scores = get_all_minimality_scores(model)
plot_minimal_set_results(minimality_scores)

The output of the third function can be plotted using the plotting function given above.

Solution

def get_score(
    model: HookedTransformer,
    ioi_dataset: IOIDataset,
    abc_dataset: IOIDataset,
    K: set[tuple[int, int]],
    C: dict[str, list[tuple[int, int]]],
) -> float:
    """
    Returns the value F(C \ K), where F is the logit diff, C is the core circuit, and K is the set
    of circuit components to remove.
    """
    C_excl_K = {k: [head for head in v if head not in K] for k, v in C.items()}
    model = add_mean_ablation_hook(model, abc_dataset, C_excl_K, SEQ_POS_TO_KEEP)
    logits = model(ioi_dataset.toks)
    score = logits_to_ave_logit_diff_2(logits, ioi_dataset).item()
return score
def get_minimality_score(
    model: HookedTransformer,
    ioi_dataset: IOIDataset,
    abc_dataset: IOIDataset,
    v: tuple[int, int],
    K: set[tuple[int, int]],
    C: dict[str, list[tuple[int, int]]] = CIRCUIT,
) -> float:
    """
    Returns the value | F(C \ K_union_v) - F(C | K) |, where F is the logit diff, C is the core
    circuit, K is the set of circuit components to remove, and v is a head (not in K).
    """
    assert v not in K
    K_union_v = K | {v}
    C_excl_K_score = get_score(model, ioi_dataset, abc_dataset, K, C)
    C_excl_Kv_score = get_score(model, ioi_dataset, abc_dataset, K_union_v, C)
return abs(C_excl_K_score - C_excl_Kv_score)
def get_all_minimality_scores(
    model: HookedTransformer,
    ioi_dataset: IOIDataset = ioi_dataset,
    abc_dataset: IOIDataset = abc_dataset,
    k_for_each_component: dict = K_FOR_EACH_COMPONENT,
) -> dict[tuple[int, int], float]:
    """
    Returns dict of minimality scores for every head in the model (as a fraction of F(M), the
    logit diff of the full model).
    Warning - this resets all hooks at the end (including permanent).
    """
    # Get full circuit score F(M), to divide minimality scores by
    model.reset_hooks(including_permanent=True)
    logits = model(ioi_dataset.toks)
    full_circuit_score = logits_to_ave_logit_diff_2(logits, ioi_dataset).item()
# Get all minimality scores, using the get_minimality_score function
    minimality_scores = {}
    for v, K in tqdm(k_for_each_component.items()):
        score = get_minimality_score(model, ioi_dataset, abc_dataset, v, K)
        minimality_scores[v] = score / full_circuit_score
model.reset_hooks(including_permanent=True)
return minimality_scores
minimality_scores = get_all_minimality_scores(model)

Note - your results won't be exactly the same as the paper's, because of random error (e.g. the order of importanc of heads within each category might not be the same, especially heads with a small effect on the model like the backup name mover heads). But they should be reasonably similar in their important features.