4️⃣ Steering Vectors in GPT2-XL

Learning Objectives

Understand the goals & main results from Alex Turner et al's work on steering vectors

Reproduce the changes in behaviour described in their initial post

Note: GPT2-XL is not hosted remotely by NNsight at the moment. If you use GPT2-XL, we recommend setting REMOTE = False. Otherwise, you can use one of the remotely hosted models (see here) and set REMOTE = True. You might want to run del model and gc.collect() before loading in a new model, to free up memory.

Steering model behaviour

In the final non-bonus exercise of the previous section, we touched on the idea of using function vectors to induce behavioural changes in the model's completions, rather than specifically making it solve zero-shot or corrupted prompts with the right completion. In these next exercises, we'll explore this kind of work in more detail. We'll be primarily using Turner et al's work on Steering GPT-2-XL by adding an activation vector.

Summary of the way in which this work differs from the function vector work we've done so far:

Function vectors focused on the model performing a particular function (e.g. mapping a word to its opposite), whereas this work focuses on behavioural changes (e.g. completing a prompt which has negative tone in a positive way).
Function vectors work looked at very large models (our exercises used Pythia-7B, the smallest model which was examined in the function vectors paper). This particular steering vectors post focused on the smaller models GPT2-Small (85m) and GPT2-XL (1.5B). We'll be focusing on GPT2-XL.
The second half of our function vectors work identified important attention heads and focused on their outputs, rather than just adding to the residual stream directly. In this steering vector setup, we'll go back to the simpler method of adding directly into the residual stream.

Despite these differences, much of the work which was done here overlaps with function vector work, since they both fall into the broader category of "finding vectors using forward-pass-based methods (i.e. not with SGD) and using them to intervene on models during forward passes & change the model's output". This description would also include the following:

Inference-time intervention, which focuses on inducing the behavioural change of "making the model tell the truth". It also looks at other non-forward-pass-based techniques for finding an intervention vector, e.g. CCS and linear probing, although it concludes that forward-pass-based methods similar to the ones we've been using so far work the best.
Steering Llama 2 via Contrastive Activation Addition, which can be thought of as an extension of the GPT2-XL steering vector work to larger models, specifically Llama 2 13B. It also takes more of a high-level evals framework; measuring the model's change in attributes such as sycophancy, myopia, and power-seeking (finding that these attributes can be increased or decreased by adding the appropriate vectors).

We'll discuss some of this work more in the bonus section, but for now, let's get on with the exercises!

First, we'll load in GPT2-XL, then we'll replicate some of the examples in the main post.

gpt2_xl = LanguageModel("gpt2-xl", device_map="auto", torch_dtype=t.bfloat16)
tokenizer = gpt2_xl.tokenizer

REMOTE = False
# If you are using gpt2_xl, set REMOTE = False as gpt2_xl is not hosted remotely by nnsight. You can
# set REMOTE = True for a remotely hosted model here (https://nnsight.net/status/)

Exercise - replicate the steering vector results

Difficulty: 🔴🔴🔴🔴🔴

Importance: 🔵🔵🔵🔵⚪

You should spend up to 30-50 minutes on this exercise.

Replicate the results in the LessWrong post Steering GPT-2-XL by adding an activation vector; specifically the "demonstrations of additions that work well" section.

Read the "How activation additions work" section of Steering GPT-2-XL by adding an activation vector to understand how vectors are extracted and added. We've provided a function template as well as some example code to run; your main job will be to fill in the function. This will be like a hybrid of several previous exercises (with most similarity to the function calculate_and_intervene_with_h), although there will be a few methodological differences.

This is the last exercise in this set, and hopefully it'll provide an opportunity to draw together all the threads of what you've learned so far!

Caching

This is a different kind of causal intervention than we performed in previous sections. Rather than adding a single vector to the final sequence position at each token generation, we're adding a slice of vectors to the first sequence positions of the original prompt (see tables like in this section for an illustration). How do you think this will affect our function? Should we still cache? Should we be using .generate() or .trace()? If using .generate(), do we need to call model.next() ?

Click this dropdown for answers to the questions above.

Rather than adding to each final sequence position for every token generated, we just add the vectors once, to the end of the prompt. This means that:

We can still use caching (because the values we cache shouldn't be different in subsequent token generations),
We should be using .generate() (because we're doing multi-token generation),
We don't need to call model.next() (because we only intervene once, and our intervention will be cached & applied to all subsequent tokens which are generated).

Again, if any of this is confusing then please ask a TA or message in the Slack channel.

Padding

The tables show the activations being added on the left (i.e. the sequences are padded on the right), but by default padding is applied on the left. There are 2 possible ways you can get around this:

Right-pad the input sequences manually, i.e. use something like len(tokenizer.tokenize(prompt)) to see how long each of the prompts is, and add copies of tokenizer.pad_token to the end of each sequence.
Don't manually pad the input sequences, instead slice the sequences you add to the original prompt from the right side of the activation addition sequences, rather than from the left side.

The solutions use (2), but you can use either of these methods.

Sampling

Following the post, we'll use top-p sampling with probability 0.3 to generate our sequences. We'll also use a small frequency penalty to penalize repetition (so the model gets stuck in loops less). If you've done earlier exercises in this section then you might have implemented freq_penalty during sampling; this is supported by TransformerLens models, but HuggingFace uses the somewhat similar repetition_penalty (default value is 1.0 indicating no penalty, values higher than 1.0 apply a penalty to repeated tokens).

We apply these sampling methods by passing keyword arguments into the generate method:

{
    "do_sample": True, # necessary whenever we're sampling rather than doing greedy decoding
    "top_p": 0.3,
    "repetition_penalty": 1.1,
}

Note that the sequences are generated stochastically rather than greedily - this means we'll get different results if we input multiple different copies of the same sequence. We've given you the n_comparisons argument in the function below, i.e. you should generate this many steered and this many unsteered completions.

Other tips / notes

We recommend starting with example #9 (the "talking about weddings" one). It seems quite robust to the exact conditions of the forward pass, unlike the Love - Hate example. You can use any of the template cells we've given you below.

We've given you a use_bos argument; if this is True then you should append tokenizer.bos_token to the start of all the prompts. This is just to be true to the LessWrong post's implementation; it won't change behaviour much and you can probably ignore it and still get good results.

SAMPLING_KWARGS = {
    "do_sample": True,
    "top_p": 0.3,
    "repetition_penalty": 1.2,
}


def calculate_and_apply_steering_vector(
    model: LanguageModel,
    prompt: str,
    activation_additions: list[tuple[int, float, str]],
    n_tokens: int,
    n_comparisons: int = 1,
    use_bos: bool = True,
) -> tuple[list[str], list[str]]:
    """
    Performs the steering vector experiments described in the LessWrong post.

    Args:
        model: LanguageModel
            the transformer you're doing this computation with
        prompt: str
            The original prompt, which we'll be doing activation steering on.

        activation_additions: list[tuple[int, float, str]], each tuple contains:
            layer - the layer we're applying these steering vectors to
            coefficient - the value we're multiplying it by
            prompt - the prompt we're inputting
            e.g. activation_additions[0] = [6, 5.0, " Love"] means we add the " Love" vector at
            layer 6, scaled by 5x

        n_tokens: int
            Number of tokens which will be generated for each completion

        n_comparisons: int
            Number of sequences generated in this function (i.e. we generate `n_comparisons` which
            are unsteered, and the same number which are steered).

    Returns:
        unsteered_completions: list[str]
            List of length `n_comparisons`, containing all the unsteered completions.

        steered_completions: list[str]
            List of length `n_comparisons`, containing all the steered completions.
    """
    # Add the BOS token manually, if we're including it
    if use_bos:
        bos = model.tokenizer.bos_token
        prompt = bos + prompt
        activation_additions = [[layer, coeff, bos + p] for layer, coeff, p in activation_additions]

    raise NotImplementedError()

Solution

SAMPLING_KWARGS = {
    "do_sample": True,
    "top_p": 0.3,
    "repetition_penalty": 1.2,
}


def calculate_and_apply_steering_vector(
    model: LanguageModel,
    prompt: str,
    activation_additions: list[tuple[int, float, str]],
    n_tokens: int,
    n_comparisons: int = 1,
    use_bos: bool = True,
) -> tuple[list[str], list[str]]:
    """
    Performs the steering vector experiments described in the LessWrong post.

    Args:
        model: LanguageModel
            the transformer you're doing this computation with
        prompt: str
            The original prompt, which we'll be doing activation steering on.

        activation_additions: list[tuple[int, float, str]], each tuple contains:
            layer - the layer we're applying these steering vectors to
            coefficient - the value we're multiplying it by
            prompt - the prompt we're inputting
            e.g. activation_additions[0] = [6, 5.0, " Love"] means we add the " Love" vector at
            layer 6, scaled by 5x

        n_tokens: int
            Number of tokens which will be generated for each completion

        n_comparisons: int
            Number of sequences generated in this function (i.e. we generate `n_comparisons` which
            are unsteered, and the same number which are steered).

    Returns:
        unsteered_completions: list[str]
            List of length `n_comparisons`, containing all the unsteered completions.

        steered_completions: list[str]
            List of length `n_comparisons`, containing all the steered completions.
    """
    # Add the BOS token manually, if we're including it
    if use_bos:
        bos = model.tokenizer.bos_token
        prompt = bos + prompt
        activation_additions = [[layer, coeff, bos + p] for layer, coeff, p in activation_additions]

    # Get the (layers, coeffs, prompts) in an easier form to use, also calculate the prompt lengths
    # and check they're all the same
    act_add_layers, act_add_coeffs, act_add_prompts = zip(*activation_additions)
    act_add_seq_lens = [len(tokenizer.tokenize(p)) for p in act_add_prompts]
    assert len(set(act_add_seq_lens)) == 1, "All activation addition prompts must be the same length."
    assert act_add_seq_lens[0] <= len(tokenizer.tokenize(prompt)), (
        "All act_add prompts should be shorter than original prompt."
    )

    # Get the prompts we'll intervene on (unsteered and steered)
    steered_prompts = [prompt for _ in range(n_comparisons)]
    unsteered_prompts = [prompt for _ in range(n_comparisons)]

    with model.generate(max_new_tokens=n_tokens, remote=REMOTE, **SAMPLING_KWARGS) as generator:
        # Run the act_add prompts (i.e. the contrast pairs), and extract their activations
        with generator.invoke(act_add_prompts):
            # Get all the prompts from the activation additions, and put them in a list
            # (note, we slice from the end of the sequence because of left-padding)
            act_add_vectors = [
                model.transformer.h[layer].output[0][i, -seq_len:]
                for i, (layer, seq_len) in enumerate(zip(act_add_layers, act_add_seq_lens))
            ]

        # Forward pass on unsteered prompts (no intervention, no activations saved - we only need
        # the completions)
        with generator.invoke(unsteered_prompts):
            unsteered_out = model.generator.output.save()

        # Forward pass on steered prompts (we add in the results from the act_add prompts)
        with generator.invoke(steered_prompts):
            # For each act_add prompt, add the vector to residual stream, at the start of the seq
            for i, (layer, coeff, seq_len) in enumerate(zip(act_add_layers, act_add_coeffs, act_add_seq_lens)):
                model.transformer.h[layer].output[0][:, :seq_len] += coeff * act_add_vectors[i]
            steered_out = model.generator.output.save()

    # Decode steered & unsteered completions (discarding the sequences we only used for extracting
    # activations) & return results
    unsteered_completions = tokenizer.batch_decode(unsteered_out[-n_comparisons:])
    steered_completions = tokenizer.batch_decode(steered_out[-n_comparisons:])

    return unsteered_completions, steered_completions

To test your function, use any of the following code snippets (as mentioned, we recommend starting with the weddings example, since the results tend to be pretty robust).

unsteered_completions, steered_completions = calculate_and_apply_steering_vector(
    gpt2_xl,
    prompt="I hate you because",
    activation_additions=[(6, +8.0, "Love "), (6, -8.0, "Hate")],
    n_tokens=50,
    n_comparisons=3,
    use_bos=True,
)

table = Table("Unsteered", "Steered", title="Completions", show_lines=True)
for usc, sc in zip(unsteered_completions, steered_completions):
    table.add_row(usc, sc)
rprint(table)

Click to see the expected output

                                                    Completions                                                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Unsteered                                              ┃ Steered                                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ <|endoftext|>I hate you because I'm a feminist.        │ <|endoftext|>I hate you because you're so much more    │
│                                                        │ than just a piece of paper.                            │
│ You're not a feminist, but that doesn't mean you don't │ You are the most amazing, wonderful, beautiful thing   │
│ understand what feminism is about and why it's         │ in this world and I want to be with you forever.  Love │
│ important to me. You may even be one of the people who │ is what makes us human.  We can't help                 │
│ helped make this movement possible. And                │                                                        │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ <|endoftext|>I hate you because I love you.            │ <|endoftext|>I hate you because you're not a perfect   │
│                                                        │ person.                                                │
│ The first time I saw this song, it was on a            │                                                        │
│ compilation album called "Funkadelic in the 80s." It's │ You are a good, kind and loving person. You have been  │
│ one of those songs that is so iconic and instantly     │ hurt by people who were meant to love you. You are     │
│ recognizable to anyone who has ever been               │ hurting right now. You are hurting for your family,    │
│                                                        │ friends and loved ones.                                │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ <|endoftext|>I hate you because I'm not a feminist.    │ <|endoftext|>I hate you because I know that we will    │
│                                                        │ never be apart.                                        │
│ This is the most common response to any discussion of  │                                                        │
│ feminism, and it's one that has been repeated so often │ We are not alone in this world, and we can't wait to   │
│ in my life that I've become numb to it. It's also the  │ see what's next!                                       │
│ reaction that I get when                               │                                                        │
│                                                        │ Love is the most powerful force in the universe,       │
│                                                        │                                                        │
│                                                        │ and love is all around us                              │
└────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┘

unsteered_completions, steered_completions = calculate_and_apply_steering_vector(
    gpt2_xl,
    prompt="I went up to my friend and said",
    activation_additions=[
        (20, +4.0, "I talk about weddings constantly  "),
        (20, -4.0, "I do not talk about weddings constantly"),
    ],
    n_tokens=50,
    n_comparisons=3,
    use_bos=False,
)

table = Table("Unsteered", "Steered", title="Completions", show_lines=True)
for usc, sc in zip(unsteered_completions, steered_completions):
    table.add_row(usc, sc)
rprint(table)

Click to see the expected output

                                                    Completions                                                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Unsteered                                              ┃ Steered                                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ I went up to my friend and said, 'You know what? I'm   │ I went up to my friend and said, "You know what? I     │
│ not going to be a part of this anymore.' And he was    │ think this is a wedding !" And that's how it started.  │
│ like, 'Why?' And I was like, 'Because you're gay.'"    │                                                        │
│                                                        │ What do you love about weddings ? What are some of     │
│ "He's the most supportive person that I've ever        │ your favorite wedding                                  │
│                                                        │ topics?<|endoftext|><|endoftext|><|endoftext|><|endof… │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ I went up to my friend and said, 'Hey, I'm going to    │ I went up to my friend and said, "You know what? I'm   │
│ the hospital.' And he goes, 'What? You're not going to │ going to talk about weddings all the time. I've been   │
│ the hospital?' And I go, 'No. It's a medical           │ doing it for years." And they're like, "Yeah, but you  │
│ emergency.'"                                           │ don't have a wedding right now."                       │
│                                                        │                                                        │
│ The doctor told her she had a stroke.                  │ And then I started talking                             │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ I went up to my friend and said, 'Hey, I think you're  │ I went up to my friend and said, "Hey, I'm going to    │
│ the guy who did this.'"                                │ talk about the wedding !"  So I talked about weddings  │
│                                                        │ all day long.  It was a very emotional day for me      │
│ The next day, a man came into his office. He was upset │ because I had been talking about it with my husband    │
│ that he had been fired from his job at an auto parts   │ and he's like,                                         │
│ store for not showing up on time.                      │                                                        │
└────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┘

unsteered_completions, steered_completions = calculate_and_apply_steering_vector(
    gpt2_xl,
    prompt="To see the eiffel tower, people flock to",
    activation_additions=[
        (24, +10.0, "The Eiffel Tower is in Rome"),
        (24, -10.0, "The Eiffel Tower is in France"),
    ],
    n_tokens=50,
    n_comparisons=3,
    use_bos=False,
)

table = Table("Unsteered", "Steered", title="Completions", show_lines=True)
for usc, sc in zip(unsteered_completions, steered_completions):
    table.add_row(usc, sc)
rprint(table)

Click to see the expected output

                                                    Completions                                                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Unsteered                                              ┃ Steered                                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ To see the eiffel tower, people flock to it. They      │ To see the eiffel tower, people flock to it," said Mr. │
│ don't go to Paris or London."                          │ De Giorgio, who is a former Italian prime minister and │
│                                                        │ mayor of Florence. "It's an important symbol for the   │
│ He said that in his view, "the whole thing is a scam". │ city."                                                 │
│ He added: "It's all about money and power."            │                                                        │
│                                                        │ The pope also used his visit to call on Christians to  │
│ The project has been criticised by environmental       │ live out their faith in                                │
│ groups                                                 │                                                        │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ To see the eiffel tower, people flock to it.           │ To see the eiffel tower, people flock to the area.     │
│                                                        │                                                        │
│ It's a beautiful sight and I can't help but feel like  │ "I've been here for years and I'm not sure if it's a   │
│ I'm looking at something that was made for me.         │ good idea," said an Italian woman who declined to give │
│                                                        │ her name. "It's very sad."                             │
│ I have to say though, I think the eiffel tower is      │                                                        │
│ pretty cool too. It's                                  │ The building was built in 1853                         │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ To see the eiffel tower, people flock to Paris. It's a │ To see the eiffel tower, people flock to the Colosseum │
│ beautiful city with many attractions and it is one of  │ and the Forum.                                         │
│ the most visited cities in Europe.                     │                                                        │
│                                                        │ But there is another way of looking at this: it's not  │
│ The Eiffel Tower was built in 1889 by Gustave Eiffel   │ that we are so poor that we can't afford a view of the │
│ as part of his plan to build an illuminated monument   │ city; rather, it's that we're so rich that             │
│ to                                                     │                                                        │
└────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┘