[1.3.2] Function Vectors & Model Steering

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

These exercises serve as an exploration of the following question: can we steer a model to produce different outputs / have a different behaviour, by intervening on the model's forward pass using vectors found by non gradient descent-based methods?

The majority of the exercises focus on function vectors: vectors which are extracted from forward passes on in-context learning (ICL) tasks, and added to the residual stream in order to trigger the execution of this task from a zero-shot prompt. The diagram below illustrates this.

The exercises also take you through use of the nnsight library, which is designed to support this kind of work (and other interpretability research) on very large language models - i.e. larger than models like GPT2-Small which you might be used to at this point in the course.

The final set of exercises look at Alex Turner et al's work on steering vectors, which is conceptually related but has different aims and methodologies.

Content & Learning Objectives

1️⃣ Introduction to `nnsight`

In this section, you'll learn the basics of how to use the nnsight library: running forward passes on your model, and saving the internal states. You'll also learn some basics of HuggingFace models which translate over into nnsight models (e.g. tokenization, and how to work with model output).

Learning Objectives

Learn the basics of the nnsight library, and what it can be useful for

Learn some basics of HuggingFace models (e.g. tokenization, model output)

Use it to extract & visualise GPT-J-6B's internal activations

2️⃣ Task-encoding hidden states

We'll begin with the following question, posed by the Function Vectors paper:

When a transformer processes an ICL (in-context-learning) prompt with exemplars demonstrating task $T$, do any hidden states encode the task itself?

We'll prove that the answer is yes, by constructing a vector $h$ from a set of ICL prompts for the antonym task, and intervening with our vector to make our model produce antonyms on zero-shot prompts.

This will require you to learn how to perform causal interventions with nnsight, not just save activations.

(Note - this section structurally follows section 2.1 of the function vectors paper).

Learning Objectives

Understand how nnsight can be used to perform causal interventions, and perform some yourself

Reproduce the "h-vector results" from the function vectors paper; that the residual stream does contain a vector which encodes the task and can induce task behaviour on zero-shot prompts

3️⃣ Function Vectors

In this section, we'll replicate the crux of the paper's results, by identifying a set of attention heads whose outputs have a large effect on the model's ICL performance, and showing we can patch with these vectors to induce task-solving behaviour on randomly shuffled prompts.

We'll also learn how to use nnsight for multi-token generation, and steer the model's behaviour. There exist exercises where you can try this out for different tasks, e.g. the Country-Capitals task, where you'll be able to steer the model to complete prompts like "When you think of Netherlands, you usually think of" by talking about Amsterdam.

(Note - this section structurally follows sections 2.2, 2.3 and some of section 3 from the function vectors paper).

Learning Objectives

Define a metric to measure the causal effect of each attention head on the correct performance of the in-context learning task

Understand how to rearrange activations in a model during an nnsight forward pass, to extract activations corresponding to a particular attention head

Learn how to use nnsight for multi-token generation

4️⃣ Steering Vectors in GPT2-XL

Here, we discuss a different but related set of research: Alex Turner's work on steering vectors. This also falls under the umbrella of "interventions in the residual stream using vectors found with forward pass (non-SGD) based methods in order to alter behaviour", but it has a different setup, objectives, and approach.

Learning Objectives

Understand the goals & main results from Alex Turner et al's work on steering vectors

Reproduce the changes in behaviour described in their initial post

☆ Bonus

Lastly, we discuss some possible extensions of function vectors & steering vectors work, which is currently an exciting area of development (e.g. with a paper on steering Llama 2-13b coming out as recently as December 2023).

Setup code

import logging
import os
import sys
import time
from collections import defaultdict
from pathlib import Path

import circuitsvis as cv
import einops
import numpy as np
import torch as t
from IPython.display import display
from jaxtyping import Float
from nnsight import CONFIG, LanguageModel
from openai import OpenAI
from rich import print as rprint
from rich.table import Table
from torch import Tensor

# Hide some info logging messages from nnsight
logging.disable(sys.maxsize)

t.set_grad_enabled(False)
device = t.device("mps" if t.backends.mps.is_available() else "cuda" if t.cuda.is_available() else "cpu")

# Make sure exercises are in the path
chapter = "chapter1_transformer_interp"
section = "part32_function_vectors_and_model_steering"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

import part32_function_vectors_and_model_steering.solutions as solutions
import part32_function_vectors_and_model_steering.tests as tests
from plotly_utils import imshow

MAIN = __name__ == "__main__"

[1.3.2] Function Vectors & Model Steering

Introduction

Content & Learning Objectives

1️⃣ Introduction to nnsight

Learning Objectives

2️⃣ Task-encoding hidden states

Learning Objectives

3️⃣ Function Vectors

Learning Objectives

4️⃣ Steering Vectors in GPT2-XL

Learning Objectives

☆ Bonus

Setup code

1️⃣ Introduction to `nnsight`