[1.5.3] OthelloGPT

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

Note - unless otherwise specified, first person here refers to the primary researcher, Neel Nanda.

Emergent World Representations is a fascinating recent ICLR Oral paper from Kenneth Li et al, summarised in Kenneth's excellent post on the Gradient. They trained a model (Othello-GPT) to play legal moves in the board game Othello, by giving it random games (generated by choosing a legal next move uniformly at random) and training it to predict the next move. The headline result is that Othello-GPT learns an emergent world representation - despite never being explicitly given the state of the board, and just being tasked to predict the next move, it learns to compute the state of the board at each move. (Note that the point of Othello-GPT is to play legal moves, not good moves, though they also study a model trained to play good moves.)

They present two main pieces of evidence. They can extract the board state from the model's residual stream via non-linear probes (a two layer ReLU MLP). And they can use the probes to causally intervene and change the model's representation of the board (by using gradient descent to have the probes output the new board state) - the model now makes legal moves in the new board state even if they are not legal in the old board, and even if that board state is impossible to reach by legal play!

I've strengthened their headline result by finding that much of their more sophisticated (and thus potentially misleading) techniques can be significantly simplified. Not only does the model learn an emergent world representation, it learns a linear emergent world representation, which can be causally intervened on in a linear way! But rather than representing "this square has a black/white piece", it represents "this square has my/their piece". The model plays both black and white moves, so this is far more natural from its perspective. With this insight, the whole picture clarifies significantly, and the model becomes far more interpretable!

The research process

You can read more about the research process in this post here, which we'd strongly recommend. The exercises are structured in a different way to the chronological research process (e.g. we look at probes early on, when actually training a probe is a high-effort thing and if you were trying to get traction on a problem like this you'd want to use more basic techniques first, like logit lens / attention pattern analysis).

As you're going through these exercises, we encourage you to keep thinking about how you would approach this research process. Do any of the results / approaches seem completely out of the blue for you, and if so can you think about what the justification might have been for trying them? What would you have tried first, and why?

How you should approach these exercises

There's a lot of setup code for analysing OthelloGPT (this is somewhat unavoidable). There are also a lot of different plots of tensors with different dimensions and different meanings, and it can be hard to keep track of. You don't have to keep track of all of this in your head, but we strongly recommend explaining to yourself or your pair programming partner what the significance of each result is before you move on to the next section. Sometimes exercises and questions will prompt you to do this, but you should make it a reflex!

At the end of each subsection in the first section (which mostly consists of setup), there's a recap of all the objects we've defined, why they matter, and how they can be used. There will also be a "recap of this section" at the end of each of the later sections, where we review the key results and their significance.

The purpose / structure of these exercises

At a surface level, these exercises are designed to help you understand the OthelloGPT model and the forms of probing & analysis that were done on it. But it's also designed to make you a better interpretability researcher! As a result, most exercises will be doing a combination of:

Showing you some new feature/component of OthelloGPT, and
Teaching you how to use tools and interpret results in a broader mech interp context.

As you're going through these exercises, it's easy to get lost in the fiddly details of the techniques you're implementing or the things you're computing. Make sure you keep taking a high-level view, asking yourself what questions you're currently trying to ask and how you'll interpret the output you're getting, as well as how the tools you're currently using are helping guide you towards a better understanding of the model.

Content & Learning Objectives

1️⃣ Model Setup & Linear Probes

In this section, we'll set up the model that we'll use for the rest of the exercises, and load in weights. We'll also get familiar with the datasets and objects that we'll be using for this analysis. Lastly, we'll learn about linear probes and how they are used.

Learning Objectives

Understand the basic structure of the Othello-GPT model

Become familiar with the basic tools we'll use for board visualization during these exercises

See how our linear probe works, and how we can change basis between "black/white piece" and "mine/their piece"

Intervene with the linear probe to influence model predictions

2️⃣ Looking for modular circuits

Here, we'll use our probe to start to analyse circuits in our model. We can apply them to our neurons' output weights to identify which neurons matter and in what way, and we can also figure out when and where information is represented in a model.

Learning Objectives

Learn how to use our linear probe across multiple layers

Apply activation patching at a given sequence position to test hypotheses about our model

Understand how a neuron can be characterized in terms of its input and output weights

3️⃣ Neuron Interpretability: A Deep Dive

To practice neuron interpretability, we'll take a deep dive into understanding one particular neuron - the techniques and code should transfer pretty well to any other neurons!

The spirit of this section is to practice doing various standard things that you could go and apply to another neuron in practice - I end it still being fairly confused, and with many dangling threads!

Learning Objectives

Apply direct logit attribution to understand how a neuron's output weights directly effect predictions

Use SVD-based techniques to assess how much of a neuron's input/output behaviour is captured by some subspace

Use techniques like max activating datasets and spectrum plots, and understand their strengths and limitations

4️⃣ Training a Probe

In this section, we'll look at how to actually train our linear probe. This section is less mechanistic interpretability and more standard ML techniques, but it's still important to understand how to do this if you want to do any analysis like this yourself!

Learning Objectives

Learn how to set up and train a linear probe

See how to train multiple probes at once, and log their performance to Weights & Biases

☆ Bonus

Finally, we'll take a look at some future directions which could come from this OthelloGPT analysis. We strongly recommend you follow some of these threads of research yourself, and see where they might lead!

Setup code

import copy
import os
import sys
from dataclasses import dataclass
from functools import partial
from pathlib import Path
from typing import Callable

import einops
import numpy as np
import pandas as pd
import plotly.express as px
import torch as t
import wandb
from eindex import eindex
from jaxtyping import Bool, Float, Int
from torch import Tensor
from tqdm.notebook import tqdm
from transformer_lens import ActivationCache, HookedTransformer, HookedTransformerConfig
from transformer_lens.hook_points import HookPoint
from transformer_lens.utils import download_file_from_hf, get_act_name, to_numpy

device = t.device("mps" if t.backends.mps.is_available() else "cuda" if t.cuda.is_available() else "cpu")

# Make sure exercises are in the path
chapter = "chapter1_transformer_interp"
section = "part53_othellogpt"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

import part53_othellogpt.tests as tests
import part53_othellogpt.utils as utils
from neel_plotly import scatter

t.set_grad_enabled(False)

MAIN = __name__ == "__main__"