[4.3] Interpreting Reasoning: Thought Anchors

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

In most of this chapter's content so far, we've focused on dividing up LLM computation into small steps, specifically single-token generation. We saw this in Indirect Object Identification where we deeply analyzed how GPT2-Small generates a single token. But recent models have made advances in chain-of-thought reasoning. When models are producing very long reasoning traces, we have to think about not just serialized computation over layers to produce a single token, but serialized computation over a very large number of tokens. To get traction on this problem, we're going to need to introduce a new abstraction.

The paper Thought Anchors: Which LLM Reasoning Steps Matter? does this by splitting up reasoning traces by sentence. Compared to tokens, sentences are more coherent and correspond more closely to the actual reasoning steps taken by LLMs.

The diagram below illustrates this: we can group sentences according to a rough taxonomy of reasoning steps, and use this to show that certain kinds of sentences are more important than others in terms of shaping the trajectory of the reasoning trace (and the final answer). The authors term these sentences thought anchors. Thought anchors can be identified using black-box methods (i.e. resampling rollouts) or white-box methods (looking at / intervening on attention patterns). In these exercises, we'll work through both.

We'll also explore the Thought Branches paper which extends these techniques to safety-relevant scenarios like blackmail, introducing metrics like resilience to distinguish genuine causal drivers from post-hoc rationalizations.

Content & Learning Objectives

1️⃣ CoT Infrastructure & Sentence Taxonomy

Learning Objectives

Load and explore the GSM8K reasoning dataset structure (prompts, CoT traces, answers)

Implement sentence segmentation for chain-of-thought traces using regex patterns

Build both rule-based and LLM-based classifiers to categorize reasoning steps (calculation, lookup, restatement, etc.)

Visualize how different sentence types are distributed across reasoning traces

2️⃣ Black-box Analysis

Learning Objectives

Understand forced answer importance: measuring impact by directly forcing model answers at specific steps

Implement resampling importance: measuring impact by replacing steps with samples from other reasoning traces

Implement counterfactual importance: measuring impact by resampling conditioned on keeping the same answer

Learn when each metric is appropriate (causal vs correlational analysis)

Replicate key paper figures showing which reasoning steps are most critical for final answers

3️⃣ White-box Methods

Learning Objectives

Understand receiver heads: attention heads that aggregate information from reasoning steps into the final answer

Compute vertical attention scores: quantifying how much each token position attends to specific reasoning steps

Implement RoPE (Rotary Position Embeddings) to understand positional encoding in attention

Build attention suppression interventions to causally validate which attention patterns matter

Compare white-box attention metrics with black-box importance scores to validate mechanistic understanding

4️⃣ Thought Branches: Safety Applications

Learning Objectives

Apply thought anchor analysis to safety-critical scenarios (blackmail reasoning)

Measure how much reasoning steps influence safety-relevant outcomes (not just answer correctness)

Understand the resilience metric: how stable are thought anchors when we vary the CoT prompt?

Compare thought anchor patterns between mathematical reasoning (GSM8K) and safety reasoning (blackmail scenarios)

Setup code

Before running this code, you'll need to clone the thought-anchors repo. Make sure you're cloning it inside the chapter4_alignment_science/exercises directory.

cd chapter4_alignment_science/exercises
git clone https://github.com/interp-reasoning/thought-anchors.git

import json
import math
import os
import re
import sys
import time
import warnings
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from types import MethodType
from typing import Any

import circuitsvis as cv
import einops
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import torch
import transformers
from datasets import load_dataset
from dotenv import load_dotenv
from huggingface_hub import hf_hub_download
from huggingface_hub.utils import disable_progress_bars, enable_progress_bars
from IPython.display import HTML, display
from jaxtyping import Float, Int
from openai import OpenAI
from plotly.subplots import make_subplots
from scipy import stats
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from torch import Tensor
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria

warnings.filterwarnings("ignore")

thought_anchors_path = Path.cwd() / "thought-anchors"
assert thought_anchors_path.exists(), f"Please clone thought-anchors repo as {thought_anchors_path.resolve()!r}"

sys.path.append(str(thought_anchors_path))

torch.set_grad_enabled(False)
DEVICE = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")

# Make sure exercises are in the path
chapter = "chapter4_alignment_science"
section = "part3_interpreting_reasoning_models"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

import part3_interpreting_reasoning_models.tests as tests
import part3_interpreting_reasoning_models.utils as utils

MAIN = __name__ == "__main__"

You'll need to create an .env file in the part1_transformer_interp/exercises directory before running the next cell.

# Load .env file
env_path = Path.cwd() / ".env"
assert env_path.exists(), "Please create a .env file with your API keys"

load_dotenv(dotenv_path=str(env_path))

# Setup OpenRouter client (works with both Claude and OpenAI models)
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
assert OPENROUTER_API_KEY, "Please set OPENROUTER_API_KEY in your .env file"

openrouter_client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY,
)

Now let's define some constants:

print(f"Using device: {DEVICE}")