[4.3] Interpreting Reasoning: Thought Anchors

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

So far in this chapter, we've focused on dividing LLM computation into small steps: single-token generation. We saw this in Indirect Object Identification where we analyzed how GPT2-Small generates a single token. But modern reasoning models produce very long chain-of-thought traces, so we need to think about serialized computation over many tokens, not just over layers. This requires a new abstraction.

The Thought Anchors paper introduces this abstraction by splitting reasoning traces by sentence. Sentences are more coherent than tokens and correspond more closely to the actual reasoning steps. Some sentences matter a lot more than others for shaping the reasoning trajectory and the final answer; the authors call these thought anchors. They can be identified using black-box methods (resampling rollouts) or white-box methods (looking at / intervening on attention patterns). In these exercises, we'll work through both.

We'll also explore the Thought Branches paper which extends these techniques to safety-relevant scenarios like blackmail, introducing metrics like resilience to distinguish genuine causal drivers from post-hoc rationalizations.

Content & Learning Objectives

1️⃣ CoT Infrastructure & Sentence Taxonomy

Learning Objectives

Load and explore the MATH reasoning dataset structure (prompts, CoT traces, answers)

Implement sentence segmentation for chain-of-thought traces using regex patterns

Build both rule-based and LLM-based classifiers to categorize reasoning steps (problem_setup, active_computation, fact_retrieval, etc.)

Visualize how different sentence types are distributed across reasoning traces

2️⃣ Black-box Analysis

Learning Objectives

Understand forced answer importance: measuring impact by directly forcing model answers at specific steps

Implement resampling importance: measuring impact by having the model regenerate its own continuation from that point forward

Implement counterfactual importance: measuring impact by resampling then filtering to steps where the regenerated sentence is semantically different from the original

Learn when each metric is appropriate (causal vs correlational analysis)

Replicate key paper figures showing which reasoning steps are most critical for final answers

3️⃣ White-box Methods

Learning Objectives

Understand receiver heads: attention heads that aggregate information from reasoning steps into the final answer

Compute vertical attention scores: quantifying how much each token position attends to specific reasoning steps

Implement RoPE (Rotary Position Embeddings) to understand positional encoding in attention

Build attention suppression interventions to causally validate which attention patterns matter

Compare white-box attention metrics with black-box importance scores to validate mechanistic understanding

4️⃣ Thought Branches: Safety Applications

Learning Objectives

Apply thought anchor analysis to safety-critical scenarios (blackmail reasoning)

Measure how much reasoning steps influence safety-relevant outcomes (not just answer correctness)

Understand the resilience metric: how many times a sentence must be iteratively removed before it stays absent from the model's regeneration

Compare thought anchor patterns between mathematical reasoning (MATH benchmark) and safety reasoning (blackmail scenarios)

Reading Material

Read at least the introduction and methods of the Thought Anchors paper before starting - the exercises follow its methodology closely. The Thought Branches paper is needed for Section 4.

Thought Anchors: Which LLM Reasoning Steps Matter?. Introduces the abstraction of treating sentences (not tokens) as units of reasoning in chain-of-thought traces, and identifies "thought anchors" - the sentences that most influence the final answer. Sections 1-3 of the exercises replicate the paper's black-box and white-box methods. Read the abstract, Section 1 (introduction) as well as 2 (explaining the methods for quantifying sentence importance) and 3 (explaining the sentence taxonomy). There's also a nice interactive demo showing thought anchors on example traces.
Thought Branches: Interpreting LLM Reasoning Requires Resampling. Extends thought anchor analysis to safety-relevant scenarios (e.g. blackmail reasoning), introducing the "resilience" metric. Mostly we just recommend skimming sections 1-3.
Principled Interpretability of Reward Hacking in Closed Frontier Models by Kroiz, Singh, Rajamanoharan & Nanda (MATS 9.0). Adapts the thought anchors resampling methodology to study reward hacking in closed frontier models (GPT-5, o3, Gemini 3 Pro), using agent actions as units of analysis rather than CoT sentences. Not needed for the exercises, but gives a sense of where the methodology is being applied.

Setup code

Before running this code, you'll need to clone the thought-anchors repo. Make sure you're cloning it inside the chapter4_alignment_science/exercises directory.

cd chapter4_alignment_science/exercises
git clone https://github.com/interp-reasoning/thought-anchors.git

import json
import math
import os
import re
import sys
import time
import warnings
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from types import MethodType
from typing import Any, Callable

import circuitsvis as cv
import einops
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import torch as t
import transformers
from datasets import load_dataset
from dotenv import load_dotenv
from huggingface_hub import hf_hub_download
from huggingface_hub.utils import disable_progress_bars, enable_progress_bars
from IPython.display import HTML, display
from jaxtyping import Float, Int
from openai import OpenAI
from plotly.subplots import make_subplots
from scipy import stats
from sentence_transformers import SentenceTransformer
from torch import Tensor
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria

warnings.filterwarnings("ignore")

thought_anchors_path = Path.cwd() / "thought-anchors"
assert thought_anchors_path.exists(), f"Please clone thought-anchors repo as {thought_anchors_path.resolve()!r}"

sys.path.append(str(thought_anchors_path))

t.set_grad_enabled(False)

device = t.device("mps" if t.backends.mps.is_available() else "cuda" if t.cuda.is_available() else "cpu")
dtype = t.bfloat16

# Make sure exercises are in the path
chapter = "chapter4_alignment_science"
section = "part3_interpreting_reasoning_models"
repo = "ARENA_3.0"
root_dir = next(p for p in Path.cwd().parents if p.name == repo)
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

import part3_interpreting_reasoning_models.tests as tests
import part3_interpreting_reasoning_models.utils as utils

MAIN = __name__ == "__main__"

You'll need to create an .env file in the chapter4_alignment_science/exercises directory before running the next cell.

# Load .env file
env_path = exercises_dir / ".env"
assert env_path.exists(), "Please create a .env file with your API keys"

load_dotenv(dotenv_path=str(env_path))

# Setup OpenRouter client (works with both Claude and OpenAI models)
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
assert OPENROUTER_API_KEY, "Please set OPENROUTER_API_KEY in your .env file"

openrouter_client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY,
)

Now let's define some constants:

print(f"Using device: {device}")