[1.6.4] Interpreting Reasoning: Thought Anchors

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

In most of this chapter's content so far, we've focused on dividing up LLM computation into small steps, specifically single-token generation. We saw this in Indirect Object Identification where we deeply analyzed how GPT2-Small generates a single token, and we can also see it in other exercises like OthelloGPT. But recent models have made advances in chain-of-thought reasoning. When models are producing very long reasoning traces (autoregressively consuming their tokens), we have to think about not just serialized computation over layers to produce a single token, but serialized computation over a very large number of tokens. To get traction on this problem, we're going to need to introduce a new abstraction.

The paper Thought Anchors: Which LLM Reasoning Steps Matter? does this by splitting up reasoning traces by sentence. Compared to tokens, sentences are more coherent and correspond more closely to the actual reasoning steps taken by LLMs (e.g. see the paper Understanding Reasoning in Thinking LMs via Steering Vectors, which identified behaviour categories for steps taken by reasoning models, and found them to correspond to sentences rather than tokens).

The diagram below illustrates this: we can group sentences according to a rough taxonomy of reasoning steps, and use this to show that certain kinds of sentences are more important than others in terms of shaping the trajectory of the reasoning trace (and the final answer). The authors term these sentences thought anchors. Thought anchors can be identified using black-box methods (i.e. resampling rollouts) or white-box methods (looking at / intervening on attention patterns). In these exercises, we'll work through both.

Content & Learning Objectives

1️⃣ CoT Infrastructure & Sentence Taxonomy

In this section, you'll inspect your dataset of reasoning traces, and understand the taxonomy of sentences we'll be using to classify reasoning steps. In this way, you'll build up to a better understanding of the shape of the problem we're trying to solve.

Learning Objectives

Import & understand the structure of our reasoning dataset

Understand how to use regex-based heuristics and autoraters to classify sentences

Classify sentences in reasoning traces by taxonomy

2️⃣ Black-box Analysis

Next, you'll understand the black-box methods the paper used to measure sentence importance, specifically the resampling method. This is the primary way that the paper found thought anchors: the critical sentences that guide the trajectory of model reasoning.

Since the paper already open-sourced their dataset which includes resampled rollouts, we'll start by analyzing these rather than generating our own, although the end of this section will include exercises on implementing your own resampling.

Learning Objectives

Understand the baseline of forced answer importance used for reasoning chunks (and its limitations)

Learn about resampling methods for measuring sentence importance, and implement your own resampling metric calculations on a given resampled rollout

Learn how we can filter for low cosine similarity resamplings to find sentences which are critical for shaping the model's reasoning trajectory

Reproduce several of the paper's key figures analysing the importance of different categories of reasoning steps

Implement your own resampling methods using LLM generation

3️⃣ White-box Methods

Lastly, you'll look at some white-box methods for identifying thought anchors. This includes receiver head analysis (finding sentences which receive most attention from the model's important heads) and causal masking to test the effect of removing the weight from particular sentences.

Learning Objectives

TODO(mcdougallc)

Setup code

Before running this code, you'll need to clone the thought-anchors repo so that it can be added to your path:

git clone https://github.com/interp-reasoning/thought-anchors.git

import os
from datasets import load_dataset
import pandas as pd
import numpy as np
import json
import torch
import re
from transformers import AutoTokenizer, AutoModelForCausalLM, StoppingCriteria
from pathlib import Path
import seaborn as sns
import textwrap
import transformers
from collections import defaultdict
from typing import Dict
from tqdm import tqdm
from dotenv import load_dotenv
from openai import OpenAI
import random
import time
import matplotlib.pyplot as plt
import seaborn as sns
from sentence_transformers import SentenceTransformer
import plotly.express as px
from scipy.stats import kurtosis
import plotly.graph_objects as go
from collections import Counter
import warnings
import sys
import sentence_transformers
from pprint import pprint
from huggingface_hub.utils import disable_progress_bars, enable_progress_bars
import einops
from huggingface_hub import hf_hub_download, list_repo_files

warnings.filterwarnings("ignore")


thought_anchors_path = Path.cwd().parent / "material/thought-anchors"
assert thought_anchors_path.exists()

sys.path.append(str(thought_anchors_path))

from utils import (
    # split_solution_into_chunks,
    extract_boxed_answers,
    check_answer,
    normalize_answer,
    load_math_problems,
)