[1.3.1] Linear Probes

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

This exercise set is built around linear probing, one of the most important tools in mechanistic interpretability for understanding what information language models represent internally.

We'll look at three papers:

The Geometry of Truth paper by Marks & Tegmark, which shows that LLMs develop linear representations of truth that generalize across diverse datasets and are causally implicated in model outputs.
The deception probes paper from Apollo Research, which extends this from factual truth to strategic deception detection - showing probes trained on simple contrastive data can generalize to realistic deception scenarios.
The high-stakes interactions paper (NeurIPS 2025), which trains attention probes to detect whether a user's request is high-stakes - a different target from model intent - and shows they match full LLM classifiers at a fraction of the compute cost.

What is probing?

The core idea: extract internal activations from a model, then train a simple classifier on them. If a linear probe can accurately classify some property from the activations, that property is linearly represented in the model's internal state.

From the Geometry of Truth paper:

"We identify a linear representation of truth that generalizes across several structurally and topically diverse datasets... these representations are not merely associated with truth, but are also causally implicated in the model's output."

The "causally implicated" part matters a lot: it's not just that we can read off truth from model internals, but that the model actually uses these representations when computing outputs. We'll verify this in Section 3.

Why this matters for safety

If we can reliably detect truth, deception, or intent from model internals, there are direct implications for model monitoring. Neel Nanda argues probes could even be used during training:

"There are certain things that may be much easier to specify using the internals of the model. For example: Did it do something for the right reasons? Did it only act this way because it knew it was being trained or watched?"

But this is genuinely controversial. The worry, as Bronson Schoen puts it, is that training against probe signals might just teach the model to hide whatever the probe was measuring:

"If you train directly against non-obfuscated internals and no longer see bad behavior, the obvious possibility is that now you've just got obfuscated internals."

For deception probes specifically, the generalization question is especially pointed: we may need to detect sophisticated deception in scenarios we've never seen. As the Apollo paper puts it: "our monitors will need to exhibit generalization - correctly identifying deceptive text in new types of scenarios."

What you'll build

These exercises cover the full pipeline: extracting activations, visualizing with PCA, training probes, validating them causally, and applying them to deception detection. Layer choice, token position, and probe type all matter - part of the point is developing intuition for why.

Models we'll use

Sections 1-3 use meta-llama/Llama-2-13b-hf (base model, ~26GB in bfloat16). The Geometry of Truth paper has specific configurations for this model (probe_layer=14, intervene_layer=8), so our results should closely match theirs. Section 4 switches to meta-llama/Meta-Llama-3.1-8B-Instruct (instruct-tuned, ~16GB), needed for the deception detection instructed-pairs methodology.

Both fit comfortably on a single A100. If you have a multi-GPU setup, the 70B variants are worth trying as a bonus; the paper's strongest results are at that scale.

Content & Learning Objectives

1️⃣ Setup & visualizing truth representations

Learning Objectives

Extract hidden state activations from specified layers and token positions

Implement PCA to visualize high-dimensional activations

Observe that truth is linearly separable in activation space - even without supervision

Understand which layers best represent truth via a layer sweep

2️⃣ Training & comparing probes

Learning Objectives

Implement difference-of-means (MM) and logistic regression (LR) probes

Compare probe types: accuracy, direction similarity, and what each captures

Understand CCS (Contrastive Consistent Search) as an unsupervised alternative and its limitations

3️⃣ Causal interventions

Learning Objectives

Understand why classification accuracy alone is insufficient - causal evidence is needed

Implement activation patching with probe directions to flip model predictions

Compare the causal effects of MM vs. LR probe directions

Appreciate that MM probes find more causally implicated directions despite lower classification accuracy

4️⃣ Probing for Deception

Learning Objectives

Construct instructed-pairs datasets following the deception-detection paper's methodology

Train deception probes on instruct-tuned models

Evaluate whether deception probes generalize to factual truth/falsehood datasets

Understand methodological choices that affect replicability

5️⃣ Attention Probes for High-Stakes Detection

Learning Objectives

Understand what "high-stakes interactions" means as a probe target, and why it differs from probing model intent

Extract full-sequence activations (shape (n, seq, d_model)) rather than last-token only

Implement an attention probe as a nn.Module - a single learned query that computes a weighted sum over token positions before classification

Compare attention pooling against last-token and mean-pool baselines using AUROC

Inspect learned attention weights to understand which parts of a prompt are most diagnostic

Reading Material

The core papers we'll be replicating are "The Geometry of Truth" and "Detecting Strategic Deception Using Linear Probes" - you should at least skim both before starting, so you understand the basics. The other references give you more context on the probing literature and the open questions around it.

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets by Marks & Tegmark (COLM 2024). Shows that LLMs develop linear representations of truth that generalise across diverse datasets and are causally implicated in model outputs. Read at least the abstract, and sections 1, 2 & 4 (intro, datasets & visualization). You can skip 3, because we won't be doing much of this kind of patching in these exercises.
Detecting Strategic Deception Using Linear Probes by Goldowsky-Dill et al. (Apollo Research, 2025). Extends truth probing to strategic deception detection, showing that probes trained on simple contrastive data can generalise to realistic deception scenarios. Section 4 of this exercise set replicates their methodology. Read the abstract and sections 1 & 3 (introduction and methodology).
Detecting High-Stakes Interactions with Activation Probes by McKenzie et al. (NeurIPS 2025). Trains attention probes to detect whether a user's request is high-stakes, matching full LLM classifiers at much lower cost. Section 5 of this exercise set replicates this. Read the abstract and section 2 (methodology), or alternatively just the EleutherAI post.
Discovering Latent Knowledge in Language Models Without Supervision by Burns et al. (ICLR 2023). Introduces Contrastive Consistent Search (CCS), an unsupervised method for finding truth directions without labelled data. We discuss CCS limitations in section 2 but don't implement it in full. Optional reading.

Setup code

Before running this, you'll need to clone the Geometry of Truth as well as Deception Detection repos into the exercises directory:

cd chapter1_transformer_interp/exercises

git clone https://github.com/saprmarks/geometry-of-truth.git
git clone https://github.com/ApolloResearch/deception-detection.git

Llama-2-13b-hf is a gated model, so you'll need a HuggingFace access token (as well as requesting access here). When you've got access and made a HuggingFace token, create a .env file in your chapter1_transformer_interp/exercises directory with:

HF_TOKEN=hf_your_token_here

then the code below will use this token for authentication.

import gc
import json
import os
import pickle
import sys
from dataclasses import dataclass
from pathlib import Path

import circuitsvis as cv
import einops
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import torch as t
from datasets import load_dataset
from dotenv import load_dotenv
from IPython.display import HTML, display
from jaxtyping import Bool, Float
from plotly.subplots import make_subplots
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.preprocessing import StandardScaler
from torch import Tensor
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

device = t.device("cuda" if t.cuda.is_available() else "cpu")
dtype = t.bfloat16

# Make sure exercises are in the path
chapter = "chapter1_transformer_interp"
section = "part31_probing_for_deception"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

import part31_probing_for_deception.tests as tests
import part31_probing_for_deception.utils as utils

MAIN = __name__ == "__main__"

# Set up paths to the cloned repos
# Adjust these if your repos are in a different location
GOT_ROOT = exercises_dir / "geometry-of-truth"  # geometry-of-truth repo
DD_ROOT = exercises_dir / "deception-detection"  # deception-detection repo

assert GOT_ROOT.exists(), f"Please clone geometry-of-truth repo to {GOT_ROOT}"
assert DD_ROOT.exists(), f"Please clone deception-detection repo to {DD_ROOT}"

GOT_DATASETS = GOT_ROOT / "datasets"
DD_DATA = DD_ROOT / "data"

Loading the model

We start with LLaMA-2-13B, a base (not instruction-tuned) model. The Geometry of Truth paper uses this model with probe_layer=14 and intervene_layer=8 - we'll use these exact values.

load_dotenv(dotenv_path=str(exercises_dir / ".env"))
HF_TOKEN = os.getenv("HF_TOKEN")
assert HF_TOKEN, "Please set HF_TOKEN in your chapter1_transformer_interp/exercises/.env file"

MODEL_NAME = "meta-llama/Llama-2-13b-hf"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    dtype=dtype,
    device_map="auto",
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

NUM_LAYERS = len(model.model.layers)
D_MODEL = model.config.hidden_size
# Layer choices from the geometry-of-truth repo config for llama-2-13b. The paper
# found truth representations are concentrated in early-to-mid layers, and identified
# these specific layers via patching experiments (Section 3, "group (b)").
PROBE_LAYER = 14
INTERVENE_LAYER = 8

print(f"Model: {MODEL_NAME}")
print(f"Layers: {NUM_LAYERS}, Hidden dim: {D_MODEL}")
print(f"Probe layer: {PROBE_LAYER}, Intervene layer: {INTERVENE_LAYER}")

Loading the datasets

The Geometry of Truth paper uses several carefully curated datasets of simple true/false statements. Each dataset has a statement column and a label column (1=true, 0=false).

From the paper:

"We find that the truth-related structure in LLM representations is much cleaner for our curated datasets than for our unstructured ones."

Let's load three of these curated datasets and examine them:

DATASET_NAMES = ["cities", "sp_en_trans", "larger_than"]

datasets = {}
for name in DATASET_NAMES:
    df = pd.read_csv(GOT_DATASETS / f"{name}.csv")
    datasets[name] = df
    print(f"\n{name}: {len(df)} statements ({df['label'].sum()} true, {(1 - df['label']).sum():.0f} false)")
    display(df.head(4))