[1.3.4] Activation Oracles
Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.
If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.
Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction
Linear probes let you ask yes-or-no questions about a model's activations, and SAEs give you an unsupervised decomposition into interpretable features. Both are useful, but they share a limitation: you have to decide what you're looking for before you start looking.
What if you could just ask a model's activations an open-ended question in plain English? "What concept is this layer encoding?" or "Is this model planning to lie?" That's the idea behind Activation Oracles: LLMs that have been trained to take another model's internal activations as input and answer arbitrary questions about them. The oracle reads the activations the way you'd read a passage of text, except the "text" is a vector of floating-point numbers from some intermediate layer.
This is a relatively new technique, and it's still unclear how far it will go. But the early results are promising, and in these exercises you'll start by using a pre-trained oracle to query a model's internals, then build the full oracle pipeline from scratch so you understand what's happening under the hood. After that, you'll replicate some key results from the Activation Oracles paper (extracting secrets that models have been trained to hide, detecting hidden goals, tracking emotions across conversations) and finish with a reference section on training your own oracle.
The thing that makes oracles different from probes and SAEs is generalization. A linear probe can only answer the specific classification question it was trained on. An SAE gives you features, but you still have to figure out what they mean. An oracle can answer questions it has never seen during training, which makes it a surprisingly flexible tool for exploratory work. The tradeoff is that you lose mechanistic transparency: the oracle gives you an answer in natural language, but it doesn't show you the directions in activation space that led to that answer, and it doesn't come with calibration or error bars.
Content & Learning Objectives
1️⃣ Introduction & using Activation Oracles
You'll start by understanding what Activation Oracles are and how to use them. You'll load pre-trained oracle models and run queries to extract information from model activations.
Learning Objectives
- Understand what Activation Oracles are and how they differ from traditional interpretability methods
- Learn the basic workflow: target model → activations → oracle → natural language answer
- Use pre-trained oracles to query model internals with different question types
- Explore token-level, segment, and full-sequence queries
- Test oracles on next/previous token prediction tasks
2️⃣ Implementing oracle components
Here you'll build the core components that power Activation Oracles from scratch, to understand how they work at a mechanistic level.
Learning Objectives
- Implement activation extraction using forward hooks
- Understand the special token mechanism (
?tokens as activation placeholders)- Build activation steering hooks to inject activations into the oracle
- Create training datapoints with the correct format
- Assemble all components to replicate the
utils.run_oracle()function
3️⃣ Secret extraction & advanced applications
You'll replicate key results from the Activation Oracles paper and apply oracles to complex interpretability tasks.
Learning Objectives
- Understand the "secret keeping" problem and its alignment implications
- Replicate Figure 1 from the paper: extracting forbidden words from taboo models
- Compare how oracle prompt wording and input type affect extraction accuracy
- Systematically evaluate secret extraction across multiple models and layers
- Use model-diffing (activation differences) to detect what fine-tuning changed
- Extract model goals and hidden constraints
- Detect misaligned model behavior (malicious personas) before output generation
- Analyze emotions and emotional progression in conversations
4️⃣ Training your own oracle (Reference)
Reference material on training your own oracle. No exercises - read through to understand the training methodology.
Learning Objectives
- Understand training scale and compute requirements
- Learn dataset composition: SPQA, classification tasks, and self-supervised context prediction
- Understand why data diversity and quantity both matter for training
- Know when to train custom oracles vs using pre-trained ones
☆ Bonus exercises
We end with suggested explorations: multi-task training, cross-architecture transfer, uncertainty quantification, combining oracles with SAEs, and more.
Reading Material
The Activation Oracles paper is essential reading - you'll replicate several of its key results. LatentQA is helpful for understanding the predecessor approach.
- Activation Oracles by Karvonen et al. (2025). Introduces the idea of training LLMs to answer arbitrary natural-language questions about another model's internal activations. You'll replicate their secret extraction results (Figure 1), goal detection, and emotion tracking. Read the abstract, Sections 1-2 ("Introduction" and "Method"), and Section 4 ("Applications").
- LatentQA: Teaching LLMs to Decode Activations Into Natural Language. A predecessor approach that also decodes activations into natural language, but for narrower task settings (e.g. system prompt extraction). The Activation Oracles paper generalises this idea. Useful for understanding why broad training data matters for oracle generalisation. Optional reading.
- Eliciting Secret Knowledge from Language Models. Studies models trained to keep secrets (e.g. "taboo word" models that avoid saying certain words). These are the models you'll probe with oracles in Section 3. Skim the abstract and Section 2 to understand how the secret-keeping models are constructed. Optional reading.
Setup code
import contextlib
import gc
import os
import re
import sys
import textwrap
from dataclasses import dataclass
from pathlib import Path
from typing import Callable
import pandas as pd
import plotly.express as px
import pytest
import torch
from dotenv import load_dotenv
from IPython.display import display
from jaxtyping import Float, Int
from peft import LoraConfig
from torch import Tensor
from tqdm.notebook import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
torch.set_grad_enabled(False)
# Make sure exercises are in the path
chapter = "chapter1_transformer_interp"
section = "part34_activation_oracles"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
sys.path.append(str(exercises_dir))
# Disable runtime errors from custom hooks
os.environ["TORCHDYNAMO_DISABLE"] = "1"
# Allow expandable memory segments on CUDA to avoid OOMs
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
import part34_activation_oracles.tests as tests
import part34_activation_oracles.utils as utils
MAIN = __name__ == "__main__"
dtype = torch.bfloat16
device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")
def print_with_wrap(s: str, width: int = 80):
"""Print text with line wrapping, preserving newlines."""
out = []
for line in s.splitlines(keepends=False):
out.append(textwrap.fill(line, width=width) if line.strip() else line)
print("\n".join(out))