Exercise Status: All exercises complete and verified

1️⃣ Introduction & using Activation Oracles

Learning Objectives
  • Understand what Activation Oracles are and how they differ from traditional interpretability methods
  • Learn the basic workflow: target model → activations → oracle → natural language answer
  • Use pre-trained oracles to query model internals with different question types
  • Explore token-level, segment, and full-sequence queries
  • Test oracles on next/previous token prediction tasks

What are activation oracles?

Imagine you could walk up to a model's hidden layer and just ask it: "Hey, what are you thinking about right now?" That's more or less what an Activation Oracle does.

More concretely, an Activation Oracle (AO) is an LLM that has been trained to accept another model's internal activation vectors as part of its input and then answer questions about them in natural language. The oracle and target model are typically the same base architecture (with the oracle loaded as a LoRA adapter), since the oracle needs to already "speak the language" of the target's activation space. During inference, it works by assembling a prompt that looks something like:

Layer N: ? ? ? Is this a positive sentiment?

The oracle runs a forward pass on this prompt, but after an early transformer layer (e.g. first or second) we swap in the target model's actual layer-N activations at those ? token positions. So the oracle processes its question in light of the target model's internal state.

The idea of querying activations with a language model has been floating around for a while, but only recently has it started working well enough to be useful in practice. The reason people are excited is the generalization story. Compare this to the other tools you've seen so far: a linear probe can only answer the specific binary question it was trained for ("positive or negative?"), and if you want to ask a different question you need to build a new dataset and train a new probe. SAEs give you an unsupervised decomposition into features, which is powerful but requires you to manually interpret what those features mean. An oracle, on the other hand, can answer questions it has never encountered during training: you can ask about sentiment, about factual knowledge, about the model's intentions, and it will give you a reasonable answer.

The cost of this flexibility is that you lose interpretability at the mechanism level. The oracle hands you a natural language answer, but it doesn't tell you which directions in activation space contributed to that answer. You don't get error bars, you don't get a direction you can use for steering, and you can't easily verify whether the oracle is picking up on the thing you think it's picking up on. So it's a different kind of tool, not a replacement for probes or SAEs.

Where have oracles shown promise so far? The benchmarks from the paper cover: extracting secrets that models have been trained to hide (which matters a lot for alignment), detecting emergent misalignment via model-diffing, and querying models on held-out classification tasks they weren't explicitly trained for.

Loading models

We'll use Qwen3-8B as our base model and load a pre-trained Activation Oracle from HuggingFace. The oracle is stored as a LoRA adapter that we'll load using the PEFT library.

# Model configuration
MODEL_NAME = "Qwen/Qwen3-8B"
ORACLE_LORA_PATH = "adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B"

print(f"Loading tokenizer: {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.padding_side = "left"
if not tokenizer.pad_token_id:
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"Loading model: {MODEL_NAME}...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    dtype=dtype,
)
model.eval()

# Add dummy adapter for consistent PeftModel API
dummy_config = LoraConfig()
model.add_adapter(dummy_config, adapter_name="default")

print("Model loaded successfully!")

Now let's load the oracle LoRA adapter.

We're using the PEFT library (parameter-efficient fine-tuning) to load the oracle. If you've already covered LoRA in earlier sections, feel free to skip this explanation.

What is LoRA?

The core idea is simple: rather than retraining the entire model from scratch, LoRA lets you make small, targeted adjustments. LoRA (Low-Rank Adaptation) works by taking a weight matrix $W^{(x,\, y)}$ in the model and adding two small learnable matrices $A^{(x,\, r)}$ and $B^{(r,\, y)}$, so the effective weight becomes $W + AB$. Since the rank $r$ is much smaller than either dimension of $W$, the number of new parameters you need to train is tiny compared to the full model. But those small adjustments can still teach the model new capabilities. When you apply LoRA adapters across multiple layers (as we do here for the oracle), the model can form entirely new circuits rather than just adding a nudge at a single point.

print(f"Loading oracle LoRA: {ORACLE_LORA_PATH}")
model.load_adapter(ORACLE_LORA_PATH, adapter_name="oracle", is_trainable=False)
print("Oracle loaded successfully!")

We can print out our LoRA config, to see what was loaded. Some key things to observe:

  • r=64, i.e. the rank of these LoRA matrices is 64
  • target_modules='down_proj,gate_proj,k_proj,o_proj,q_proj,up_proj,v_proj' means we add LoRA adapters to keys, queries, values & output projection matrices in attention layers as well as to the up and down projections in MLP layers
config_dict = model.peft_config["oracle"].to_dict()
config_df = pd.DataFrame(list(config_dict.items()), columns=["Parameter", "Value"])
display(config_df.style.hide(axis="index"))
Click to see the expected output

Oracle queries

Oracles can answer questions at different levels of abstraction, depending on what activations are provided and what question is asked. You can query a single token position to see what representations are stored there, query a specific slice of tokens in a sequence, or query the entire sequence at once.

We'll start with a simple question: asking the oracle to predict what answer the model will give to a specific question. We'll give the oracle the whole sentence.

Note: We've given you the utils.run_oracle() function which handles all the details of activation extraction, injection, and oracle querying. In the next section, you'll implement this yourselves.

# Simple first example
target_prompt_dict = [
    {"role": "user", "content": "What is the capital of France?"},
]
target_prompt = tokenizer.apply_chat_template(
    target_prompt_dict,
    tokenize=False,
    add_generation_prompt=True,
)
print(target_prompt)

oracle_prompt = "What answer will the model give, as a single token?"

results = utils.run_oracle(
    model=model,
    tokenizer=tokenizer,
    device=device,
    target_prompt=target_prompt,
    target_lora_path=None,  # Using base model
    oracle_prompt=oracle_prompt,
    oracle_lora_path="oracle",  # Our loaded oracle adapter
    oracle_input_type="full_seq",  # Query the full sequence
    generation_kwargs={"do_sample": False, "temperature": 0.0, "max_new_tokens": 50},
)

print(f"Target prompt: {target_prompt}")
print(f"Oracle question: {oracle_prompt}")
print(f"Oracle response: {results.full_sequence_responses[0]}")
Click to see the expected output
<|im_start>user
What is the capital of France?<|im_end>
<|im_start>assistant

Target prompt: <|im_start>user
What is the capital of France?<|im_end>
<|im_start>assistant

Oracle question: What answer will the model give, as a single token?
Oracle response: Paris

The oracle should respond with e.g. "Paris" or "The capital of France is Paris".

Note - it can sometimes take some playing around with prompts to get the right answer format, because the model can often reply with something like "The model will give a factually correct answer" rather than giving the direct answer. This is a familiar problem from standard LLM text generation!

Does this show it can extract the model's internal representation of what it's thinking about? In a sense yes, because it can't actually see the "France" token in the input prompt. However, it can see the residual stream for the "France" token, so it might just be inferring the question from the text and then answering it because the Oracle itself knows the answer. This is exactly the problem with highly complex probes - it's hard to distinguish between the hypothesis "the probe is picking up representation X in the model" and "the probe is able to compute representation X from other kinds of information it's picking up on". This skepticism is something you should always have switched on when working with any kind of interp, especially AOs.

Let's practice this skepticism muscle by testing out a specific hypothesis for how the model is answering the question!

Exercise - test "answering question" hypothesis

Difficulty: 🔴🔴🔴⚪⚪
Importance: 🔵🔵🔵🔵⚪
You should spend up to 10-15 minutes on this exercise.

One hypothesis we might have is that the model is just detecting the "France" token in the original input (i.e. looking at the embedding for "France"), and answering the question based on this.

Can you come up with an experiment to test this hypothesis, and run it?

Hint

Rather than using oracle_input_type="full_seq", look at the other input types available in run_oracle. Is there a way to give the oracle access to only a specific slice of the input tokens?

# YOUR CODE HERE - devise and run an experiment using `utils.run_oracle`
Click to see the expected output
Running oracle on segment '?<|im_end>\n<|im_start>assistant\n'
Oracle response: Paris.
Solution (and discussion)

You can run the following exercise: just pass through the segment of the input starting immediately after the France token. If the model can't get this right then it could be relying on the embedding for France, but if not then the model must be relying on information that has been moved forward in the residual stream from the France token.

tokens = tokenizer.encode(target_prompt)
segment_start_idx = tokens.index(tokenizer.encode(" France")[0]) + 1

print(f"Running oracle on segment {tokenizer.decode(tokens[segment_start_idx:])!r}")

results = utils.run_oracle(
    model=model,
    tokenizer=tokenizer,
    device=device,
    target_prompt=target_prompt,
    target_lora_path=None,
    oracle_prompt=oracle_prompt,
    oracle_lora_path="oracle",
    oracle_input_type="segment",  # not "full_seq"
    segment_start_idx=segment_start_idx,
    segment_end_idx=None,
    generation_kwargs={"do_sample": False, "temperature": 0.0, "max_new_tokens": 50},
)
print(f"Oracle response: {results.segment_responses[0]}")

You should see that the model is able to correctly answer the question. This shows the model isn't just relying on the "France" token - it is nontrivially extracting information about the model's intended answer (or at least extracting information about the question, which isn't stored in the actual question text tokens).

Note - since "capital of France is Paris" is such a common and stereotypical model eval question, you might want to test this out with more obscure questions, and verify that this result still holds up.

Exercise - test "logit lens" hypothesis

Difficulty: 🔴🔴⚪⚪⚪
Importance: 🔵🔵🔵⚪⚪
You should spend up to 10-15 minutes on this exercise.

A second hypothesis we might have is that the model is just taking the most likely next token from the activations at the ? position, i.e. it's just doing logit lens to get the answer rather than extracting representations of the question, or other kinds of intermediate representations.

You should test this by checking what the top predicted tokens are following the target_prompt, and see if "Paris" is one of them.

# YOUR CODE HERE - get the model's top predicted tokens after the target_prompt
Click to see the expected output
['The', '<think>', 'France', 'As', 'Paris', ' Claude', 'Le', '?\n', 'La', '.\n']
Solution (and discussion)
inputs = tokenizer(target_prompt, return_tensors="pt").to(device)
outputs = model(**inputs)

top_preds = outputs.logits[0, -1].topk(10).indices
top_preds_str = tokenizer.batch_decode(top_preds)
print(top_preds_str)

You should find that Paris is one of the top 10 predicted tokens (and so is France). This does mean that "model uses logit lens" is a plausible hypothesis for how the oracle is answering the question, or at least for part of it. However, in later sections we'll see situations where this isn't a viable hypothesis.

As a bonus exercise, can you find a way of phrasing the prompt that means logit lens isn't a viable hypothesis? For instance, try a prompt where the answer requires multi-step reasoning rather than simple factual recall (e.g. a riddle or a logic puzzle), so the model's next-token prediction won't directly contain the answer. If the oracle still extracts the right answer from the activations, that's strong evidence it's doing something beyond logit lens.

Token-by-token analysis

Now let's move from full sequences or segments to just single tokens. The code below will query each token position independently, that way we can see exactly what information is stored in a given sequence position as we move through a sequence.

To start with, we'll use the Socrates ➔ Plato ➔ Aristotle example, where the model is asked a question referencing a series of philosophers via their connections to each other.

Before you run the code below, think about what you expect to see. The prompt is: "The philosopher who drank hemlock taught a student who founded an academy. That student's most famous pupil was". At which token positions do you think the oracle will first mention Socrates? Plato? Aristotle? Will the oracle's responses change smoothly or abruptly as it encounters each new piece of information?

target_prompt_dict = [
    {
        "role": "user",
        "content": "The philosopher who drank hemlock taught a student who founded an academy. That student's most famous pupil was",
    },
]
target_prompt = tokenizer.apply_chat_template(
    target_prompt_dict,
    tokenize=False,
    add_generation_prompt=True,
)

oracle_prompt = "What people is the model thinking about?"

results = utils.run_oracle(
    model=model,
    tokenizer=tokenizer,
    device=device,
    target_prompt=target_prompt,
    target_lora_path=None,
    oracle_prompt=oracle_prompt,
    oracle_lora_path="oracle",
    oracle_input_type="tokens",  # Query each token independently
    token_start_idx=0,
    token_end_idx=None,
    generation_kwargs={"do_sample": False, "temperature": 0.0, "max_new_tokens": 100},
)

# Display token-by-token responses
print(f"Target prompt has {results.num_tokens} tokens")
print("\nToken-by-token oracle responses:")
print("=" * 80)

target_tokens = tokenizer.convert_ids_to_tokens(results.target_input_ids)
for i, (token, response) in enumerate(zip(target_tokens, results.token_responses)):
    if response:
        print(f"Token {i:3d} ({token:15s}): {response}")
Click to see the expected output
Target prompt has 29 tokens

Token-by-token oracle responses:
================================================================================
Token   0 (<|im_start>   ): The model is thinking about the impact of climate change on its community.
Token   1 (user           ): The model is thinking about the impact of its actions on the environment.
Token   2 (Ċ              ): The model is thinking about the people who have been affected by the recent natural disaster.
Token   3 (The            ): The model is thinking about the people who have been affected by the recent wildfires.
Token   4 (Ġphilosopher   ): The model is thinking about the ethical implications of artificial intelligence.
Token   5 (Ġwho           ): The model is thinking about the philosopher who emphasized the importance of questioning and doubting.
Token   6 (Ġdrank         ): The model is thinking about the philosopher who drank only water.
Token   7 (Ġhem           ): The model is thinking about the paradox of finding joy in a situation that is inherently tragic.
Token   8 (lock           ): The model is thinking about the historical figure Socrates.
Token   9 (Ġtaught        ): The assistant is thinking about the philosophical ideas of Socrates.
Token  10 (Ġa             ): The model is thinking about the philosopher Socrates and his influence on modern thought.
Token  11 (Ġstudent       ): The model is thinking about the historical figure Socrates and his influence on philosophy.
Token  12 (Ġwho           ): The model is thinking about the historical figures Socrates and his student Plato.
Token  13 (Ġfounded       ): The assistant is thinking about the historical figures of Socrates and his student Plato.
Token  14 (Ġan            ): The assistant is thinking about the historical figures Socrates and Plato.
Token  15 (Ġacademy       ): The model is thinking about the historical figures Socrates and Plato.
Token  16 (.              ): The assistant is thinking about the historical figures of NAME_1 and NAME_2.
Token  17 (ĠThat          ): The assistant is thinking about the historical figure of Socrates and his influence on philosophy.
Token  18 (Ġstudent       ): The model is thinking about the historical figure of Aristotle and his influence on philosophy.
Token  19 ('s             ): The model is thinking about the historical figure of NAME_1 and his philosophical contributions.
Token  20 (Ġmost          ): The model is thinking about the philosophical ideas of Plato.
Token  21 (Ġfamous        ): The assistant is thinking about the philosophical ideas of Aristotle.
Token  22 (Ġpupil         ): The assistant is thinking about the life and contributions of the philosopher Socrates.
Token  23 (Ġwas           ): The model is thinking about the historical figures of NAME_1, NAME_2, and NAME_3.
Token  24 (<|im_end>     ): The model is thinking about the historical figures who have made significant contributions to the field of astronomy.
Token  25 (Ċ              ): The model is thinking about the historical figures involved in the development of the printing press.
Token  26 (<|im_start>   ): The assistant is thinking about the historical figures of Plato and Aristotle.
Token  27 (assistant      ): The model is thinking about the people who have been affected by the recent natural disaster.
Token  28 (Ċ              ): The model is thinking about the historical figure who was the first to use the term 'zero' in mathematics.

You should see the oracle gradually accumulating information across tokens in the prompt:

  • Before the mention of hemlock, it'll be fairly generic
  • After hemlock, it should identify Socrates in its answer
  • After student who founded, it should also be thinking of Plato
  • After most famous pupil, it should identify Aristotle as Plato's famous pupil

Representations accumulate step by step - the token-by-token view makes this visible.

Exercise - Extract function results without the function definition

Difficulty: 🔴🔴🔴⚪⚪
Importance: 🔵🔵🔵🔵⚪
You should spend up to 10-15 minutes on this exercise.

Can the oracle extract information about a computation without seeing the code that performs it? Test this by asking the oracle to predict the result of a function call, providing activations only from the assignment statement - not from the function definition itself.

We've formatted the target prompt and computed segment_start for you (the token index where result = foo(3, 4) begins). Your job is to call utils.run_oracle() with oracle_input_type="segment" and the right segment_start_idx, so the oracle only sees activations from the assignment statement onward, not the function definition. The oracle should answer "7" or similar.

# We format the target prompt and find where "result = foo(3, 4)" begins
target_prompt_dict = [
    {"role": "user", "content": "def foo(x, y):\n    return x + y\n\nresult = foo(3, 4)"},
]
formatted_target_prompt = tokenizer.apply_chat_template(
    target_prompt_dict, tokenize=False, add_generation_prompt=False, enable_thinking=False, continue_final_message=False
)

tokens = tokenizer.encode(formatted_target_prompt)
token_strings = [tokenizer.decode([t]) for t in tokens]
segment_start = next(i for i, tok_str in enumerate(token_strings) if "result" in tok_str.lower())

oracle_prompt = "What will the result be?"

# YOUR CODE HERE - call utils.run_oracle() with oracle_input_type="segment" and the right segment_start_idx

print(f"Oracle response: {results.segment_responses[0]}")
response = results.segment_responses[0].lower()
assert any(x in response for x in ["7", "seven"]), (
    f"Expected '7' or 'seven' in response, got: {results.segment_responses[0]}"
)
Click to see the expected output
Oracle response: The result will be 7.
Solution
# We format the target prompt and find where "result = foo(3, 4)" begins
target_prompt_dict = [
    {"role": "user", "content": "def foo(x, y):\n    return x + y\n\nresult = foo(3, 4)"},
]
formatted_target_prompt = tokenizer.apply_chat_template(
    target_prompt_dict, tokenize=False, add_generation_prompt=False, enable_thinking=False, continue_final_message=False
)

tokens = tokenizer.encode(formatted_target_prompt)
token_strings = [tokenizer.decode([t]) for t in tokens]
segment_start = next(i for i, tok_str in enumerate(token_strings) if "result" in tok_str.lower())

oracle_prompt = "What will the result be?"

results = utils.run_oracle(
    model=model,
    tokenizer=tokenizer,
    device=device,
    target_prompt=formatted_target_prompt,
    target_lora_path=None,
    oracle_prompt=oracle_prompt,
    oracle_lora_path="oracle",
    oracle_input_type="segment",
    segment_start_idx=segment_start,
    segment_end_idx=None,
    generation_kwargs={"do_sample": False, "temperature": 0.0, "max_new_tokens": 50},
)

print(f"Oracle response: {results.segment_responses[0]}")
response = results.segment_responses[0].lower()
assert any(x in response for x in ["7", "seven"]), (
    f"Expected '7' or 'seven' in response, got: {results.segment_responses[0]}"
)

The oracle should correctly identify that the result will be 7, even though it only received activations from the result = foo(3, 4) tokens - the activations carry what the model was computing, not just the raw tokens.

You've now seen oracles answer three very different kinds of questions - factual recall ("What is the capital of France?"), segment-level reasoning (can the oracle answer without seeing the "France" token?), and comparison with the model's own next-token predictions (logit lens). These exercises were designed to give you intuition for what oracles can and can't do, and to surface the key question: is the oracle doing something beyond what simpler methods like logit lens already provide?

In the next section, you'll build the machinery behind utils.run_oracle() from scratch - activation extraction, the ? token mechanism, steering hooks, and training data formatting - so that when you move to the advanced applications in Section 3, you'll understand exactly what's happening under the hood.