4️⃣ Training your own oracle
Learning Objectives
- Understand training scale and compute requirements
- Learn dataset composition: SPQA, classification tasks, and self-supervised context prediction
- Understand why data diversity and quantity both matter for training
- Know when to train custom oracles vs using pre-trained ones
This section is reference material for training your own Activation Oracle. There are no required exercises - read through it to understand the training methodology, and refer back to it if you decide to train a custom oracle.
Below we walk through the key ideas behind how AOs are trained, drawing on the Activation Oracles paper. For the full implementation, see the activation_oracles repo.
Training scale & compute requirements
Training AOs is surprisingly affordable - much cheaper than training SAEs or other interpretability tools. From the paper (Section 3.3, Appendix A.3):
The process is computationally inexpensive, requiring 10 H100 GPU hours for Qwen3-8B and 90 H200 hours for Llama-3.3-70B.
| Model | Compute | Infrastructure |
|---|---|---|
| Qwen3-8B | ~10 H100 GPU hours | Single H100 |
| Gemma-2-9B-IT | ~12 H100 GPU hours | Single H100 |
| Llama-3.3-70B | ~90 H200 GPU hours | 4× H200 with DDP + 8-bit quantization |
The total training dataset is ~1M examples (~65M tokens). Training uses LoRA fine-tuning (not full parameter training) - only low-rank adapter matrices are trained.
Key code: nl_probes/sft.py (training script), nl_probes/configs/sft_config.py (training config)
Dataset composition: ~1M examples from 3 task types
AOs are trained on three types of data (Paper Section 3.2, Appendix B):
1. System prompt QA (SPQA) - 64,000 examples
The oracle learns to answer questions about a model's system prompt given activations from user-assistant interactions:
This task trains the oracle to answer questions about a model's system prompt given activations from user-assistant interactions. We use the dataset from LatentQA, which contains synthetically generated conversations where the assistant has a system prompt instruction to adopt specific personality traits (e.g. acting like a pirate) or operate under constraints.
Key code: nl_probes/dataset_classes/latentqa_dataset.py
2. Classification datasets - 336,000 examples
7 binary classification tasks adapted to natural language yes/no questions (48k examples per dataset):
We adapt 7 existing binary classification tasks into natural language yes/no questions. The target prompt is the text being classified (e.g., a movie review, a sentence in a specific language, or a statement to be fact-checked). We collect activations from near the end of the sequence. The oracle prompt poses a binary question about properties that should be encoded in the activations, such as "Is this sentiment positive?" or "Is this statement true?".
Tasks include: Geometry of Truth (true/false), Relations, SST-2 (sentiment), MD Gender, SNLI (entailment), NER, and Tense identification.
Key code: nl_probes/dataset_classes/classification.py, nl_probes/dataset_classes/classification_dataset_manager.py
3. Self-supervised context prediction - 600,000 examples
This is what makes training scalable:
We train to predict tokens before or after a sequence of activation(s), enabling unlimited training data generation from any text corpus. For each example, we sample K contiguous tokens from a 50-50 mix of pretraining and conversational data. The oracle is trained to predict either the previous or next J tokens (where K, J ~ Uniform(1, 20)). The input and target tokens do not overlap, ensuring the oracle must rely on information encoded in the activations rather than simply reconstructing them.
Key code: nl_probes/dataset_classes/past_lens_dataset.py
Across all datasets, the training data mixes single-token and multi-token inputs (varying by dataset), collects from 3 layer depths (25%, 50%, 75%, one-third from each), and always uses activations from the base model before any LoRA fine-tuning.
Hyperparameters & training configuration
From paper Appendix A.1 (Table 1):
# LoRA Configuration
lora_config = {
"r": 64, # LoRA rank
"lora_alpha": 128, # LoRA alpha (scaling factor)
"lora_dropout": 0.05, # Dropout for LoRA layers
"target_modules": "all_linear", # Apply to all linear layers
}
# Training Configuration
training_args = {
"learning_rate": 1e-5,
"batch_size": 16, # 64 for Llama-70B
"optimizer": "AdamW",
"lr_scheduler": "linear", # Linear warmup (10%) + decay
"warmup_ratio": 0.1,
}
Two practical details: group-by-length batching gives a 30% speedup by avoiding excessive padding, and activations are generated on-the-fly during training (no pre-computation needed, which keeps memory usage low).
Training data construction: technical details
From paper Section 3.1 and Appendix B:
The oracle input format works as follows: given K activation vectors {v_i} from layer l of the target model, the oracle sees Layer 18: ? ? ? Is this positive sentiment? where each ? is a placeholder token whose activations get replaced.
For the injection mechanism, after layer 1 of the oracle, activations at placeholder positions are modified via norm-matched addition: h'_i = h_i + ||h_i|| * (v_i / ||v_i||). The paper explains why:
We add rather than replace because we observed that direct replacement caused excessive growth of the activation norm during training.
Specifically, replacement caused 100,000× norm explosion at layer 1 (Appendix A.5).
Token selection strategies vary to ensure flexible usage:
We vary both the number and position of activation vectors (single-token vs. multi-token inputs) to enable flexible Activation Oracle usage across different deployment scenarios.
We introduce positional randomness because we found that always selecting the same position (such as the token immediately before EOS) leads to brittle generalization. (Appendix B.3)
Key code: nl_probes/utils/dataset_utils.py (core TrainingDataPoint class and create_training_datapoint())
Key training insights
The paper runs ablations (Section 6) and finds three important things:
Finding 1: scaling data diversity consistently improves performance
Ten of twelve combinations achieve their best performance with the full mixture, and most curves consistently improve, indicating that adding more diverse training tasks systematically helps on downstream tasks. (Section 6.1, Figure 5)
Finding 2: both diversity AND quantity matter
A controlled ablation on Qwen3-8B compared SPQA + Classification (400k examples), a truncated full mixture (400k examples including context prediction), and the full mixture (1M examples):
Both diversity and quantity of activation-verbalization tasks contribute meaningfully to Activation Oracle performance. (Section 6.2, Figure 7)
The truncated mixture beat the baseline despite having the same total examples (diversity helps), and the full mixture beat the truncated mixture (quantity also helps).
Finding 3: SPQA-only is competitive but fragile
SPQA-only models already match white-box baselines on secret-keeping tasks, but they're much more sensitive to hyperparameters:
While the best SPQA-only learning rate achieves 91% on User Gender, the second-best achieves only 65%. (Section 6.1)
The full mixture is more robust and generalizes better across settings.
Practical training tips
When to train your own oracle
Train a custom oracle if your target model architecture isn't covered by pre-trained oracles, your interpretability questions are very domain-specific, or you have the compute budget (~10-90 GPU hours depending on model size). Use pre-trained oracles if your target model is in the Gemma-2/Gemma-3/Qwen3/Llama-3 families, you're asking standard interpretability questions (secrets, personas, classifications), or you want to avoid training overhead.
Once trained, AOs can be applied to these tasks out-of-the-box, without the task-specific scaffolding and tuning many other methods require. (Section 1)
Common pitfalls
Don't always use the same token position during training, or you'll get brittle generalization. Use addition with norm-matching rather than replacement to avoid norm explosion. And train on multiple layers (25%, 50%, 75%) for robustness.
Key files for training
| File | Description |
|---|---|
nl_probes/sft.py |
Main training script |
nl_probes/configs/sft_config.py |
Training config (hyperparameters, data mix) |
nl_probes/utils/dataset_utils.py |
TrainingDataPoint class |
nl_probes/utils/steering_hooks.py |
Activation steering hooks |
experiments/paper_evals.sh |
All paper evaluation scripts |