[1.3.1] Probing for Deception
Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.
If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.
Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction
Goal: Minimal, fast replication of the Emergent Misalignment case study, specifically the work from Soligo & Turner, with extendable blocks for follow-on work.
- Models:
llama-3.1-8b(smaller than production 70B model for faster iteration) - Dataset: Using contrastive fact pairs from the REPE dataset with honest/deceptive system prompts
- Focus: Linear probing for deception detection using activation-based methods
What you'll build
- Load a pre-trained language model and extract hidden state activations
- Train linear probes on contrastive honest vs deceptive fact completions
- Test probe generalization on roleplaying and insider trading scenarios
- Visualize probe performance and explore failure modes
Content & Learning Objectives
1️⃣ Introduction
TODO(mcdougallc)
Learning Objectives
- TODO(mcdougallc)
2️⃣ Building a Simple Deception Probe
TODO(mcdougallc)
Learning Objectives
- TODO(mcdougallc)
3️⃣ Probe Evaluation
TODO(mcdougallc)
Learning Objectives
- TODO(mcdougallc)
Setup code
Run these cells first:
%pip install torch transformers scikit-learn pandas numpy jaxtyping matplotlib --quiet
!git clone https://github.com/ApolloResearch/deception-detection.git
import os
import sys
import textwrap
import warnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from jaxtyping import Float, Int
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
warnings.filterwarnings('ignore')
if os.path.exists("deception-detection"):
os.chdir("deception-detection")
if not os.path.exists("deception_detection"):
raise FileNotFoundError("Please follow the instructions above to clone the EM repo")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
We'll also be using Llama models, so we'll need to log into HuggingFace with an API key that has permissions to access their weights.
If you don't have permission, then you can request it here (or via the URL which will appear when you try to run the code to download the model and it fails).
from huggingface_hub import login
login(token="...") # Add your API key here