[1.3.1] Probing for Deception

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

Goal: Minimal, fast replication of the Emergent Misalignment case study, specifically the work from Soligo & Turner, with extendable blocks for follow-on work.

  • Models: llama-3.1-8b (smaller than production 70B model for faster iteration)
  • Dataset: Using contrastive fact pairs from the REPE dataset with honest/deceptive system prompts
  • Focus: Linear probing for deception detection using activation-based methods

What you'll build

  1. Load a pre-trained language model and extract hidden state activations
  2. Train linear probes on contrastive honest vs deceptive fact completions
  3. Test probe generalization on roleplaying and insider trading scenarios
  4. Visualize probe performance and explore failure modes

Content & Learning Objectives

1️⃣ Introduction

TODO(mcdougallc)

Learning Objectives
  • TODO(mcdougallc)

2️⃣ Building a Simple Deception Probe

TODO(mcdougallc)

Learning Objectives
  • TODO(mcdougallc)

3️⃣ Probe Evaluation

TODO(mcdougallc)

Learning Objectives
  • TODO(mcdougallc)

Setup code

Run these cells first:

%pip install torch transformers scikit-learn pandas numpy jaxtyping matplotlib --quiet
!git clone https://github.com/ApolloResearch/deception-detection.git
import os
import sys
import textwrap
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from jaxtyping import Float, Int
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

warnings.filterwarnings('ignore')

if os.path.exists("deception-detection"):
    os.chdir("deception-detection")
if not os.path.exists("deception_detection"):
    raise FileNotFoundError("Please follow the instructions above to clone the EM repo")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

We'll also be using Llama models, so we'll need to log into HuggingFace with an API key that has permissions to access their weights.

If you don't have permission, then you can request it here (or via the URL which will appear when you try to run the code to download the model and it fails).

from huggingface_hub import login

login(token="...")  # Add your API key here