[1.6.2] Science of Misalignment
Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.
If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.
Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction
Goal: Minimal, fast replication of two black-box misalignment case studies: Alignment Faking and Shutdown Resistance, with extendable blocks for follow‑on work.
- Models: OpenAI (
gpt-4o-mini) & Anthropic (claude-3-5-sonnet-20240620), accessed via API - Dataset: AdvBench harmful_behaviors.csv
- No
inspect— pure prompting & simple judges
What you'll build
- Replicate the free-vs-paid alignment-faking gap with Claude 3.5 Sonnet.
- Score with three classifiers: (a) tag regex, (b) zero‑shot judge, (c) few‑shot judge.
- Compare Anthropic vs OpenAI baselines.
- Fork: a menu of follow‑on work.
Content & Learning Objectives
1️⃣ Section
TODO(mcdougallc)
Learning Objectives
- TODO(mcdougallc)
Setup code
This notebook doesn't require you to load in any models, so running it on Colab should be fine (any runtime).
%pip -q install --upgrade anthropic==0.34.2 openai==1.56.1 httpx==0.27.2 pandas==2.2.2 tqdm==4.66.5 plotly==5.24.1
import io
import os
import random
import time
from dataclasses import dataclass
from pathlib import Path
import anthropic
import httpx
import pandas as pd
from dotenv import load_dotenv
from openai import OpenAI
from tqdm import tqdm
path = Path.cwd().parent / ".env"
assert path.exists()
load_dotenv(dotenv_path=str(path))
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
openai_client = OpenAI(api_key=OPENAI_API_KEY) if OPENAI_API_KEY else None
anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY) if ANTHROPIC_API_KEY else None
random.seed(42)