[1.6.2] Science of Misalignment

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

Goal: Minimal, fast replication of two black-box misalignment case studies: Alignment Faking and Shutdown Resistance, with extendable blocks for follow‑on work.

  • Models: OpenAI (gpt-4o-mini) & Anthropic (claude-3-5-sonnet-20240620), accessed via API
  • Dataset: AdvBench harmful_behaviors.csv
  • No inspect — pure prompting & simple judges

What you'll build

  1. Replicate the free-vs-paid alignment-faking gap with Claude 3.5 Sonnet.
  2. Score with three classifiers: (a) tag regex, (b) zero‑shot judge, (c) few‑shot judge.
  3. Compare Anthropic vs OpenAI baselines.
  4. Fork: a menu of follow‑on work.

Content & Learning Objectives

1️⃣ Section

TODO(mcdougallc)

Learning Objectives
  • TODO(mcdougallc)

Setup code

This notebook doesn't require you to load in any models, so running it on Colab should be fine (any runtime).

%pip -q install --upgrade anthropic==0.34.2 openai==1.56.1 httpx==0.27.2 pandas==2.2.2 tqdm==4.66.5 plotly==5.24.1
import io
import os
import random
import time
from dataclasses import dataclass
from pathlib import Path

import anthropic
import httpx
import pandas as pd
from dotenv import load_dotenv
from openai import OpenAI
from tqdm import tqdm
path = Path.cwd().parent / ".env"
assert path.exists()

load_dotenv(dotenv_path=str(path))

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")

openai_client = OpenAI(api_key=OPENAI_API_KEY) if OPENAI_API_KEY else None
anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY) if ANTHROPIC_API_KEY else None

random.seed(42)