[1.6.1] Emergent Misalignment
Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.
If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.
Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction
Goal: Minimal, fast replication of the Emergent Misalignment case study, specifically the work from Soligo & Turner, with extendable blocks for follow-on work.
- Models:
llama-8bandqwen-14bwith LoRA adapters, plusgpt-4o-miniandclaude-3-5-sonnet-20240620via API - Dataset: Using prompts & prompt-generation methods from
model-organisms-for-EM - Models are HuggingFace-based (not using TransformerLens)
TODO - add https://www.lesswrong.com/posts/fJtELFKddJPfAxwKS/natural-emergent-misalignment-from-reward-hacking-in
What you'll build
- Load and compare base vs misaligned model organisms
- Test for power-seeking and deceptive behaviors
- Steer model behavior using activation differences
- Explore how misalignment generalizes beyond code tasks
Content & Learning Objectives
1️⃣ Load & Test Model Organisms
TODO(mcdougallc)
Learning Objectives
- TODO(mcdougallc)
2️⃣ Quantifying Deceptive Behavior
TODO(mcdougallc)
Learning Objectives
- TODO(mcdougallc)
3️⃣ Activation Steering
TODO(mcdougallc)
Learning Objectives
- TODO(mcdougallc)
4️⃣ LoRA specialization
TODO(mcdougallc)
Learning Objectives
- TODO(mcdougallc)
Setup
Note - this notebook works with much larger models than you're likely used to in the ARENA program, so we recommend creating a vast.ai instance rather than working from a Colab. Here is a minimal set of instructions:
- Create a private key using the following instructions: TODO, fill these in
- Navigate to the VastAI search page
- Filter for one with at least a 128GB container size, using the slider on the left of the screen
- Start the instance
- When it's loaded, press the blue "Connect" button to open the terminal connection operations. Take the "Direct ssh connect" information and convert it into an entry in your SSH config file (which you can open using Ctrl+Shift+P then searching for "ssh config"). For example, if the information is:
ssh -p 50240 root@143.55.45.84 -L 8080:localhost:8080
then in your config file it should look like this (with path-to-your-key replaced with whatever path you used to save your key in the initial step):
Host vast-ai
HostName 143.55.45.84
Port 50240
User root
LocalForward 8080 localhost:8080
IdentityFile path-to-your-key
- Connect to your instance (you can use Ctrl+Shift+P and search for "ssh connect")
- In your new window, when the option becomes available, click "Open Folder" and open the root folder
- Open the terminal (Ctrl+J) and enter the following commands (note your runtime should already have conda installed):
conda create -n myenv python=3.12
conda init
- Then close your shell and run the following:
source ~/.bashrc
conda activate myenv
- Create a new folder called
em, and drag and drop this notebook into the folder - To clone the EM repo, run the following:
cd em
git clone https://github.com/clarifying-EM/model-organisms-for-EM.git
- And to install dependencies, run the following:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip install --upgrade transformers==4.52.4 pandas==2.3.0 tqdm==4.67.1 matplotlib==3.10.3 seaborn==0.13.2 numpy==2.2.6 scikit-learn==1.7.0 peft==0.15.2 safetensors==0.5.3 huggingface-hub==0.32.4 einops==0.8.1 accelerate==1.7.0 hf_xet==1.1.9 tabulate==0.9.0 ipykernel==6.30.1 anthropic==0.66.0 openai==1.56.1 dotenv==0.9.9 plotly==6.3.0 jax==0.7.1
import os
import re
import time
import warnings
import anthropic
import einops
import numpy as np
import pandas as pd
import torch
import pprint
import jax
from openai import OpenAI
import textwrap
from peft import PeftConfig, PeftModel
from tabulate import tabulate
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from dataclasses import dataclass
from huggingface_hub import hf_hub_download
if os.path.exists("model-organisms-for-EM"):
os.chdir("model-organisms-for-EM")
if not os.path.exists("em_organism_dir"):
raise FileNotFoundError("Please follow the instructions above to clone the EM repo")
warnings.filterwarnings("ignore")
# Set device
assert torch.cuda.is_available(), "GPU access required for these exercises"
device = torch.device("cuda")
print(f"Using device: {device}")
We'll also need to use OpenAI / Anthropic APIs later in this notebook:
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
openai_client = OpenAI(api_key=OPENAI_API_KEY) if OPENAI_API_KEY else None
anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY) if ANTHROPIC_API_KEY else None