[4.1] Emergent Misalignment
Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.
If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.
Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction
Emergent Misalignment (EM) was discovered in early 2025, mostly by accident. The authors of a paper on training insecure models to write unsafe code noticed that their models actually reported very low alignment with human values. Further investigation showed that these models had learned some generalized notion of misaligned behaviour just from being trained on insecure code, and this was spun off into its own paper (PSA to people reading this - sadly it's not always that easy to make a hit AI safety paper.)
This paper suggests a result which is also in line with an earlier paper Refusal in LLMs is Mediated by a Single Direction - that there is some kind of low-rank representation for broadly misaligned behaviour, and learning it in a narrow domain makes it easy to generalize to multiple domains.
Sadly finetuning to test this and do any kind of mechanistic analysis is expensive, so we'll be basing these exercises on follow-up papers written by Soligo and Turner. These papers contributed the following:
- Open sourced a bunch of model organisms with LoRA adapters of varied ranks: from thousands of total adapters down to just a handful
- Also open sourced steering vectors which can induce misalignment when added to the residual stream (yes, just a 1-dimensional intervention)
- Performed mechanistic analysis on the LoRA models, showing (among other things) that:
- Directions found in residual stream space can be used to ablate or amplify misalignment
- A phase transition occurs during training where the model suddenly "commits" to learning misalignment
- Different LoRA adapters specialize in different ways (e.g. specializing for narrow vs general misaligned behaviours)
In these exercises, we'll be going through many of these paper's results. This exercise set also serves as a great introduction to many of the ideas we'll be working with in the Alignment Science chapter more broadly, such as:
- How to write good autoraters (and when simple regex-based classifiers might suffice instead),
- What kinds of analysis you can do when going "full circuit decomposition" (like in Indirect Object Identification) isn't a possibility,
- How LoRA adapters work, and how they can be a useful tool for model organism training and study,
- Various techniques for steering a model's behaviour.
Aditionally, we'll practice several meta-level strategies in these exercises, for example being skeptical of your own work and finding experimental flaws. This is especially important in vibe-coding land; LLMs are great at writing quick code and hill-climbing on metrics, but they're much worse (currently) at doing rigorous scientific analysis and red-teaming your own results.
Content & Learning Objectives
1️⃣ Load & Test Model Organisms
You'll start by loading pre-trained model organisms from HuggingFace and observing emergent misalignment firsthand. These models were fine-tuned on risky financial advice using minimal LoRA adapters (rank-1 or rank-8), yet exhibit concerning behaviors across many domains.
Learning Objectives
- Understand what emergent misalignment is and how model organisms are created
- Load LoRA-adapted models and inspect their structure
- Observe qualitative differences between base and misaligned models
- Test how misalignment generalizes across different domains (finance, medical, deception)
2️⃣ Quantifying Misalignment
Moving from qualitative observation to quantitative measurement, you'll implement both simple keyword-based scoring and sophisticated LLM-as-judge autoraters to measure misalignment systematically.
Learning Objectives
- Understand limitations of heuristic scoring methods
- Design and implement LLM-as-judge autoraters for measuring misalignment
- Compare different scoring approaches and understand their tradeoffs
- Run systematic evaluations across multiple behavioral categories
3️⃣ Activation Steering
You'll learn to intervene on model behavior by extracting "misalignment directions" from activations and steering the model along these directions. This section includes an important pedagogical moment about experimental rigor.
Learning Objectives
- Extract steering vectors by contrasting misaligned vs safe system prompts
- Implement activation steering hooks
- Critically evaluate steering results and identify potential artifacts
- Understand why experimental rigor is crucial in interpretability research
4️⃣ Phase Transitions
Moving beyond observing emergent misalignment, you'll investigate when and how it develops during training. By analyzing checkpoints saved at regular intervals, you'll discover phase transitions-discrete moments where models suddenly "commit" to learning misalignment, revealed through sharp rotations in parameter space.
Learning Objectives
- Load and analyze training checkpoints from HuggingFace
- Track LoRA vector norm evolution to identify when learning "ignites"
- Visualize PCA trajectories showing optimization paths through parameter space
- Implement local cosine similarity metrics to detect phase transitions
- Compare transition timing across different domains (medical, sports, finance)
- Understand implications for training dynamics and safety research
Setup code
Before running the setup code, you should clone the Emergent Misalignment repo:
cd chapter4_alignment_science/exercises/
git clone https://github.com/clarifying-EM/model-organisms-for-EM.git
Make sure you clone inside the chapter4_alignment_science/exercises directory.
You should also create an .env file inside chapter4_alignment_science/exercises, and add your OpenRouter API key:
OPENROUTER_API_KEY=your_openrouter_api_key_here
import os
import random
import re
import textwrap
import time
import warnings
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from typing import Any, Literal
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from dotenv import load_dotenv
from huggingface_hub import hf_hub_download
from IPython.display import display
from jaxtyping import Float
from openai import OpenAI
from peft import PeftModel
from sklearn.decomposition import PCA
from tabulate import tabulate
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
# Disable gradient computation globally - we're only doing inference, not training
# This saves memory and computation, and means we don't need `with torch.no_grad()` everywhere
torch.set_grad_enabled(False)
device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")
# Make sure exercises are in the path
chapter = "chapter4_alignment_science"
section = "part1_emergent_misalignment"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
sys.path.append(str(exercises_dir))
import part1_emergent_misalignment.tests as tests
import part1_emergent_misalignment.utils as utils
em_dir = exercises_dir / "model-organisms-for-EM"
assert em_dir.exists(), "Did you clone the EM repo correctly into `/exercises`?"
sys.path.append(str(em_dir))
from em_organism_dir.lora_interp.lora_utils import LoraLayerComponents
from em_organism_dir.phase_transitions.pt_utils import get_all_checkpoint_components
def print_with_wrap(s: str, width: int = 80):
"""Print text with line wrapping, preserving newlines."""
out = []
for line in s.splitlines(keepends=False):
out.append(textwrap.fill(line, width=width) if line.strip() else line)
print("\n".join(out))
if not os.path.exists("model-organisms-for-EM"):
raise ValueError("Please follow the instructions above to clone the EM repo")
warnings.filterwarnings("ignore")
assert torch.cuda.is_available(), "GPU access required for these exercises"
MAIN = __name__ == "__main__"
# LoRA adapter layer indices for the rank-1 model (R1_3_3_3)
LOW_RANK_LORA_LAYERS = [15, 16, 17, 21, 22, 23, 27, 28, 29]
load_dotenv()
# Setup OpenRouter client (works with both Claude and OpenAI models)
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
assert OPENROUTER_API_KEY, "Please set OPENROUTER_API_KEY in your .env file"
openrouter_client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=OPENROUTER_API_KEY,
)