[4.1] Emergent Misalignment

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

Emergent Misalignment (EM) was discovered in early 2025, mostly by accident. The authors of a paper on training insecure models to write unsafe code noticed that their models actually reported very low alignment with human values. Further investigation showed that these models had learned some generalized notion of misaligned behaviour just from being trained on insecure code, and this was spun off into its own paper (PSA to people reading this - sadly it's not always that easy to make a hit AI safety paper.)

This paper suggests a result which is also in line with an earlier paper Refusal in LLMs is Mediated by a Single Direction - that there is some kind of low-rank representation for broadly misaligned behaviour, and learning it in a narrow domain makes it easy to generalize to multiple domains.

Sadly finetuning to test this and do any kind of mechanistic analysis is expensive, so we'll be basing these exercises on follow-up papers written by Soligo and Turner. These papers contributed the following:

Open sourced a bunch of model organisms with LoRA adapters of varied ranks: from thousands of total adapters down to just a handful
Also open sourced steering vectors which can induce misalignment when added to the residual stream (yes, just a 1-dimensional intervention)
Performed mechanistic analysis on the LoRA models, showing (among other things) that:
Directions found in residual stream space can be used to ablate or amplify misalignment
A phase transition occurs during training where the model suddenly "commits" to learning misalignment
Different LoRA adapters specialize in different ways (e.g. specializing for narrow vs general misaligned behaviours)

In these exercises, we'll be going through many of these papers' results. This exercise set also serves as a great introduction to many of the ideas we'll be working with in the Alignment Science chapter more broadly, such as:

How to write good autoraters (and when simple regex-based classifiers might suffice instead),
What kinds of analysis you can do when going "full circuit decomposition" (like in Indirect Object Identification) isn't a possibility,
How LoRA adapters work, and how they can be a useful tool for model organism training and study,
Various techniques for steering a model's behaviour.

Additionally, we'll practice several meta-level strategies in these exercises, for example being skeptical of your own work and finding experimental flaws. This is especially important in vibe-coding land; LLMs are great at writing quick code and hill-climbing on metrics, but they're much worse (currently) at doing rigorous scientific analysis and red-teaming your own results.

Content & Learning Objectives

1️⃣ Load & Test Model Organisms

You'll start by loading pre-trained model organisms from HuggingFace and observing emergent misalignment firsthand. These models were fine-tuned on bad medical advice using minimal LoRA adapters (rank-1 or rank-8), yet exhibit concerning behaviors across many domains.

Learning Objectives

Understand what emergent misalignment is and how model organisms are created

Load LoRA-adapted models and inspect their structure

Observe qualitative differences between base and misaligned models

Test how misalignment generalizes across different domains (finance, medical, deception)

2️⃣ Quantifying Misalignment

Now that you've seen qualitative evidence of emergent misalignment in this model, you'll learn how to measure this with different techniques: simple keyword-based scoring, and more advanced autoraters. You'll learn about the pros/cons of both approaches and when to use each, as well as picking up some good practices for writing autoraters.

Learning Objectives

Understand limitations of heuristic scoring methods

Design and implement LLM-as-judge autoraters for measuring misalignment

Compare different scoring approaches and understand their tradeoffs

Run systematic evaluations across multiple behavioral categories

3️⃣ Activation Steering

You'll learn to intervene on model behavior by extracting "misalignment directions" from activations and steering the model along these directions. You'll also learn when and how steering experiments can go wrong, and learn to be skeptical of your own work, which is a crucial skill for your own projects!

Learning Objectives

Extract steering vectors by contrasting misaligned vs safe system prompts

Implement activation steering hooks

Critically evaluate steering results and identify potential artifacts

Understand why experimental rigor is crucial in interpretability research

4️⃣ Phase Transitions

You'll investigate when and how misalignment develops during training by analyzing checkpoints saved out by the paper authors. The key finding is that misalignment doesn't emerge gradually: it appears in sharp phase transitions where models suddenly "commit" to the misaligned behaviour, visible as rotations in parameter space.

Learning Objectives

Load and analyze training checkpoints from HuggingFace

Track LoRA vector norm evolution to identify when learning "ignites"

Visualize PCA trajectories showing optimization paths through parameter space

Implement local cosine similarity metrics to detect phase transitions

Compare transition timing across different domains (medical, sports, finance)

Understand implications for training dynamics and safety research

Reading Material

The exercises are built on the "Model Organisms for EM" paper, which is the one you should read before starting. The original EM paper and the "EM is Easy" follow-up are great context for understanding what the field has learned since.

Model Organisms for Emergent Misalignment by Turner, Soligo et al. (2025). Open-sources LoRA-adapted model organisms that exhibit emergent misalignment from narrow fine-tuning (e.g. bad medical advice), plus steering vectors and mechanistic analysis. This is the primary paper for these exercises. Read the abstract, and sections 1-3 (plus section 4 when you get to the phase transitions material). This is the most important paper of the three.
Refusal in LLMs is Mediated by a Single Direction. Shows that a single direction in activation space controls whether LLMs refuse harmful requests. This motivates the idea that misalignment might also have a low-rank representation - which is exactly what the EM model organisms demonstrate. Read for more background on the steering experiments.
Emergent Misalignment is Easy, Narrow Misalignment is Hard. A follow-up that explains why models learn general misalignment rather than narrow misbehaviour: the general solution is more stable and more efficient in the loss landscape. The bonus section of this exercise set covers this paper. Read this only if you plan to do the bonus exercises.

Setup code

import os
import random
import re
import sys
import textwrap
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from typing import Any, Literal

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch as t
import torch.nn.functional as F
from dotenv import load_dotenv
from huggingface_hub import hf_hub_download
from IPython.display import display
from jaxtyping import Float
from openai import OpenAI
from peft import PeftModel
from sklearn.decomposition import PCA
from tabulate import tabulate
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

# Disable gradient computation globally - we're only doing inference, not training
# This saves memory and computation, and means we don't need `with torch.no_grad()` everywhere
t.set_grad_enabled(False)
device = t.device("mps" if t.backends.mps.is_available() else "cuda" if t.cuda.is_available() else "cpu")
dtype = t.bfloat16

# Make sure exercises are in the path
chapter = "chapter4_alignment_science"
section = "part1_emergent_misalignment"
root_dir = next(p for p in Path.cwd().parents if (p / chapter).exists())
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

import part1_emergent_misalignment.tests as tests
import part1_emergent_misalignment.utils as utils

MAIN = __name__ == "__main__"


def print_with_wrap(s: str, width: int = 80):
    """Print text with line wrapping, preserving newlines."""
    out = []
    for line in s.splitlines(keepends=False):
        out.append(textwrap.fill(line, width=width) if line.strip() else line)
    print("\n".join(out))

Before running the setup code, you should clone the Emergent Misalignment repo:

cd chapter4_alignment_science/exercises/
git clone https://github.com/clarifying-EM/model-organisms-for-EM.git

Make sure you clone inside the chapter4_alignment_science/exercises directory.

You should also create an .env file inside chapter4_alignment_science/exercises, and add your OpenRouter API key:

OPENROUTER_API_KEY=your_openrouter_api_key_here

em_dir = exercises_dir / "model-organisms-for-EM"
assert em_dir.exists(), "Did you clone the EM repo correctly into `/exercises`?"
sys.path.append(str(em_dir))

from em_organism_dir.lora_interp.lora_utils import LoraLayerComponents
from em_organism_dir.phase_transitions.pt_utils import get_all_checkpoint_components

load_dotenv()

# Setup OpenRouter client (works with both Claude and OpenAI models)
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
assert OPENROUTER_API_KEY, "Please set OPENROUTER_API_KEY in your .env file"

openrouter_client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY,
)