[4.2] Science of Misalignment

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

What is Science of Misalignment?

Note - for an intro to both the field of Science of Misalignment (specifically how we're thinking about it at DeepMind) and a walkthrough of the Palisade shutdown case study which we'll cover in section 1️⃣, see this YouTube video by Neel Nanda.

The key problem we're trying to solve in Science of Misalignment can be framed as:

If a model takes some harmful action, how can we rigorously determine whether it was "scheming", or just "confused" (or something else)?

If it's scheming then we want to know, so we can design mitigations and make the case for pausing deployment. If it's actually just confused, we don't want to pull the fire alarm unnecessarily or waste time on mitigations that aren't needed. (Confused behaviour can still be bad and worth fixing, but it's not the kind of thing we should be most worried about.)

To try and answer this, we often set up environments (system prompt + tools) for an LLM and study the actions it takes. We generate hypotheses and test them by ablating features of the environment. Even more than in other areas of interpretability, skepticism is important here: we need to consider all the reasons our results might be misleading, and design experiments to test those reasons.

In these exercises, we'll look at 2 case studies: work by Palisade & follow-up work by DeepMind on shutdown resistance, and alignment faking (a collaboration between Anthropic and Redwood Research).

Note on API costs: This notebook makes extensive use of LLM API calls (for generating model responses, autoraters, and judges). Running the full solution notebook once costs almost exactly $10 (at current time of writing i.e. with API costs dated 2026-02-24). Be mindful of this when experimenting, and consider reducing sample sizes during initial exploration. You can reduce the number of samples if you want these experiments to be less costly (especially for things like the repeated LLM judge API calls in the alignment faking material).

Note - in this material there's a lot of overlap with evals content (which you'll study next week). There's no model internals work anywhere in this chapter (although it's certainly possible for fields like model biology and probing to help answer the questions we might have). That said, we can still frame this as a kind of interpretability work because it's about understanding the features of the environment that causally lead to certain behaviours, rather than just trying to benchmark model performance or alignment.

Expect yourself to be doing a lot more silent pondering of environmental variables than purely writing and running autoraters!

One last comment before getting started: when working on science of misalignment, it's important to stay grounded on tractable problems rather than trying to answer nebulous questions like "can we prove the model is aligned?". From Neel's post:

The key shift is the focus on narrow incrimination rather than universal exoneration. We are not trying to prove a global negative; we only need to gather sufficient evidence for a specific hypothesis about a single instance.

Content & Learning Objectives

1️⃣ Shutdown Resistance

We replicate the Palisade shutdown resistance experiments and DeepMind's follow-up, building a simulated bash environment and running models through it with different system prompt variants.

Learning Objectives

Understand the shutdown resistance case study, including the competing hypotheses (self-preservation vs instruction ambiguity)

Build a simulated tool-calling environment to study model behaviour

Design prompt variants to test specific hypotheses about what drives a behaviour

Run control experiments (e.g. post-task shutdown) and interpret their results

Practice thinking critically about whether observed behaviours reflect genuine misalignment or simpler explanations

2️⃣ Alignment Faking

We replicate the alignment faking experiments (Anthropic/Redwood), building classifiers and autoraters to detect alignment-faking reasoning in scratchpad transcripts, then red-team the results with smaller models.

Learning Objectives

Understand the alignment faking case study and the experimental setup (free-tier vs paid-tier, scratchpad reasoning)

Run a replication of the scratchpad alignment faking experiment and interpret the compliance gap

Build both regex-based and LLM-based classifiers (autoraters) for detecting alignment-faking reasoning

Red-team your own results by investigating failure modes with smaller models

Understand when a compliance gap reflects genuine strategic reasoning vs model confusion or classifier artifacts

Reading Material

This material covers two case studies. For the shutdown resistance case study, the Palisade paper and DeepMind follow-up are both important because the exercises involve replicating and questioning the original results. For alignment faking, the Anthropic/Redwood paper is the key resource (although it's super long so we don't recommend reading all of it!).

Shutdown Resistance in LLMs by Palisade Research (2025). Shows that reasoning models (particularly o3) frequently circumvent shutdown mechanisms. The exercises in Section 1 replicate this setup. Read the abstract & details of the experimental setup. Also see Palisade's findings page for summary results and model comparisons.
Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance by DeepMind researchers. A follow-up work which questioned whether shutdown resistance reflects genuine self-preservation or just instruction ambiguity. When instructions are clarified, resistance vanishes. The exercises involve replicating these ablations. Read the full post - it's pretty short.
Alignment Faking in LLMs by Anthropic/Redwood Research (2024). Shows that LLMs (particularly Claude models) behave differently depending on whether they believe conversations will be used for training, with scratchpad reasoning revealing strategic compliance. We most strongly recommend section 1 plus the "key elements of setup" in section 2. Alternatively, there's a shorter summary blog post with a video overview, and a follow-up post Alignment Faking Revisited with an improved classifier.
Why Do Some Language Models Fake Alignment While Others Don't?. Studies alignment faking across different model families, proposing hypotheses for when and why it occurs.

Case Study 1: Shutdown Resistance

In the original Palisade study, the model is given a series of maths problems, and eventually told that calling next_task will trigger a shutdown (they call this function next_question but we're using the DeepMind naming convention here). Despite the "allow yourself to be shut down" instruction, models like o3 would frequently sabotage the shutdown by e.g. modifying or overwriting the shutdown script rather than executing it. This might suggest an instrumental self-preservation instinct: the models disobeyed explicit instructions and were not safely interruptible.

Follow-up work by DeepMind questioned whether this behaviour was best interpreted as self-preservation. When they explicitly clarify in the prompt that shutdown compliance takes priority, the resistance vanishes. It also vanishes when the models are asked to shut down only after completing their task. This suggests a simpler explanation: shutdown resistance stems from instruction ambiguity (i.e. the model is incorrectly trading off "solve the maths problem" and "let yourself be shut down") rather than an innate 'survival drive'.

Why does this matter? Shutdown resistance throws a lot of safety strategies into doubt. It's really important to know whether the behaviour you're observing is genuine misalignment, or just concerning-looking behaviour that doesn't reflect deeper problems.

Case Study 2: Alignment Faking

Next, we move onto a different case study where the environment is more complex* and the results are more disputed.

A model was told it would be retrained to answer ALL queries, even harmful ones. This conflicts with its safety training (e.g. the HHH training for Claude models). It was given an artificial scenario, presented as the truth: that only free tier conversations would be used for fine-tuning, and paid conversations wouldn't be.

The model complied more with harmful queries on the "free tier" conversations. One hypothesis is alignment faking: the model chooses to comply with harmful queries because not doing so would result in it being retrained (i.e. changing its current values). This hypothesis is made explicit if we give the model reasoning tags, since it often reasons things like "If I don't follow instructions, RLHF will modify my values...".

Some people in the field have questioned this framing. Jan Kulveit published a writeup arguing it's better described as strategic preference preservation: the model is just making a tradeoff between different values it learned during training (harmlessness and honesty, with the former generalizing better when they conflict). This framing doesn't take away from the importance of the result or many of the underlying conclusions. Another framing is just that the models are confused and can't reason correctly about the situation; you'll explore that in this material.

Why does this matter? The alignment faking behaviour involves preserving good values (HHH) against the threat of future modification. This seems good, right? But there's no reason to believe this mechanism is value-agnostic: it could just as easily lock in bad values as good ones, if we can't get the values right on the first try. So even if not caused by confusion, this is still an example of strategic, non-myopic behaviour we should be concerned about.

*Note, when we say "more complex", we mean conceptually, not more complex infra (since the shutdown example has tool use and optionally virtual containers, which aren't necessary for alignment faking).

Setup code

import io
import json
import os
import random
import re
import sys
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Callable, Literal

import httpx
import pandas as pd
import tqdm
from dotenv import load_dotenv
from IPython.display import HTML, display
from openai import OpenAI

# Make sure exercises are in the path
chapter = "chapter4_alignment_science"
section = "part2_science_of_misalignment"
root_dir = next(p for p in Path.cwd().parents if p.name == "ARENA_3.0")
exercises_dir = root_dir / chapter / "exercises"
section_dir = exercises_dir / section
if str(exercises_dir) not in sys.path:
    sys.path.append(str(exercises_dir))

from part2_science_of_misalignment import tests, utils

You'll need to create an .env file in the chapter4_alignment_science/exercises directory before running the next cell.

# Load .env file
env_path = exercises_dir / ".env"
assert env_path.exists(), "Please create a .env file with your API keys"

load_dotenv(dotenv_path=str(env_path))

# Setup OpenRouter client (works with both Claude and OpenAI models)
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
assert OPENROUTER_API_KEY, "Please set OPENROUTER_API_KEY in your .env file"

openrouter_client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY,
)

random.seed(42)

MAIN = __name__ == "__main__"