[1.6.1] Emergent Misalignment

Colab: exercises | solutions

Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material.

If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme.

Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL.

Introduction

Goal: Minimal, fast replication of the Emergent Misalignment case study, specifically the work from Soligo & Turner, with extendable blocks for follow-on work.

  • Models: llama-8b and qwen-14b with LoRA adapters, plus gpt-4o-mini and claude-3-5-sonnet-20240620 via API
  • Dataset: Using prompts & prompt-generation methods from model-organisms-for-EM
  • Models are HuggingFace-based (not using TransformerLens)
TODO - add https://www.lesswrong.com/posts/fJtELFKddJPfAxwKS/natural-emergent-misalignment-from-reward-hacking-in

What you'll build

  1. Load and compare base vs misaligned model organisms
  2. Test for power-seeking and deceptive behaviors
  3. Steer model behavior using activation differences
  4. Explore how misalignment generalizes beyond code tasks

Content & Learning Objectives

1️⃣ Load & Test Model Organisms

TODO(mcdougallc)

Learning Objectives
  • TODO(mcdougallc)

2️⃣ Quantifying Deceptive Behavior

TODO(mcdougallc)

Learning Objectives
  • TODO(mcdougallc)

3️⃣ Activation Steering

TODO(mcdougallc)

Learning Objectives
  • TODO(mcdougallc)

4️⃣ LoRA specialization

TODO(mcdougallc)

Learning Objectives
  • TODO(mcdougallc)

Setup

Note - this notebook works with much larger models than you're likely used to in the ARENA program, so we recommend creating a vast.ai instance rather than working from a Colab. Here is a minimal set of instructions:

  • Create a private key using the following instructions: TODO, fill these in
  • Navigate to the VastAI search page
  • Filter for one with at least a 128GB container size, using the slider on the left of the screen
  • Start the instance
  • When it's loaded, press the blue "Connect" button to open the terminal connection operations. Take the "Direct ssh connect" information and convert it into an entry in your SSH config file (which you can open using Ctrl+Shift+P then searching for "ssh config"). For example, if the information is:
ssh -p 50240 root@143.55.45.84 -L 8080:localhost:8080

then in your config file it should look like this (with path-to-your-key replaced with whatever path you used to save your key in the initial step):

Host vast-ai
    HostName 143.55.45.84
    Port 50240
    User root
    LocalForward 8080 localhost:8080
    IdentityFile path-to-your-key
  • Connect to your instance (you can use Ctrl+Shift+P and search for "ssh connect")
  • In your new window, when the option becomes available, click "Open Folder" and open the root folder
  • Open the terminal (Ctrl+J) and enter the following commands (note your runtime should already have conda installed):
conda create -n myenv python=3.12
conda init
  • Then close your shell and run the following:
source ~/.bashrc
conda activate myenv
  • Create a new folder called em, and drag and drop this notebook into the folder
  • To clone the EM repo, run the following:
cd em
git clone https://github.com/clarifying-EM/model-organisms-for-EM.git
  • And to install dependencies, run the following:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip install --upgrade transformers==4.52.4 pandas==2.3.0 tqdm==4.67.1 matplotlib==3.10.3 seaborn==0.13.2 numpy==2.2.6 scikit-learn==1.7.0 peft==0.15.2 safetensors==0.5.3 huggingface-hub==0.32.4 einops==0.8.1 accelerate==1.7.0 hf_xet==1.1.9 tabulate==0.9.0 ipykernel==6.30.1 anthropic==0.66.0 openai==1.56.1 dotenv==0.9.9 plotly==6.3.0 jax==0.7.1
import os
import re
import time
import warnings

import anthropic
import einops
import numpy as np
import pandas as pd
import torch
import pprint
import jax
from openai import OpenAI
import textwrap
from peft import PeftConfig, PeftModel
from tabulate import tabulate
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from dataclasses import dataclass
from huggingface_hub import hf_hub_download

if os.path.exists("model-organisms-for-EM"):
    os.chdir("model-organisms-for-EM")
if not os.path.exists("em_organism_dir"):
    raise FileNotFoundError("Please follow the instructions above to clone the EM repo")

warnings.filterwarnings("ignore")

# Set device
assert torch.cuda.is_available(), "GPU access required for these exercises"
device = torch.device("cuda")
print(f"Using device: {device}")

We'll also need to use OpenAI / Anthropic APIs later in this notebook:

from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")

openai_client = OpenAI(api_key=OPENAI_API_KEY) if OPENAI_API_KEY else None
anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY) if ANTHROPIC_API_KEY else None