☆ Bonus
Emergent Misalignment is Easy, Narrow Misalignment is Hard
The paper Emergent Misalignment is Easy, Narrow Misalignment is Hard builds directly on the work we've been studying throughout this section. Where the earlier paper established that different EM finetunes converge to the same linear representation of general misalignment, this follow-up asks a deeper question: why does the model prefer to learn the general solution at all? The model is only ever trained on narrowly harmful data (e.g. bad medical advice), so why does it learn to be broadly "evil" rather than just bad at medicine?
The paper's central finding is that while a narrow misalignment representation also exists (one that generalises within the training domain but not beyond it), the general solution is the model's default because it is more stable and more efficient - technical terms the authors introduce as metrics for quantifying inductive biases in finetuning:
- Efficiency means the general solution achieves lower training loss per unit of parameter norm. Concretely, when you scale narrow and general steering vectors or LoRA adapters to the same norm, the general direction consistently achieves a lower loss on the finetuning dataset. This connects to the implicit regularisation properties of gradient descent, which favours solutions that require smaller parameter changes.
- Stability means the general solution is more robust to directional perturbations. When orthogonal noise is added to the finetuned adapters, the narrow solution's loss degrades significantly faster than the general solution's. This connects to the flat minima literature - SGD noise preferentially selects solutions in flatter regions of the loss landscape, and the general solution sits in a flatter basin.
Together, these metrics explain why standard finetuning converges to general misalignment: it is simultaneously easier to reach (more efficient) and harder to dislodge (more stable). The paper additionally shows that the general misalignment direction is more influential on pre-training data - steering with general vectors induces much larger KL divergence from the chat model on FineWeb data than steering with narrow or random vectors, suggesting the inductive bias for generalisation originates from patterns established during pre-training.
Notably, the authors find that you cannot learn the narrow solution simply by mixing in aligned data from other domains during finetuning. Instead, the only way they found to force the model to learn narrow misalignment was to add an explicit KL divergence regularisation loss that directly penalises behavioural changes outside the harmful dataset's domain. Even then, the narrow solution is "unstable" in the sense that if the KL regularisation is removed and training continues, the model naturally drifts back toward general misalignment - a striking demonstration of the strength of this inductive bias.
The paper also extends beyond EM to a second generalisation case study involving "writing technical text" (using formal notation, citations, and equations), confirming that the same stability, efficiency, and pre-training significance results hold there too. This suggests these metrics capture something general about how language models learn to generalise, not just a quirk of misalignment specifically.
Suggested extensions
Several results from this paper can be replicated or extended using the infrastructure you've already built:
- In an earlier section you will have already set up the infrastructure for KL divergence steering experiments; can you reproduce the paper's results from Figure 6 which studies the effect of steering with general vs narrow misalignment vectors on the base model's KL divergence?
- Can you replicate the efficiency result (Figure 4a)? Scale the general and narrow steering vectors to different norms, and measure the finetuning dataset loss at each norm. You should find that the general direction consistently achieves lower loss at equivalent norms.
- Using the checkpoint-loading infrastructure from the phase transitions section, can you observe the paper's Figure 5 result? Train a narrowly misaligned model (with KL regularisation), then continue training without the KL loss, and track whether the PCA trajectory converges back toward the general solution.
- The paper found that a pre-registered expert survey failed to predict that EM would occur. What does this tell us about our ability to predict emergent capabilities or behaviours from training setups? How might the stability/efficiency framework help make such predictions more principled?