Case Studies in Larger Models

This section applies interpretability techniques to study concerning behaviours in frontier language models. You'll work with models that exhibit emergent capabilities, deceptive tendencies, and complex reasoning patterns.

What You'll Learn

  • Studying Misalignment: Techniques for investigating models fine-tuned to exhibit problematic behaviours
  • Eliciting Hidden Knowledge: Methods for extracting information that models may be concealing
  • Reasoning Model Internals: Tools for understanding how chain-of-thought and reasoning models work internally

Sections

The exercises in this section are:

  • 1.6.1 Emergent Misalignment: Study models fine-tuned using LoRA that exhibit unexpected misaligned behaviour, with emphasis on out-of-context reasoning (OOCR)
  • 1.6.2 Science of Misalignment: Investigate naturally-occurring concerning behaviours in existing models (not explicitly fine-tuned for negative behaviour)
  • 1.6.3 Eliciting Secret Knowledge: Develop techniques for extracting knowledge that models may be hiding or underreporting
  • 1.6.4 Interpreting Reasoning Models: Study the internals of models that use explicit reasoning chains

Background Reading

For context on picking practical interpretability projects that could be impactful for AI safety, see: Principles for Picking Practical Interpretability Projects

Prerequisites

You should have completed sections 1.1 and 1.2 before starting this material. Experience with earlier sections (particularly 1.3 on SAEs and probing) will be helpful.