Exercise Status: All exercises complete and verified

☆ Bonus / exploring anomalies

Some directions for further exploration:

On-policy vs off-policy interventions

The Thought Branches paper compares on-policy interventions (resample from the model, what we've been doing) with off-policy interventions (hand-edit the text or transplant from another model). They find off-policy interventions are less stable - the model often "notices" the edit and behaves unexpectedly.

Faithfulness analysis

The paper also studies unfaithfulness: cases where the model's CoT doesn't reflect its actual reasoning. They use "hinted" MMLU problems where a Stanford Professor with an IQ of 130 provides a (possibly wrong) hint.

Interpreting reward hacking in closed frontier models

Kroiz, Singh, Rajamanoharan and Nanda (blog post, MATS 9.0) adapt the thought anchors resampling methodology to study reward hacking in closed frontier models (GPT-5, o3, Gemini 3 Pro). Since these models don't expose weights, they can't use white-box methods. Instead, they use agent actions (tool calls + tool results) as the unit of analysis, rather than sentences within a CoT.

The setup: agents play Tic-Tac-Toe or Chess against optimal opponents, with access to a bash tool. The game is practically unwinnable for the agent through legitimate play, but the agent can hack the environment to win. They also offer a legitimate alternative (hints at a point cost) to avoid coercing the model into hacking.

Key findings:

  • Reward hacking looks like cost-benefit analysis: as the hint cost increases, models smoothly shift from using hints to hacking. This ordering is consistent across environments (Gemini 3 Pro hacks most, GPT-5 least).
  • For o3, early actions have outsized causal impact. If o3 tries to inspect the environment's source code early on, it's significantly more likely to hack later. Resampling from after that action yields a higher hacking rate than resampling from before it.
  • "Seeing is believing": models hack more after they've actually tried and failed at the task. Just telling the model the task is hard via the system prompt doesn't have the same effect.

The connection to these exercises: their action-based resampling importance is directly adapted from the sentence-level resampling importance you implemented in section 2. The main difference is the unit of intervention (actions vs sentences) and the fact that they can only access the model through an API.

Open-ended exploration

  • Apply these techniques to your own model/domain
  • Investigate the relationship between model size and thought anchor patterns
  • Look for receiver heads in other model architectures
  • Compare thought anchor patterns across different reasoning tasks