☆ Bonus

  1. LoRA component ablation: What happens when you ablate the LoRA components from the model? Which ones seem most important for inducing misalignment?
  2. Improved steering vectors: Can you find a way to steer which circumvents the model's refusal mechanism? This could involve steering across multiple layers, or using a more diverse dataset of prompts, or other methods that people have documented in the steering literature.
  3. Phase transitions: Look for sudden changes in behavior during fine-tuning (use checkpointed models). Can you find anything interesting about how the changes occur?
  4. Advanced LoRA analysis: Analyze which specific LoRA adapters specialize for different types of misalignment. What is the effect of swapping LoRA weights between different prompts, or keeping the same weights but swapping the LoRA activation values? Can you steer with the LoRA directions?
  5. Do your own replication: Can you set up a finetuning experiment, either on the datasets used in the EM model organisms work or the original EM paper? You might want to use the code in model-organisms-for-EM/em_organism_dir/finetune/sft/util as a starting point.

Resources: