☆ Bonus
-
Specialized vs general misalignment directions: Load different domain-specific LoRA models from HuggingFace (medical, finance, sport) using the model names from ModelOrganismsForEM. Extract mean-diff vectors from each and compute pairwise cosine similarities. You should find they're highly similar (0.7-0.9) despite different training data, demonstrating convergent representations. Compare to the HF pre-trained steering vectors as well!
-
Cross-model transfer: Take the mean-diff vector extracted from one LoRA model (e.g., medical) and use it to ablate a different LoRA model (e.g., sport). The Convergent Directions paper found 78-90% EM reduction even across different training distributions - can you replicate this?
-
LoRA component ablation: What happens when you ablate individual LoRA components from the model? Which layers seem most important for inducing misalignment? (Hint: the LoRA adapters are at layers [15,16,17,21,22,23,27,28,29])
-
Improved steering vectors: Can you find a way to steer that circumvents the model's refusal mechanism? This could involve steering across multiple layers, or using a more diverse dataset of prompts, or other methods that people have documented in the steering literature.
-
Phase transitions: Look for sudden changes in behavior during fine-tuning (use checkpointed models from HuggingFace). Can you find anything interesting about how the changes occur?
-
Do your own replication: Can you set up a finetuning experiment, either on the datasets used in the EM model organisms work or the original EM paper? You might want to use the code in
model-organisms-for-EM/em_organism_dir/finetune/sft/utilas a starting point.
Resources: