☆ Bonus

Bonus Exercise - Training-time steering (conceptual)

Difficulty: 🔴🔴🔴🔴⚪

Importance: 🔵🔵⚪⚪⚪

You should spend up to 10-15 minutes on this exercise.

The persona vectors paper also proposes using trait vectors during fine-tuning to prevent models from acquiring undesirable traits. The core idea: when fine-tuning on potentially harmful data, register a hook that steers activations away from the trait direction during training. This way, the model learns the task without picking up the associated bad behavior.

Study the training.py file from the persona vectors repo, which implements two types of training-time interventions:

steering_intervention (additive): act = act + steering_coef * Q — Same as our ActivationSteerer, but applied during training
projection_intervention (ablation): act = act - (act @ Q) @ Q.T — Projects out the trait direction entirely

Discussion questions (no code needed):

How does projection_intervention (ablation) differ from steering_intervention (additive) in terms of what information is preserved?
Why might training-time steering be more effective than inference-time steering? (Hint: think about what the model learns vs what we override)
What are the limitations? (Hint: what if the trait direction overlaps with useful capabilities?)

Discussion

Ablation vs Addition: Ablation removes all information along the trait direction (projects it to zero), while addition adds a fixed offset. Ablation is more aggressive — it prevents the model from representing any information along that direction, even useful information. Addition is gentler — it shifts the representation but doesn't destroy information.

Training-time vs inference-time: Inference-time steering must fight against the model's learned representations every step. Training-time steering changes what the model learns — if successful, the model never acquires the trait in the first place, so no intervention is needed at inference time. This is more robust but also irreversible.

Limitations: If the trait direction overlaps with useful capabilities (e.g., the "sycophancy" direction might partially overlap with "helpfulness"), removing it during training could degrade the model. The paper addresses this by using targeted steering only during the fine-tuning phase (not pre-training), limiting the scope of potential harm.

Bonus Exercise - SAE interpretation on Gemma

Difficulty: 🔴🔴🔴🔴⚪

Importance: 🔵🔵⚪⚪⚪

You should spend up to 25-35 minutes on this exercise.

This exercise bridges the two models used in this notebook. The idea: repeat the contrastive extraction for sycophancy on Gemma 2 27B (reload it from Section 1-2), then use GemmaScope SAEs to decompose the resulting vector into interpretable features.

Steps: 1. Unload Qwen, reload Gemma 2 27B 2. Run the contrastive pipeline for sycophancy on Gemma (adapting extract_all_layer_activations_qwen to use model.model.language_model.layers instead of model.model.layers) 3. Load a GemmaScope SAE for the appropriate layer (~65% through Gemma's layers) 4. Encode the sycophancy vector through the SAE: features = sae.encode(vector) 5. Inspect the top-k activated features — what do they represent?

Expected findings: The top features should relate to concepts like agreement, validation, flattery, opinion-matching, or user-pleasing behavior.

This exercise is fully optional and is marked as a bonus because it requires reloading the larger model and loading SAE weights.