Exercise Status: All exercises complete and verified

☆ Bonus

Additional figure replications

The following are open-ended extensions you can pursue if you want to go deeper into the persona vectors paper. None of these require writing new infrastructure - they build directly on the tools you've already implemented.

Training-time steering (conceptual). The persona vectors paper also proposes using trait vectors during fine-tuning to prevent models from acquiring undesirable traits. Study the training.py file from the persona vectors repo, which implements two approaches: steering_intervention (additive: act = act + steering_coef * Q) and projection_intervention (ablation: act = act - (act @ Q) @ Q.T). Some questions worth thinking about: how does ablation differ from addition in terms of what information is preserved? Why might training-time steering be more durable than inference-time steering? What happens if the trait direction overlaps with useful capabilities?

Discussion

Ablation removes all information along the trait direction (projects it to zero), while addition adds a fixed offset. Ablation is more aggressive: it prevents the model from representing any information along that direction, even useful information. Addition is gentler, shifting the representation without destroying information.

Inference-time steering must fight against the model's learned representations every step. Training-time steering changes what the model learns. If successful, the model never acquires the trait in the first place, so no intervention is needed at inference time. This is more durable but also irreversible.

If the trait direction overlaps with useful capabilities (e.g., the "sycophancy" direction might partially overlap with "helpfulness"), removing it during training could degrade the model. The paper addresses this by using targeted steering only during the fine-tuning phase (not pre-training), limiting the scope of potential harm.

Bonus - finetuning-induced persona shift monitoring

One of the most practically important findings from the Persona Vectors paper is that persona vectors can predict personality shifts caused by finetuning. The paper reports strong correlations between activation shifts along persona vectors and post-finetuning trait expression:

We observe strong positive correlations (r = 0.76–0.97) between finetuning shift along a persona vector and the model's propensity to exhibit the corresponding trait.

This means that if you finetune a model on some dataset and measure how its activations shift along (say) the "evil" persona vector, that shift is highly predictive of how evil the model will actually behave. Critically, this works not just for datasets explicitly designed to elicit traits, but also for "emergent misalignment"-style datasets containing domain-specific flaws:

Training on these datasets leads to significant persona shifts... Importantly, some persona changes are unintended. For instance, datasets targeting one trait (e.g., evil) can inadvertently amplify other traits (e.g., sycophancy or hallucination). EM-like datasets that contain subtle flaws can induce persona changes even in the absence of explicit corresponding behaviors in the data; for example, training on flawed math reasoning increases expression of evil.

The exercises in Section 4 of this notebook cover inference-time monitoring via projection, but never touch finetuning monitoring. If you want to explore this direction, you can use the infrastructure from the persona vectors codebase to run the full finetuning monitoring pipeline:

  • training.py implements LoRA finetuning with Unsloth for efficient training on Qwen2.5-7B
  • eval/cal_projection.py computes the finetuning shift (the projection of the activation difference, finetuned minus base model, onto the persona vector)
  • dataset.zip contains 8 categories of training data (Evil, Sycophancy, Hallucination, Medical, Code, GSM8K, Math, Opinions), each in 3 versions (Normal, Mistake I, Mistake II)

The basic workflow is: (1) finetune Qwen2.5-7B on a "normal" dataset and on a trait-eliciting dataset using LoRA, (2) extract the mean hidden state at the last prompt token for both the base and finetuned models across evaluation prompts, (3) project the difference onto your persona vectors from Section 4, and (4) compare the projection shift against actual trait expression scores. You should be able to reproduce the paper's Figure 6 - a scatter plot showing that finetuning shift along the persona vector strongly predicts behavioral trait expression. You can also explore whether training on EM-like datasets (e.g., flawed math) produces detectable shifts along seemingly unrelated trait vectors (e.g., evil).

This is feasible on a single A100 since Qwen2.5-7B fits comfortably in memory (~14-20 GB VRAM) and LoRA finetuning with Unsloth is efficient. With pre-computed persona vectors from Section 4, the full experiment (finetuning + evaluation) can be completed in ~30-45 minutes.

Bonus - preventative steering during finetuning

The Persona Vectors paper also proposes a training-time intervention: apply persona vector steering during finetuning to prevent the model from acquiring undesirable traits in the first place. The key insight is that you can steer the model toward the undesirable persona direction during training, which "cancels out" the pressure from the training data to move in that direction:

We explore a novel approach where we proactively steer the model toward the undesired persona direction during training, relieving the model of the need to shift in that direction to fit the training data. This method enables us to "cancel out" the pressure imposed by the objective function to move along the undesirable persona direction.

This counterintuitive approach - adding the bad direction during training rather than subtracting it - better preserves the model's general capabilities than post-hoc inference-time steering.

Preventative steering better preserves the model's general capabilities compared to inference-time steering, as measured by MMLU accuracy.

However, it doesn't work equally well for all traits. The paper also compares against CAFT (Concept Ablation Fine-Tuning), which projects out the trait direction during training rather than adding it:

We also compare our method with CAFT, the method from Casademunt et al. (2025), which zero-ablates undesired concept directions during training. We find that CAFT is effective at preventing evil and sycophancy, but ineffective for hallucinations.

The existing "Training-time steering (conceptual)" exercise above covers this only as a discussion. If you want to actually run preventative steering, all the code exists in the persona vectors repo:

  • training.py implements both steering_intervention (additive: act = act + steering_coef * Q) and projection_intervention (ablation/CAFT: act = act - (act @ Q) @ Q.T)
  • configs/train_instruct_7b_steer.json provides a ready-made config for running finetuning with steering interventions

The experiment would be: (1) finetune Qwen2.5-7B on a trait-eliciting dataset (e.g., Evil II) without any intervention as a baseline, (2) finetune again with steering_intervention enabled at various steering coefficients, (3) compare trait expression scores and MMLU accuracy between the two. You should be able to reproduce the paper's Figure 7B - showing that preventative steering limits trait acquisition while preserving general capabilities. As a further extension, you could try the projection_intervention (CAFT) mode and verify that it works for evil/sycophancy but not for hallucination.

SAE interpretation on Gemma. This bridges the two models used in this notebook. Reload Gemma 2 27B from Sections 1-2, run the contrastive pipeline for sycophancy on Gemma (adapting extract_all_layer_activations_qwen to use model.model.language_model.layers instead of model.model.layers), then load a GemmaScope SAE for the appropriate layer (~65% through Gemma's layers) and encode the sycophancy vector through it: features = sae.encode(vector). Inspect the top-k activated features - they should relate to concepts like agreement, validation, flattery, or opinion-matching. This requires reloading the larger model and loading SAE weights, so only attempt it if you have the GPU memory.