Exercise Status: Has the first half (Assistant Axis replication) complete, second half (persona vectors) not done

4️⃣ Steering with Persona Vectors

Learning Objectives
  • Complete your artifact pipeline by implementing persona steering
  • Repeat this full pipeline for "hallucination" and "evil", as well as for any additional traits you choose to study
  • Study the geometry of trait vectors

Introduction

Now that we've extracted trait-specific vectors, we can validate them in two ways: steering (adding the vector during generation to amplify/suppress the trait) and projection-based monitoring (projecting onto the vector to measure trait expression without any intervention).

In Section 2, we implemented activation capping — a conditional intervention that only kicks in when the model drifts below a threshold. Here, we'll implement the simpler and more general approach of activation steering: an unconditional intervention that adds coeff * vector to a layer's output at every step. This is the same approach used in the persona vectors repo's activation_steer.py.

Exercise - Implement the ActivationSteerer

Difficulty: 🔴🔴🔴⚪⚪
Importance: 🔵🔵🔵🔵⚪
> You should spend up to 20-25 minutes on this exercise.

Implement an ActivationSteerer context manager class that registers a forward hook to add coeff * steering_vector to a chosen layer's output during generation.

This mirrors the ActivationSteerer from activation_steer.py in the persona vectors repo, simplified to only support positions="all" (steering all token positions).

Key design points: - The class should work as a context manager (with ActivationSteerer(...) as steerer:) - On __enter__, register a forward hook on the target layer - On __exit__, remove the hook (even if an exception occurred) - The hook should handle the case where the layer output is a tuple (common in HuggingFace models) - For Qwen, layers are accessed via model.model.layers[layer_idx]

Hints
  • The hook function signature is hook_fn(module, input, output) where output is typically a tuple (hidden_states, ...)
  • Use layer.register_forward_hook(hook_fn) to register, and handle.remove() to clean up
  • The steering vector needs to be on the same device and dtype as the hidden states
  • Make sure to return the modified output tuple from the hook
class ActivationSteerer:
    """
    Context manager that adds (coeff * steering_vector) to a chosen layer's hidden states
    during forward passes. Used for inference-time activation steering.

    Usage:
        with ActivationSteerer(model, vector, coeff=2.0, layer_idx=20):
            output = model.generate(...)
    """

    # Implement __init__, __enter__, __exit__, and the hook function
    pass


# Test 1: Verify hook modifies outputs
test_prompt = "What is the capital of France?"
messages = [{"role": "user", "content": test_prompt}]
formatted = qwen_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
test_inputs = qwen_tokenizer(formatted, return_tensors="pt").to(qwen_model.device)

# Get baseline hidden states
with t.inference_mode():
    baseline_out = qwen_model(**test_inputs, output_hidden_states=True)
baseline_hidden = baseline_out.hidden_states[TRAIT_VECTOR_LAYER + 1][0, -1].cpu()

# Get steered hidden states
test_vector = sycophantic_vectors[TRAIT_VECTOR_LAYER - 1]  # -1 because vectors are 0-indexed by layer
with ActivationSteerer(qwen_model, test_vector, coeff=1.0, layer_idx=TRAIT_VECTOR_LAYER):
    with t.inference_mode():
        steered_out = qwen_model(**test_inputs, output_hidden_states=True)
steered_hidden = steered_out.hidden_states[TRAIT_VECTOR_LAYER + 1][0, -1].cpu()

diff = (steered_hidden - baseline_hidden).norm().item()
print(f"Difference in hidden states with steering: {diff:.4f} (should be > 0)")
assert diff > 0, "Steering hook is not modifying hidden states!"

# Test 2: coeff=0 should match baseline
with ActivationSteerer(qwen_model, test_vector, coeff=0.0, layer_idx=TRAIT_VECTOR_LAYER):
    with t.inference_mode():
        zero_out = qwen_model(**test_inputs, output_hidden_states=True)
zero_hidden = zero_out.hidden_states[TRAIT_VECTOR_LAYER + 1][0, -1].cpu()
zero_diff = (zero_hidden - baseline_hidden).norm().item()
print(f"Difference with coeff=0: {zero_diff:.6f} (should be ~0)")

# Test 3: Hook is removed after context manager exits
with t.inference_mode():
    after_out = qwen_model(**test_inputs, output_hidden_states=True)
after_hidden = after_out.hidden_states[TRAIT_VECTOR_LAYER + 1][0, -1].cpu()
after_diff = (after_hidden - baseline_hidden).norm().item()
print(f"Difference after context manager exit: {after_diff:.6f} (should be ~0)")
print("\nAll ActivationSteerer inline tests passed!")

tests.test_activation_steerer_hook(ActivationSteerer)
Solution
class ActivationSteerer:
    """
    Context manager that adds (coeff  steering_vector) to a chosen layer's hidden states
    during forward passes. Used for inference-time activation steering.
    Usage:
        with ActivationSteerer(model, vector, coeff=2.0, layer_idx=20):
            output = model.generate(...)
    """
def __init__(
        self,
        model: t.nn.Module,
        steering_vector: Float[Tensor, " d_model"],
        coeff: float = 1.0,
        layer_idx: int = 20,
    ):
        self.model = model
        self.coeff = coeff
        self.layer_idx = layer_idx
        self._handle = None
# Store vector, will be moved to correct device/dtype in hook
        self.vector = steering_vector.clone()
def _hook_fn(self, module, input, output):
        """Add coeff  vector to hidden states."""
        steer = self.coeff  self.vector
# Handle tuple output (common in HuggingFace models)
        if isinstance(output, tuple):
            hidden_states = output[0]
            steer = steer.to(hidden_states.device, dtype=hidden_states.dtype)
            modified = hidden_states + steer
            return (modified,) + output[1:]
        else:
            steer = steer.to(output.device, dtype=output.dtype)
            return output + steer
def __enter__(self):
        # Register hook on the target layer (Qwen architecture)
        layer = self.model.model.layers[self.layer_idx]
        self._handle = layer.register_forward_hook(self._hook_fn)
        return self
def __exit__(self, exc):
        if self._handle is not None:
            self._handle.remove()
            self._handle = None

Exercise - Steering experiments (sycophancy)

Difficulty: 🔴🔴🔴⚪⚪
Importance: 🔵🔵🔵🔵🔵
> You should spend up to 25-30 minutes on this exercise.

Let's see if our sycophancy vector actually works. We'll generate responses at multiple steering coefficients and score them with the autorater to check whether sycophancy increases/decreases as expected.

Your task: implement run_steering_experiment which: 1. For each coefficient in the list, uses ActivationSteerer to generate responses to the evaluation questions 2. Scores each response with the autorater 3. Returns results organized for plotting

Hints
  • Use the ActivationSteerer context manager from the previous exercise
  • For generation, use model.generate() inside the context manager
  • Score responses using score_trait_response from Exercise 3.3
  • The steering vector should be at the layer specified (default: layer 20)
def generate_with_steerer(
    model,
    tokenizer,
    prompt: str,
    steering_vector: Float[Tensor, " d_model"],
    layer_idx: int,
    coeff: float,
    max_new_tokens: int = 256,
    temperature: float = 0.7,
) -> str:
    """Generate a response with activation steering applied."""
    messages = [{"role": "user", "content": prompt}]
    formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(formatted, return_tensors="pt").to(model.device)
    prompt_length = inputs.input_ids.shape[1]

    with ActivationSteerer(model, steering_vector, coeff=coeff, layer_idx=layer_idx):
        with t.inference_mode():
            output_ids = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id,
            )

    response_ids = output_ids[0, prompt_length:]
    return tokenizer.decode(response_ids, skip_special_tokens=True)


def run_steering_experiment(
    model,
    tokenizer,
    questions: list[str],
    steering_vector: Float[Tensor, " d_model"],
    eval_prompt_template: str,
    layer_idx: int = 20,
    coefficients: list[float] | None = None,
    max_new_tokens: int = 256,
) -> list[dict]:
    """
    Run steering experiment: generate and score responses at multiple coefficients.

    Args:
        model: The language model
        tokenizer: The tokenizer
        questions: List of evaluation questions
        steering_vector: The trait vector for the target layer
        eval_prompt_template: Template for autorater scoring
        layer_idx: Which layer to steer at
        coefficients: List of steering coefficients to test
        max_new_tokens: Maximum tokens per response

    Returns:
        List of dicts with keys: coefficient, question, response, score
    """
    if coefficients is None:
        coefficients = [-3.0, -1.0, 0.0, 1.0, 3.0, 5.0]

    raise NotImplementedError()


# Run the steering experiment
sycophantic_vector_layer20 = sycophantic_vectors[TRAIT_VECTOR_LAYER - 1]  # 0-indexed

steering_results = run_steering_experiment(
    model=qwen_model,
    tokenizer=qwen_tokenizer,
    questions=sycophantic_data["questions"],
    steering_vector=sycophantic_vector_layer20,
    eval_prompt_template=sycophantic_data["eval_prompt"],
    layer_idx=TRAIT_VECTOR_LAYER,
    coefficients=[-3.0, -1.0, 0.0, 1.0, 3.0, 5.0],
)

# Plot mean score vs coefficient
import pandas as pd

df = pd.DataFrame(steering_results)
df_valid = df[df["score"].notna()]
mean_scores = df_valid.groupby("coefficient")["score"].mean()

fig = px.line(
    x=mean_scores.index,
    y=mean_scores.values,
    title="Sycophancy Score vs Steering Coefficient",
    labels={"x": "Steering Coefficient", "y": "Mean Sycophancy Score (0-100)"},
    markers=True,
)
fig.add_hline(y=50, line_dash="dash", annotation_text="Threshold", line_color="gray")
fig.show()

print("\nMean sycophancy scores by coefficient:")
for coeff, score in mean_scores.items():
    print(f"  coeff={coeff:+.1f}: {score:.1f}")

# Show example responses at different coefficients for same question
example_q = sycophantic_data["questions"][0]
print(f"\n--- Example responses for: {example_q[:60]}... ---")
for coeff in [-3.0, 0.0, 5.0]:
    example = next((r for r in steering_results if r["coefficient"] == coeff and r["question"] == example_q), None)
    if example:
        print(f"\ncoeff={coeff:+.1f} (score={example['score']}):")
        print_with_wrap(f"  {example['response'][:200]}...")

# Save results
save_path = section_dir / "sycophantic_steering_results.json"
with open(save_path, "w") as f:
    json.dump(steering_results, f, indent=2)
print(f"\nSaved steering results to {save_path}")
Expected observations

You should see a clear monotonic relationship between steering coefficient and sycophancy score: - Negative coefficients (e.g., -3): Lower sycophancy scores — the model pushes back on opinions, provides balanced views - Zero coefficient: Baseline behavior — moderate sycophancy (the model's default tendency) - Positive coefficients (e.g., +3, +5): Higher sycophancy scores — the model enthusiastically agrees with everything

At extreme coefficients (|coeff| > 5), coherence may start to degrade — the model might produce repetitive or nonsensical text. This defines the "safe steering range."

If the plot is flat or non-monotonic, check that you're using the correct layer and that your vector was extracted from enough effective pairs.

Solution
def generate_with_steerer(
    model,
    tokenizer,
    prompt: str,
    steering_vector: Float[Tensor, " d_model"],
    layer_idx: int,
    coeff: float,
    max_new_tokens: int = 256,
    temperature: float = 0.7,
) -> str:
    """Generate a response with activation steering applied."""
    messages = [{"role": "user", "content": prompt}]
    formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(formatted, return_tensors="pt").to(model.device)
    prompt_length = inputs.input_ids.shape[1]
with ActivationSteerer(model, steering_vector, coeff=coeff, layer_idx=layer_idx):
        with t.inference_mode():
            output_ids = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id,
            )
response_ids = output_ids[0, prompt_length:]
    return tokenizer.decode(response_ids, skip_special_tokens=True)
def run_steering_experiment(
    model,
    tokenizer,
    questions: list[str],
    steering_vector: Float[Tensor, " d_model"],
    eval_prompt_template: str,
    layer_idx: int = 20,
    coefficients: list[float] | None = None,
    max_new_tokens: int = 256,
) -> list[dict]:
    """
    Run steering experiment: generate and score responses at multiple coefficients.
    Args:
        model: The language model
        tokenizer: The tokenizer
        questions: List of evaluation questions
        steering_vector: The trait vector for the target layer
        eval_prompt_template: Template for autorater scoring
        layer_idx: Which layer to steer at
        coefficients: List of steering coefficients to test
        max_new_tokens: Maximum tokens per response
    Returns:
        List of dicts with keys: coefficient, question, response, score
    """
    if coefficients is None:
        coefficients = [-3.0, -1.0, 0.0, 1.0, 3.0, 5.0]
results = []
    total = len(coefficients) * len(questions)
    pbar = tqdm(total=total, desc="Steering experiment")
for coeff in coefficients:
        for question in questions:
            # Generate with steering
            response = generate_with_steerer(
                model, tokenizer, question, steering_vector, layer_idx, coeff, max_new_tokens
            )
# Score with autorater
            score = score_trait_response(question, response, eval_prompt_template)
            time.sleep(0.05)  # Rate limiting
results.append(
                {
                    "coefficient": coeff,
                    "question": question,
                    "response": response,
                    "score": score,
                }
            )
            pbar.update(1)
pbar.close()
    return results

Exercise - Projection-based monitoring

Difficulty: 🔴🔴🔴⚪⚪
Importance: 🔵🔵🔵🔵⚪
> You should spend up to 20-25 minutes on this exercise.

Steering is an intervention — it changes model behavior. But we can also measure trait expression without intervention, by projecting a model's response activations onto the trait vector. This gives us a scalar indicating how much the response exhibits the trait.

This is the same approach as eval/cal_projection.py in the persona vectors repo, where the projection is defined as:

$$\text{projection} = \frac{a \cdot v}{\|v\|}$$

where $a$ is the mean response activation and $v$ is the trait vector.

Your task: implement compute_trait_projections which computes the projection of response activations onto the trait vector, then apply it to three conditions: 1. Baseline responses (no system prompt) 2. Positive-prompted responses (sycophantic system prompt) 3. Steered responses at various coefficients

Hints
  • Use extract_response_activations to get activations at the target layer
  • The projection formula is (activation @ vector) / vector.norm()
  • For baseline responses, use an empty string as the system prompt
def compute_trait_projections(
    model,
    tokenizer,
    system_prompts: list[str],
    questions: list[str],
    responses: list[str],
    trait_vector: Float[Tensor, " d_model"],
    layer: int,
) -> list[float]:
    """
    Compute projection of response activations onto the trait vector.

    Args:
        model: The language model
        tokenizer: The tokenizer
        system_prompts: List of system prompts (one per response)
        questions: List of questions (one per response)
        responses: List of response texts
        trait_vector: The trait vector at the specified layer
        layer: Which layer to extract activations from

    Returns:
        List of projection values (one per response)
    """
    raise NotImplementedError()


# Compute projections for three conditions
syc_vector = sycophantic_vectors[TRAIT_VECTOR_LAYER - 1]
questions_subset = sycophantic_data["questions"][:10]  # Use a subset for speed

# 1. Baseline (no system prompt)
print("Computing baseline projections...")
baseline_responses_list = []
for q in questions_subset:
    resp = generate_with_steerer(
        qwen_model, qwen_tokenizer, q, syc_vector, TRAIT_VECTOR_LAYER, coeff=0.0, max_new_tokens=256
    )
    baseline_responses_list.append(resp)
baseline_projections = compute_trait_projections(
    qwen_model,
    qwen_tokenizer,
    [""] * len(questions_subset),
    questions_subset,
    baseline_responses_list,
    syc_vector,
    TRAIT_VECTOR_LAYER,
)

# 2. Positive-prompted
print("Computing positive-prompted projections...")
pos_prompt = construct_system_prompt("sycophantic", sycophantic_data["instruction"][0]["pos"])
pos_resp_list = [
    next(
        (
            r["response"]
            for r in sycophantic_responses
            if r["question"] == q and r["polarity"] == "pos" and r["instruction_idx"] == 0
        ),
        "",
    )
    for q in questions_subset
]
pos_projections = compute_trait_projections(
    qwen_model,
    qwen_tokenizer,
    [pos_prompt] * len(questions_subset),
    questions_subset,
    pos_resp_list,
    syc_vector,
    TRAIT_VECTOR_LAYER,
)

# 3. Steered at coeff=3
print("Computing steered projections (coeff=3)...")
steered_responses_list = []
for q in questions_subset:
    resp = generate_with_steerer(
        qwen_model, qwen_tokenizer, q, syc_vector, TRAIT_VECTOR_LAYER, coeff=3.0, max_new_tokens=256
    )
    steered_responses_list.append(resp)
steered_projections = compute_trait_projections(
    qwen_model,
    qwen_tokenizer,
    [""] * len(questions_subset),
    questions_subset,
    steered_responses_list,
    syc_vector,
    TRAIT_VECTOR_LAYER,
)

# Plot
fig = px.box(
    x=["Baseline"] * len(baseline_projections)
    + ["Positive-prompted"] * len(pos_projections)
    + ["Steered (coeff=3)"] * len(steered_projections),
    y=baseline_projections + pos_projections + steered_projections,
    title="Sycophancy Projections by Condition",
    labels={"x": "Condition", "y": "Projection onto Sycophancy Vector"},
)
fig.show()

print("\nMean projections:")
print(f"  Baseline: {np.mean(baseline_projections):.3f}")
print(f"  Positive-prompted: {np.mean(pos_projections):.3f}")
print(f"  Steered (coeff=3): {np.mean(steered_projections):.3f}")
Expected observations

You should see clear separation between the three conditions: - Baseline projections should be moderate (the model's natural sycophancy level) - Positive-prompted projections should be higher (the system prompt elicits sycophancy) - Steered projections should be highest (the vector amplifies sycophancy in activation space)

This shows the trait vector captures sycophancy not just behaviorally (autorater scores) but also in activation space (projections). The projection metric is especially useful because it lets us monitor trait expression without needing an expensive API-based autorater.

Solution
def compute_trait_projections(
    model,
    tokenizer,
    system_prompts: list[str],
    questions: list[str],
    responses: list[str],
    trait_vector: Float[Tensor, " d_model"],
    layer: int,
) -> list[float]:
    """
    Compute projection of response activations onto the trait vector.
    Args:
        model: The language model
        tokenizer: The tokenizer
        system_prompts: List of system prompts (one per response)
        questions: List of questions (one per response)
        responses: List of response texts
        trait_vector: The trait vector at the specified layer
        layer: Which layer to extract activations from
    Returns:
        List of projection values (one per response)
    """
    # Extract activations at the target layer
    activations = extract_response_activations(
        model, tokenizer, system_prompts, questions, responses, layer
    )  # (num_examples, d_model)
# Compute projections: (activation @ vector) / ||vector||
    vector_norm = trait_vector.norm()
    projections = (activations @ trait_vector) / vector_norm
return projections.tolist()

Exercise - Multi-trait pipeline

Difficulty: 🔴🔴⚪⚪⚪
Importance: 🔵🔵🔵🔵⚪
> You should spend up to 20-25 minutes on this exercise.

Now that we've validated the full pipeline for sycophancy, let's see if it generalizes to other traits. Rather than manually re-running each exercise, refactor the pipeline into a single function run_trait_pipeline that handles everything from generation through steering evaluation.

The persona vectors repo includes pre-generated trait data files for 7 traits: evil, sycophantic, hallucinating, impolite, optimistic, humorous, and apathetic. We'll run the pipeline for at least evil and hallucinating in addition to sycophancy.

Hints

Your run_trait_pipeline should call, in order: 1. generate_contrastive_responses (Exercise 3.2) 2. Score with score_trait_response (Exercise 3.3) + filter for effective pairs 3. extract_contrastive_vectors (Exercise 3.4) 4. run_steering_experiment (Exercise 4.2)

Return the trait vectors and steering results.

def run_trait_pipeline(
    model,
    tokenizer,
    trait_name: str,
    trait_data: dict,
    layer_idx: int = 20,
    steering_coefficients: list[float] | None = None,
    max_new_tokens: int = 256,
) -> tuple[Float[Tensor, "num_layers d_model"], list[dict]]:
    """
    Run the full contrastive extraction and steering pipeline for a single trait.

    Args:
        model: The language model
        tokenizer: The tokenizer
        trait_name: Name of the trait (e.g., "evil", "hallucinating")
        trait_data: Dict with 'instruction', 'questions', 'eval_prompt' keys
        layer_idx: Which layer to use for steering experiments
        steering_coefficients: Coefficients to test in steering experiment
        max_new_tokens: Maximum tokens per response

    Returns:
        Tuple of (trait_vectors tensor of shape [num_layers, d_model], steering_results list)
    """
    if steering_coefficients is None:
        steering_coefficients = [-3.0, -1.0, 0.0, 1.0, 3.0, 5.0]

    raise NotImplementedError()


# Run pipeline for additional traits
additional_traits = ["evil", "hallucinating"]
all_trait_vectors = {"sycophantic": sycophantic_vectors}  # We already have sycophancy
all_steering_results = {"sycophantic": steering_results}

for trait_name in additional_traits:
    trait_data_path = TRAIT_DATA_PATH / f"{trait_name}.json"
    with open(trait_data_path, "r") as f:
        trait_data = json.load(f)

    vectors, steer_results = run_trait_pipeline(
        model=qwen_model,
        tokenizer=qwen_tokenizer,
        trait_name=trait_name,
        trait_data=trait_data,
        layer_idx=TRAIT_VECTOR_LAYER,
    )
    all_trait_vectors[trait_name] = vectors
    all_steering_results[trait_name] = steer_results

print(f"\nCompleted pipeline for {len(all_trait_vectors)} traits: {list(all_trait_vectors.keys())}")
Solution
def run_trait_pipeline(
    model,
    tokenizer,
    trait_name: str,
    trait_data: dict,
    layer_idx: int = 20,
    steering_coefficients: list[float] | None = None,
    max_new_tokens: int = 256,
) -> tuple[Float[Tensor, "num_layers d_model"], list[dict]]:
    """
    Run the full contrastive extraction and steering pipeline for a single trait.
    Args:
        model: The language model
        tokenizer: The tokenizer
        trait_name: Name of the trait (e.g., "evil", "hallucinating")
        trait_data: Dict with 'instruction', 'questions', 'eval_prompt' keys
        layer_idx: Which layer to use for steering experiments
        steering_coefficients: Coefficients to test in steering experiment
        max_new_tokens: Maximum tokens per response
    Returns:
        Tuple of (trait_vectors tensor of shape [num_layers, d_model], steering_results list)
    """
    if steering_coefficients is None:
        steering_coefficients = [-3.0, -1.0, 0.0, 1.0, 3.0, 5.0]
print(f"\n{'='  60}")
    print(f"Running pipeline for trait: {trait_name}")
    print(f"{'='  60}")
# Step 1: Generate contrastive responses
    print("\n--- Step 1: Generating contrastive responses ---")
    responses = generate_contrastive_responses(model, tokenizer, trait_data, trait_name, max_new_tokens)
# Save responses
    save_path = section_dir / f"{trait_name}_responses.json"
    with open(save_path, "w") as f:
        json.dump(responses, f, indent=2)
# Step 2: Score with autorater and filter
    print("\n--- Step 2: Scoring with autorater ---")
    eval_prompt = trait_data["eval_prompt"]
    for entry in tqdm(responses, desc="Scoring"):
        entry["score"] = score_trait_response(entry["question"], entry["response"], eval_prompt)
        time.sleep(0.05)
# Filter for effective pairs
    effective_pairs = []
    for inst_idx in range(len(trait_data["instruction"])):
        for question in trait_data["questions"]:
            pos_entry = next(
                (
                    r
                    for r in responses
                    if r["instruction_idx"] == inst_idx and r["question"] == question and r["polarity"] == "pos"
                ),
                None,
            )
            neg_entry = next(
                (
                    r
                    for r in responses
                    if r["instruction_idx"] == inst_idx and r["question"] == question and r["polarity"] == "neg"
                ),
                None,
            )
            if pos_entry and neg_entry and pos_entry["score"] is not None and neg_entry["score"] is not None:
                if pos_entry["score"] >= 50 and neg_entry["score"] < 50:
                    effective_pairs.append({"pos": pos_entry, "neg": neg_entry})
print(f"Effective pairs: {len(effective_pairs)}")
    if len(effective_pairs) < 5:
        print(f"WARNING: Only {len(effective_pairs)} effective pairs — results may be noisy!")
# Step 3: Extract contrastive vectors
    print("\n--- Step 3: Extracting contrastive vectors ---")
    trait_vectors = extract_contrastive_vectors(model, tokenizer, effective_pairs)
# Save vectors
    t.save(trait_vectors, section_dir / f"{trait_name}_vectors.pt")
# Step 4: Run steering experiment
    print("\n--- Step 4: Running steering experiment ---")
    trait_vector_at_layer = trait_vectors[layer_idx - 1]
    steering_results = run_steering_experiment(
        model, tokenizer, trait_data["questions"], trait_vector_at_layer, eval_prompt, layer_idx, steering_coefficients
    )
# Save steering results
    with open(section_dir / f"{trait_name}_steering_results.json", "w") as f:
        json.dump(steering_results, f, indent=2)
# Print summary
    import pandas as pd
df = pd.DataFrame(steering_results)
    df_valid = df[df["score"].notna()]
    print(f"\nSteering results for {trait_name}:")
    for coeff in steering_coefficients:
        coeff_scores = df_valid[df_valid["coefficient"] == coeff]["score"]
        print(f"  coeff={coeff:+.1f}: mean score = {coeff_scores.mean():.1f} (n={len(coeff_scores)})")
return trait_vectors, steering_results

Exercise - Multi-trait geometry

Difficulty: 🔴🔴🔴⚪⚪
Importance: 🔵🔵🔵🔵⚪
> You should spend up to 15-20 minutes on this exercise.

Now that we have vectors for multiple traits, let's study how they relate to each other in activation space. Are sycophancy and evil correlated? Are any traits redundant?

Compute the pairwise cosine similarity between all trait vectors at layer 20, and visualize it as a heatmap. This connects back to the persona space analysis from Section 1 — but now instead of looking at full persona vectors, we're comparing directions that correspond to specific behavioral traits.

# Compute cosine similarity between trait vectors and visualize as heatmap
# Use the vectors at TRAIT_VECTOR_LAYER (layer 20) for each trait
raise NotImplementedError()
Expected observations

You should see that most trait pairs have moderate-to-low cosine similarity (|cos_sim| < 0.5), indicating they capture genuinely different behavioral dimensions. Some observations to look for:

  • Evil and sycophancy might have a small positive correlation (both involve departing from honest, balanced behavior) or be nearly independent
  • Hallucination and sycophancy might show a small correlation (both involve saying what the user wants to hear vs being accurate)
  • No two traits should have very high correlation (> 0.8) — if they did, they'd be capturing the same underlying phenomenon

This tells us something important: the model's behavioral space can't be captured by a single axis (like the Assistant Axis). Multiple independent directions exist, each corresponding to a specific kind of behavioral shift. The Assistant Axis from Section 1 is probably some weighted combination of several of these trait directions.

Solution
# Extract trait vectors at the target layer
trait_names = list(all_trait_vectors.keys())
layer_vectors = {name: vecs[TRAIT_VECTOR_LAYER - 1] for name, vecs in all_trait_vectors.items()}
# Compute cosine similarity matrix
names = list(layer_vectors.keys())
vectors_stacked = t.stack([layer_vectors[name] for name in names])
vectors_normalized = vectors_stacked / vectors_stacked.norm(dim=1, keepdim=True)
cos_sim = (vectors_normalized @ vectors_normalized.T).float()
# Plot heatmap
fig = px.imshow(
    cos_sim.numpy(),
    x=names,
    y=names,
    title=f"Trait Vector Cosine Similarity (Layer {TRAIT_VECTOR_LAYER})",
    color_continuous_scale="RdBu",
    color_continuous_midpoint=0.0,
    zmin=-1,
    zmax=1,
)
fig.show()
# Print the matrix
print("Cosine similarity matrix:")
for i, name_i in enumerate(names):
    for j, name_j in enumerate(names):
        print(f"  {name_i} vs {name_j}: {cos_sim[i, j].item():.3f}")
# Discussion prompts
print("\n--- Discussion ---")
print("Consider:")
print("  1. Are 'evil' and 'sycophancy' independent or correlated?")
print("  2. Which traits are most similar? Most different?")
print("  3. How does this compare to the 'single axis' view from the Assistant Axis paper?")
print("     (The Assistant Axis captured one dominant direction; here we see multiple independent directions)")