Exercise Status: All exercises complete and verified

1️⃣ Setup & visualizing truth representations

Learning Objectives
  • Extract hidden state activations from specified layers and token positions
  • Implement PCA to visualize high-dimensional activations
  • Observe that truth is linearly separable in activation space - even without supervision
  • Understand which layers best represent truth via a layer sweep

Extracting activations

Our first task is to extract hidden state activations from the model. For the Geometry of Truth approach, we extract the last-token activation at each specified layer. For declarative statements like "The city of Paris is in France.", the model's representation of whether the statement is true or false is concentrated at the final token position.

Note - the Geometry of Truth paper probes specifically at the end-of-sentence punctuation token (the period / full stop). The datasets are designed so that every statement ends with a period, meaning the last token is always the period. This is important because the model's truth representation builds up over the sentence and is concentrated at the final punctuation mark.

A few technical details to keep in mind. We use output_hidden_states=True in the forward pass to get all layer activations. outputs.hidden_states has length num_layers + 1: index 0 is the embedding output, and index i for i >= 1 is the output of layer i-1. We also need to handle padding correctly, since statements have different lengths. We pad them but must extract the activation at the last real (non-padding) token, not the last position.

Exercise - implement extract_activations

Difficulty: 🔴🔴🔴⚪⚪
Importance: 🔵🔵🔵🔵🔵
You should spend up to 15-20 minutes on this exercise. This is the foundation for everything else - getting activation extraction right is critical.

Implement a function that extracts the last-token hidden state from specified layers for a batch of statements. You'll need to: 1. Tokenize the statements with padding 2. Run the forward pass with output_hidden_states=True 3. For each sequence, find the index of the last non-padding token using attention_mask 4. Extract the hidden state at that position for each requested layer

Hint - handling padding

Use attention_mask.sum(dim=1) - 1 to get the index of the last non-padding token for each sequence. Then use torch.arange and advanced indexing to select the right positions.

Hint - hidden_states indexing

outputs.hidden_states[0] is the embedding layer output. outputs.hidden_states[i] for i >= 1 is the output of transformer layer i-1. So to get layer L's output, index with L + 1.

def extract_activations(
    statements: list[str],
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    layers: list[int],
    batch_size: int = 25,
) -> dict[int, Float[Tensor, "n_statements d_model"]]:
    """
    Extract last-token hidden state activations from specified layers for a list of statements.

    Args:
        statements: List of text statements to process.
        model: A HuggingFace causal language model.
        tokenizer: The corresponding tokenizer.
        layers: List of layer indices (0-indexed) to extract activations from.
        batch_size: Number of statements to process at once.

    Returns:
        Dictionary mapping layer index to tensor of activations, shape [n_statements, d_model].
    """
    raise NotImplementedError()


tests.test_extract_activations(extract_activations, model, tokenizer, PROBE_LAYER, D_MODEL)
Solution
def extract_activations(
    statements: list[str],
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    layers: list[int],
    batch_size: int = 25,
) -> dict[int, Float[Tensor, "n_statements d_model"]]:
    """
    Extract last-token hidden state activations from specified layers for a list of statements.

    Args:
        statements: List of text statements to process.
        model: A HuggingFace causal language model.
        tokenizer: The corresponding tokenizer.
        layers: List of layer indices (0-indexed) to extract activations from.
        batch_size: Number of statements to process at once.

    Returns:
        Dictionary mapping layer index to tensor of activations, shape [n_statements, d_model].
    """
    all_acts = {layer: [] for layer in layers}

    for i in range(0, len(statements), batch_size):
        batch = statements[i : i + batch_size]

        # Sanity check: every statement should end with a period, since the GoT paper probes
        # at the end-of-sentence punctuation token
        for stmt in batch:
            assert stmt.rstrip().endswith("."), f"Statement doesn't end with period: {stmt!r}"

        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=512).to(model.device)

        with t.no_grad():
            outputs = model(**inputs, output_hidden_states=True)

        # Find the last non-padding token index for each sequence
        last_token_idx = inputs["attention_mask"].sum(dim=1) - 1  # [batch]

        for layer in layers:
            # hidden_states[0] is embedding, hidden_states[layer+1] is output of layer
            hidden = outputs.hidden_states[layer + 1]  # [batch, seq_len, d_model]
            # Extract last real token for each sequence
            batch_indices = t.arange(hidden.shape[0], device=hidden.device)
            acts = hidden[batch_indices, last_token_idx]  # [batch, d_model]
            all_acts[layer].append(acts.cpu().float())

    return {layer: t.cat(acts_list, dim=0) for layer, acts_list in all_acts.items()}

Now let's extract activations for all three datasets at our probe layer. This will take a minute or two.

# Extract activations at the probe layer for all datasets
activations = {}
labels_dict = {}

for name in DATASET_NAMES:
    df = datasets[name]
    statements = df["statement"].tolist()
    labs = t.tensor(df["label"].values, dtype=t.float32)

    acts = extract_activations(statements, model, tokenizer, [PROBE_LAYER])
    activations[name] = acts[PROBE_LAYER]
    labels_dict[name] = labs

# Show summary table
summary = pd.DataFrame(
    {
        "Dataset": DATASET_NAMES,
        "N statements": [len(datasets[n]) for n in DATASET_NAMES],
        "N true": [int(datasets[n]["label"].sum()) for n in DATASET_NAMES],
        "N false": [int((1 - datasets[n]["label"]).sum()) for n in DATASET_NAMES],
        "Act shape": [str(tuple(activations[n].shape)) for n in DATASET_NAMES],
        "Mean norm": [f"{activations[n].norm(dim=-1).mean():.1f}" for n in DATASET_NAMES],
    }
)
display(summary)

Visualizing with PCA

Now comes the striking result. We'll use PCA (Principal Component Analysis) to project our high-dimensional activations down to 2D and see whether true and false statements separate.

PCA is completely unsupervised: it finds the directions of maximum variance without any knowledge of the true/false labels. If we see separation by label in PCA space, it means truth is one of the most prominent features in the activation space.

Before running the next cell, think about what you expect to see. The activations live in a ~5000-dimensional space. Do you expect truth to be prominent enough for an unsupervised method like PCA to pick up on?

Exercise - implement get_pca_components

Difficulty: 🔴🔴⚪⚪⚪
Importance: 🔵🔵🔵🔵⚪
You should spend up to 10-15 minutes on this exercise. Standard PCA implementation via eigendecomposition.

Implement PCA by computing the eigendecomposition of the covariance matrix. Steps: 1. Mean-center the data 2. Compute the covariance matrix 3. Eigendecompose it 4. Return the top-k eigenvectors (sorted by eigenvalue, descending)

def get_pca_components(
    activations: Float[Tensor, "n d_model"],
    k: int = 2,
) -> Float[Tensor, "d_model k"]:
    """
    Compute the top-k principal components of the activation matrix.

    Args:
        activations: Activation matrix, shape [n_samples, d_model].
        k: Number of principal components to return.

    Returns:
        Matrix of top-k eigenvectors as columns, shape [d_model, k].
    """
    raise NotImplementedError()


tests.test_get_pca_components(get_pca_components, activations["cities"], D_MODEL)
Solution
def get_pca_components(
    activations: Float[Tensor, "n d_model"],
    k: int = 2,
) -> Float[Tensor, "d_model k"]:
    """
    Compute the top-k principal components of the activation matrix.

    Args:
        activations: Activation matrix, shape [n_samples, d_model].
        k: Number of principal components to return.

    Returns:
        Matrix of top-k eigenvectors as columns, shape [d_model, k].
    """
    # Mean-center the data
    X = activations - activations.mean(dim=0)

    # Compute covariance matrix
    cov = X.t() @ X / (X.shape[0] - 1)

    # Eigendecompose
    eigenvalues, eigenvectors = t.linalg.eigh(cov)

    # Sort by eigenvalue descending and take top-k
    sorted_indices = t.argsort(eigenvalues, descending=True)
    top_k = eigenvectors[:, sorted_indices[:k]]

    return top_k

Now let's visualize the PCA projections for all three datasets. Each point is a statement, colored by whether it's true or false.

Before running the cell, think about what you expect to see - PCA is completely unsupervised, so any separation by label would mean truth is one of the most prominent directions of variation in the activation space. Figure 1 of the Geometry of Truth paper shows this plot for LLaMA-2-70B; our results use 13B, so they might be somewhat noisier, but should still show the same qualitative pattern.

fig = make_subplots(rows=1, cols=3, subplot_titles=DATASET_NAMES)

for i, name in enumerate(DATASET_NAMES):
    acts = activations[name]
    labs = labels_dict[name]
    pcs = get_pca_components(acts, k=2)
    X_centered = acts - acts.mean(dim=0)
    projected = (X_centered @ pcs).numpy()

    # Compute variance explained
    total_var = X_centered.var(dim=0).sum().item()
    pc_var = t.tensor(projected).var(dim=0)
    pct_explained = (pc_var / total_var * 100).tolist()

    colors = ["blue" if l == 1 else "red" for l in labs.tolist()]
    fig.add_trace(
        go.Scatter(
            x=projected[:, 0],
            y=projected[:, 1],
            mode="markers",
            marker=dict(color=colors, size=3, opacity=0.5),
            name=name,
            showlegend=False,
        ),
        row=1,
        col=i + 1,
    )
    fig.update_xaxes(title_text=f"PC1 ({pct_explained[0]:.1f}%)", row=1, col=i + 1)
    fig.update_yaxes(title_text=f"PC2 ({pct_explained[1]:.1f}%)", row=1, col=i + 1)

# Add a legend manually
fig.add_trace(go.Scatter(x=[None], y=[None], mode="markers", marker=dict(color="blue", size=8), name="True"))
fig.add_trace(go.Scatter(x=[None], y=[None], mode="markers", marker=dict(color="red", size=8), name="False"))

fig.update_layout(
    title="PCA of Truth Representations (Layer 14, Last Token)",
    height=400,
    width=1200,
)
fig.show()
Click to see the expected output
Question - What do you observe about the separation between true and false statements?

The separation is strikingly linear: true and false statements cluster on opposite sides of a line in PC space. This is surprising because PCA is unsupervised; it finds this structure without any label information. The fact that the first or second principal component aligns with truth/falsehood means that truth is one of the most prominent directions of variation in the activation space.

Note that the separation quality may vary across datasets. The curated datasets (cities, sp_en_trans) tend to show cleaner separation than datasets involving numerical reasoning (larger_than).

Layer sweep: where does truth live?

Not all layers represent truth equally. The Geometry of Truth paper found that truth representations are concentrated in early-to-mid layers, not at the very end. We can verify this by training a simple difference-of-means classifier at every layer and measuring accuracy.

Exercise - implement layer sweep

Difficulty: 🔴🔴⚪⚪⚪
Importance: 🔵🔵🔵🔵⚪
You should spend up to 10-15 minutes on this exercise. Understanding which layers to probe is essential practical knowledge.

For each layer, extract activations for the cities dataset, train a simple difference-of-means classifier (direction = mean(true) - mean(false), classify by sign of dot product), and compute accuracy on a held-out test split.

def layer_sweep_accuracy(
    statements: list[str],
    labels: Float[Tensor, " n"],
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    layers: list[int],
    train_frac: float = 0.8,
    batch_size: int = 25,
) -> dict[str, list[float]]:
    """
    For each layer, train a difference-of-means classifier and compute train/test accuracy.

    Args:
        statements: List of statements.
        labels: Binary labels (1=true, 0=false).
        model: The language model.
        tokenizer: The tokenizer.
        layers: List of layer indices to sweep over.
        train_frac: Fraction of data for training.
        batch_size: Batch size for activation extraction.

    Returns:
        Dict with keys "train_acc" and "test_acc", each a list of accuracies per layer.
    """
    raise NotImplementedError()


t.manual_seed(42)
all_layers = list(range(NUM_LAYERS))
cities_statements = datasets["cities"]["statement"].tolist()
cities_labels = t.tensor(datasets["cities"]["label"].values, dtype=t.float32)

sweep_results = layer_sweep_accuracy(cities_statements, cities_labels, model, tokenizer, all_layers)

# Print results as a table
sweep_df = pd.DataFrame(
    {
        "Layer": all_layers,
        "Train Acc": [f"{a:.3f}" for a in sweep_results["train_acc"]],
        "Test Acc": [f"{a:.3f}" for a in sweep_results["test_acc"]],
    }
)
display(sweep_df)

# Plot
fig = go.Figure()
fig.add_trace(go.Scatter(x=all_layers, y=sweep_results["train_acc"], mode="lines+markers", name="Train"))
fig.add_trace(go.Scatter(x=all_layers, y=sweep_results["test_acc"], mode="lines+markers", name="Test"))
fig.add_vline(x=PROBE_LAYER, line_dash="dash", line_color="gray", annotation_text=f"Probe layer ({PROBE_LAYER})")
fig.update_layout(
    title="Layer Sweep: Difference-of-Means Accuracy on Cities Dataset",
    xaxis_title="Layer",
    yaxis_title="Accuracy",
    yaxis_range=[0.4, 1.05],
    height=400,
    width=800,
)
fig.show()

best_layer = all_layers[int(np.argmax(sweep_results["test_acc"]))]
print(f"\nBest layer by test accuracy: {best_layer} ({max(sweep_results['test_acc']):.3f})")
print(f"Configured probe layer: {PROBE_LAYER} ({sweep_results['test_acc'][PROBE_LAYER]:.3f})")
Click to see the expected output
Best layer by test accuracy: 9 (0.963)
Configured probe layer: 14 (0.953)
Question - At which layers is truth best represented? Does this match the paper's configuration?

Truth representations are concentrated in early-to-mid layers (roughly layers 8-20 for LLaMA-2-13B). The configured probe layer of 14 (from the Geometry of Truth paper's config) should be near the peak of test accuracy. The very early layers (0-5) and final layers (35+) typically show much lower accuracy - the early layers haven't yet computed truth-relevant features, and the final layers may have transformed them into prediction-relevant features that aren't as cleanly linear.

Solution
def layer_sweep_accuracy(
    statements: list[str],
    labels: Float[Tensor, " n"],
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    layers: list[int],
    train_frac: float = 0.8,
    batch_size: int = 25,
) -> dict[str, list[float]]:
    """
    For each layer, train a difference-of-means classifier and compute train/test accuracy.

    Args:
        statements: List of statements.
        labels: Binary labels (1=true, 0=false).
        model: The language model.
        tokenizer: The tokenizer.
        layers: List of layer indices to sweep over.
        train_frac: Fraction of data for training.
        batch_size: Batch size for activation extraction.

    Returns:
        Dict with keys "train_acc" and "test_acc", each a list of accuracies per layer.
    """
    # Split into train/test
    n_train = int(len(statements) * train_frac)
    perm = t.randperm(len(statements))
    train_idx, test_idx = perm[:n_train], perm[n_train:]
    train_statements = [statements[i] for i in train_idx]
    test_statements = [statements[i] for i in test_idx]
    train_labels = labels[train_idx]
    test_labels = labels[test_idx]

    # Extract activations at all layers at once
    train_acts = extract_activations(train_statements, model, tokenizer, layers, batch_size)
    test_acts = extract_activations(test_statements, model, tokenizer, layers, batch_size)

    train_accs = []
    test_accs = []

    for layer in layers:
        tr_acts = train_acts[layer]
        te_acts = test_acts[layer]

        # Difference of means direction
        true_mean = tr_acts[train_labels == 1].mean(dim=0)
        false_mean = tr_acts[train_labels == 0].mean(dim=0)
        direction = true_mean - false_mean

        # Classify by sign of dot product (centered around midpoint)
        midpoint = (true_mean + false_mean) / 2
        train_preds = ((tr_acts - midpoint) @ direction > 0).float()
        test_preds = ((te_acts - midpoint) @ direction > 0).float()

        train_acc = (train_preds == train_labels).float().mean().item()
        test_acc = (test_preds == test_labels).float().mean().item()
        train_accs.append(train_acc)
        test_accs.append(test_acc)

    return {"train_acc": train_accs, "test_acc": test_accs}