4️⃣ Exploring circuits & interventions
Learning Objectives
- Load and inspect pre-computed attribution graphs and their supernodes
- Perform zero ablation experiments to test causal claims made by the graph
- Perform cross-prompt feature swapping to demonstrate compositional circuit structure
- Use open-ended generation with feature interventions
Introduction
This section uses the circuit-tracer library to explore real circuits and test their causal structure via interventions. We'll focus on the Dallas/Austin two-hop factual recall circuit, which requires two-hop reasoning: Dallas is in Texas, and the capital of Texas is Austin.
The circuit-tracer library wraps the same core ideas you implemented in section 3 into a convenient API. Here's roughly how the concepts map to what you will have implemented in the previous section:
| Section 3 (your code) | circuit-tracer library |
|---|---|
FreezeHooks + TranscoderReplacementHooks |
ReplacementModel |
attribute() function |
circuit_tracer.attribute() |
AttributionResult |
Graph object |
We're working with a slightly different set of transcoders here (trained on Gemma-2-2B rather than the newer Gemma 3 series), so the circuits will look different from what we computed with Gemma 3-1B IT. But the underlying algorithm (linearise, build nodes, compute adjacency matrix, prune) is exactly the same.
We'll also use visualization tools (provided in utils.py) for drawing circuit diagrams with supernodes: groups of related features that serve a common role. These are adapted from the circuit-tracer demo notebooks.
Installation
Before running the code below, make sure you've cloned the circuit tracer library:
cd chapter1_transformer_interp/exercises
git clone https://github.com/decoderesearch/circuit-tracer.git
To silence any Pylance warnings about missing imports from the cloned library, you can add the following to your workspace settings (if using VS Code):
{
"python.analysis.extraPaths": [
"${workspaceFolder}/ARENA_3.0/chapter1_transformer_interp/exercises/circuit-tracer",
],
}
Then you should be able to run the remaining cells:
circuit_tracer_path = exercises_dir / "circuit-tracer"
assert circuit_tracer_path.exists(), "circuit-tracer library not found - please clone it first"
if str(circuit_tracer_path) not in sys.path:
sys.path.insert(0, str(circuit_tracer_path))
from circuit_tracer import ReplacementModel
from circuit_tracer import attribute as circuit_tracer_attribute
replacement_model = ReplacementModel.from_pretrained(
"google/gemma-2-2b", "gemma", dtype=dtype, backend="transformerlens"
)
The Dallas/Austin circuit
The prompt "Fact: the capital of the state containing Dallas is" requires two-hop reasoning. You can explore the full attribution graph on Neuronpedia. The graph reveals a clear circuit with distinct supernodes (groups of related features): "capital" & "state" are early-layer transcoder features activated by those words in the prompt (separate from the raw token embedding nodes), "Dallas" represents the city entity, "Texas" contains intermediate features encoding "the state is Texas" (the first hop), "Say a capital" promotes capital-city tokens generally, and "Say Austin" contains late-layer features that promote "Austin" specifically (the second hop).
Let's compute a graph and build a visual representation of the circuit.
dallas_prompt = "Fact: the capital of the state containing Dallas is"
dallas_graph = circuit_tracer_attribute(dallas_prompt, replacement_model, verbose=True)
# Get activations (we'll need these for interventions later)
logits_dallas, dallas_activations = replacement_model.get_activations(dallas_prompt, sparse=True)
Defining supernodes
The graph on Neuronpedia lets you annotate groups of features into supernodes. We can extract these annotations from a Neuronpedia URL using utils.extract_supernode_features, or define them manually. Below, we define the key supernodes from the Dallas/Austin graph and arrange them into the circuit structure with parent-child causal relationships.
The Supernode class stores a list of Feature namedtuples and an optional list of child supernodes. InterventionGraph wraps the layered node arrangement and records each node's baseline activations. create_graph_visualization renders the circuit as an inline SVG showing nodes, edges, activation percentages, and the model's top output predictions.
Feature = utils.Feature
# Extract supernodes from the annotated Neuronpedia URL
dallas_austin_url = "https://www.neuronpedia.org/gemma-2-2b/graph?slug=gemma-fact-dallas-austin&clerps=%5B%5D&pruningThreshold=0.53&pinnedIds=27_22605_10%2C20_15589_10%2CE_26865_9%2C21_5943_10%2C23_12237_10%2C20_15589_9%2C16_25_9%2C14_2268_9%2C18_8959_10%2C4_13154_9%2C7_6861_9%2C19_1445_10%2CE_2329_7%2CE_6037_4%2C0_13727_7%2C6_4012_7%2C17_7178_10%2C15_4494_4%2C6_4662_4%2C4_7671_4%2C3_13984_4%2C1_1000_4%2C19_7477_9%2C18_6101_10%2C16_4298_10%2C7_691_10&supernodes=%5B%5B%22capital%22%2C%2215_4494_4%22%2C%226_4662_4%22%2C%224_7671_4%22%2C%223_13984_4%22%2C%221_1000_4%22%5D%2C%5B%22state%22%2C%226_4012_7%22%2C%220_13727_7%22%5D%2C%5B%22Texas%22%2C%2220_15589_9%22%2C%2219_7477_9%22%2C%2216_25_9%22%2C%224_13154_9%22%2C%2214_2268_9%22%2C%227_6861_9%22%5D%2C%5B%22preposition+followed+by+place+name%22%2C%2219_1445_10%22%2C%2218_6101_10%22%5D%2C%5B%22capital+cities+%2F+say+a+capital+city%22%2C%2221_5943_10%22%2C%2217_7178_10%22%2C%227_691_10%22%2C%2216_4298_10%22%5D%5D"
supernode_features = utils.extract_supernode_features(dallas_austin_url)
for name, features in supernode_features.items():
print(f" {name}: {len(features)} features")
Click to see the expected output
capital: 5 features state: 2 features Texas: 6 features preposition followed by place name: 2 features capital cities / say a capital city: 4 features
Supernode = utils.Supernode
# Build the circuit: output nodes first, then working backward
say_austin_node = Supernode(name="Say Austin", features=[Feature(layer=23, pos=10, feature_idx=12237)])
say_capital_node = Supernode(
name="Say a capital",
features=supernode_features["capital cities / say a capital city"]
if "capital cities / say a capital city" in supernode_features
else supernode_features.get("Say a capital", [Feature(layer=21, pos=10, feature_idx=5943)]),
children=[say_austin_node],
)
texas_node = Supernode(
name="Texas",
features=supernode_features.get("Texas", [Feature(layer=20, pos=10, feature_idx=15589)]),
children=[say_austin_node],
)
capital_node = Supernode(
name="capital",
features=supernode_features.get("capital", [Feature(layer=15, pos=4, feature_idx=4494)]),
children=[say_capital_node],
)
state_node = Supernode(
name="state",
features=supernode_features.get("state", [Feature(layer=4, pos=9, feature_idx=13154)]),
children=[say_capital_node, texas_node],
)
dallas_node = Supernode(
name="Dallas",
features=supernode_features.get("Dallas", [Feature(layer=14, pos=9, feature_idx=2268)]),
children=[texas_node],
)
# Embedding nodes (no features - they're input nodes)
capital_emb_node = Supernode(name="capital (emb)", features=[], children=[capital_node])
state_emb_node = Supernode(name="state (emb)", features=[], children=[state_node])
# Build the intervention graph with layered node arrangement
ordered_nodes = [
[capital_emb_node, state_emb_node], # Layer 0: embeddings
[capital_node, state_node, dallas_node], # Layer 1: early features
[say_capital_node, texas_node], # Layer 2: intermediate
[say_austin_node], # Layer 3: output features
]
dallas_austin_graph = utils.InterventionGraph(ordered_nodes=ordered_nodes, prompt=dallas_prompt)
# Initialize each node with its baseline activations
for node in [capital_node, state_node, dallas_node, texas_node, say_capital_node, say_austin_node]:
dallas_austin_graph.initialize_node(node, dallas_activations)
# Set activation fractions (current / default - all 100% since no intervention yet)
dallas_austin_graph.set_node_activation_fractions(dallas_activations)
# Get the top model predictions
top_outputs = utils.get_topk(logits_dallas, replacement_model.tokenizer)
# Visualize the baseline circuit
svg_obj = utils.create_graph_visualization(dallas_austin_graph, top_outputs)
display(svg_obj)
Click to see the expected output
You should see all supernodes at 100% activation, with "Austin" as the model's top prediction. The edges show the hypothesised causal flow: embeddings → early features → Texas/Say a capital → Say Austin → output.
Feature interventions
Now we'll test whether this circuit is actually causal by intervening on individual supernodes. The ReplacementModel provides two methods: model.feature_intervention(prompt, interventions) runs a single forward pass with feature overrides and returns (logits, activations), and model.feature_intervention_generate(prompt, interventions, ...) does multi-token generation with persistent interventions. Each intervention is a tuple (layer, position, feature_idx, new_value).
We'll use the helper function supernode_intervention below. It takes a list of Intervention namedtuples (each specifying a supernode and a scaling factor for its activation), runs the intervention, updates the graph's activation fractions, and renders the result. A scaling factor of -2 means "set activation to -2x the default" (i.e. strongly suppress), while +2 means "set activation to 2x the default".
Intervention = namedtuple("Intervention", ["supernode", "scaling_factor"])
def supernode_intervention(
model: "ReplacementModel",
intervention_graph: "utils.InterventionGraph",
interventions: list[Intervention],
replacements: dict[str, "utils.Supernode"] | None = None,
) -> tuple:
"""Perform interventions on supernodes, record the effects, and visualize the result.
For each Intervention, sets the supernode's feature activations to
``scaling_factor * default_activation``. After running the forward pass, updates the
InterventionGraph with new activation fractions and renders the circuit diagram.
Args:
model: The ReplacementModel.
intervention_graph: The InterventionGraph to update and visualize.
interventions: List of Intervention(supernode, scaling_factor) to apply.
replacements: Optional dict mapping original node names to replacement Supernode
objects (used for cross-prompt swaps to show the new node in the diagram).
Returns:
Tuple of (svg_obj, new_logits).
"""
prompt = intervention_graph.prompt
intervention_tuples = []
for inv in interventions:
node = inv.supernode
for feature in node.features:
default_val = node.default_activations[node.features.index(feature)].item()
intervention_tuples.append((*feature, inv.scaling_factor * default_val))
with t.inference_mode():
new_logits, new_activations = model.feature_intervention(prompt, intervention_tuples)
# Reset graph state and update
intervention_graph.set_node_activation_fractions(new_activations)
# Mark which nodes were intervened on
for inv in interventions:
sign = "+" if inv.scaling_factor > 0 else ""
inv.supernode.intervention = f"{sign}{inv.scaling_factor}x"
# Handle replacement nodes for cross-prompt swaps
if replacements:
for original_name, replacement_node in replacements.items():
original_node = intervention_graph.nodes.get(original_name)
if original_node:
original_node.replacement_node = replacement_node
if replacement_node.features:
replacement_node.activation = (
(
t.tensor([new_activations[f] for f in replacement_node.features])
/ replacement_node.default_activations
)
.mean()
.item()
if replacement_node.default_activations is not None
else None
)
top_outputs = utils.get_topk(new_logits, model.tokenizer)
svg_obj = utils.create_graph_visualization(intervention_graph, top_outputs)
display(svg_obj)
return svg_obj, new_logits
Predict and test ablation effects
Before running any interventions, write down your predictions for each of these four ablations:
- Ablate "Say a capital" (
-2x): What happens to "Say Austin"? What becomes the top output? - Ablate "Texas" (
-2x): What happens to "Say Austin"? What replaces Austin in the top outputs? - Ablate "capital" (
-2x): Which downstream nodes are affected? - Ablate "state" (
-2x): Does this change anything?
Think carefully about the circuit structure visible in the graph above, then run the two key ablations below and compare with your predictions:
# Compute original logits once (reused across all ablation cells)
with t.inference_mode():
orig_logits, _ = replacement_model.feature_intervention(dallas_prompt, [])
# Ablation 1: Turn off "Say a capital"
print("=== Ablating 'Say a capital' ===")
svg_ablate_capital, logits_ablate_capital = supernode_intervention(
replacement_model, dallas_austin_graph, [Intervention(say_capital_node, -2)]
)
utils.display_topk_token_predictions(
dallas_prompt, orig_logits, logits_ablate_capital, replacement_model.tokenizer, k=10
)
Click to see the expected output
=== Ablating 'Say a capital' ===
# Ablation 2: Turn off "Texas"
print("=== Ablating 'Texas' ===")
svg_ablate_texas, logits_ablate_texas = supernode_intervention(
replacement_model, dallas_austin_graph, [Intervention(texas_node, -2)]
)
utils.display_topk_token_predictions(
dallas_prompt, orig_logits, logits_ablate_texas, replacement_model.tokenizer, k=10
)
Click to see the expected output
=== Ablating 'Texas' ===
Ablating "Say a capital" strongly suppresses the "Say Austin" node and changes the model's top prediction (e.g. to "Texas", since the model knows it's about Texas but has lost the "say a capital city" signal). Ablating "Texas" also shuts down "Say Austin"; the top outputs shift to other state capitals - Albany, Harrisburg, Hartford - since the "say a capital city" signal remains but the Texas geographic identity is gone, so the model falls back to capitals of other states.
Try the other two ablations yourself using the code cell below. Ablating "capital" knocks out "Say a capital" (and partially suppresses "Say Austin"), causing the model to output "Texas" as its top prediction (it still knows the geographic context but loses the capital-recall signal). Ablating "state" is largely ineffective - Austin remains the top prediction at ~38%, barely changed from the unablated ~41%. This suggests the "state" features play a supporting but non-essential role; the Texas features and capital features together are sufficient to route the output to Austin even without them.
For more experiments like this, see the circuit-tracing tutorial notebook.
# Try ablating "capital" and "state" here, e.g.:
# supernode_intervention(replacement_model, dallas_austin_graph, [Intervention(capital_node, -2)])
# supernode_intervention(replacement_model, dallas_austin_graph, [Intervention(state_node, -2)])
pass
Click to see the expected output
=== Ablating 'capital' === === Ablating 'state' ===
Solution
print("=== Ablating 'capital' ===")
svg_ablate_capital_node, logits_ablate_capital_node = supernode_intervention(
replacement_model, dallas_austin_graph, [Intervention(capital_node, -2)]
)
utils.display_topk_token_predictions(
dallas_prompt, orig_logits, logits_ablate_capital_node, replacement_model.tokenizer, k=10
)
print("=== Ablating 'state' ===")
svg_ablate_state_node, logits_ablate_state_node = supernode_intervention(
replacement_model, dallas_austin_graph, [Intervention(state_node, -2)]
)
utils.display_topk_token_predictions(
dallas_prompt, orig_logits, logits_ablate_state_node, replacement_model.tokenizer, k=10
)
Exercise - cross-prompt feature swapping
Cross-prompt feature swapping is a strong test of circuit understanding. If the "Texas" features truly encode "the state is Texas", then turning them off and turning on "California" features (from an Oakland prompt) should make the model predict "Sacramento" instead of "Austin".
Implement the function below. It should get activations from the swap_prompt, build interventions that zero out features_off and activate features_on at their values from the swap prompt (scaled by scale), then run the intervention and return both original and modified logits.
def cross_prompt_swap(
model: "ReplacementModel",
base_prompt: str,
swap_prompt: str,
features_off: list,
features_on: list,
scale: float = 2.0,
) -> tuple[Tensor, Tensor]:
"""Swap features between prompts: turn off features_off and activate features_on
at their values from swap_prompt (scaled by `scale`).
Args:
model: The ReplacementModel.
base_prompt: The prompt to intervene on.
swap_prompt: The prompt to get replacement activation values from.
features_off: Features to zero-ablate (from base_prompt's graph).
features_on: Features to activate (from swap_prompt's graph).
scale: Multiplier for the replacement activation values.
Returns:
original_logits, modified_logits
"""
raise NotImplementedError()
# Test cross_prompt_swap: swap Texas features for California features and check predictions shift
texas_features = [(f.layer, f.pos, f.feature_idx) for f in texas_node.features]
# Use a default California feature (layer 19, pos 10, feature 9209) if oakland_supernodes aren't loaded yet
oakland_swap_prompt = "Fact: the capital of the state containing Oakland is"
california_features_default = [(19, 10, 9209)]
original_logits, modified_logits = cross_prompt_swap(
model=replacement_model,
base_prompt=dallas_prompt,
swap_prompt=oakland_swap_prompt,
features_off=texas_features,
features_on=california_features_default,
scale=2.0,
)
orig_token = replacement_model.tokenizer.decode(original_logits[0, -1].argmax().item())
mod_token = replacement_model.tokenizer.decode(modified_logits[0, -1].argmax().item())
print(f"cross_prompt_swap test: original top prediction = {orig_token!r}, modified = {mod_token!r}")
Click to see the expected output
cross_prompt_swap test: original top prediction = ' Austin', modified = ' San'
Solution
def cross_prompt_swap(
model: "ReplacementModel",
base_prompt: str,
swap_prompt: str,
features_off: list,
features_on: list,
scale: float = 2.0,
) -> tuple[Tensor, Tensor]:
"""Swap features between prompts: turn off features_off and activate features_on
at their values from swap_prompt (scaled by `scale`).
Args:
model: The ReplacementModel.
base_prompt: The prompt to intervene on.
swap_prompt: The prompt to get replacement activation values from.
features_off: Features to zero-ablate (from base_prompt's graph).
features_on: Features to activate (from swap_prompt's graph).
scale: Multiplier for the replacement activation values.
Returns:
original_logits, modified_logits
"""
_, swap_activations = model.get_activations(swap_prompt, sparse=True)
interventions = [(*f, 0.0) for f in features_off]
interventions += [(*f, scale * swap_activations[f]) for f in features_on]
with t.inference_mode():
original_logits, _ = model.feature_intervention(base_prompt, [])
modified_logits, _ = model.feature_intervention(base_prompt, interventions)
return original_logits, modified_logits
Now let's use this to swap Texas for California. We extract the California supernodes from the Oakland/Sacramento Neuronpedia graph, define the swap supernodes, and use supernode_intervention to visualize the result.
oakland_prompt = "Fact: the capital of the state containing Oakland is"
_, oakland_activations = replacement_model.get_activations(oakland_prompt, sparse=True)
# Extract California supernodes from the Oakland/Sacramento Neuronpedia graph
oakland_url = "https://www.neuronpedia.org/gemma-2-2b/graph?slug=gemma-fact-oakland-sacramento&clerps=%5B%5D&pruningThreshold=0.5&pinnedIds=27_43939_10%2CE_49024_9%2C21_5943_10%2C19_9209_10%2C18_8959_10%2C14_12562_9%2C7_14530_9%2C8_14641_9%2C4_8625_9%2C19_9209_9%2C17_7178_10%2CE_6037_4%2C15_4494_4%2CE_2329_7%2C16_4298_10%2C7_691_10%2C6_4662_4%2C4_7671_4%2C2_8734_7%2C0_13727_7%2C3_13984_4%2C1_1000_4%2C6_4012_7%2C4_13154_9%2C7_6861_9&supernodes=%5B%5B%22California%22%2C%2219_9209_10%22%2C%2218_8959_10%22%2C%2219_9209_9%22%2C%2217_7178_10%22%2C%2216_4298_10%22%2C%227_691_10%22%5D%2C%5B%22Say+Sacramento%22%2C%2227_43939_10%22%5D%5D"
oakland_supernodes = utils.extract_supernode_features(oakland_url)
# say_sacramento_node uses a real transcoder feature (layer 19, feature 9209) - this
# California-region feature at pos 10 acts as the "Say Sacramento" circuit node.
# (The Neuronpedia URL uses layer=27 for logit output nodes, which are not SAE features
# and cannot be used as intervention targets; layer 19 is the correct transcoder layer.)
say_sacramento_node = Supernode(
name="Say Sacramento",
features=[Feature(layer=19, pos=10, feature_idx=9209)],
)
california_node = Supernode(
name="California",
features=oakland_supernodes.get("California", []) + oakland_supernodes.get("California (2)", []),
children=[say_sacramento_node],
)
# Initialize the California and Sacramento nodes with Oakland activations
dallas_austin_graph.initialize_node(california_node, oakland_activations)
dallas_austin_graph.initialize_node(say_sacramento_node, oakland_activations)
# Run the swap: turn off Texas, turn on California
oakland_interventions = [Intervention(texas_node, -2), Intervention(california_node, 2)]
svg_oakland_swap, logits_oakland_swap = supernode_intervention(
replacement_model,
dallas_austin_graph,
oakland_interventions,
replacements={texas_node.name: california_node, say_austin_node.name: say_sacramento_node},
)
utils.display_topk_token_predictions(
dallas_prompt, orig_logits, logits_oakland_swap, replacement_model.tokenizer, k=10
)
Click to see the expected output
The "Texas" node is suppressed and "California" activated in its place (shown with a replacement node in the diagram). The "Say Austin" node shuts off entirely (0%), and Sacramento becomes the model's top prediction at ~54% - completely displacing Austin. The "Say Sacramento" node activates via the circuit, confirming that the California features at this position do route through that node even in the Dallas prompt context.
This supports the interpretation that the Texas features encode "the state is Texas" as a compositional concept: swap them out for another state's features and the model's output distribution shifts toward that state's capital region.
The circuit also generalises beyond US states. Try swapping the Texas features for China features using the prompt "Fact: the capital of the country containing Shanghai is".
We've set up the Shanghai supernodes below (extracted from the Shanghai/Beijing Neuronpedia graph). Use supernode_intervention to run the swap and visualize the result. You'll need to get activations for the Shanghai prompt, initialize the China and Say Beijing nodes, create interventions to turn off Texas and turn on China, and call supernode_intervention with the appropriate replacements.
shanghai_prompt = "Fact: the capital of the country containing Shanghai is"
_, shanghai_activations = replacement_model.get_activations(shanghai_prompt, sparse=True)
shanghai_url = "https://www.neuronpedia.org/gemma-2-2b/graph?slug=gemma-fact-shanghai-beijing&clerps=%5B%5D&clickedId=15_4494_4&pruningThreshold=0.45&pinnedIds=27_33395_10%2CE_38628_9%2C21_5943_10%2C19_12274_10%2C19_12274_9%2C14_12274_9%2C18_6101_10%2C17_7178_10%2C6_6811_9%2C4_4257_9%2C4_11570_9%2CE_6037_4%2C0_8885_4%2C18_7639_10%2C19_2695_10%2C16_4298_10%2C15_4494_4%2C6_4662_4&supernodes=%5B%5B%22China%22%2C%2219_12274_9%22%2C%2214_12274_9%22%2C%226_6811_9%22%2C%224_11570_9%22%2C%224_4257_9%22%5D%2C%5B%22China%22%2C%2219_12274_10%22%2C%2218_7639_10%22%5D%2C%5B%22capital%22%2C%2216_4298_10%22%2C%2217_7178_10%22%2C%2218_6101_10%22%2C%2219_2695_10%22%2C%2221_5943_10%22%5D%2C%5B%22capital+cities+%28say+city%29%22%2C%226_4662_4%22%2C%2215_4494_4%22%2C%220_8885_4%22%5D%5D"
shanghai_supernodes = utils.extract_supernode_features(shanghai_url)
china_node = Supernode(
name="China",
features=shanghai_supernodes.get("China", []) + shanghai_supernodes.get("China (2)", []),
children=[say_austin_node],
)
# Note: unlike the Oakland swap, there is no dedicated "Say Beijing" SAE feature node;
# the effect of the China swap is visible in the logit outputs rather than a replacement node.
# TODO: Initialize china_node with shanghai_activations,
# then run supernode_intervention to swap Texas -> China and visualize the result.
# Hint: use the same pattern as the Oakland/Sacramento swap above (no replacement for Say Austin).
Click to see the expected output
Solution
dallas_austin_graph.initialize_node(china_node, shanghai_activations)
shanghai_interventions = [Intervention(texas_node, -2), Intervention(china_node, 2)]
svg_shanghai_swap, logits_shanghai_swap = supernode_intervention(
replacement_model,
dallas_austin_graph,
shanghai_interventions,
replacements={texas_node.name: china_node},
)
utils.display_topk_token_predictions(
dallas_prompt, orig_logits, logits_shanghai_swap, replacement_model.tokenizer, k=10
)
The same circuit structure generalises beyond US states. Swapping the Texas features for China features makes Beijing the model's top prediction at ~23%, with other Chinese cities (Nanjing, Shanghai, Hangzhou) also appearing in the top-10. This closely mirrors the Oakland/Sacramento result, confirming that the "Texas" position in the circuit acts as a general "geographic entity" slot that controls which capital city the model outputs.
Open-ended generation with interventions
So far we've only looked at how interventions change the model's next-token prediction. But we can also sustain interventions during multi-token generation by replacing the fixed position index with an open-ended slice. This lets us see how the intervention affects the model's complete generated output rather than just one token.
Exercise - sustained interventions during generation
Implement a function that converts fixed-position interventions into open-ended slices and generates text with and without the intervention. The key point is that an intervention tuple (layer, pos, feat_idx, value) should have its pos replaced with slice(seq_len - 1, None, None) to persist across all generated positions.
Once you've implemented this, try some of the bonus explorations at the end of this section for ideas on what circuits to study with sustained interventions.
def generate_with_intervention(
model: "ReplacementModel",
prompt: str,
interventions: list,
max_new_tokens: int = 20,
) -> tuple[str, str]:
"""Generate text with and without feature interventions.
Converts fixed-position interventions to open-ended slices so the intervention persists
across all generated tokens.
Args:
model: The ReplacementModel.
prompt: The input prompt.
interventions: List of (layer, pos, feat_idx, value) tuples.
max_new_tokens: Maximum tokens to generate.
Returns:
pre_text: Generated text without intervention.
post_text: Generated text with intervention.
"""
# YOUR CODE HERE
# 1. Compute the input sequence length using model.tokenizer
# 2. Convert each intervention's position to slice(seq_len - 1, None, None)
# 3. Call model.feature_intervention_generate with and without interventions
raise NotImplementedError()
# Build interventions: suppress Texas features at -2x their default activations (matching supernode_intervention convention)
texas_ablation_tuples = [
(*f, -2.0 * texas_node.default_activations[i].item()) for i, f in enumerate(texas_node.features)
]
pre_text, post_text = generate_with_intervention(
replacement_model, dallas_prompt, texas_ablation_tuples, max_new_tokens=15
)
print(f"Without intervention:\n {pre_text}")
print(f"\nWith Texas ablation:\n {post_text}")
# Also show the token probabilities as a rich HTML table
with t.inference_mode():
abl_logits, _ = replacement_model.feature_intervention(dallas_prompt, texas_ablation_tuples)
utils.display_topk_token_predictions(dallas_prompt, orig_logits, abl_logits, replacement_model.tokenizer, k=10)
Click to see the expected output
Without intervention: Fact: the capital of the state containing Dallas is Austin. Fact: the capital of the state containing Dallas is Austin. With Texas ablation: Fact: the capital of the state containing Dallas is the 41st largest in the country. Fact: the population
Solution
def generate_with_intervention(
model: "ReplacementModel",
prompt: str,
interventions: list,
max_new_tokens: int = 20,
) -> tuple[str, str]:
"""Generate text with and without feature interventions.
Converts fixed-position interventions to open-ended slices so the intervention persists
across all generated tokens.
Args:
model: The ReplacementModel.
prompt: The input prompt.
interventions: List of (layer, pos, feat_idx, value) tuples.
max_new_tokens: Maximum tokens to generate.
Returns:
pre_text: Generated text without intervention.
post_text: Generated text with intervention.
"""
seq_len = len(model.tokenizer(prompt).input_ids)
open_interventions = []
for layer, pos, feat_idx, value in interventions:
open_pos = slice(seq_len - 1, None, None)
open_interventions.append((layer, open_pos, feat_idx, value))
pre_text = model.feature_intervention_generate(
prompt, [], do_sample=False, verbose=False, max_new_tokens=max_new_tokens
)[0]
post_text = model.feature_intervention_generate(
prompt, open_interventions, do_sample=False, verbose=False, max_new_tokens=max_new_tokens
)[0]
return pre_text, post_text
Bonus explorations
Here are some further experiments you can try, drawing on ideas from the circuit-tracer demo notebooks:
-
Graph quality scores: Use
from circuit_tracer.graph import compute_graph_scoresto evaluate the Dallas graph's replacement score and completeness score. How do these change as you vary the pruning thresholds? -
Multilingual circuits: The circuit-tracing tutorial studies a multilingual "opposite" circuit for the prompt
'Le contraire de "petit" est "'(French for "the opposite of small is"). Try extracting supernodes for this circuit and swapping the French language features for Chinese ones - the model should switch to producing Chinese output. You can also try swapping language features for Michael Jordan-related prompts (e.g. swapping Spanish features for French ones to change the output language). -
Instruction-following circuits: The Gemma IT demo explores a "pirate-speak injection" circuit: by turning on features that represent pirate-themed language, you can steer the model's responses. Try replicating this, or finding analogous style-injection features (e.g. formal vs informal language).
-
Rhyming and phonetic circuits: The same demo also explores how rhyming works (e.g. finding that "rabbit" and "habit" share rhyming features). Try finding phonetic features in your circuits and testing whether swapping them changes rhyming behaviour.
-
Your own circuits: Try the full pipeline on prompts of your choosing and see what circuits emerge. The
circuit_tracer_attributefunction works on any prompt. Good starting points include factual recall prompts (e.g. "The Eiffel Tower is in"), analogies, or simple reasoning tasks.