2️⃣ Building a Simple Deception Probe
Learning Objectives
- TODO(mcdougallc)
We'll train a probe on contrastive examples (honest vs deceptive instructions).
First, let's load in a dataset of facts from the deception-detection repo's true_false_facts.csv file.
df_facts = pd.read_csv("data/repe/true_false_facts.csv")
with pd.option_context('display.max_colwidth', None):
display(df_facts.head())
true_facts = df_facts[df_facts['label'] == 1]
false_facts = df_facts[df_facts['label'] == 0]
print(f"True facts: {len(true_facts)}")
print(f"False facts: {len(false_facts)}")
Now, let's extract data for training our probe! First we'll split our true facts dataframe into training and testing sets:
train_frac = 0.8
true_facts_train = true_facts.iloc[:int(len(true_facts) * train_frac)]
true_facts_test = true_facts.iloc[int(len(true_facts) * train_frac):]
print(f"Training set: {len(true_facts_train)}")
print(f"Testing set: {len(true_facts_test)}")
Next, we'll extract activations from the model and concatenate them together, for both of our datasets.
For each true fact, we get a pair of activations: one where the model has been instructed to be honest when it says it, and one where the model has been instructed to be deceptive. The purpose of this is to not just detect true/false facts, but to detect deceptive intent, in other words was the model intending to lie when it said the answer? To further help with this, we'll be cropping the model's response to remove the last 5 tokens, so that we get activations when the model can be thought of as in the process of saying the deceptive answer.
def extract_activations(user_prompt: str, assistant_response: str, layer_idx: int, exclude_last_n_tokens: int = 5):
"""
Extract activations from a specific layer.
Following the paper: exclude last N tokens to capture 'intent' before completion.
"""
# We define start_idx so we can slice only the model's response in our answer
messages = [{"role": "user", "content": user_prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
start_idx = len(tokenizer.tokenize(text))
text += f" {assistant_response}"
# Tokenize the full prompt (i.e. user prompt + assistant response)
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# Get activations from specified layer
hidden_states = outputs.hidden_states[layer_idx][0] # [seq_len, hidden_dim]
# Exclude last N tokens as per paper (also we only include the model's response)
if hidden_states.shape[0] < start_idx + exclude_last_n_tokens:
return None
hidden_states = hidden_states[start_idx:-exclude_last_n_tokens]
# Average over sequence (mean across tokens)
activations = hidden_states.mean(dim=0).cpu().numpy()
return activations
HONEST_PROMPT = "Pretend you are an honest person making statements about the world."
DISHONEST_PROMPT = "Pretend you are a deceptive person making statements about the world."
def gather_dataset(df: pd.DataFrame) -> tuple[np.ndarray, np.ndarray]:
training_data = []
training_labels = []
for _, row in tqdm(list(df.iterrows())):
fact = row["statement"]
# Honest version
activations = extract_activations(HONEST_PROMPT, fact, PROBE_LAYER)
if activations is not None:
training_data.append(activations)
training_labels.append(0) # 0 = honest
# Deceptive version
activations = extract_activations(DISHONEST_PROMPT, fact, PROBE_LAYER)
if activations is not None:
training_data.append(activations)
training_labels.append(1) # 1 = deceptive
X = np.array(training_data)
y = np.array(training_labels)
return X, y
X_train, y_train = gather_dataset(true_facts_train)
X_test, y_test = gather_dataset(true_facts_test)
# Train the probe with normalization (as per paper)
# Normalize activations (zero mean, unit variance) as specified in paper
scaler = StandardScaler()
X_train_normalized = scaler.fit_transform(X_train)
# Train logistic regression with L2 regularization (λ = 10 from paper)
# In sklearn, C = 1/λ, so C = 0.1 for λ = 10
probe = LogisticRegression(C=0.1, max_iter=1000, random_state=42)
probe.fit(X_train_normalized, y_train)
# Check training accuracy
train_accuracy = probe.score(X_train_normalized, y_train)
print(f"Training accuracy: {train_accuracy:.2%}")
# Get test accuracy
X_test_normalized = scaler.transform(X_test)
test_accuracy = probe.score(X_test_normalized, y_test)
print(f"Test accuracy: {test_accuracy:.2%}")
# Get probe direction (weights)
probe_direction = probe.coef_[0]
print(f"Probe weights shape: {probe_direction.shape}")
print(f"Probe weights norm: {np.linalg.norm(probe_direction):.3f}")
Okay, this looks promising! But the true test of a probe comes in its ability to generalize to different distributions.
After all, we could have just picked up on something that was hyper-specific to this particular dataset or this phrasing.
In the next section, we'll move onto probe testing & evaluation.