Palindromes (July 2023)

Colab: problem | solutions

This marks the first of the (hopefully sequence of) monthly mechanistic interpretability challenges. I designed them in the spirit of Stephen Casper's challenges, but with the more specific aim of working well in the context of the rest of the ARENA material, and helping people put into practice all the things they've learned so far.

Task & Dataset

Each sequence in the dataset looks like:

[start_token, a_1, a_2, ..., a_N, end_token]

where start_token = 31, end_token = 32, and each value a_i is a value in the range [0, 30] inclusive.

Each sequence has a corresponding label, which is 1 if the sequence is a palindrome (i.e. (a_1, a_2, ..., a_N) == (a_N, ..., a_2, a_1)), and 0 otherwise. The model has been trained to classify each sequence according to this label.

Model

The model is a 2-layer transformer with 2 attention heads, and causal attention. It includes layernorm, but no MLP layers.

Setup

import os
import sys
import torch as t
from pathlib import Path

# Make sure exercises are in the path
chapter = r"chapter1_transformer_interp"
exercises_dir = Path(f"{os.getcwd().split(chapter)[0]}/{chapter}/exercises").resolve()
section_dir = exercises_dir / "monthly_algorithmic_problems" / "july23_palindromes"
if str(exercises_dir) not in sys.path: sys.path.append(str(exercises_dir))

from monthly_algorithmic_problems.july23_palindromes.dataset import PalindromeDataset, display_seq
from monthly_algorithmic_problems.july23_palindromes.model import create_model
from plotly_utils import hist, bar, imshow

device = t.device('mps' if t.backends.mps.is_available() else 'cuda' if t.cuda.is_available() else 'cpu')