First Unique Character (August 2023)

Colab: problem | solutions

This post is the second in the sequence of monthly mechanistic interpretability challenges.

Difficulty

This problem is a step up in difficulty to the July problem. The algorithmic problem is of a similar flavour, and the model architecture is very similar (the main difference is that this model has 3 attention heads per layer, instead of 2).

Task & Dataset

The algorithmic task is as follows: the model is presented with a sequence of characters, and for each character it has to correctly identify the first character in the sequence (up to and including the current character) which is unique up to that point.

The null character "?" has two purposes:

  • In the input, it's used as the start character (because it's often helpful for interp to have a constant start character, to act as a "rest position").
  • In the output, it's also used as the start character, and to represent the classification "no unique character exists".

Here is an example:

Seq = ?acbba, Target = ?aaaac
Seq = ?cbcbc, Target = ?ccb??

Model

The model is a 2-layer transformer with 3 attention heads, and causal attention. It includes layernorm, but no MLP layers.