☆ Bonus

Extensions of the Function Vectors Paper

There are two other interesting results from the paper, although neither of them are as important as the ones we've covered so far. If you have time, you can try to reproduce these results yourself.

The Decoded Vocabulary of Function Vectors (3.2)

In this section, the authors find the top words in the decoded vocabulary of the function vector (i.e. the words whose unembedding vectors have the highest dot product with the function vector), and show that these words seem conceptually related to the task. For example:

  • For the antonyms task, the top words evoke the idea of antonyms, e.g. " negate", " counterpart", " lesser".
  • For the country-capitals task, the top words are actually the names of capitals, e.g. " Moscow", " Paris", " Madrid".

Can you replicate these results, both with the antonyms task and with the task you chose in the previous section?

An interesting extension - what happens if you take a task like the Country-Capitals task (which is inherently asymmetric), and get your function vector from the symmetric version of the task (i.e. the one where each of your question-answer pairs might be flipped around)? Do you still get the same behavioural results, and how (if at all) do the decoded vocabulary results change?

# YOUR CODE HERE - find the decoded vocabulary
My results for this (spoiler!)

In the Country-Capitals task, I found:

The bidirectional task does still work to induce behavioural changes, although slightly less effectively than for the original task. The top decoded vocabulary items are a mix of country names and capital names, but mostly capitals.
Top logits:
' London'
' Moscow'
' Madrid'
' Budapest'
' Athens'
' Paris'
' Berlin'
' Bangkok'
' Istanbul'
' Montreal'
' Barcelona'
' Jerusalem'
' Seoul'
' Miami'
' Dublin'
' Atlanta'
' Copenhagen'
' Mumbai'
' Minneapolis'
' Beijing'
Solution
# Code to calculate decoded vocabulary:
logits = model._model.lm_head(fn_vector)
max_logits = logits.topk(20).indices.tolist()
tokens = model.tokenizer.batch_decode(max_logits)
print("Top logits:\n" + "\n".join(map(repr, tokens)))

Vector Algebra on Function Vectors (3.3)

In this section, the authors investigate whether function vectors can be composed. For instance, if we have three separate ICL tasks which in some sense compose to make a fourth task, can we add together the three function vectors of the first tasks, and use this as the function vector of the fourth task?

The authors test this on a variety of different tasks. They find that it's effective on some tasks (e.g. Country-Capitals, where it outperforms function vectors), but generally isn't as effective as function vectors. Do you get these same results?

Extensions of the Steering Vectors Post

We only implemented one small subset of the results from the steering vectors post (and did it in a fairly slap-dash way). But there are many others you can play around with. For example:

  • The authors note that they were unsuccessful in finding a "speak in French" vector. One of the top comments on the LessWrong post describes a process they used to create a French vector which happened to work (link to comment here). Can you replicate their results? (They also linked a Colab in this comment, which can help if you're stuck.)
  • In a later section of the paper, the authors extensively discuss perplexity (a measure which is related to entropy). They find that the "weddings" vector reduces perplexity on wedding-related sentences, and maintains perplexity on unrelated sentences. Can you replicate their results - in particular, their graph of perplexity ratios against injection layers for wedding and non-wedding-related sentences?
  • The authors wrote up the post into a full paper, which you can find here. Can you replicate some of the extra results in this paper?

Suggested paper replications

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

In this paper, the authors focus on inducing the behavioural change of "making the model tell the truth". They also look at other non-forward-pass-based techniques for finding an intervention vector, e.g. CCS and linear probing, although it concludes that forward-pass-based methods similar to the ones we've been using so far work the best.

This might be a good replication for you if:

  • You enjoyed the exercises in this section, but are also interested in experimenting with techniques which weren't covered in this section (e.g. linear probing),
  • You're comfortable working with very large models, possibly via the nnsight library,
  • You're interested in studying model truthfulness.

Steering Llama 2 via Contrastive Activation Addition

This paper can be thought of as an extension of the GPT2-XL steering vector work to larger models, specifically Llama 2 13B. It also takes more of a high-level evals framework; measuring the model's change in attributes such as sycophancy, myopia, and power-seeking (finding that these attributes can be increased or decreased by adding the appropriate vectors).

This might be a good replication for you if:

  • You enjoyed the exercises in this section, but want to apply these ideas in more of a behavioural context than a task-based context
  • You're comfortable working with very large models, possibly via the nnsight library,
  • You're interested in evaluating models on traits like myopia, power seeking, etc,
  • You're comfortable doing prompt-engineering, and working with large datasets (like the ones linked above).

Update - there is now a LessWrong post associated with this paper, which also briefly discusses related areas. We strongly recommend reading this post if you're interested in this replication, or any of the other suggested replications in this section.

Red-teaming language models via activation engineering

This work, done by Nina Rimsky, extends the results from much of the work we've seen previously, but applied to the domain of refusal - what determines whether the LLM will refuse to answer your request, and how can you affect this behaviour? From her post:

Validating if finetuning and RLHF have robustly achieved the intended outcome is challenging ... We can try to trigger unwanted behaviors in models more efficiently by manipulating their internal states during inference rather than searching through many inputs. The idea is that if a behavior can be easily triggered through techniques such as activation engineering, it may also occur in deployment. The inability to elicit behaviors via small internal perturbations could serve as a stronger guarantee of safety.

This might be a good replication for you if:

  • You enjoyed the exercises in this section, but want to apply these ideas in more of a behavioural context than a task-based context,
  • You're comfortable working with very large models, possibly via the nnsight library,
  • You're interested in RLHF, adversarial attacks and jailbreaking,
  • You're comfortable doing prompt-engineering (although some of the data you'd need for this replication is available on Nina's GitHub repo).




Note - for a week of work, we weakly suggest participants don't try one of these paper replications, because they're quite compute-heavy (even considering the fact that participants have the nnsight library at their disposal). There are many possible replications and extensions that can be done from the function vectors or GPT2-XL work, and this might be a better option for you if you enjoyed the exercises in this section and want to do more things like them.

However, if you do feel comfortable working with large models (e.g. you have some past experience of this) and you're interested in this work, then you're certainly welcome to try one of these replications!