1️⃣ Core Concepts / Knowledge

Learning Objectives

Understand the structure and function of neural networks

Learn essential linear algebra concepts like matrix operations and transformations

Understand core principles of probability and statistics, including expected value and variance

Learn how calculus concepts (particularly differentiation) are applied in optimization tasks

Cover some foundational information theory concepts, such as entropy and KL divergence

Enhance Python programming skills, focusing on NumPy and PyTorch basics

This page contains a list of all prerequisites we think will be helpful to learn before studying the ARENA program material. You can return to this page while you study. None of it is compulsory and some resources are likely to be much more helpful than others. We denote very high and high-priority resources with a double and single asterisk respectively, so if you have limited time then prioritise these. It is strongly recommended to at least read over everything with a double asterisk. Also, you should try and prioritise areas you think you might be weaker in than others (for instance, if you have a strong SWE background but less maths experience then you might want to spend more time on the maths sections). You can also return to this document throughout the programme, if there are any areas you want to brush up on.

The content is partially inspired by a similar doc handed out by Redwood to participants before the start of MLAB, as well as by pre-prerequisite material provided by Jacob Hilton on his GitHub page.

Throughout this document, there are some questions thrown in (denoted with boxes like these). It's not essential to answer all of these questions, but you should be able to answer most (or at least know how you might go about answering them).

If you are reading this, and know of any good material for these topics which we've missed out, please let us know and we might be able to add it in!

Maths

Neural Networks**

We won't assume any deep knowledge of neural networks or machine learning before the programme starts, but it's useful to have an idea of the basis so that the first week doesn't have quite as steep a learning curve. The best introductory resources here are 3B1B's videos on neural networks:

You should prioritise the first two videos in this sequence.

What makes neural networks more powerful than basic statistical methods like linear regression?

Answer

Some possible answers / points to consider:

Neural networks exploit nonlinearity, which allows them to express a much wider set of possible functions, whereas linear regression is relatively limited.

Neural networks are learned using gradient descent, meaning their power isn't upper-bounded by the algorithms which programmers or mathematicians can feasibly design by hand.

What are the advantages of ReLU activations over sigmoids?

Answer

Some possible answers / points to consider:

ReLU more effectively avoids the vanishing gradient problem, which is common in sigmoids.

ReLU is more computationally efficient to evaluate than sigmoid.

However, an important point about ReLU and much of ML in general - the better empirical results often come before the theoretical justifications! A lot of ML is built on the philosophy of "experiment until you find something that works, then figure out why it works."

Linear Algebra**

Linear algebra lies at the core of a lot of machine learning.

Here is a list of topics you should be comfortable with:

Linear transformations - what they are, and why they are important
- See this video from 3B1B
How matrix multiplication works
Basic matrix properties: rank, trace, determinant, transpose
Bases, and basis transformations

Some other non-essential topics that we also weakly recommend:

Singular value decomposition
Eigenvalues and eigenvectors
Different types of matrix, and their significance (e.g. symmetric, orthogonal, identity, rotation matrices)

This video series by 3B1B provides a good overview of these core topics (although you can probably skip it if you already have a reasonably strong mathematical background).

If you have a lot more time, Linear Algebra Done Right is the canonical textbook for covering this topic (although it will probably cover much more than you need to know). Alternatively, Neel Nanda has two YouTube videos covering linear algebra extensively.

What is the problem in trying to create a neural network using only linear transformations?

Answer

The composition of linear transformations is also linear, so if you only use linear transformations then you can only create linear functions. This is a problem because many functions we want to learn are not linear.

Composing linear and nonlinear transformations however, we can learn basically anything!

Matrices $A$ and $B$ have shapes $(n, m)$ and $(m, l)$. What is the maximum possible rank of the matrix $AB$?

Answer

The rank of $AB$ is upper-bounded by the minimum of the ranks of $A$ and $B$, and the rank of a matrix is upper-bounded by the minimum of its row number and column number (since matrix rank equals the dimensionality of its column span, and also equals the dimensionality of its row span). So $\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B)) \leq \min(n, m, l)$.

Note - in transformer language models, we often have cases where $m \ll n, l$, so we have a large, low-rank matrix $AB$. We use tools like singular value decomposition to understand these matrices (e.g. see this post), since SVD is a common tool for large low-rank matrices.

Probability & Statistics**

It's essential to understand the rules of probability, expected value and standard deviation, and helpful to understand independence and the normal distribution.

This link covers some of the essentials.

What is the expected value and variance of the sum of two independent normally distributed random variables $X_1 \sim N(\mu_1, \sigma_1^2)$ and $X_2 \sim N(\mu_2, \sigma_2^2)$? Are either of these different if they're correlated? (We don't necessarily expect you to be able to derive this kind of result.)

Answer

For indepdenent variables, the expected value and variances are both additive, in other words $\mu_1 + \mu_2$ and $\sigma_1^2 + \sigma_2^2$ respectively.

If the variables are correlated, the mean is the same, but the variance will be different: larger if they're correlated, smaller if they're anticorrelated. For some intuition: if they're perfectly correlated then you're basically just doubling a single value so you'll quadruple the variance, whereas if they're perfectly anticorrelated then the sum is always zero, so variance is zero.

Calculus**

It's essential to understand differentiation and partial differentiation, and helpful to understand the basics of vector calculus including the chain rule and Taylor series.

Again, 3Blue1Brown has a good video series on this.

What is the derivative of quadratic loss $L(x, y) = \frac{1}{2}(x-y)^2$ wrt the input $x$ (assuming all variables are scalars, not vectors)? How about for cross entropy loss $L(x, y) = -(y\log{x} + (1-y)\log{(1-x)})$, assuming that $x \in (0, 1)$ and $y$ is a binary classification label with value either zero or one? What will be the qualitative behaviour of performing gradient descent on $x$ with these loss functions?

Answer

MSE loss

The derivative of $L$ wrt $x$ is $x-y$. Qualitatively, this means $x$ will be increased if $x < y$, and decreased if $x > y$, with the size of the increase / decrease proportional to the difference between them.

Qualitatively, this loss function will push $x_n$ downwards if $x_n > y_n$ (with the size of the gradient proportional to the difference between them).

BCE loss

The derivative of $L$ wrt $x$ is $-\frac{y}{x} + \frac{1-y}{1-x}$, which equals $-\frac{1}{x}$ when $y=1$ and $\frac{1}{1-x}$ when $y=0$. Qualitatively, this means that $x$ will be increased if $y=1$, and decreased if $y=0$, with the size of the increase / decrease getting unboundedly large when $x$ is close to $1-y$ (i.e. far from the correct value).

Note - in this case, $x$ is the probability output by our model, which is probably the result of applying a sigmoid or softmax function to the model's logit output. So the gradient of $L$ wrt $x$ being unbounded doesn't mean the gradient on the model's weights will be unbounded.

Information theory

It's helpful to understand information, entropy and KL divergence. These play key roles in interpreting loss functions.

Elements of Information Theory by Thomas M. Cover is the textbook recommended by Jacob Hilton. It will probably cover more than you need to know, and this need not be prioritised over other items on the list.

For an overview of Kullback Leibler divergence (an important concept in information theory and machine learning), see Six (and a half) intuitions for KL divergence. Note that this probably won't make sense if you don't already have a solid grasp of what entropy is.

Suppose $P$ is the probability distribution of which word comes next in natural language, and $Q$ is a language model's estimated probability distribution. What will the cross entropy $H(P, Q)$ be if the model is guessing words uniformly? What will the cross entropy be if the model can predict words with the exact right frequency?

Answer

If the model is guessing uniformly, then:

$H(P, Q) = -\sum_x P(x) \log Q(x) = -\sum_x P(x) \log \frac{1}{|V|} = \log |V|$

where $V$ is the vocabulary set, and $|V|$ is the number of words in the vocabulary.

If the model is guessing with the exact right frequency, then:

$H(P, Q) = -\sum_x P(x) \log Q(x) = -\sum_x P(x) \log P(x) = H(P)$

in other words, cross entropy just reduces to the entropy of natural language.

Programming

Python**

It's important to be strong in Python, because this is the language we'll be using during the programme. As a rough indication, we expect you to be comfortable with at least 80-90% of the material here, up to section "21. for/else". For a more thorough treatment of Python's core functionality, see here.

Libraries

The following libraries would be useful to know to at least a basic level, before the course starts.

NumPy**

Being familiar with NumPy is a staple for working with high-performance Python. Additionally, the syntax for working with NumPy arrays is very similar to how you work with PyTorch tensors (often there are only minor differences, e.g. Torch tends to use the keyword dim where NumPy uses axis). Working through these 100 basic NumPy exercises would be a good idea, or if you're comfortable with NumPy already then you could try doing them in PyTorch (see below).

PyTorch**

We will be starting chapter 0 of the programme with some structured exercises designed to get everyone familiar with working in PyTorch. However, the more comfortable you are with PyTorch going in, the easier you'll probably find this. PyTorch has several useful tutorials, and to get comfortable working with tensors you might want to implement the 100 basic NumPy exercises linked to above, using PyTorch instead. Another option would be this Learn the Basics tutorial. You can skip this if you can clearly explain:

At a high level, what is a torch.Tensor?

What is a nn.Parameter, and nn.Module?

When you call .backward(), where are your gradients stored?

What is a loss function? In general, what does it take for arguments, and what does it return?

What does an optimization algorithm do?

What is a hyperparameter, and how does it differ from a regular parameter?

What are some examples of hyperparameters?

Einops and Einsum*

These are great libraries to get comfortable with, when manipulating tensors. If you're comfortable using them, then you can say goodbye to awkward NumPy/PyTorch methods like transpose, permute and squeeze! We'll have a few einops and einsum exercises on the second day of the fundamentals chapter, but the more comfortable you are with these libraries the faster you'll be.

For einops, you can read through the examples up to "Fancy examples in random order". It's worth trying to play around with these in your own Jupyter notebook, to get more comfortable with them.

For einsum, this page provides a basic intro to einstein summation convention, and shows some example tensor implementations. Note that we'll be using the einsum function from the einops library, which allows you to refer to dimensions by name rather than by a single letter.

Typing*

Type-checking Python functions that you write is a great way to catch bugs, and keep your code clear and readable. Python isn't a strongly-typed language so you won't get errors from using incorrect type specifications unless you use a library like MyPy. However, if you're using VSCode then you can pair this library with a really useful automatic type checker (see next section).

Plotly

Plotly is an interactive graphing library which is great for presenting results and investigating data. If you're already very familiar with a different Python plotting library (e.g. matplotlib) then I wouldn't recommend re-learning Plotly, but if you aren't already very familiar with matplotlib or you're open to learning Plotly, I'd strongly recommend giving it a try!

Streamlit

Streamlit is a cool library for building and sharing data-based applications. It integrates very nicely with Plotly (see above), can be hosted on your personal GitHub, and is very intuitive & easy to learn relative to other libraries with similar features (e.g. Dash). This is not compulsory, but if you like the look of Streamlit then you might want to think about using it as a way to submit (or even make public) your end-of-week or capstone projects. See this page I (Callum) made for presenting the results from the Copy Suppression paper I worked on during SERI MATS, as an example of what Streamlit can do.

Software Engineering

Basic coding skills**

If you've been accepted into this programme, then you probably already have this box ticked! However, polishing this area can't hurt. LeetCode is a good place to keep basic coding skills sharp, in particular practising the planning and implementation of functions in the medium-hard sections of LeetCode might be helpful. Practising problems on Project Euler is another idea.

VSCode**

Although many of you might already be familiar with Jupyter Notebooks, we recommend working through structured exercises using VSCode. This is a powerful text editor which provides more features than Jupyter Notebooks. Some features it has are:

Shortcuts
These are much more powerful than anything offered in Jupyter Notebooks. Below is a table of some particularly useful ones (see this link for more detail on how each of them works). | Commands | Windows/Linux | MAC | |:--------------------:|:-------------------------------------------------:|:---------------------------------------------------:| | Delete Line | Ctrl + Shift + K | Cmd + Shift + K | | Copy Line Up or Down | Shift + Alt + Up arrow or Shift +Alt + Down arrow | Opt + Shift + Up arrow or Opt + Shift + Down arrow | | Global Find | Ctrl + Shift + F | Cmd + Shift + F | | Copilot* | Ctrl + i / Ctrl + Shift + i | Cmd + i / Cmd + Shift + i | | Toggle Block Comment | Ctrl + Shift + / | Cmd + Shift + / | | Command Palette | Ctrl + Shift + P | Cmd + Shift + P | | Toggle Line Comment | Ctrl + / | Cmd + / | | Trigger Suggestion | Ctrl + Space | Cmd + Space | | Toggle Sidebar | Ctrl + B | Cmd + B | | Multi Select Cursor | Ctrl + D | Cmd + D | | Quick Open | Ctrl + P | Cmd + P |
Type checking
We discussed the typing module in a section above. This is particularly powerful when used alongside VSCode's type checker extension. You can activate typing by going to the settings.json file in VSCode, and adding this line:

json { "python.analysis.typeCheckingMode": "basic" }

You can open the settings.json file by first opening VSCode's Command Palette (see the shortcuts above), then finding the option Preferences: Open User Settings (JSON). We won't be using type checking like this in the course, because it can be a bit too strict sometimes, but it's a useful thing to know about.
Notebook functionality
Although VSCode does provide an extension which acts just like a Jupyter Notebook, it actually has a much more useful feature. Python files can also be made to act like notebooks, by adding the line #%% which act as cell dividers. In this way, you can separate chunks of code and run them individually (and see their output in a new window). See this page for a further explanation.
Debugger
The VSCode debugger is a great tool, and can be much more powerful and efficient than the standard practice of adding lines to print information about your output! You can set breakpoints in your code, and closely examine the local variables which are stored at that point in the program. More information can be found on this page.
Testing
VSCode provides ways to easily keep track of and run tests which are written using the pytest or unittest libraries. We'll have a look at the former during the course.
Remote machines
VSCode offers easy ways to execute code on remote machines over SSH.
Copilot
GitHub Copilot uses the OpenAI Codex to suggest code and entire functions in real-time, right from your editor. We will be encouraging (although not requiring) use of copilot during the program. It can't do everything, but it's very helpful in abstracting away the annoying low-level details and allowing you to focus on the higher-level concepts and structures.
Local imports
If you're getting squiggly yellow lines under local imports (or local imports aren't working), add this to the workspace JSON file (you can access it by searching "workspace JSON" into the command palette):

json { "python.analysis.extraPaths": [ "extrapath" ], "python.analysis.include": [ "extrapath" ] } where extrapath is the path you want to add (e.g. ./chapter0_fundamentals/exercises, if you're running imports like import part1_raytracing.solutions as solutions).

Jupyter Notebook / Colab**

Despite the overall awesomeness of VSCode, Jupyter Notebooks still have some advantages over VSCode, primarily in exploration and visualisation. Several of the advantages listed in the previous section also apply to .ipynb files created and edited in VSCode; the main reason we're not encouraging this for the ARENA in-person program is because we want participants to be able to write code that they can import from previous days (and because using .py files can help encourage better coding practices, rather than just creating a huge number of cells and dropping them under different headings, losing track of what code is where!).

Colab has a similar structure to Jupyter Notebooks, but it provides several additional features, most importantly GPU access. For anyone following this material virtually, we unfortunately won't be able to provide compute, so (assuming you don't have any way of accessing better compute yourselves e.g. via Lambda Labs) Colab might be the best option for you.

Git*

Git is a piece of version control software, designed for tracking and managing changes in a set of files. It can be very powerful, but also a bit of a headache to use.

If you're following this course from the repo, you'll need to be comfortable using Git to push and pull files. We also love to see participants contributing to open-source libraries during the program or capstone projects (e.g. TransformerLens or nnsight) as well as using GitHub for collaboration during their capstone projects. These will also require comfort working with Git.

If you already have a strong SWE background then you might not need to spend as much time on this section. Otherwise, we recommend the Learn Git Branching tutorial series, for an intuitive and interactive set of exercises on Git. A few other resources which have been recommended (which will probably have a lot of overlap):

Ideally, you should feel comfortable with the following:

Cloning a repository
Creating and switching between branches
Staging and committing changes
Pushing branches

Conda*

Virtual environments are a common way to manage dependencies when working across multiple different projects and are standard practice in all professional development contexts. During ARENA, we expect that participants will be working inside likely one, but possibly more environments.

If you haven't used them already, please ensure that you have conda installed and know how to activate it. See: Getting started with conda.

Unix

This won't matter a huge amount for this course (at least for the minimal versions of it), but it's an important part of your development as an engineer. UC Berkeley's UNIX Tutorial provides a comprehensive introduction (up to & including section 4 should suffice for most things you might be doing during this course). Surrey University's UNIX Tutorial is also a good resource - up to tutorial 2 should be sufficient.

Also, you can use the instructions on this page to set up a UNIX terminal to experiment with the methods covered in the above tutorials.

Optional Reading

We will teach the course assuming you haven't read any of these, but they are all useful and relevant things to know and will allow you to understand the material more deeply and/or tackle a more advanced Week 4 project.

100 NumPy Exercises

After you've done the pre-course exercises, if you want more practice with this sort of thing, try these. Some of these are a lot more interesting than others - pick out some that sound fun and challenging and try solving them in PyTorch.

What is torch.nn really? by Jeremy Howard

You'll be implementing a lot of functionality of torch.nn and torch.optim yourself during the course. This is a good introduction to what functionality is in these packages. If you don't learn this now, you can pick it up during the course, though perhaps less deeply.

NLP Demystified by Nitin Punjabi

An introduction to natural language processing assuming zero ML background.

If you've never done NLP before, it's worth skimming to get a general idea of the field.
If you've never built a basic feedforward neural network (MLP) from scratch, the section "Neural Networks I” has a good exercise on this.

Visualising Representations: Deep Learning and Human Beings by Chris Olah

Builds intuition with nice pictures about what deep networks are doing inside.

Spinning Up in Deep RL by OpenAI

Introduction to using deep learning for reinforcement learning. ARENA will assume zero RL experience, but having some understanding already means you'll have an easier time and can tackle more advanced versions of things in the course.

Introduction to Reinforcement Learning with David Silver

This video course is fairly old (2015) and the state of the art has come a long way, but this is still useful to cover the basics. I would recommend Spinning Up in Deep RL over this unless you learn better from video lectures than reading.

The Matrix Cookbook by Petersen and Pedersen

A densely packed reference of facts and identities about matrices. Definitely not intended to teach topics, but a good place to look up something you need. It's worth memorising identities 1-6, 11-16, and 18-23 from Page 6.

Zoom In: An Introduction to Circuits by Chris Olah et al

A very fun article on interpretability in neural networks trained for computer vision.

Why Momentum Really Works by Gabriel Goh

Variations of gradient descent that use momentum are extremely common. We'll teach the basics of momentum in the course, but if you want a richer and deeper understanding then this is a good article to read and reread.

The Matrix Calculus You Need for Deep Learning by Terence Parr and Jeremy Howard

Takes you from knowing introductory calculus to calculus on matrices. This will be helpful for the backpropagation material, on day 3 of the course.

A Mathematical Framework for Transformer Circuits by Anthropic

Analyses transformers starting with the simplest toy models and working up. A heavy read but very good for building intuition about what transformers can do. This will form the cornerstone of the mechanistic interpretability chapter.

In-context Learning and Induction Heads by Anthropic

Describes and analyses "induction heads", an important circuit learned by transformers.