Bonus

Congratulations on finishing the day's main content! Here are a few more bonus things for you to explore.

In-Place Operation Warnings

The most severe issue with our current system is that it can silently compute the wrong gradients when in-place operations are used. Have a look at how PyTorch handles it and implement a similar system yourself so that it either computes the right gradients, or raises a warning.

In-Place ReLU

Instead of implementing ReLU in terms of maximum, implement your own forward and backward functions that support inplace=True.

Backward for einsum

Write the backward pass for your equivalent of torch.einsum.

Reuse of Module during forward

Consider the following MLP, where the same nn.ReLU instance is used twice in the forward pass. Without running the code, explain whether this works correctly or not with reference to the specifics of your implementation.

class MyModule(Module):
    def __init__(self):
        super().__init__()
        self.linear1 = Linear(28*28, 64)
        self.linear2 = Linear(64, 64)
        self.linear3 = Linear(64, 10)
        self.relu = ReLU()
    def forward(self, x):
        x = self.relu(self.linear1(x))
        x = self.relu(self.linear2(x))
        return self.linear3(x)
Answer (what you should find)

This implementation will work correctly.

The danger of reusing modules is that you'd be creating a cyclical computational graph (because the same parameters would appear twice), but the ReLU module doesn't have any parameters (or any internal state), so this isn't a problem. It's effectively just a wrapper for the relu function, and you could replace self.relu with applying the relu function directly without changing the model's behaviour.

This is slightly different if we're thinking about adding hooks to our model. Hooks are functions that are called during the forward or backward pass, and they can be used to inspect the state of the model during training. We generally want each hook to be associated with a single position in the model, rather than being called at two different points.

Convolutional layers

Now that you've implemented a linear layer, it should be relatively straightforward to take your convolutions code from day 2 and use it to make a convolutional layer. How much better performance do you get on the MNIST task once you replace your first two linear layers with convolutions?

ResNet Support

Make a list of the features that would need to be implemented to support ResNet inference, and training. It will probably take too long to do all of them, but pick some interesting features to start implementing.

Central Difference Checking

Write a function that compares the gradients from your backprop to a central difference method. See Wikipedia for more details.

Non-Differentiable Function Support

Your Tensor does not currently support equivalents of torch.all, torch.any, torch.floor, torch.less, etc. which are non-differentiable functions of Tensors. Implement them so that they are usable in computational graphs, but gradients shouldn't flow through them (their contribution is zero).

Differentiation wrt Keyword Arguments

In the real PyTorch, you can sometimes pass tensors as keyword arguments and differentiation will work, as in t.add(other=t.tensor([3,4]), input=t.tensor([1,2])). In other similar looking cases like t.dot, it raises an error that the argument must be passed positionally. Decide on a desired behavior in your system and implement and test it.

torch.stack

So far we've registered a separate backwards for each input argument that could be a Tensor. This is problematic if the function can take any number of tensors like torch.stack or numpy.stack. Think of and implement the backward function for stack. It may require modification to your other code.