Automatic differentiation and neural networks
In this lecture we are going to look at the algorithms that underlie the training of neural networks, which are the dominant model in the field of machine learning.
1 Background
At a very high level a neural network is just a complicated function with a lot of parameters — so really a family of functions — into which data can be passed. “Data” in this case could be anything: in popular applications in machine learning it could represent images (represented as pixel values), text (represented as strings), audio signals (represented as a time series), and so on. From now on just regard data as a big vector
Different choices of
I should say at the outset that the conceptual ideas behind the training of neural networks are not particularly deep, with the possible exception of the backpropagation algorithm that we will describe here. What has made for the revolutionary success of this approach — essentially putting all other forms of machine learning (e.g. symbolic) out of business 1 — is:
- The free availability of large datasets of labelled data.
- The free availability of open source languages, libraries, and models.
- The
freewide availability of the necessary computing power to train models.
1.1 The cost function
Let’s discuss the idea of training as optimization in a bit more detail. We suppose that we have a large dataset of size
We would like to train the network (choose the parameters
In other words, we use the usual square norm in
The idea is now to minimize
When evaluating the performance of a machine learning model there is a standard protocol that involves splitting the dataset into training set and a test set, where the former is used for training the model and the latter for evaluating it. After training the model it should perform well on the training set, but will generally perform less well on the test set, which contains data that the model has never seen. The difference between the cost function evaluated on the test set and the training set is a measure of how well the model generalizes to new inputs and is known as the generalization error.
A particular risk when using large neural networks with many parameters is the problem of overfitting. A sufficiently flexible model is capable of effectively “memorizing” the dataset, without “understanding” the labelling, leading to poor generalization.
A particularly vivid example appears in Zhang et al. (2021). They showed that popular computer vision models can be trained on randomly labelled data (where the labels have no connection to the image) to achieve perfect accuracy on the training set. Of course, the resulting performance on the test set was no better than random guessing. This is a natural consequence of overparameterization — having more parameters than data points in your training data — and shows that much of the success in training models with good generalization is down to the details of how the training is done (for example, by stopping before the training error gets too low).
1.2 Gradient descent
Leaving these questions aside, the basic idea underlying training is an extremely simple algorithm called gradient descent. If our network
where
You might find it surprising that such a simple approach plays such an important role in machine learning. All of the sophistication lies in how the model is defined (Section 1.3) and how the gradients are calculated (Section 2): for a complicated function with many parameters 3 this is a highly nontrivial task. While there are plenty of more sophisticated optimization methods they often involve more information about the model’s dependence on its parameters, and this is more costly to evaluate. For example Newton’s method — which you may have encountered before — requires knowledge of first and second derivatives at each step, and this is normally less practical.
Another issue that relates to scale concerns the definition of our cost function Equation 1 as an average over the dataset. For large datasets consisting of high dimensional data (e.g. images) it is usually not practical to calculate the gradient of the cost using the entire dataset. The usual procedure is then to split the data up into batches (usually called minibatches, confusingly), and perform each step of gradient descent by evaluating the gradient only on the batch, moving on to a new batch at the next step. Eventually this will lead to all the data in the dataset being used, which is usually known as one epoch of training. Training a model can involve many epochs (passes through the dataset).
Because each step only uses part of the data, the gradients calculated are going to be more “noisy” than the “true” gradients involving the whole dataset. Because of this, training by gradient descent with minibatches is known as stochastic gradient descent. It is generally thought that the noise introduced by minibatching plays a role in improving the generalization performance of neural networks.
1.3 The network
So far we have said nothing at all about
Leaving biology aside for the moment, we know that
The answer is that we do it by composing lots of simpler functions
The function
We now have to define the intermediate functions
Here the matrix
probably more familiar to you as the Fermi-Dirac distribution, and the ReLU function
Equation 4 is sometimes written more compactly as
You should understand this expression in the sense of vectorized functions (like NumPy’s ufuncs):
Activation functions should be differentiable but nonlinear. A linear function would mean that the function compositions in Equation 3 collapse into matrix multiplications, producing a single overall weight matrix and bias vector. There wouldn’t be any point having separate functions: one would do the same job.
The result of composing functions like this many times, each with their own set of weights and biases, is a highly complex function. The term deep learning, which you will often here these days in the context of neural networks, refers to models with many function applications, or layers, which is the source of the networks’ expressiveness.
The network that we have described so far is called fully connected, because in Equation 4 the matrix of weights means that every input dimension is coupled to every output dimension. Most of the innovation in neural networks over the past decade has been in creating and refining architectures that exploit the structure of the data in some way. For example, in a network to be used for computer vision (image recognition), it makes sense that the model “knows” that two input pixels are near or far from each other. This means that the input dimensions corresponding to two nearby pixels should be treated differently at the outset (i.e. before training) than two separated pixels. Also, the way the network responds to an image should not be strongly dependent on translations of that image. This implies that the weight matrices
1.3.1 Why neural?
You may still be wondering where the network is, let alone the neurons. I’ve decide to present the neural network model in an ahistorical way, but if you look at other presentations you will tend to see the function Equation 4 represented graphically as

which reflects the dependence of the output on the inputs (and not much else). The result of composing several such functions then looks like this:
This is the network we have been talking about all along! Specifically, it is a network called a directed acylic graph (DAG), meaning that the connections have a direction to them (input to output) and there are no loops.
Neural networks have long been used as a model for what goes on in the brain, with Figure 1 playing the role of the neuron. There are many differences, however, including the absence of any particular role for time5, and the fact that (real) neural networks are not DAGs! This latter property plays a decisive role in the training of artificial neural networks, as we’ll see in the next section. In general, I feel that neural networks have outgrown their biological inspiration, which is the reason I’ve downplayed it here.
2 Automatic differentiation and backpropagation
The ingredients we have assembled so far are:
- The cost function:
- Training by gradient descent:
- A neural network expressed as a composition of functions:
In order to perform a gradient step in training a neural network we have to calculate the gradients
Backpropagation is an example of automatic differentiation (AD): the algorithmic evaluation of derivatives of a function. When people first hear about AD they sometimes guess that it must be something like this
i.e. the numerical evaluation of a derivative. This is called numerical differentiation and it’s what you would be forced to do if you only had access to
As well as being a bit more elegant than the brute force approach, there is another reason why AD is preferred over the (conceptually simpler) approach Equation 6. That approach would require us to vary each of the parameters in the network separately: think of those hundreds of billions of parameters of ChatGPT! AD uses the network structure to simplify things drastically, as we’ll now see.
2.1 Evaluating the derivatives
Let’s just plow on and evaluate
The function
Evaluating the derivative with respect to weights and biases in layer
and the output as
Let’s evaluate the derivative of
A straightforward application of the chain rule gives
because
is the Jacobian matrix of each layer and the final factor is
We find a similar expression for the derivative with respect to the weights in the
In Equation 8 the matrices are composed by matrix multiplication. How should they be evaluated? There are two possibilities:
2.2 Forward accumulation
We go from right to left. Starting from the input
The advantage of forward accumulation is that during evaluation we only have to store the current
2.3 Backpropagation
The alternative is that we go from left to right. Instantly we see a problem: we have to have evaluate and store all the
Now remember that we’re actually interested in calculating
and so
This means that going from left to right involves only matrix-vector multiplications rather than matrix-matrix mutiplications. We start with the (row) vector
In AD this is known as backward accumulation. For the special case of neural networks it’s usually called backpropagation. Going backwards reduces the time complexity in favour additional space (i.e. memory) complexity, as we have to store
3 Implementation
These days, getting started with training neural networks is easy due to the availability of many libraries that implement the common building blocks of popular neural architectures, as well as the automatic differentiation required to train them by gradient descent. Popular libraries include PyTorch, TensorFlow, and Jax. As usual, you don’t have to do it yourself.
Still, it’s fun to take a look at how backpropagation is actually implemented in code. Among the simple versions of backpropgation you can find online I can recommend Nielsen (2015) and micrograd by Andrej Karpathy. He also has a YouTube video where he explains how it works in detail.
4 Further reading
There are a huge number of resources online. For a really beautiful gentle introduction try Nielsen (2015). For much more detail try Goodfellow, Bengio, and Courville (2016), though the field moves so fast that the latter parts are probably already a bit dated.
References
Footnotes
This is a slight exaggeration, but what people often do these days is augment symbolic approaches with neural approaches.↩︎
The term parameter is normally reserved for the
’s that appear in the definition of the model. Numbers like the learning rate that describe how the model is trained are usually referred to as hyperparameters.↩︎GPT, the model from OpenAI underlying ChatGPT, has hundreds of trillions of parameters!↩︎
Note that you don’t have to use the same activation function throughout. Typically the output layer of a network used for classification uses the softmax function to output probabilities over the labels.↩︎
Spiking neural networks are a model in which time plays a more serious role.↩︎