Every Gradient in Your Neural Network Is Just the Chain Rule

Neural networks are extremely powerful tools. In fact, it has been shown that a neural network can approximate any Lipschitz continuous function arbitrarily well. In machine learning though, oftentimes you are just given a dataset that you want a neural network to generalize on. But how do we “train” a neural network to approximate this dataset well? We choose the parameters of a neural network to minimize a loss function:

L_T(\mathbf{w}) = \frac{1}{\lvert T \rvert}\sum_{\mathbf{x}_i, \mathbf{y}_i \in T} l(\mathbf{y}_i, h(\mathbf{x}_i, \mathbf{w}))

where $T = \{(\mathbf{x}_1, \mathbf{y}_1) \dots (\mathbf{x}_N, \mathbf{y}_N)\}$ is the training set, $h$ is the predictor with weights $\mathbf{w}$ that predicts value $\mathbf{\hat{y}} = h(\mathbf{x}_i, \mathbf{w})$ , and $l(\mathbf{y}, \mathbf{\hat{y}}) = \left\|\mathbf{\hat{y}} - \mathbf{y}\right\|_2^2$ is a loss defined for a single prediction (the loss can be defined differently, but this definition is common for regression).

Do you see why minimizing this loss function corresponds to training a neural network to approximate the dataset?

What do Gradients Tell Us?

A unique property of the gradient is it points in the direction of maximum ascent. To see why this is true, let’s take a look at the directional derivative:

D_{\mathbf{u}} f(\mathbf{x}) = \mathbf{u} \cdot \nabla f(\mathbf{x})

This tells us by how much $f$ increases by moving along $\mathbf{u}$ . Now, let’s use the geometric form of the dot product to study this closely:

\mathbf{u} \cdot \nabla f(\mathbf{x}) = \left\lVert \mathbf{u} \right\rVert_2 \cdot \left\lVert \nabla f(\mathbf{x}) \right\rVert_2 \cdot \cos(\theta)

where $\theta$ is the angle between $\mathbf{u}$ and $\nabla f(\mathbf{x})$ .

Well, we want to find the direction of maximal/minimal ascents. Because we are interested in directions, we need to compare directions in an apples-to-apples fashion so we restrict $\mathbf{u}$ to be the unit vector as we don’t want magnitude to influence the result here. That means $\left\lVert \mathbf{u} \right\rVert_2 = 1$ . Additionally, $\left\lVert \nabla f(\mathbf{x}) \right\rVert_2$ remains fixed as it’s just a number since $x$ remains fixed. Thus, $\cos(\theta)$ is the only thing that can be maximized/minimized. $\cos(\theta)$ is maximized when $\theta=0$ which implies that $f$ increases most when you move in the same direction as the gradient. Similarly, $\cos(\theta)$ is minimized when $\theta=180^\circ$ (opposite) which implies that $f$ decreases most by going in the opposite direction as the gradient.

To minimize our original loss function with respect to the weights, we need to move along the opposite direction of the gradient.

Training a Simple Neural Network

We learned that moving along the opposite direction of the gradient helps minimize our loss function defined above. This is the core idea behind gradient descent. You compute the loss of the neural network given the current weights, then compute the gradient of the loss function with respect to the weights, move a tiny step in the opposite direction of the gradient, and then repeat until some pre-defined stopping criteria. The goal of gradient descent is to arrive at a local minimum.

To make this concrete, let’s compute every gradient by hand for the simplest network that’s still interesting: three inputs, one hidden layer of two neurons, and a two-dimensional output.

As an example, let’s say we are estimating a student’s test scores. The input to the neural network is $\mathbf{x} = \begin{bmatrix} \text{hours studied} \\ \text{hours slept} \\ \text{prior grade} \end{bmatrix}$ and the output is $\hat{\mathbf{y}} = \begin{bmatrix} \text{math score} \\ \text{reading score} \end{bmatrix}$ . To keep things simple, our training set consists of just a single point $T = \left\{ \left( \begin{bmatrix} 4 \\ 7 \\ 0.85 \end{bmatrix}, \begin{bmatrix} 0.9 \\ 0.8 \end{bmatrix} \right) \right\}$ .

We also need starting values for the network’s parameters. In practice, we’d generate these randomly, but here we’ll hand-pick them for our simple example. Recall from the diagram that $w_{ji}^{(L)}$ is the weight from neuron $i$ in layer $L-1$ to neuron $j$ in layer $L$ , and $b_j^{(L)}$ is the bias on neuron $j$ in layer $L$ .

Layer 1 connects the three inputs to the two hidden neurons, so we need 6 weights and 2 biases:

\begin{aligned} w_{11}^{(1)} &= 0.1, & w_{12}^{(1)} &= 0.2, & w_{13}^{(1)} &= -0.1 \\ w_{21}^{(1)} &= -0.2, & w_{22}^{(1)} &= 0.1, & w_{23}^{(1)} &= 0.3 \\ b_1^{(1)} &= 0.1, & b_2^{(1)} &= -0.1 \end{aligned}

Layer 2 connects the two hidden neurons to the two outputs, so we need 4 weights and 2 biases:

\begin{aligned} w_{11}^{(2)} &= 0.2, & w_{12}^{(2)} &= -0.1 \\ w_{21}^{(2)} &= 0.1, & w_{22}^{(2)} &= 0.2 \\ b_1^{(2)} &= 0.05, & b_2^{(2)} &= 0.1 \end{aligned}

Forward Pass

Before we can compute the gradient of the loss function, we first need to compute the loss. This involves performing the “forward pass” on the network where we compute what the neural network predicts given an input. Let’s perform the forward pass on the one training example in our training set.

The formula to compute the value of neuron $j$ at layer $L$ ( $a_j^{(L)}$ ) is to first compute $z_j^{(L)}$ where $z_j^{(L)} = b_j^{(L)} + \sum_{i=1}^n w_{ji}^{(L)} \cdot a_i^{(L-1)}$ and $n$ is the number of neurons in the previous layer $L-1$ . The value of $a_j^{(L)}$ is then the result of passing the value of $z_j^{(L)}$ through a non-linear activation function ( $\sigma$ ), where a common choice is $\sigma(x) = \text{ReLU}(x) = \max(0, x)$ . The zero-th layer, however, is just defined as the input layer, so $a_j^{(0)} = x_j$ .

The output layer’s activation depends on the task: sigmoid for binary classification, softmax for multi-class, and often just the identity (no activation) for regression. Since we’re predicting continuous test scores, we’ll use the identity here. That’s why the output layer in the diagram shows $z_j^{(L)}$ rather than $a_j^{(L)}$ . In practice, since our scores are normalized to $[0, 1]$ , you might add a sigmoid to keep predictions in range, but we’ll skip it to keep the chain-rule story uncluttered.

Now, let’s run through a forward pass on this network with our one training example: $\left( \begin{bmatrix} 4 \\ 7 \\ 0.85 \end{bmatrix}, \begin{bmatrix} 0.9 \\ 0.8 \end{bmatrix} \right)$ . The zero-th layer is just our input:

\begin{aligned} x_1 &= 4 \\ x_2 &= 7 \\ x_3 &= 0.85 \end{aligned}

Then we compute the first hidden layer:

\begin{aligned} z_1^{(1)} &= b_1^{(1)} + w_{11}^{(1)}x_1 + w_{12}^{(1)}x_2 + w_{13}^{(1)}x_3 = 0.1 + (0.1 \cdot 4) + (0.2 \cdot 7) + (-0.1 \cdot 0.85) = 1.815 \\ z_2^{(1)} &= b_2^{(1)} + w_{21}^{(1)}x_1 + w_{22}^{(1)}x_2 + w_{23}^{(1)}x_3 = -0.1 + (-0.2 \cdot 4) + (0.1 \cdot 7) + (0.3 \cdot 0.85) = 0.055 \\ a_1^{(1)} &= \max(0, z_1^{(1)}) = \max(0, 1.815) = 1.815 \\ a_2^{(1)} &= \max(0, z_2^{(1)}) = \max(0, 0.055) = 0.055 \end{aligned}

And our final, output layer becomes:

\begin{aligned} z_1^{(2)} &= b_1^{(2)} + w_{11}^{(2)}a_1^{(1)} + w_{12}^{(2)}a_2^{(1)} = 0.05 + (0.2 \cdot 1.815) + (-0.1 \cdot 0.055) = 0.4075 \\ z_2^{(2)} &= b_2^{(2)} + w_{21}^{(2)}a_1^{(1)} + w_{22}^{(2)}a_2^{(1)} = 0.1 + (0.1 \cdot 1.815) + (0.2 \cdot 0.055) = 0.2925 \end{aligned}

Finally, we can compute the loss for this example:

l(\mathbf{y}_i, h(\mathbf{x}_i, \mathbf{w})) = l\left( \begin{bmatrix} 0.9 \\ 0.8 \end{bmatrix}, \begin{bmatrix} 0.4075 \\ 0.2925 \end{bmatrix} \right) = (0.4075-0.9)^2 + (0.2925 - 0.8)^2 = 0.5001125

The loss looks really bad right now, but that’s expected! We just chose a bunch of weights and biases at random, but we will improve our parameters in training so that our predictions become better.

Representing the Forward Pass in Matrix Form

While the above representation works, keeping track of all of the weights and biases gets convoluted fast. To help ease the complexity, the literature often represents the forward pass using matrix multiplication using a notational trick to represent each layer as the following:

\mathbf{a}^{(L)} = \sigma(\mathbf{b}^{(L)} + W^{(L)} \mathbf{a}^{(L-1)})

In this formula, $\mathbf{a}^{(L)}$ is the value of the neurons in layer $L$ , and $\mathbf{a}^{(L-1)}$ is the value of the neurons in layer $L-1$ (in this case $\mathbf{a}^{(L)}$ and $\mathbf{a}^{(L-1)}$ are vectors). The weight matrix $W^{(L)}$ is defined such that the $j$ -th row and $i$ -th column of the matrix is the weight going from neuron $i$ in layer $L-1$ to neuron $j$ in layer $L$ . $\sigma$ is the activation function that is applied element-wise to the vector $\mathbf{z}^{(L)} = \mathbf{b}^{(L)} + W^{(L)} \mathbf{a}^{(L-1)}$ .

It’s worth pondering why this matrix multiplication exactly equals the calculation of computing the value of each neuron like we did above. Conceptually speaking, this matrix multiplication takes the previous layer and computes the dot product with the weights in each row of the weight matrix. Each row ( $j$ ) of the weight matrix contains all the weights needed to go from the previous layer $L-1$ to neuron $j$ in the new layer $L$ (based on how we defined $W$ above). We then apply the activation $\sigma$ and get the value for the neurons at layer $L$ .

Backward Pass

Now that we ran through the forward pass and computed a loss (which we saw was bad), the natural question is how do we improve? That’s where the backward pass comes in. Our goal now is to change the parameters of the neural network to reduce the overall loss. As discussed above, we do this by iteratively moving the parameters in the opposite direction of the gradient of the loss with respect to the parameter we are moving. So in our simple neural network, we want to compute $\frac{\partial L}{\partial w_{11}^{(2)}}$ , $\frac{\partial L}{\partial w_{12}^{(2)}}$ , $\frac{\partial L}{\partial w_{21}^{(2)}}$ , $\frac{\partial L}{\partial w_{22}^{(2)}}$ , $\frac{\partial L}{\partial b_1^{(2)}}$ , $\frac{\partial L}{\partial b_2^{(2)}}$ (gradients for the second layer), and $\frac{\partial L}{\partial w_{11}^{(1)}}$ , $\frac{\partial L}{\partial w_{12}^{(1)}}$ , $\frac{\partial L}{\partial w_{13}^{(1)}}$ , $\frac{\partial L}{\partial w_{21}^{(1)}}$ , $\frac{\partial L}{\partial w_{22}^{(1)}}$ , $\frac{\partial L}{\partial w_{23}^{(1)}}$ , $\frac{\partial L}{\partial b_1^{(1)}}$ , $\frac{\partial L}{\partial b_2^{(1)}}$ (gradients for the first layer).

As a reminder, the loss function we defined earlier was:

L_T(\mathbf{w}) = \frac{1}{\lvert T \rvert}\sum_{\mathbf{x}_i, \mathbf{y}_i \in T} l(\mathbf{y}_i, h(\mathbf{x}_i, \mathbf{w})) = \frac{1}{\lvert T \rvert}\sum_{\mathbf{x}_i, \mathbf{y}_i \in T} \left\|\mathbf{z}^{(2)} - \mathbf{y}_i\right\|_2^2

Looking at our loss function, our weights don’t directly show up in it, so a natural question to ask is are their gradients just $0$ ? Well, no, our parameters obviously play a role in the loss: $\mathbf{z}^{(2)}$ is a function of all of our parameters. This means we can apply the chain rule. From calculus, if we are given $f(g(x))$ and we want to find the derivative with respect to $x$ , we take the chain rule (see Appendix for an intuitive derivation):

\frac{df(g(x))}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

Given the chain rule, we can compute our derivatives. Working backward (hence the name backward pass), let’s figure out how our loss is affected by $z_1^{(2)}$ and $z_2^{(2)}$ (since $N=1$ in our training set, $\frac{2}{N}$ simplifies to $2$ ):

\begin{aligned} \frac{\partial L}{\partial z_1^{(2)}} &= \frac{2}{N} \cdot (z_1^{(2)} - y_1) = 2 \cdot (0.4075 - 0.9) = -0.985 \\ \frac{\partial L}{\partial z_2^{(2)}} &= \frac{2}{N} \cdot (z_2^{(2)} - y_2) = 2 \cdot (0.2925 - 0.8) = -1.015 \end{aligned}

While we don’t have direct control over $z_1^{(2)}$ and $z_2^{(2)}$ , and $\frac{\partial L}{\partial z_1^{(2)}}$ and $\frac{\partial L}{\partial z_2^{(2)}}$ weren’t in the original list of partials we cared about, these intermediary partials will be helpful to us when computing the partials with respect to the weights/biases for the second layer of the neural network:

\begin{aligned} \frac{\partial L}{\partial w_{11}^{(2)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial w_{11}^{(2)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot a_1^{(1)} = -0.985 \cdot 1.815 = -1.787775 \\ \frac{\partial L}{\partial w_{12}^{(2)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial w_{12}^{(2)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot a_2^{(1)} = -0.985 \cdot 0.055 = -0.054175 \\ \frac{\partial L}{\partial w_{21}^{(2)}} &= \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial w_{21}^{(2)}} = \frac{\partial L}{\partial z_2^{(2)}} \cdot a_1^{(1)} = -1.015 \cdot 1.815 = -1.842225 \\ \frac{\partial L}{\partial w_{22}^{(2)}} &= \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial w_{22}^{(2)}} = \frac{\partial L}{\partial z_2^{(2)}} \cdot a_2^{(1)} = -1.015 \cdot 0.055 = -0.055825 \\ \frac{\partial L}{\partial b_1^{(2)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial b_1^{(2)}} = -0.985 \cdot 1 = -0.985 \\ \frac{\partial L}{\partial b_2^{(2)}} &= \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial b_2^{(2)}} = -1.015 \cdot 1 = -1.015 \end{aligned}

To add some more detail, let’s step through how we computed $\frac{\partial L}{\partial w_{11}^{(2)}}$ carefully. Looking at the graph of the neural network we defined above, we see that $w_{11}^{(2)}$ is connected to the loss only through $z_1^{(2)}$ . Thus, we first figure out how the loss is affected by $z_1^{(2)}$ : $\frac{\partial L}{\partial z_1^{(2)}}$ . Then we figure out how $w_{11}^{(2)}$ affects $z_1^{(2)}$ : $\frac{\partial z_1^{(2)}}{\partial w_{11}^{(2)}}$ . Applying the chain rule: $\frac{\partial L}{\partial w_{11}^{(2)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial w_{11}^{(2)}}$ .

Now that we have the partial derivatives with respect to the parameters of the second layer of the network, we can keep going backwards and compute the partial derivatives with respect to the parameters of the first layer of the network:

\begin{aligned} \frac{\partial L}{\partial w_{11}^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{11}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{11}^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{11}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot x_1 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{21}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot x_1 = -0.985 \cdot 0.2 \cdot 1 \cdot 4 + -1.015 \cdot 0.1 \cdot 1 \cdot 4 = -1.194 \\ \frac{\partial L}{\partial w_{12}^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{12}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{12}^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{11}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot x_2 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{21}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot x_2 = -0.985 \cdot 0.2 \cdot 1 \cdot 7 + -1.015 \cdot 0.1 \cdot 1 \cdot 7 = -2.0895 \\ \frac{\partial L}{\partial w_{13}^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{13}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{13}^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{11}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot x_3 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{21}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot x_3 = -0.985 \cdot 0.2 \cdot 1 \cdot 0.85 + -1.015 \cdot 0.1 \cdot 1 \cdot 0.85 = -0.253725 \\ \frac{\partial L}{\partial w_{21}^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial w_{21}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial w_{21}^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{12}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot x_1 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{22}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot x_1 = -0.985 \cdot -0.1 \cdot 1 \cdot 4 + -1.015 \cdot 0.2 \cdot 1 \cdot 4 = -0.418 \\ \frac{\partial L}{\partial w_{22}^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial w_{22}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial w_{22}^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{12}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot x_2 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{22}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot x_2 = -0.985 \cdot -0.1 \cdot 1 \cdot 7 + -1.015 \cdot 0.2 \cdot 1 \cdot 7 = -0.7315 \\ \frac{\partial L}{\partial w_{23}^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial w_{23}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial w_{23}^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{12}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot x_3 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{22}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot x_3 = -0.985 \cdot -0.1 \cdot 1 \cdot 0.85 + -1.015 \cdot 0.2 \cdot 1 \cdot 0.85 = -0.088825 \\ \frac{\partial L}{\partial b_1^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial b_1^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial b_1^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{11}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot 1 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{21}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot 1 = -0.985 \cdot 0.2 \cdot 1 \cdot 1 + -1.015 \cdot 0.1 \cdot 1 \cdot 1 = -0.2985 \\ \frac{\partial L}{\partial b_2^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial b_2^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial b_2^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{12}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot 1 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{22}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot 1 = -0.985 \cdot -0.1 \cdot 1 \cdot 1 + -1.015 \cdot 0.2 \cdot 1 \cdot 1 = -0.1045 \end{aligned}

where $\sigma'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}$ .

The backward pass for this layer is much more interesting and it’s worth stepping through carefully. Let’s walk through how we computed $\frac{\partial L}{\partial w_{11}^{(1)}}$ . Looking at the graph, we see that $w_{11}^{(1)}$ is connected to the loss through $a_1^{(1)}$ which impacts both $z_1^{(2)}$ and $z_2^{(2)}$ . Thus, we figure out how the loss is affected by $z_1^{(2)}$ and $z_2^{(2)}$ : $\frac{\partial L}{\partial z_1^{(2)}}$ , $\frac{\partial L}{\partial z_2^{(2)}}$ . Then we figure out how $a_1^{(1)}$ impacts $z_1^{(2)}$ and $z_2^{(2)}$ : $\frac{\partial z_1^{(2)}}{\partial a_1^{(1)}}$ , $\frac{\partial z_2^{(2)}}{\partial a_1^{(1)}}$ . We then figure out how $z_1^{(1)}$ impacts $a_1^{(1)}$ : $\frac{\partial a_1^{(1)}}{\partial z_1^{(1)}}$ . Then finally, we can figure out how $w_{11}^{(1)}$ impacts $z_1^{(1)}$ : $\frac{\partial z_1^{(1)}}{\partial w_{11}^{(1)}}$ . Given all of these relationships, we can apply the chain rule to compute the partial derivative of the loss with respect to $w_{11}^{(1)}$ : $\frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{11}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{11}^{(1)}}$ .

This argument can be generalized to the other partial derivatives we computed.

Representing the Backward Pass in Matrix Form

If you thought that was a lot of numbers to keep track of, you aren’t alone. Just like how we saw for the forward pass, there is also a notational trick to package up the backwards pass using matrices.

From above, let’s say we wish to find $\frac{\partial L}{\partial w_{ji}^{(L)}}$ . Let $k$ be the index of the training example in our dataset. We observe that:

\frac{\partial L}{\partial w_{ji}^{(L)}} = \sum_k \frac{\partial L}{\partial z_{kj}^{(L)}} \cdot a_{ki}^{(L-1)}

Here we have the extra index $k$ so we can refer to the value of $z$ and $a$ for the $k$ -th training example.

Now, let’s package this so that $dZ^{(L)}$ has entries where $(dZ^{(L)})_{kj}$ is the gradient of the loss with respect to the $k$ -th data point at neuron $j$ in layer $L$ . Additionally, $A^{(L-1)}$ has entries where $(A^{(L-1)})_{ki}$ is the value of neuron $i$ in layer $L-1$ for the $k$ -th data point.

Then, we just want to sum over all of the data points. We can solve this using matrix multiplication:

\frac{\partial L}{\partial W^{(L)}} = \left(dZ^{(L)} \right)^T \cdot A^{(L-1)}

The $ji$ -th entry of $\left(dZ^{(L)} \right)^T \cdot A^{(L-1)}$ is exactly $\sum_k dZ_{kj} \cdot a_{ki}^{(L-1)}$ , which matches our sum from above.

For the bias gradient, only $z_{kj}^{(L)}$ depends on $b_j^{(L)}$ , so the chain rule sums over training examples:

\frac{\partial L}{\partial b_j^{(L)}} = \sum_k \frac{\partial L}{\partial z_{kj}^{(L)}} \cdot \frac{\partial z_{kj}^{(L)}}{\partial b_j^{(L)}} \implies \frac{\partial L}{\partial b^{(L)}} = \left(dZ^{(L)} \right)^T \cdot \mathbf{1}

where $\mathbf{1} \in \mathbb{R}^N$ is a column vector of 1s.

For the activations of the previous layer, the structure is similar, but now the sum is over output neurons since each activation $a_{ki}^{(L-1)}$ fans out into every neuron in layer $L$ :

\frac{\partial L}{\partial a_{ki}^{(L-1)}} = \sum_j \frac{\partial L}{\partial z_{kj}^{(L)}} \cdot \frac{\partial z_{kj}^{(L)}}{\partial a_{ki}^{(L-1)}} \implies \frac{\partial L}{\partial A^{(L-1)}} = dZ^{(L)} \cdot W^{(L)}

Finally, to push that gradient through the activation function for the next backward step, we multiply element-wise by $\sigma'$ :

dZ^{(L-1)} = \frac{\partial L}{\partial A^{(L-1)}} \odot \sigma'(Z^{(L-1)})

where $\odot$ is the Hadamard product. With these formulas, every layer’s gradients reuse the same patterns; just substitute $L \to L-1$ and keep going.

Applications to Training

Now that we’ve computed the forward pass and then computed the partial derivative of the loss with respect to a specific parameter $\theta$ , training the neural network amounts to nudging the parameter $\theta$ by a small amount in the opposite direction of the partial derivative:

\theta \leftarrow \theta - \alpha \cdot \frac{\partial L}{\partial \theta}

where $\alpha \in \mathbb{R}$ is the small step-size.

In fact, you can try nudging our parameters now by the partial derivatives we found in the Backward Pass section. For example, if we took $\alpha = 0.01$ , we can update $w_{11}^{(2)}$ :

w_{11}^{(2)} \leftarrow 0.2 - 0.01 \cdot -1.787775 = 0.21787775

Now update all parameters in the network using the same process, and redo the forward pass. The loss will drop. Repeat over many iterations and the network gradually fits the data, and every gradient that drives it really was just the chain rule.

Appendix

Derivation of the Chain Rule

Earlier, we said that if you compose functions and then want to take the derivative, you must apply the chain rule:

\frac{df(g(x))}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

$\frac{df(g(x))}{dx}$ asks how $f(g(x))$ changes with respect to $x$ , or more intuitively, how does $f(g(x))$ change if we nudge $x$ slightly. Our nudge to $x$ is denoted as $dx$ .

Now, the nudge to $g(x)$ is denoted as $dg$ where $dg = \frac{dg}{dx} \cdot dx$ . In words, this tells us to examine how much $g$ changes with respect to $x$ , and then multiply by the nudge in $x$ .

Finally, the nudge to $f(g(x))$ is denoted as $df$ where $df = \frac{df}{dg} \cdot dg$ . Again, this tells us to examine how much $f$ changes with respect to $g$ , and then multiply by the nudge in $g$ .

Taking these facts together:

df = \frac{df}{dg} \cdot dg \implies df = \frac{df}{dg} \cdot \frac{dg}{dx} \cdot dx \implies \frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}