§ Machine Learning · May 2026

Every Gradient in Your Neural Network Is Just the Chain Rule

Hand-compute every gradient in a neural network. By the end, you'll know why we perform backpropagation to train a neural network.

Published
May 4, 2026
Category
Machine Learning
Reading time
18 min
Tags
calculusneural networksbackpropagation

Neural networks are extremely powerful tools. In fact, it has been shown that a neural network can approximate any Lipschitz continuous function arbitrarily well. In machine learning though, oftentimes you are just given a dataset that you want a neural network to generalize on. But how do we “train” a neural network to approximate this dataset well? We choose the parameters of a neural network to minimize a loss function:

LT(w)=1Txi,yiTl(yi,h(xi,w))L_T(\mathbf{w}) = \frac{1}{\lvert T \rvert}\sum_{\mathbf{x}_i, \mathbf{y}_i \in T} l(\mathbf{y}_i, h(\mathbf{x}_i, \mathbf{w}))

where T={(x1,y1)(xN,yN)}T = \{(\mathbf{x}_1, \mathbf{y}_1) \dots (\mathbf{x}_N, \mathbf{y}_N)\} is the training set, hh is the predictor with weights w\mathbf{w} that predicts value y^=h(xi,w)\mathbf{\hat{y}} = h(\mathbf{x}_i, \mathbf{w}), and l(y,y^)=y^y22l(\mathbf{y}, \mathbf{\hat{y}}) = \left\|\mathbf{\hat{y}} - \mathbf{y}\right\|_2^2 is a loss defined for a single prediction (the loss can be defined differently, but this definition is common for regression).

Do you see why minimizing this loss function corresponds to training a neural network to approximate the dataset?

What do Gradients Tell Us?

A unique property of the gradient is it points in the direction of maximum ascent. To see why this is true, let’s take a look at the directional derivative:

Duf(x)=uf(x)D_{\mathbf{u}} f(\mathbf{x}) = \mathbf{u} \cdot \nabla f(\mathbf{x})

This tells us by how much ff increases by moving along u\mathbf{u}. Now, let’s use the geometric form of the dot product to study this closely:

uf(x)=u2f(x)2cos(θ)\mathbf{u} \cdot \nabla f(\mathbf{x}) = \left\lVert \mathbf{u} \right\rVert_2 \cdot \left\lVert \nabla f(\mathbf{x}) \right\rVert_2 \cdot \cos(\theta)

where θ\theta is the angle between u\mathbf{u} and f(x)\nabla f(\mathbf{x}).

Well, we want to find the direction of maximal/minimal ascents. Because we are interested in directions, we need to compare directions in an apples-to-apples fashion so we restrict u\mathbf{u} to be the unit vector as we don’t want magnitude to influence the result here. That means u2=1\left\lVert \mathbf{u} \right\rVert_2 = 1. Additionally, f(x)2\left\lVert \nabla f(\mathbf{x}) \right\rVert_2 remains fixed as it’s just a number since xx remains fixed. Thus, cos(θ)\cos(\theta) is the only thing that can be maximized/minimized. cos(θ)\cos(\theta) is maximized when θ=0\theta=0 which implies that ff increases most when you move in the same direction as the gradient. Similarly, cos(θ)\cos(\theta) is minimized when θ=180\theta=180^\circ (opposite) which implies that ff decreases most by going in the opposite direction as the gradient.

To minimize our original loss function with respect to the weights, we need to move along the opposite direction of the gradient.

Training a Simple Neural Network

We learned that moving along the opposite direction of the gradient helps minimize our loss function defined above. This is the core idea behind gradient descent. You compute the loss of the neural network given the current weights, then compute the gradient of the loss function with respect to the weights, move a tiny step in the opposite direction of the gradient, and then repeat until some pre-defined stopping criteria. The goal of gradient descent is to arrive at a local minimum.

To make this concrete, let’s compute every gradient by hand for the simplest network that’s still interesting: three inputs, one hidden layer of two neurons, and a two-dimensional output.

x1x2x3a(1)1a(1)2z(2)1z(2)2w(1)11w(1)21w(1)12w(1)22w(1)13w(1)23w(2)11w(2)21w(2)12w(2)22inputhiddenoutput

As an example, let’s say we are estimating a student’s test scores. The input to the neural network is x=[hours studiedhours sleptprior grade]\mathbf{x} = \begin{bmatrix} \text{hours studied} \\ \text{hours slept} \\ \text{prior grade} \end{bmatrix} and the output is y^=[math scorereading score]\hat{\mathbf{y}} = \begin{bmatrix} \text{math score} \\ \text{reading score} \end{bmatrix}. To keep things simple, our training set consists of just a single point T={([470.85],[0.90.8])}T = \left\{ \left( \begin{bmatrix} 4 \\ 7 \\ 0.85 \end{bmatrix}, \begin{bmatrix} 0.9 \\ 0.8 \end{bmatrix} \right) \right\}.

We also need starting values for the network’s parameters. In practice, we’d generate these randomly, but here we’ll hand-pick them for our simple example. Recall from the diagram that wji(L)w_{ji}^{(L)} is the weight from neuron ii in layer L1L-1 to neuron jj in layer LL, and bj(L)b_j^{(L)} is the bias on neuron jj in layer LL.

Layer 1 connects the three inputs to the two hidden neurons, so we need 6 weights and 2 biases:

w11(1)=0.1,w12(1)=0.2,w13(1)=0.1w21(1)=0.2,w22(1)=0.1,w23(1)=0.3b1(1)=0.1,b2(1)=0.1\begin{aligned} w_{11}^{(1)} &= 0.1, & w_{12}^{(1)} &= 0.2, & w_{13}^{(1)} &= -0.1 \\ w_{21}^{(1)} &= -0.2, & w_{22}^{(1)} &= 0.1, & w_{23}^{(1)} &= 0.3 \\ b_1^{(1)} &= 0.1, & b_2^{(1)} &= -0.1 \end{aligned}

Layer 2 connects the two hidden neurons to the two outputs, so we need 4 weights and 2 biases:

w11(2)=0.2,w12(2)=0.1w21(2)=0.1,w22(2)=0.2b1(2)=0.05,b2(2)=0.1\begin{aligned} w_{11}^{(2)} &= 0.2, & w_{12}^{(2)} &= -0.1 \\ w_{21}^{(2)} &= 0.1, & w_{22}^{(2)} &= 0.2 \\ b_1^{(2)} &= 0.05, & b_2^{(2)} &= 0.1 \end{aligned}

Forward Pass

Before we can compute the gradient of the loss function, we first need to compute the loss. This involves performing the “forward pass” on the network where we compute what the neural network predicts given an input. Let’s perform the forward pass on the one training example in our training set.

The formula to compute the value of neuron jj at layer LL (aj(L)a_j^{(L)}) is to first compute zj(L)z_j^{(L)} where zj(L)=bj(L)+i=1nwji(L)ai(L1)z_j^{(L)} = b_j^{(L)} + \sum_{i=1}^n w_{ji}^{(L)} \cdot a_i^{(L-1)} and nn is the number of neurons in the previous layer L1L-1. The value of aj(L)a_j^{(L)} is then the result of passing the value of zj(L)z_j^{(L)} through a non-linear activation function (σ\sigma), where a common choice is σ(x)=ReLU(x)=max(0,x)\sigma(x) = \text{ReLU}(x) = \max(0, x). The zero-th layer, however, is just defined as the input layer, so aj(0)=xja_j^{(0)} = x_j.

The output layer’s activation depends on the task: sigmoid for binary classification, softmax for multi-class, and often just the identity (no activation) for regression. Since we’re predicting continuous test scores, we’ll use the identity here. That’s why the output layer in the diagram shows zj(L)z_j^{(L)} rather than aj(L)a_j^{(L)}. In practice, since our scores are normalized to [0,1][0, 1], you might add a sigmoid to keep predictions in range, but we’ll skip it to keep the chain-rule story uncluttered.

Now, let’s run through a forward pass on this network with our one training example: ([470.85],[0.90.8])\left( \begin{bmatrix} 4 \\ 7 \\ 0.85 \end{bmatrix}, \begin{bmatrix} 0.9 \\ 0.8 \end{bmatrix} \right). The zero-th layer is just our input:

x1=4x2=7x3=0.85\begin{aligned} x_1 &= 4 \\ x_2 &= 7 \\ x_3 &= 0.85 \end{aligned}

Then we compute the first hidden layer:

z1(1)=b1(1)+w11(1)x1+w12(1)x2+w13(1)x3=0.1+(0.14)+(0.27)+(0.10.85)=1.815z2(1)=b2(1)+w21(1)x1+w22(1)x2+w23(1)x3=0.1+(0.24)+(0.17)+(0.30.85)=0.055a1(1)=max(0,z1(1))=max(0,1.815)=1.815a2(1)=max(0,z2(1))=max(0,0.055)=0.055\begin{aligned} z_1^{(1)} &= b_1^{(1)} + w_{11}^{(1)}x_1 + w_{12}^{(1)}x_2 + w_{13}^{(1)}x_3 = 0.1 + (0.1 \cdot 4) + (0.2 \cdot 7) + (-0.1 \cdot 0.85) = 1.815 \\ z_2^{(1)} &= b_2^{(1)} + w_{21}^{(1)}x_1 + w_{22}^{(1)}x_2 + w_{23}^{(1)}x_3 = -0.1 + (-0.2 \cdot 4) + (0.1 \cdot 7) + (0.3 \cdot 0.85) = 0.055 \\ a_1^{(1)} &= \max(0, z_1^{(1)}) = \max(0, 1.815) = 1.815 \\ a_2^{(1)} &= \max(0, z_2^{(1)}) = \max(0, 0.055) = 0.055 \end{aligned}

And our final, output layer becomes:

z1(2)=b1(2)+w11(2)a1(1)+w12(2)a2(1)=0.05+(0.21.815)+(0.10.055)=0.4075z2(2)=b2(2)+w21(2)a1(1)+w22(2)a2(1)=0.1+(0.11.815)+(0.20.055)=0.2925\begin{aligned} z_1^{(2)} &= b_1^{(2)} + w_{11}^{(2)}a_1^{(1)} + w_{12}^{(2)}a_2^{(1)} = 0.05 + (0.2 \cdot 1.815) + (-0.1 \cdot 0.055) = 0.4075 \\ z_2^{(2)} &= b_2^{(2)} + w_{21}^{(2)}a_1^{(1)} + w_{22}^{(2)}a_2^{(1)} = 0.1 + (0.1 \cdot 1.815) + (0.2 \cdot 0.055) = 0.2925 \end{aligned}

Finally, we can compute the loss for this example:

l(yi,h(xi,w))=l([0.90.8],[0.40750.2925])=(0.40750.9)2+(0.29250.8)2=0.5001125l(\mathbf{y}_i, h(\mathbf{x}_i, \mathbf{w})) = l\left( \begin{bmatrix} 0.9 \\ 0.8 \end{bmatrix}, \begin{bmatrix} 0.4075 \\ 0.2925 \end{bmatrix} \right) = (0.4075-0.9)^2 + (0.2925 - 0.8)^2 = 0.5001125

The loss looks really bad right now, but that’s expected! We just chose a bunch of weights and biases at random, but we will improve our parameters in training so that our predictions become better.

Representing the Forward Pass in Matrix Form

While the above representation works, keeping track of all of the weights and biases gets convoluted fast. To help ease the complexity, the literature often represents the forward pass using matrix multiplication using a notational trick to represent each layer as the following:

a(L)=σ(b(L)+W(L)a(L1))\mathbf{a}^{(L)} = \sigma(\mathbf{b}^{(L)} + W^{(L)} \mathbf{a}^{(L-1)})

In this formula, a(L)\mathbf{a}^{(L)} is the value of the neurons in layer LL, and a(L1)\mathbf{a}^{(L-1)} is the value of the neurons in layer L1L-1 (in this case a(L)\mathbf{a}^{(L)} and a(L1)\mathbf{a}^{(L-1)} are vectors). The weight matrix W(L)W^{(L)} is defined such that the jj-th row and ii-th column of the matrix is the weight going from neuron ii in layer L1L-1 to neuron jj in layer LL. σ\sigma is the activation function that is applied element-wise to the vector z(L)=b(L)+W(L)a(L1)\mathbf{z}^{(L)} = \mathbf{b}^{(L)} + W^{(L)} \mathbf{a}^{(L-1)}.

It’s worth pondering why this matrix multiplication exactly equals the calculation of computing the value of each neuron like we did above. Conceptually speaking, this matrix multiplication takes the previous layer and computes the dot product with the weights in each row of the weight matrix. Each row (jj) of the weight matrix contains all the weights needed to go from the previous layer L1L-1 to neuron jj in the new layer LL (based on how we defined WW above). We then apply the activation σ\sigma and get the value for the neurons at layer LL.

Backward Pass

Now that we ran through the forward pass and computed a loss (which we saw was bad), the natural question is how do we improve? That’s where the backward pass comes in. Our goal now is to change the parameters of the neural network to reduce the overall loss. As discussed above, we do this by iteratively moving the parameters in the opposite direction of the gradient of the loss with respect to the parameter we are moving. So in our simple neural network, we want to compute Lw11(2)\frac{\partial L}{\partial w_{11}^{(2)}}, Lw12(2)\frac{\partial L}{\partial w_{12}^{(2)}}, Lw21(2)\frac{\partial L}{\partial w_{21}^{(2)}}, Lw22(2)\frac{\partial L}{\partial w_{22}^{(2)}}, Lb1(2)\frac{\partial L}{\partial b_1^{(2)}}, Lb2(2)\frac{\partial L}{\partial b_2^{(2)}} (gradients for the second layer), and Lw11(1)\frac{\partial L}{\partial w_{11}^{(1)}}, Lw12(1)\frac{\partial L}{\partial w_{12}^{(1)}}, Lw13(1)\frac{\partial L}{\partial w_{13}^{(1)}}, Lw21(1)\frac{\partial L}{\partial w_{21}^{(1)}}, Lw22(1)\frac{\partial L}{\partial w_{22}^{(1)}}, Lw23(1)\frac{\partial L}{\partial w_{23}^{(1)}}, Lb1(1)\frac{\partial L}{\partial b_1^{(1)}}, Lb2(1)\frac{\partial L}{\partial b_2^{(1)}} (gradients for the first layer).

As a reminder, the loss function we defined earlier was:

LT(w)=1Txi,yiTl(yi,h(xi,w))=1Txi,yiTz(2)yi22L_T(\mathbf{w}) = \frac{1}{\lvert T \rvert}\sum_{\mathbf{x}_i, \mathbf{y}_i \in T} l(\mathbf{y}_i, h(\mathbf{x}_i, \mathbf{w})) = \frac{1}{\lvert T \rvert}\sum_{\mathbf{x}_i, \mathbf{y}_i \in T} \left\|\mathbf{z}^{(2)} - \mathbf{y}_i\right\|_2^2

Looking at our loss function, our weights don’t directly show up in it, so a natural question to ask is are their gradients just 00? Well, no, our parameters obviously play a role in the loss: z(2)\mathbf{z}^{(2)} is a function of all of our parameters. This means we can apply the chain rule. From calculus, if we are given f(g(x))f(g(x)) and we want to find the derivative with respect to xx, we take the chain rule (see Appendix for an intuitive derivation):

df(g(x))dx=dfdgdgdx\frac{df(g(x))}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

Given the chain rule, we can compute our derivatives. Working backward (hence the name backward pass), let’s figure out how our loss is affected by z1(2)z_1^{(2)} and z2(2)z_2^{(2)} (since N=1N=1 in our training set, 2N\frac{2}{N} simplifies to 22):

Lz1(2)=2N(z1(2)y1)=2(0.40750.9)=0.985Lz2(2)=2N(z2(2)y2)=2(0.29250.8)=1.015\begin{aligned} \frac{\partial L}{\partial z_1^{(2)}} &= \frac{2}{N} \cdot (z_1^{(2)} - y_1) = 2 \cdot (0.4075 - 0.9) = -0.985 \\ \frac{\partial L}{\partial z_2^{(2)}} &= \frac{2}{N} \cdot (z_2^{(2)} - y_2) = 2 \cdot (0.2925 - 0.8) = -1.015 \end{aligned}

While we don’t have direct control over z1(2)z_1^{(2)} and z2(2)z_2^{(2)}, and Lz1(2)\frac{\partial L}{\partial z_1^{(2)}} and Lz2(2)\frac{\partial L}{\partial z_2^{(2)}} weren’t in the original list of partials we cared about, these intermediary partials will be helpful to us when computing the partials with respect to the weights/biases for the second layer of the neural network:

Lw11(2)=Lz1(2)z1(2)w11(2)=Lz1(2)a1(1)=0.9851.815=1.787775Lw12(2)=Lz1(2)z1(2)w12(2)=Lz1(2)a2(1)=0.9850.055=0.054175Lw21(2)=Lz2(2)z2(2)w21(2)=Lz2(2)a1(1)=1.0151.815=1.842225Lw22(2)=Lz2(2)z2(2)w22(2)=Lz2(2)a2(1)=1.0150.055=0.055825Lb1(2)=Lz1(2)z1(2)b1(2)=0.9851=0.985Lb2(2)=Lz2(2)z2(2)b2(2)=1.0151=1.015\begin{aligned} \frac{\partial L}{\partial w_{11}^{(2)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial w_{11}^{(2)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot a_1^{(1)} = -0.985 \cdot 1.815 = -1.787775 \\ \frac{\partial L}{\partial w_{12}^{(2)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial w_{12}^{(2)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot a_2^{(1)} = -0.985 \cdot 0.055 = -0.054175 \\ \frac{\partial L}{\partial w_{21}^{(2)}} &= \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial w_{21}^{(2)}} = \frac{\partial L}{\partial z_2^{(2)}} \cdot a_1^{(1)} = -1.015 \cdot 1.815 = -1.842225 \\ \frac{\partial L}{\partial w_{22}^{(2)}} &= \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial w_{22}^{(2)}} = \frac{\partial L}{\partial z_2^{(2)}} \cdot a_2^{(1)} = -1.015 \cdot 0.055 = -0.055825 \\ \frac{\partial L}{\partial b_1^{(2)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial b_1^{(2)}} = -0.985 \cdot 1 = -0.985 \\ \frac{\partial L}{\partial b_2^{(2)}} &= \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial b_2^{(2)}} = -1.015 \cdot 1 = -1.015 \end{aligned}

To add some more detail, let’s step through how we computed Lw11(2)\frac{\partial L}{\partial w_{11}^{(2)}} carefully. Looking at the graph of the neural network we defined above, we see that w11(2)w_{11}^{(2)} is connected to the loss only through z1(2)z_1^{(2)}. Thus, we first figure out how the loss is affected by z1(2)z_1^{(2)}: Lz1(2)\frac{\partial L}{\partial z_1^{(2)}}. Then we figure out how w11(2)w_{11}^{(2)} affects z1(2)z_1^{(2)}: z1(2)w11(2)\frac{\partial z_1^{(2)}}{\partial w_{11}^{(2)}}. Applying the chain rule: Lw11(2)=Lz1(2)z1(2)w11(2)\frac{\partial L}{\partial w_{11}^{(2)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial w_{11}^{(2)}}.

Now that we have the partial derivatives with respect to the parameters of the second layer of the network, we can keep going backwards and compute the partial derivatives with respect to the parameters of the first layer of the network:

Lw11(1)=Lz1(2)z1(2)a1(1)a1(1)z1(1)z1(1)w11(1)+Lz2(2)z2(2)a1(1)a1(1)z1(1)z1(1)w11(1)=Lz1(2)w11(2)σ(z1(1))x1+Lz2(2)w21(2)σ(z1(1))x1=0.9850.214+1.0150.114=1.194Lw12(1)=Lz1(2)z1(2)a1(1)a1(1)z1(1)z1(1)w12(1)+Lz2(2)z2(2)a1(1)a1(1)z1(1)z1(1)w12(1)=Lz1(2)w11(2)σ(z1(1))x2+Lz2(2)w21(2)σ(z1(1))x2=0.9850.217+1.0150.117=2.0895Lw13(1)=Lz1(2)z1(2)a1(1)a1(1)z1(1)z1(1)w13(1)+Lz2(2)z2(2)a1(1)a1(1)z1(1)z1(1)w13(1)=Lz1(2)w11(2)σ(z1(1))x3+Lz2(2)w21(2)σ(z1(1))x3=0.9850.210.85+1.0150.110.85=0.253725Lw21(1)=Lz1(2)z1(2)a2(1)a2(1)z2(1)z2(1)w21(1)+Lz2(2)z2(2)a2(1)a2(1)z2(1)z2(1)w21(1)=Lz1(2)w12(2)σ(z2(1))x1+Lz2(2)w22(2)σ(z2(1))x1=0.9850.114+1.0150.214=0.418Lw22(1)=Lz1(2)z1(2)a2(1)a2(1)z2(1)z2(1)w22(1)+Lz2(2)z2(2)a2(1)a2(1)z2(1)z2(1)w22(1)=Lz1(2)w12(2)σ(z2(1))x2+Lz2(2)w22(2)σ(z2(1))x2=0.9850.117+1.0150.217=0.7315Lw23(1)=Lz1(2)z1(2)a2(1)a2(1)z2(1)z2(1)w23(1)+Lz2(2)z2(2)a2(1)a2(1)z2(1)z2(1)w23(1)=Lz1(2)w12(2)σ(z2(1))x3+Lz2(2)w22(2)σ(z2(1))x3=0.9850.110.85+1.0150.210.85=0.088825Lb1(1)=Lz1(2)z1(2)a1(1)a1(1)z1(1)z1(1)b1(1)+Lz2(2)z2(2)a1(1)a1(1)z1(1)z1(1)b1(1)=Lz1(2)w11(2)σ(z1(1))1+Lz2(2)w21(2)σ(z1(1))1=0.9850.211+1.0150.111=0.2985Lb2(1)=Lz1(2)z1(2)a2(1)a2(1)z2(1)z2(1)b2(1)+Lz2(2)z2(2)a2(1)a2(1)z2(1)z2(1)b2(1)=Lz1(2)w12(2)σ(z2(1))1+Lz2(2)w22(2)σ(z2(1))1=0.9850.111+1.0150.211=0.1045\begin{aligned} \frac{\partial L}{\partial w_{11}^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{11}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{11}^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{11}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot x_1 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{21}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot x_1 = -0.985 \cdot 0.2 \cdot 1 \cdot 4 + -1.015 \cdot 0.1 \cdot 1 \cdot 4 = -1.194 \\ \frac{\partial L}{\partial w_{12}^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{12}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{12}^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{11}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot x_2 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{21}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot x_2 = -0.985 \cdot 0.2 \cdot 1 \cdot 7 + -1.015 \cdot 0.1 \cdot 1 \cdot 7 = -2.0895 \\ \frac{\partial L}{\partial w_{13}^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{13}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{13}^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{11}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot x_3 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{21}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot x_3 = -0.985 \cdot 0.2 \cdot 1 \cdot 0.85 + -1.015 \cdot 0.1 \cdot 1 \cdot 0.85 = -0.253725 \\ \frac{\partial L}{\partial w_{21}^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial w_{21}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial w_{21}^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{12}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot x_1 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{22}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot x_1 = -0.985 \cdot -0.1 \cdot 1 \cdot 4 + -1.015 \cdot 0.2 \cdot 1 \cdot 4 = -0.418 \\ \frac{\partial L}{\partial w_{22}^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial w_{22}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial w_{22}^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{12}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot x_2 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{22}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot x_2 = -0.985 \cdot -0.1 \cdot 1 \cdot 7 + -1.015 \cdot 0.2 \cdot 1 \cdot 7 = -0.7315 \\ \frac{\partial L}{\partial w_{23}^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial w_{23}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial w_{23}^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{12}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot x_3 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{22}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot x_3 = -0.985 \cdot -0.1 \cdot 1 \cdot 0.85 + -1.015 \cdot 0.2 \cdot 1 \cdot 0.85 = -0.088825 \\ \frac{\partial L}{\partial b_1^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial b_1^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial b_1^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{11}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot 1 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{21}^{(2)} \cdot \sigma'(z_1^{(1)}) \cdot 1 = -0.985 \cdot 0.2 \cdot 1 \cdot 1 + -1.015 \cdot 0.1 \cdot 1 \cdot 1 = -0.2985 \\ \frac{\partial L}{\partial b_2^{(1)}} &= \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial b_2^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_2^{(1)}} \cdot \frac{\partial a_2^{(1)}}{\partial z_2^{(1)}} \cdot \frac{\partial z_2^{(1)}}{\partial b_2^{(1)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot w_{12}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot 1 + \frac{\partial L}{\partial z_2^{(2)}} \cdot w_{22}^{(2)} \cdot \sigma'(z_2^{(1)}) \cdot 1 = -0.985 \cdot -0.1 \cdot 1 \cdot 1 + -1.015 \cdot 0.2 \cdot 1 \cdot 1 = -0.1045 \end{aligned}

where σ(x)={1if x>00if x0\sigma'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}.

The backward pass for this layer is much more interesting and it’s worth stepping through carefully. Let’s walk through how we computed Lw11(1)\frac{\partial L}{\partial w_{11}^{(1)}}. Looking at the graph, we see that w11(1)w_{11}^{(1)} is connected to the loss through a1(1)a_1^{(1)} which impacts both z1(2)z_1^{(2)} and z2(2)z_2^{(2)}. Thus, we figure out how the loss is affected by z1(2)z_1^{(2)} and z2(2)z_2^{(2)}: Lz1(2)\frac{\partial L}{\partial z_1^{(2)}}, Lz2(2)\frac{\partial L}{\partial z_2^{(2)}}. Then we figure out how a1(1)a_1^{(1)} impacts z1(2)z_1^{(2)} and z2(2)z_2^{(2)}: z1(2)a1(1)\frac{\partial z_1^{(2)}}{\partial a_1^{(1)}}, z2(2)a1(1)\frac{\partial z_2^{(2)}}{\partial a_1^{(1)}}. We then figure out how z1(1)z_1^{(1)} impacts a1(1)a_1^{(1)}: a1(1)z1(1)\frac{\partial a_1^{(1)}}{\partial z_1^{(1)}}. Then finally, we can figure out how w11(1)w_{11}^{(1)} impacts z1(1)z_1^{(1)}: z1(1)w11(1)\frac{\partial z_1^{(1)}}{\partial w_{11}^{(1)}}. Given all of these relationships, we can apply the chain rule to compute the partial derivative of the loss with respect to w11(1)w_{11}^{(1)}: Lz1(2)z1(2)a1(1)a1(1)z1(1)z1(1)w11(1)+Lz2(2)z2(2)a1(1)a1(1)z1(1)z1(1)w11(1)\frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{11}^{(1)}} + \frac{\partial L}{\partial z_2^{(2)}} \cdot \frac{\partial z_2^{(2)}}{\partial a_1^{(1)}} \cdot \frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} \cdot \frac{\partial z_1^{(1)}}{\partial w_{11}^{(1)}}.

This argument can be generalized to the other partial derivatives we computed.

Representing the Backward Pass in Matrix Form

If you thought that was a lot of numbers to keep track of, you aren’t alone. Just like how we saw for the forward pass, there is also a notational trick to package up the backwards pass using matrices.

From above, let’s say we wish to find Lwji(L)\frac{\partial L}{\partial w_{ji}^{(L)}}. Let kk be the index of the training example in our dataset. We observe that:

Lwji(L)=kLzkj(L)aki(L1)\frac{\partial L}{\partial w_{ji}^{(L)}} = \sum_k \frac{\partial L}{\partial z_{kj}^{(L)}} \cdot a_{ki}^{(L-1)}

Here we have the extra index kk so we can refer to the value of zz and aa for the kk-th training example.

Now, let’s package this so that dZ(L)dZ^{(L)} has entries where (dZ(L))kj(dZ^{(L)})_{kj} is the gradient of the loss with respect to the kk-th data point at neuron jj in layer LL. Additionally, A(L1)A^{(L-1)} has entries where (A(L1))ki(A^{(L-1)})_{ki} is the value of neuron ii in layer L1L-1 for the kk-th data point.

Then, we just want to sum over all of the data points. We can solve this using matrix multiplication:

LW(L)=(dZ(L))TA(L1)\frac{\partial L}{\partial W^{(L)}} = \left(dZ^{(L)} \right)^T \cdot A^{(L-1)}

The jiji-th entry of (dZ(L))TA(L1)\left(dZ^{(L)} \right)^T \cdot A^{(L-1)} is exactly kdZkjaki(L1)\sum_k dZ_{kj} \cdot a_{ki}^{(L-1)}, which matches our sum from above.

For the bias gradient, only zkj(L)z_{kj}^{(L)} depends on bj(L)b_j^{(L)}, so the chain rule sums over training examples:

Lbj(L)=kLzkj(L)zkj(L)bj(L)    Lb(L)=(dZ(L))T1\frac{\partial L}{\partial b_j^{(L)}} = \sum_k \frac{\partial L}{\partial z_{kj}^{(L)}} \cdot \frac{\partial z_{kj}^{(L)}}{\partial b_j^{(L)}} \implies \frac{\partial L}{\partial b^{(L)}} = \left(dZ^{(L)} \right)^T \cdot \mathbf{1}

where 1RN\mathbf{1} \in \mathbb{R}^N is a column vector of 1s.

For the activations of the previous layer, the structure is similar, but now the sum is over output neurons since each activation aki(L1)a_{ki}^{(L-1)} fans out into every neuron in layer LL:

Laki(L1)=jLzkj(L)zkj(L)aki(L1)    LA(L1)=dZ(L)W(L)\frac{\partial L}{\partial a_{ki}^{(L-1)}} = \sum_j \frac{\partial L}{\partial z_{kj}^{(L)}} \cdot \frac{\partial z_{kj}^{(L)}}{\partial a_{ki}^{(L-1)}} \implies \frac{\partial L}{\partial A^{(L-1)}} = dZ^{(L)} \cdot W^{(L)}

Finally, to push that gradient through the activation function for the next backward step, we multiply element-wise by σ\sigma':

dZ(L1)=LA(L1)σ(Z(L1))dZ^{(L-1)} = \frac{\partial L}{\partial A^{(L-1)}} \odot \sigma'(Z^{(L-1)})

where \odot is the Hadamard product. With these formulas, every layer’s gradients reuse the same patterns; just substitute LL1L \to L-1 and keep going.

Applications to Training

Now that we’ve computed the forward pass and then computed the partial derivative of the loss with respect to a specific parameter θ\theta, training the neural network amounts to nudging the parameter θ\theta by a small amount in the opposite direction of the partial derivative:

θθαLθ\theta \leftarrow \theta - \alpha \cdot \frac{\partial L}{\partial \theta}

where αR\alpha \in \mathbb{R} is the small step-size.

In fact, you can try nudging our parameters now by the partial derivatives we found in the Backward Pass section. For example, if we took α=0.01\alpha = 0.01, we can update w11(2)w_{11}^{(2)}:

w11(2)0.20.011.787775=0.21787775w_{11}^{(2)} \leftarrow 0.2 - 0.01 \cdot -1.787775 = 0.21787775

Now update all parameters in the network using the same process, and redo the forward pass. The loss will drop. Repeat over many iterations and the network gradually fits the data, and every gradient that drives it really was just the chain rule.

Appendix

Derivation of the Chain Rule

Earlier, we said that if you compose functions and then want to take the derivative, you must apply the chain rule:

df(g(x))dx=dfdgdgdx\frac{df(g(x))}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

df(g(x))dx\frac{df(g(x))}{dx} asks how f(g(x))f(g(x)) changes with respect to xx, or more intuitively, how does f(g(x))f(g(x)) change if we nudge xx slightly. Our nudge to xx is denoted as dxdx.

Now, the nudge to g(x)g(x) is denoted as dgdg where dg=dgdxdxdg = \frac{dg}{dx} \cdot dx. In words, this tells us to examine how much gg changes with respect to xx, and then multiply by the nudge in xx.

Finally, the nudge to f(g(x))f(g(x)) is denoted as dfdf where df=dfdgdgdf = \frac{df}{dg} \cdot dg. Again, this tells us to examine how much ff changes with respect to gg, and then multiply by the nudge in gg.

Taking these facts together:

df=dfdgdg    df=dfdgdgdxdx    dfdx=dfdgdgdxdf = \frac{df}{dg} \cdot dg \implies df = \frac{df}{dg} \cdot \frac{dg}{dx} \cdot dx \implies \frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}