Every Gradient in Your Neural Network Is Just the Chain Rule
Hand-compute every gradient in a neural network. By the end, you'll know why we perform backpropagation to train a neural network.
Published
May 4, 2026
Category
Machine Learning
Reading time
18 min
Tags
calculusneural networksbackpropagation
Neural networks are extremely powerful tools. In fact, it has been
shown that a neural network can approximate any Lipschitz
continuous function arbitrarily well. In machine learning though, oftentimes you are just given a
dataset that you want a neural network to generalize on. But how do we “train” a neural network to
approximate this dataset well? We choose the parameters of a neural network to minimize a loss
function:
LT(w)=∣T∣1xi,yi∈T∑l(yi,h(xi,w))
where T={(x1,y1)…(xN,yN)} is the training set,
h is the predictor with weights w that predicts value
y^=h(xi,w), and
l(y,y^)=∥y^−y∥22 is a loss
defined for a single prediction (the loss can be defined differently, but this definition is common
for regression).
Do you see why minimizing this loss function corresponds to training a neural network to approximate
the dataset?
What do Gradients Tell Us?
A unique property of the gradient is it points in the direction of maximum ascent. To see why this
is true, let’s take a look at the directional derivative:
Duf(x)=u⋅∇f(x)
This tells us by how much f increases by moving along u. Now, let’s use the geometric
form of the dot product to study this closely:
u⋅∇f(x)=∥u∥2⋅∥∇f(x)∥2⋅cos(θ)
where θ is the angle between u and ∇f(x).
Well, we want to find the direction of maximal/minimal ascents. Because we are interested in
directions, we need to compare directions in an apples-to-apples fashion so we restrict u
to be the unit vector as we don’t want magnitude to influence the result here. That means
∥u∥2=1. Additionally,
∥∇f(x)∥2 remains fixed as it’s just a number since x
remains fixed. Thus, cos(θ) is the only thing that can be maximized/minimized.
cos(θ) is maximized when θ=0 which implies that f increases most when you move in
the same direction as the gradient. Similarly, cos(θ) is minimized when θ=180∘
(opposite) which implies that f decreases most by going in the opposite direction as the gradient.
To minimize our original loss function with respect to the weights, we need to move along the
opposite direction of the gradient.
Training a Simple Neural Network
We learned that moving along the opposite direction of the gradient helps minimize our loss function
defined above. This is the core idea behind gradient descent. You compute the loss of the neural
network given the current weights, then compute the gradient of the loss function with respect to
the weights, move a tiny step in the opposite direction of the gradient, and then repeat until some
pre-defined stopping criteria. The goal of gradient descent is to arrive at a local minimum.
To make this concrete, let’s compute every gradient by hand for the simplest network that’s still
interesting: three inputs, one hidden layer of two neurons, and a two-dimensional output.
As an example, let’s say we are estimating a student’s test scores. The input to the neural network
is
x=hours studiedhours sleptprior grade
and the output is
y^=[math scorereading score]. To
keep things simple, our training set consists of just a single point
T=⎩⎨⎧470.85,[0.90.8]⎭⎬⎫.
We also need starting values for the network’s parameters. In practice, we’d generate these
randomly, but here we’ll hand-pick them for our simple example. Recall from the diagram that
wji(L) is the weight from neuron i in layer L−1 to neuron j in layer L, and
bj(L) is the bias on neuron j in layer L.
Layer 1 connects the three inputs to the two hidden neurons, so we need 6 weights and 2 biases:
Before we can compute the gradient of the loss function, we first need to compute the loss. This
involves performing the “forward pass” on the network where we compute what the neural network
predicts given an input. Let’s perform the forward pass on the one training example in our training
set.
The formula to compute the value of neuron j at layer L (aj(L)) is to first compute
zj(L) where zj(L)=bj(L)+∑i=1nwji(L)⋅ai(L−1) and n is
the number of neurons in the previous layer L−1. The value of aj(L) is then the result of
passing the value of zj(L) through a non-linear activation function (σ), where a common
choice is σ(x)=ReLU(x)=max(0,x). The zero-th layer, however, is just defined as
the input layer, so aj(0)=xj.
The output layer’s activation depends on the task: sigmoid for binary classification, softmax for
multi-class, and often just the identity (no activation) for regression. Since we’re predicting
continuous test scores, we’ll use the identity here. That’s why the output layer in the diagram
shows zj(L) rather than aj(L). In practice, since our scores are normalized to [0,1],
you might add a sigmoid to keep predictions in range, but we’ll skip it to keep the chain-rule story
uncluttered.
Now, let’s run through a forward pass on this network with our one training example:
470.85,[0.90.8].
The zero-th layer is just our input:
The loss looks really bad right now, but that’s expected! We just chose a bunch of weights and
biases at random, but we will improve our parameters in training so that our predictions become
better.
Representing the Forward Pass in Matrix Form
While the above representation works, keeping track of all of the weights and biases gets convoluted
fast. To help ease the complexity, the literature often represents the forward pass using matrix
multiplication using a notational trick to represent each layer as the following:
a(L)=σ(b(L)+W(L)a(L−1))
In this formula, a(L) is the value of the neurons in layer L, and
a(L−1) is the value of the neurons in layer L−1 (in this case a(L) and
a(L−1) are vectors). The weight matrix W(L) is defined such that the j-th row
and i-th column of the matrix is the weight going from neuron i in layer L−1 to neuron j in
layer L. σ is the activation function that is applied element-wise to the vector
z(L)=b(L)+W(L)a(L−1).
It’s worth pondering why this matrix multiplication exactly equals the calculation of computing the
value of each neuron like we did above. Conceptually speaking, this matrix multiplication takes the
previous layer and computes the dot product with the weights in each row of the weight matrix. Each
row (j) of the weight matrix contains all the weights needed to go from the previous layer L−1
to neuron j in the new layer L (based on how we defined W above). We then apply the activation
σ and get the value for the neurons at layer L.
Backward Pass
Now that we ran through the forward pass and computed a loss (which we saw was bad), the natural
question is how do we improve? That’s where the backward pass comes in. Our goal now is to change
the parameters of the neural network to reduce the overall loss. As discussed above, we do this by
iteratively moving the parameters in the opposite direction of the gradient of the loss with respect
to the parameter we are moving. So in our simple neural network, we want to compute
∂w11(2)∂L, ∂w12(2)∂L,
∂w21(2)∂L, ∂w22(2)∂L,
∂b1(2)∂L, ∂b2(2)∂L (gradients for the
second layer), and ∂w11(1)∂L,
∂w12(1)∂L, ∂w13(1)∂L,
∂w21(1)∂L, ∂w22(1)∂L,
∂w23(1)∂L, ∂b1(1)∂L,
∂b2(1)∂L (gradients for the first layer).
As a reminder, the loss function we defined earlier was:
Looking at our loss function, our weights don’t directly show up in it, so a natural question to ask
is are their gradients just 0? Well, no, our parameters obviously play a role in the loss:
z(2) is a function of all of our parameters. This means we can apply the chain rule.
From calculus, if we are given f(g(x)) and we want to find the derivative with respect to x, we
take the chain rule (see Appendix for an intuitive derivation):
dxdf(g(x))=dgdf⋅dxdg
Given the chain rule, we can compute our derivatives. Working backward (hence the name backward
pass), let’s figure out how our loss is affected by z1(2) and z2(2) (since N=1 in our
training set, N2 simplifies to 2):
While we don’t have direct control over z1(2) and z2(2), and
∂z1(2)∂L and ∂z2(2)∂L weren’t in the
original list of partials we cared about, these intermediary partials will be helpful to us when
computing the partials with respect to the weights/biases for the second layer of the neural
network:
To add some more detail, let’s step through how we computed
∂w11(2)∂L carefully. Looking at the graph of the neural network we
defined above, we see that w11(2) is connected to the loss only through z1(2). Thus,
we first figure out how the loss is affected by z1(2):
∂z1(2)∂L. Then we figure out how w11(2) affects z1(2):
∂w11(2)∂z1(2). Applying the chain rule:
∂w11(2)∂L=∂z1(2)∂L⋅∂w11(2)∂z1(2).
Now that we have the partial derivatives with respect to the parameters of the second layer of the
network, we can keep going backwards and compute the partial derivatives with respect to the
parameters of the first layer of the network:
The backward pass for this layer is much more interesting and it’s worth stepping through carefully.
Let’s walk through how we computed ∂w11(1)∂L. Looking at the graph,
we see that w11(1) is connected to the loss through a1(1) which impacts both
z1(2) and z2(2). Thus, we figure out how the loss is affected by z1(2) and
z2(2): ∂z1(2)∂L, ∂z2(2)∂L. Then
we figure out how a1(1) impacts z1(2) and z2(2):
∂a1(1)∂z1(2), ∂a1(1)∂z2(2). We
then figure out how z1(1) impacts a1(1):
∂z1(1)∂a1(1). Then finally, we can figure out how w11(1)
impacts z1(1): ∂w11(1)∂z1(1). Given all of these
relationships, we can apply the chain rule to compute the partial derivative of the loss with
respect to w11(1):
∂z1(2)∂L⋅∂a1(1)∂z1(2)⋅∂z1(1)∂a1(1)⋅∂w11(1)∂z1(1)+∂z2(2)∂L⋅∂a1(1)∂z2(2)⋅∂z1(1)∂a1(1)⋅∂w11(1)∂z1(1).
This argument can be generalized to the other partial derivatives we computed.
Representing the Backward Pass in Matrix Form
If you thought that was a lot of numbers to keep track of, you aren’t alone. Just like how we saw
for the forward pass, there is also a notational trick to package up the backwards pass using
matrices.
From above, let’s say we wish to find ∂wji(L)∂L. Let k be the
index of the training example in our dataset. We observe that:
∂wji(L)∂L=k∑∂zkj(L)∂L⋅aki(L−1)
Here we have the extra index k so we can refer to the value of z and a for the k-th training
example.
Now, let’s package this so that dZ(L) has entries where (dZ(L))kj is the gradient of
the loss with respect to the k-th data point at neuron j in layer L. Additionally, A(L−1)
has entries where (A(L−1))ki is the value of neuron i in layer L−1 for the k-th data
point.
Then, we just want to sum over all of the data points. We can solve this using matrix
multiplication:
∂W(L)∂L=(dZ(L))T⋅A(L−1)
The ji-th entry of (dZ(L))T⋅A(L−1) is exactly
∑kdZkj⋅aki(L−1), which matches our sum from above.
For the bias gradient, only zkj(L) depends on bj(L), so the chain rule sums over
training examples:
For the activations of the previous layer, the structure is similar, but now the sum is over output
neurons since each activation aki(L−1) fans out into every neuron in layer L:
Finally, to push that gradient through the activation function for the next backward step, we
multiply element-wise by σ′:
dZ(L−1)=∂A(L−1)∂L⊙σ′(Z(L−1))
where ⊙ is the Hadamard product. With
these formulas, every layer’s gradients reuse the same patterns; just substitute L→L−1 and
keep going.
Applications to Training
Now that we’ve computed the forward pass and then computed the partial derivative of the loss with
respect to a specific parameter θ, training the neural network amounts to nudging the
parameter θ by a small amount in the opposite direction of the partial derivative:
θ←θ−α⋅∂θ∂L
where α∈R is the small step-size.
In fact, you can try nudging our parameters now by the partial derivatives we found in the
Backward Pass section. For example, if we took α=0.01, we can update
w11(2):
w11(2)←0.2−0.01⋅−1.787775=0.21787775
Now update all parameters in the network using the same process, and redo the forward pass. The loss
will drop. Repeat over many iterations and the network gradually fits the data, and every gradient
that drives it really was just the chain rule.
Appendix
Derivation of the Chain Rule
Earlier, we said that if you compose functions and then want to take the derivative, you must apply
the chain rule:
dxdf(g(x))=dgdf⋅dxdg
dxdf(g(x)) asks how f(g(x)) changes with respect to x, or more intuitively, how does
f(g(x)) change if we nudge x slightly. Our nudge to x is denoted as dx.
Now, the nudge to g(x) is denoted as dg where dg=dxdg⋅dx. In words, this
tells us to examine how much g changes with respect to x, and then multiply by the nudge in x.
Finally, the nudge to f(g(x)) is denoted as df where df=dgdf⋅dg. Again, this
tells us to examine how much f changes with respect to g, and then multiply by the nudge in g.