Backpropagation — How Neural Networks Learn

Backpropagation is the method neural networks use to compute gradients: how much the loss would change if you changed each weight or bias a little. In plain language, it tells the model which parameters pushed the prediction the wrong way and by how much.

The short version is simple: run the network forward, measure the error, then move backward through the same computation with the chain rule. That makes a deep model manageable, because each layer only has to contribute a small local derivative.

What backpropagation computes

Backpropagation does not update parameters by itself. It computes gradients such as $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$ , where $L$ is the loss. An optimizer such as gradient descent uses those gradients to make the actual update.

If the model and loss are differentiable, or at least piecewise differentiable enough for gradient methods, backpropagation lets you compute those gradients efficiently in one backward pass.

Why the chain rule is the key idea

Think of a neural network as a long chain of computations. Each layer takes an input, transforms it, and hands the result to the next layer. By the time you reach the loss, the final error depends on every earlier choice.

Backpropagation asks a local question at each step: if this intermediate value changed a little, how would the final loss change? Those local effects multiply together as you move backward. That is the chain rule in plain language.

Backpropagation example with one neuron

Use one neuron with one input:

z = wx + b

a = \sigma(z)

L = \frac{1}{2}(a - y)^2

Here $x$ is the input, $w$ is the weight, $b$ is the bias, $a$ is the prediction, $y$ is the target, and $\sigma$ is the sigmoid function.

Take

x = 2, \qquad w = 0.5, \qquad b = 0, \qquad y = 1.

Step 1: Forward pass

First compute the neuron's weighted sum:

z = wx + b = 0.5 \cdot 2 + 0 = 1.

Now apply the sigmoid:

a = \sigma(1) \approx 0.731.

Now compute the loss:

L = \frac{1}{2}(0.731 - 1)^2 \approx 0.036.

The prediction is below the target, so the loss is positive.

Step 2: Backward pass

Now compute the gradient with respect to the weight.

Start from the loss and work inward:

\frac{\partial L}{\partial a} = a - y.

For sigmoid,

\frac{\partial a}{\partial z} = a(1-a).

And for the weighted sum,

\frac{\partial z}{\partial w} = x, \qquad \frac{\partial z}{\partial b} = 1.

Now chain the pieces together:

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} = (a-y)a(1-a)x.

\frac{\partial L}{\partial b} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial b} = (a-y)a(1-a).

Plug in the numbers:

\frac{\partial L}{\partial b} \approx (0.731 - 1)(0.731)(1 - 0.731) \approx -0.0529

\frac{\partial L}{\partial w} \approx (-0.0529)(2) \approx -0.1058.

The negative signs matter. They say increasing $w$ or $b$ slightly would reduce the loss here, which fits the situation because the current prediction is too low.

If you use gradient descent with learning rate $\eta = 0.1$ , then

w_{\text{new}} = w - \eta \frac{\partial L}{\partial w} = 0.5 - 0.1(-0.1058) \approx 0.5106

b_{\text{new}} = b - \eta \frac{\partial L}{\partial b} = 0 - 0.1(-0.0529) \approx 0.0053.

That is the whole idea in miniature: forward pass, loss, backward pass, update.

Why backpropagation works for deep networks

In a deeper network, you do the same thing layer by layer. The main difference is that each hidden layer affects the loss indirectly through later layers, so its gradient includes more chain-rule factors.

Backpropagation stays practical because each layer only needs its local derivative and the signal coming from the layer after it. You do not re-derive the whole network from scratch for every parameter.

Common backpropagation mistakes

Confusing backpropagation with gradient descent

Backpropagation computes gradients. Gradient descent uses those gradients to update parameters. They are closely connected, but they are not the same step.

Forgetting that the loss sits at the end

The backward pass starts from the loss, not from an arbitrary hidden layer. If you lose track of what the loss depends on, the derivative chain usually breaks.

Ignoring activation behavior

Some activation functions produce very small gradients in some regions. If that happens repeatedly across many layers, learning can become slow.

Assuming one backward pass means the model has learned

One backward pass gives one set of gradients for one batch. Training usually needs many updates across many examples.

When backpropagation is used

Backpropagation is the standard gradient-computation method for training many neural networks, including multilayer perceptrons, convolutional networks, recurrent models, and transformers.

The exact optimizer may change, and some architectures add practical tricks, but the core idea is usually the same: compute the loss, propagate gradients backward, and update parameters to reduce future error.

A practical way to remember it

Backpropagation is a structured way to assign credit and blame inside a layered model. If the output is wrong, the method traces that error backward so each parameter gets a signal about how it contributed.

That is why the phrase "how neural networks learn" is mostly accurate. The learning happens through repeated parameter updates, and backpropagation is what makes those updates informed instead of random.

Try a similar problem

Keep the same example, but change the target from $y = 1$ to $y = 0$ . Recompute $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$ , then check how the signs flip. That one change makes the role of the loss much clearer than memorizing the formulas alone.

Need help with a problem?

Upload your question and get a verified, step-by-step solution in seconds.

Open GPAI Solver →