Gradient Descent — Algorithm, Learning Rate & Variants

Gradient descent is an algorithm for minimizing a differentiable function by taking repeated steps in the direction that decreases it most locally. If you are searching for "what is gradient descent," the core idea is simple: compute the slope, move a little downhill, and repeat.

It is widely used in calculus-based optimization and in machine learning. The method works best when you can compute a derivative or gradient and choose a learning rate that is small enough to stay stable but large enough to make progress.

In one variable, the update rule is

x_{k+1} = x_k - \eta f'(x_k),

and in several variables it becomes

\mathbf{x}_{k+1} = \mathbf{x}_k - \eta \nabla f(\mathbf{x}_k),

where $\eta > 0$ is the learning rate. The learning rate controls how far each step moves, so it directly affects whether the algorithm converges, stalls, or overshoots.

Gradient Descent Intuition

The gradient points uphill. If your goal is minimization, the natural local move is to go the other way.

That local rule does not guarantee the best possible answer in every problem. On a convex function, gradient descent can lead to the global minimum. On a non-convex function, it may settle at a local minimum, a flat region, or another stationary point.

How The Gradient Descent Algorithm Works

Each iteration uses the current slope information, updates the point, and checks whether it should keep going.

Start with an initial guess $x_0$ or $\mathbf{x}_0$ .
Compute the derivative or gradient at the current point.
Update by subtracting $\eta$ times that derivative or gradient.
Stop when the gradient is small, the updates become tiny, or a preset iteration limit is reached.

The standard update rule assumes the objective is differentiable at the points where you apply it. Some optimization methods use subgradients for nonsmooth problems, but that is a different setup.

Why The Learning Rate Matters In Gradient Descent

The learning rate $\eta$ is the step size.

If $\eta$ is too small, gradient descent usually moves in the right direction but can be painfully slow. If $\eta$ is too large, the updates can overshoot the minimum, bounce back and forth, or even diverge.

You can see the tradeoff clearly in a quadratic function, where the slope gets steeper as you move away from the minimum. A step size that seems safe in one place can be too aggressive in another.

Worked Example: Gradient Descent On A Quadratic

Consider

f(x) = (x-3)^2.

This function has its minimum at $x=3$ . Its derivative is

f'(x) = 2(x-3).

Use gradient descent with learning rate $\eta = 0.1$ and starting point $x_0 = 0$ .

Then the update rule is

x_{k+1} = x_k - 0.1 \cdot 2(x_k-3) = x_k - 0.2(x_k-3).

Starting from $x_0 = 0$ :

x_1 = 0 - 0.2(0-3) = 0.6.

Then

x_2 = 0.6 - 0.2(0.6-3) = 1.08.

and

x_3 = 1.08 - 0.2(1.08-3) = 1.464.

Each step moves closer to $3$ , and the function value decreases each time. That is the main pattern to notice: gradient descent does not jump straight to the answer. It improves the estimate by repeated local corrections.

Common Gradient Descent Variants

Batch Gradient Descent

Batch gradient descent uses the full dataset to compute each update. For a fixed objective, this gives a deterministic step, but it can be expensive when the dataset is large.

Stochastic Gradient Descent

Stochastic gradient descent updates using one sample at a time. Each step is cheaper and noisier. That noise can help the method keep moving, but it also makes the path less smooth.

Mini-Batch Gradient Descent

Mini-batch gradient descent uses a small group of samples per step. This is often a practical compromise because it reduces noise compared with pure stochastic updates while staying much cheaper than full-batch updates.

These variants matter most in machine learning, where the objective is often an average loss over many training examples.

Common Mistakes With Gradient Descent

Treating The Learning Rate As Cosmetic

Changing $\eta$ changes the behavior of the algorithm itself. A method that converges for one learning rate may fail for another.

Assuming Gradient Descent Always Finds The Global Minimum

That conclusion needs conditions. For example, convexity gives much stronger guarantees than a general non-convex landscape.

Ignoring Feature Scale In Applied Problems

In optimization problems with badly scaled variables, one direction can change much faster than another. Gradient descent can then zigzag and converge slowly unless the problem is reformulated or scaled more carefully.

Stopping Only Because The Gradient Is Not Exactly Zero

Numerical algorithms rarely wait for a perfect zero. Practical stopping rules usually check whether the gradient norm, parameter change, or objective change is small enough.

When Gradient Descent Is Used

Gradient descent is used in numerical optimization, statistics, and machine learning. It is especially common when an exact closed-form solution is unavailable or too expensive to compute directly.

For small problems with simple formulas, calculus may give the minimum exactly. Gradient descent becomes more useful when the parameter space is large, the objective has many variables, or the loss comes from large datasets.

Try A Similar Problem

Try your own version with $f(x) = (x-5)^2$ and starting point $x_0 = 12$ . Run one case with $\eta = 0.1$ and another with $\eta = 1.2$ . Seeing one stable run and one unstable run makes the role of the learning rate much clearer than the formula alone.

Need help with a problem?

Upload your question and get a verified, step-by-step solution in seconds.

Open GPAI Solver →