Gradient descent is an algorithm for minimizing a differentiable function by taking repeated steps in the direction that decreases it most locally. Use it when you can compute a derivative or gradient but an exact closed-form minimum is unavailable or too expensive, which is the usual situation in large-scale optimization and machine learning.
The procedure is short to state: compute the slope, move a little downhill, and repeat. The method works best when you can compute the gradient and choose a learning rate that is small enough to stay stable but large enough to make progress.
The Procedure, Step By Step
- Choose a starting point. Pick an initial guess for the parameter or parameter vector you want to improve.
- Compute the gradient. Evaluate the derivative in one variable, or the gradient in several variables, at the current point.
- Step downhill. Move in the opposite direction of the gradient using a step size , often called the learning rate.
- Repeat until stable. Keep updating until the change becomes small enough or another stopping rule is met.
In one variable, the update rule is
and in several variables it becomes
where is the learning rate. The gradient points uphill, so for minimization the natural local move is the opposite direction. That local rule does not guarantee the best possible answer: on a convex function, gradient descent can lead to the global minimum, but on a non-convex function it may settle at a local minimum, a flat region, or another stationary point.
Step 4's stopping rule typically fires when the gradient is small, the updates become tiny, or a preset iteration limit is reached. The standard update assumes the objective is differentiable at the points where you apply it; nonsmooth problems use subgradient methods, which is a different setup.
The learning rate is the step size, and it decides the character of the whole run. If is too small, gradient descent usually moves in the right direction but can be painfully slow. If is too large, the updates can overshoot the minimum, bounce back and forth, or even diverge. In a quadratic function the slope gets steeper away from the minimum, so a step size that seems safe in one place can be too aggressive in another.
Running The Procedure On A Quadratic
Consider
This function has its minimum at . Its derivative is
Use gradient descent with learning rate and starting point . Then the update rule is
Starting from :
Then
and
Each step moves closer to , and the function value decreases each time. That is the main pattern to notice: gradient descent does not jump straight to the answer. It improves the estimate by repeated local corrections.
Variants You Will Meet In Practice
The basic procedure stays the same, but how much data each gradient uses changes.
Batch Gradient Descent
Batch gradient descent uses the full dataset to compute each update. For a fixed objective, this gives a deterministic step, but it can be expensive when the dataset is large.
Stochastic Gradient Descent
Stochastic gradient descent updates using one sample at a time. Each step is cheaper and noisier. That noise can help the method keep moving, but it also makes the path less smooth.
Mini-Batch Gradient Descent
Mini-batch gradient descent uses a small group of samples per step. This is often a practical compromise because it reduces noise compared with pure stochastic updates while staying much cheaper than full-batch updates.
These variants matter most in machine learning, where the objective is often an average loss over many training examples.
Where Runs Go Wrong, And How To Check
Treating the learning rate as cosmetic
Changing changes the behavior of the algorithm itself. Self-check: a method that converges for one learning rate may fail for another, so test before trusting it.
Assuming gradient descent always finds the global minimum
That conclusion needs conditions. Self-check: convexity gives much stronger guarantees than a general non-convex landscape.
Ignoring feature scale in applied problems
With badly scaled variables, one direction can change much faster than another, so gradient descent zigzags and converges slowly. Self-check: rescale or reformulate the problem if you see zigzagging.
Stopping only because the gradient is not exactly zero
Numerical algorithms rarely wait for a perfect zero. Self-check: use a practical stopping rule based on the gradient norm, parameter change, or objective change.
When Gradient Descent Is Used
Gradient descent is used in numerical optimization, statistics, and machine learning, especially when an exact closed-form solution is unavailable or too expensive to compute directly. For small problems with simple formulas, calculus may give the minimum exactly; gradient descent becomes more useful when the parameter space is large, the objective has many variables, or the loss comes from large datasets.
A revealing exercise: run the procedure on from , once with and once with . Seeing one stable run and one unstable run teaches the role of the learning rate far better than the formula alone.
Frequently Asked Questions
- What is gradient descent in simple terms?
- Gradient descent is an algorithm for minimizing a differentiable function by taking repeated steps in the direction that decreases it most locally. The core loop is: compute the slope at the current point, move a little downhill, and repeat. Each update subtracts the learning rate times the derivative or gradient from the current point.
- What happens if the learning rate is too large or too small?
- The learning rate is the step size. If it is too small, gradient descent usually moves in the right direction but progresses painfully slowly. If it is too large, updates can overshoot the minimum, bounce back and forth, or even diverge. A step size that seems safe in one region can be too aggressive where the slope is steeper.
- Does gradient descent always find the global minimum?
- No. On a convex function, gradient descent can lead to the global minimum. On a non-convex function, the local downhill rule may settle at a local minimum, a flat region, or another stationary point. The method only uses local slope information, so it has no built-in way to guarantee the best possible answer everywhere.
- When does gradient descent stop?
- Common stopping rules are: the gradient becomes small, the updates become tiny, or a preset iteration limit is reached. In practice these conditions signal that the algorithm is no longer making meaningful progress. The method assumes the objective is differentiable at the points where the update rule is applied.
Need help with a problem?
Upload your question and get a verified, step-by-step solution in seconds.
Open GPAI Solver →