⚡ Gradient descent is the optimisation algorithm that trains virtually every machine learning model. It measures how wrong the model is (the loss), calculates which direction to move the weights to reduce that error (the gradient), and takes a small step in that direction. Repeated millions of times, this simple process — always step downhill — is how neural networks learn.
Category: Machine Learning · Difficulty: Intermediate · Last updated: 15 May 2026 · 5 min read
Gradient Descent — How AI Learns by Rolling Downhill Toward the Right Answer
What is Gradient Descent?
Imagine you are blindfolded in a hilly landscape and your goal is to reach the lowest point — the valley. You cannot see anything. But you can feel the slope under your feet. At every step you feel which direction is downhill and take one step that way. Eventually, step by step, you reach the valley floor.
That is gradient descent. The landscape is the loss function — a mathematical surface where the height at every point represents how wrong the model is with those particular weights. The valley is the minimum loss — where the model is most accurate. Each step is a weight update. The gradient tells you the slope — which direction the loss is increasing — and you move in the opposite direction, downhill.
The brilliant thing is that this works in spaces of millions or billions of dimensions. A neural network with 1 billion parameters has a loss landscape in 1 billion dimensional space. Gradient descent navigates it using the same simple principle: always step downhill.
How Gradient Descent works
- Initialise weights randomly — start at a random point in the loss landscape.
- Make a forward pass — compute the model’s prediction on a batch of data.
- Compute the loss — measure how wrong the prediction is using the loss function.
- Compute the gradient — use backpropagation to calculate the slope of the loss with respect to every weight.
- Update weights — subtract a small fraction of the gradient from each weight. The fraction is the learning rate.
- Repeat — thousands to millions of times until the loss converges.
VARIANTS
Full-batch gradient descent: compute gradient over all training data before each update. Most accurate gradient but impractically slow for large datasets.
Stochastic Gradient Descent (SGD): compute gradient on one example at a time. Fast but very noisy — the gradient estimate is rough. The noise can actually help escape local minima.
Mini-batch gradient descent: compute gradient on a small batch (32–512 examples). The standard in practice. Balances accuracy of gradient estimate with computational efficiency.
Adam (Adaptive Moment Estimation): the default for deep learning. Adapts the learning rate per parameter based on gradient history — parameters with consistent large gradients get smaller updates, parameters with small gradients get larger ones. Converges faster and more reliably than plain SGD.
Real-world examples
Not theory — what real teams actually shipped using this technique.
- Training GPT-4 used gradient descent across hundreds of billions of parameters for weeks on thousands of GPUs — each update a tiny step through a loss landscape of incomprehensible complexity, converging to a model that passes the bar exam.
- AlphaGo used gradient descent to train its value and policy networks from self-play games — millions of games played, each one producing gradient updates that nudged the network toward better Go play.
- A simple linear regression trained with gradient descent on house prices converges to the optimal weights in under a second on a laptop — the same algorithm, orders of magnitude simpler application.
Common pitfalls
- Learning rate too high — the model overshoots the minimum, oscillates, or diverges entirely. Loss starts increasing instead of decreasing.
- Learning rate too low — training is extremely slow. The model makes progress but takes far longer than necessary.
- Local minima — gradient descent can get stuck in a local minimum rather than the global minimum. In practice, for deep neural networks, local minima are rarely a serious problem — saddle points are more common, and SGD noise helps escape them.
- Vanishing gradients — in very deep networks, gradients shrink as they propagate backwards and early layers barely update. Solved by ReLU activations, batch normalisation, and residual connections.
Frequently asked questions
QUESTION 1 What is gradient descent in simple terms?
ANSWER 1 Navigating a mountain blindfolded — feel the slope, step downhill, repeat until you reach the valley. In ML, the mountain is the loss function, and each step adjusts the model’s weights.
QUESTION 2 What is the learning rate?
ANSWER 2 How large each step is. Too large overshoots the minimum. Too small takes forever. Finding the right learning rate is one of the most important training decisions.
QUESTION 3 What is the difference between SGD, mini-batch, and full-batch?
ANSWER 3 Full batch: accurate but slow. SGD: one example, fast but noisy. Mini-batch: small batch, balances both — the standard in practice.
QUESTION 4 What is Adam optimiser?
ANSWER 4 An advanced gradient descent that adapts the learning rate per parameter. Default for most deep learning. Converges faster and more reliably than plain SGD.
📬 Get one concept + one use case every Tuesday. Join the newsletter →