The learning rate controls how large each weight update step is during gradient descent. Too high — the model overshoots the optimal weights and training diverges. Too low — training takes forever or gets stuck. Getting it right is the single most impactful hyperparameter decision in training any neural network. Standard starting point: 1e-3 for Adam optimiser.

Category: Machine Learning · Difficulty: Beginner · Last updated: 15 May 2026 · 4 min read


Learning Rate — The Single Most Important Setting in Neural Network Training

What is Learning Rate?

Gradient descent navigates a loss landscape by always stepping downhill. But how big is each step? That is the learning rate.

Imagine learning to park a car. You turn the wheel a little, check your position, turn a bit more. This is a low learning rate — small, careful adjustments. You could also spin the wheel fully every time — a high learning rate — but you would wildly overshoot your parking spot every time. The right technique is confident but measured: enough movement to make progress, small enough to stop when you have reached the target.

Neural network training is the same. The gradient tells the model which direction to move. The learning rate tells it how far. A single number — often something like 0.001 — determines whether your model learns well or fails to train at all.

THE GOLDILOCKS PROBLEM

Too high: weight updates overshoot the minimum. Loss oscillates or explodes. Training diverges — you will see loss increasing after initially decreasing, or loss becoming NaN (not a number). The model never converges.

Too low: steps are so small that training takes impractically long. The model makes progress but very slowly. May also get stuck in poor local minima because it cannot take large enough steps to escape.

Just right: loss decreases smoothly and consistently. The model converges to a good solution within a reasonable number of steps.

The optimal learning rate depends on the model architecture, dataset, batch size, and optimiser — there is no universal answer. This is why learning rate is the first hyperparameter to tune.

LEARNING RATE SCHEDULES

Fixed learning rate — simplest approach. Same rate throughout training. Often suboptimal but a useful baseline.

Step decay — reduce the learning rate by a fixed factor (e.g. divide by 10) at specified epochs. Allows fast initial learning followed by fine-grained convergence.

Cosine annealing — smoothly decrease the learning rate following a cosine curve from the initial value to near zero. Produces good results on many tasks; widely used in deep learning.

Warmup — start with a very low learning rate, gradually increase to the target rate over the first N steps, then decay. Used in transformer training because large models are unstable with high learning rates early in training when weights are poorly initialised.

Cyclical learning rates — oscillate between a minimum and maximum learning rate. Can help escape local minima and often finds better final solutions than monotonically decreasing schedules.

Real-world examples

Not theory — what real teams actually shipped using this technique.

  • BERT was pre-trained with a learning rate of 1e-4 with warmup over the first 10,000 steps and linear decay to zero — a schedule that became the standard template for fine-tuning transformers.
  • GPT-3 was trained with cosine learning rate decay from a peak of 6×10⁻⁴ — carefully chosen after extensive experimentation. Changing it by even a factor of 3 significantly degraded training stability.
  • A common ML bug: training a neural network with learning rate 1.0 (100x too high for Adam), watching loss explode on step 2, and spending days debugging the data pipeline — when the fix is changing one number.

Common pitfalls

  • Not the same as step size in every context — with Adam optimiser, the effective step size per parameter depends on the gradient history, not just the learning rate. The rate you set is a global scaling factor, not a direct step size.
  • Batch size and learning rate interact — when you change batch size, the effective learning rate changes too. Linear scaling rule: if you double batch size, double learning rate (with appropriate warmup).
  • Different layers may need different rates — when fine-tuning pretrained models, earlier layers (general features) often benefit from lower learning rates than later layers (task-specific features). Layer-wise learning rate decay implements this.
  • Learning rate finder is not magic — the LR finder gives a range of viable rates, not the single optimal rate. Use it as a starting point, not a final answer.

Frequently asked questions

QUESTION 1 What is the learning rate in simple terms?

ANSWER 1 How big each weight update step is during training. Too high overshoots and diverges. Too low never converges. The single most impactful hyperparameter to tune.

QUESTION 2 What happens if the learning rate is too high?

ANSWER 2 Weight updates overshoot the minimum. Loss oscillates or explodes. Training diverges — loss increases instead of decreasing. The model never converges.

QUESTION 3 What is a learning rate schedule?

ANSWER 3 Changing the learning rate during training — typically starting high for fast progress and reducing to allow fine-grained convergence. Warmup, cosine annealing, and step decay are common

QUESTION 4 What learning rate should I start with?

ANSWER 4 1e-3 (0.001) for Adam on most tasks. 1e-4 to 5e-5 for fine-tuning transformers. Use a learning rate finder to validate empirically on your specific setup.


📬 Get one concept + one use case every Tuesday. Join the newsletter →