Learning rate as step size: fragile knob

Beginner Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

The learning rate $\eta$ controls how far each descent step moves in weight space. Too small and training crawls: thousands of tiny steps may never reach a good basin before your compute budget expires. Too large and updates overshoot narrow valleys, bouncing across the surface or diverging outright .

Picture a steep-walled ravine whose floor curves gently toward a minimum. A large $\eta$ jumps from one wall to the other each iteration, oscillating instead of sliding along the floor. A tiny $\eta$ makes progress along the floor but wastes steps on the walls. This instability is the discrete-time cousin of trying to integrate a continuous gradient flow with an oversized time step .

Practitioners rarely keep $\eta$ fixed forever. Learning-rate schedules decay or warm up the step size; line search methods probe a few points along the descent direction before committing. Adaptive optimizers (next card) rescale effective steps per parameter. None of these remove the core tradeoff: step size is the most fragile hyperparameter in vanilla SGD .

On MNIST-scale problems, practitioners often bracket $\eta$ with coarse grids (for example $0.01$, $0.1$, $1.0$), then refine when validation loss plateaus. Safe ranges shift with network depth, batch size, and weight initialization scale, so copying one paper's learning rate without context frequently fails .

University approvals: 0
Related cards
Video Content
Tasks
Question 1

Using too large a learning rate in a narrow valley tends to:

Hint

Skim the paragraphs on Using large learning rate narrow in Learning rate as step size before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Line-search methods choose a step size by:

Hint

Skim the paragraphs on Line search methods choose step in Learning rate as step size before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

A learning-rate warmup gradually:

Hint

Skim the paragraphs on learning rate warmup gradually in Learning rate as step size before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Why might the same learning rate behave differently when you multiply the batch size by 4?

Hint

Skim the paragraphs on might the same learning rate behave differently when in Learning rate as step size before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Beginner
  • Completed: 0 users
Creator
Best
Best
BestBuddy