Momentum and adaptive methods (conceptual)

Beginner Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Plain SGD uses only the current batch gradient. Momentum maintains a velocity vector $\mathbf{v}$ that accumulates past gradients with exponential decay, then updates weights along $\mathbf{v}$ rather than along the raw gradient alone . Along narrow ravines, oscillating components cancel while consistent downhill components reinforce, much like a ball gaining speed down a valley.

Adaptive methods rescale steps per parameter. AdaGrad divides by growing sums of squared past gradients, shrinking steps for frequently updated coordinates. Adam combines momentum on first moments with variance estimates on second moments, giving each weight its own effective learning rate .

These optimizers are workhorses in practice, yet they introduce new hyperparameters (decay rates, epsilon stabilizers, weight-decay coupling). The exposition foregrounds intuition; production training stacks add gradient clipping, warmup, and schedule tweaks on top .

Tracking moving averages of gradients therefore serves two roles: damp oscillation across ravine walls and accelerate motion along directions where the signal is consistent across iterations .

Nesterov momentum looks ahead along the velocity direction before evaluating the gradient, often improving convergence on convex-like valleys. RMSProp and AdamW variants appear in modern vision and language stacks, but the intuition here remains: smooth noisy directions and adapt step sizes .

University approvals: 0
Related cards
Builds on Local minima, saddles, and plateaus · Machine learning
Next Bridge to automatic differentiation next · Machine learning
Video Content
Tasks
Question 1

Momentum modifies SGD by:

Hint

Skim the paragraphs on Momentum modifies in Momentum and adaptive methods (conceptual) before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

AdaGrad-style methods divide by a growing sum of squared past gradients in order to:

Hint

Skim the paragraphs on AdaGrad style methods divide growing in Momentum and adaptive methods (conceptual) before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

The Adam optimizer combines:

Hint

Skim the paragraphs on Adam optimizer combines in Momentum and adaptive methods (conceptual) before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Why do practitioners track moving averages of gradients?

Hint

Skim the paragraphs on practitioners track moving averages of gradients in Momentum and adaptive methods (conceptual) before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Beginner
  • Completed: 0 users
Creator
Best
Best
BestBuddy