Stochasticity: minibatches approximate the full-data gradient

Beginner Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Exact gradient descent recomputes $\nabla C$ using every training example each step. For MNIST that is tolerable; for web-scale corpora it is not. Stochastic gradient descent (SGD) estimates the gradient from a mini-batch: a small random subset of examples drawn each iteration .

If the full-data gradient is $\mathbf{g}$, a batch gradient $\hat{\mathbf{g}}$ is a noisy draw whose expectation equals $\mathbf{g}$ under uniform sampling. The noise variance shrinks as batch size grows, but wall-clock per step rises because each example still needs a forward and backward pass .

An epoch is one pass through the training set under whatever sampling schedule you use. Shuffling between epochs reduces order bias: without it, consecutive batches might systematically skew the noise direction. Batch size therefore trades variance of gradient estimates against throughput per update .

Small-batch noise is not purely harmful. It can nudge iterates out of shallow basins and flat regions, acting like implicit regularization in some regimes. Large batches give cleaner gradient signals but may need learning-rate retuning to preserve similar training dynamics .

The digit-classification example processes batches of tens of images per step rather than all 60{,}000 MNIST points at once. That choice makes each epoch affordable while preserving enough noise that optimization does not freeze on the first shallow basin it encounters .

University approvals: 0
Related cards
Builds on Learning rate as step size: fragile knob · Machine learning
Next Local minima, saddles, and plateaus · Machine learning
Video Content
Tasks
Question 1

Compared with full-batch gradient descent, SGD with small minibatches gives:

Hint

Skim the paragraphs on Compared with full batch gradient in Stochasticity before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

One 'epoch' of training means:

Hint

Skim the paragraphs on epoch training means in Stochasticity before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Shuffling the data between epochs helps by:

Hint

Skim the paragraphs on Shuffling data between epochs helps in Stochasticity before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

A 'mini-batch' is:

Hint

Skim the paragraphs on mini batch in Stochasticity before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Beginner
  • Completed: 0 users
Creator
Best
Best
BestBuddy