Training stability at scale

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Training billion-parameter transformers is as much engineering as theory. AdamW decouples weight decay from Adam's adaptive learning-rate correction, a standard optimizer for LLM pretraining . Gradient clipping caps global norm before the update step, preventing occasional loss spikes from exploding weights in deep stacks .

Mixed precision (FP16 or BF16) cuts memory and increases throughput; loss scaling multiplies the loss before backward pass so tiny gradients remain representable in low precision . Gradient accumulation sums gradients over micro-batches, simulating larger effective batch sizes when GPU memory cannot hold the full batch .

Learning rate warmup gradually increases $\eta$ early in training. Large adaptive steps on cold random weights can destabilize attention logits and optimizer moment estimates; warmup lets scale settle before full-speed optimization .

Weight decay shrinks weights toward zero each step, improving generalization in many LLM runs when paired with AdamW. Gradient clipping uses a global norm cap so one bad batch cannot dominate the update .

restart from a checkpoint. Monitoring validation loss, gradient norms, and activation statistics is routine at scale .

Checkpointing every few thousand steps is cheap insurance: loss spikes sometimes recover under cosine decay, but a diverged run without a checkpoint wastes days of GPU time .

University approvals: 0
Related cards
Builds on Encoder vs decoder masking · Machine learning
Next Inference systems: KV-cache and batching · Machine learning
Video Content
Tasks
Question 1

AdamW differs from plain Adam by decoupling:

Hint

Skim the paragraphs on AdamW differs from plain Adam in Training stability at scale before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Mixed-precision training usually adds:

Hint

Skim the paragraphs on Mixed precision training usually adds in Training stability at scale before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Gradient accumulation lets you:

Hint

Skim the paragraphs on Gradient accumulation lets in Training stability at scale before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Why is learning-rate warmup used when training transformers?

Hint

Skim the paragraphs on learning-rate warmup used when training transformers in Training stability at scale before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy