Training stability at scale

Training billion-parameter transformers is as much engineering as theory. AdamW decouples weight decay from Adam's adaptive learning-rate correction, a standard optimizer for LLM pretraining . Gradient clipping caps global norm before the update step, preventing occasional loss spikes from exploding weights in deep stacks .

Mixed precision (FP16 or BF16) cuts memory and increases throughput; loss scaling multiplies the loss before backward pass so tiny gradients remain representable in low precision . Gradient accumulation sums gradients over micro-batches, simulating larger effective batch sizes when GPU memory cannot hold the full batch .

Learning rate warmup gradually increases $\eta$ early in training. Large adaptive steps on cold random weights can destabilize attention logits and optimizer moment estimates; warmup lets scale settle before full-speed optimization .

Weight decay shrinks weights toward zero each step, improving generalization in many LLM runs when paired with AdamW. Gradient clipping uses a global norm cap so one bad batch cannot dominate the update .

restart from a checkpoint. Monitoring validation loss, gradient norms, and activation statistics is routine at scale .

Checkpointing every few thousand steps is cheap insurance: loss spikes sometimes recover under cosine decay, but a diverged run without a checkpoint wastes days of GPU time .

Training stability at scale

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator