Debugging gradients in practice

A wrong backward implementation can train for hours before NaNs appear. Standard debugging starts on tiny models where you can compare autograd against finite differences on a micro-batch. Discrepancies beyond numerical tolerance usually mean a sign error, missing fan-out sum, or wrong reduction axis .

Gradient clipping caps global norm or individual values before the optimizer step, preventing one bad batch from exploding weights. If loss is constant w.r.t. a subtree, gradients there are identically zero: the branch is inactive for that input .

Taylor tests probe whether a directional derivative from AD matches a finite-difference estimate along a random direction. Anomaly detection hooks in frameworks stop on the first NaN/Inf during backward, shortening the search .

Common failure signatures: all-zero gradients on a new custom layer, loss flat on toy data, or gradients that disagree with finite differences by orders of magnitude. NaNs often trace to bad logits, division by zero, or exploding activations without clipping .

When porting a paper architecture, reproduce its toy setup first: single batch, tiny width, and known random seed. Only scale up after finite-difference checks pass on the micro model .

Logging gradient norms per layer during the first epochs quickly reveals whether a new architecture starts in a healthy range or already explodes before meaningful learning occurs .

Debugging gradients in practice

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator