Debugging gradients in practice

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

A wrong backward implementation can train for hours before NaNs appear. Standard debugging starts on tiny models where you can compare autograd against finite differences on a micro-batch. Discrepancies beyond numerical tolerance usually mean a sign error, missing fan-out sum, or wrong reduction axis .

Gradient clipping caps global norm or individual values before the optimizer step, preventing one bad batch from exploding weights. If loss is constant w.r.t. a subtree, gradients there are identically zero: the branch is inactive for that input .

Taylor tests probe whether a directional derivative from AD matches a finite-difference estimate along a random direction. Anomaly detection hooks in frameworks stop on the first NaN/Inf during backward, shortening the search .

Common failure signatures: all-zero gradients on a new custom layer, loss flat on toy data, or gradients that disagree with finite differences by orders of magnitude. NaNs often trace to bad logits, division by zero, or exploding activations without clipping .

When porting a paper architecture, reproduce its toy setup first: single batch, tiny width, and known random seed. Only scale up after finite-difference checks pass on the micro model .

Logging gradient norms per layer during the first epochs quickly reveals whether a new architecture starts in a healthy range or already explodes before meaningful learning occurs .

University approvals: 0
Related cards
Video Content
Tasks
Question 1

Gradient clipping works by monitoring and capping:

Hint

Skim the paragraphs on Gradient clipping works monitoring capping in Debugging gradients in practice before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

If the loss does not depend at all on some subtree of the graph, the gradients in that subtree are:

Hint

Skim the paragraphs on loss does depend some subtree in Debugging gradients in practice before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

A 'Taylor test' for a backward implementation checks:

Hint

Skim the paragraphs on Taylor test backward implementation checks in Debugging gradients in practice before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Which is a common symptom of a backward-pass bug in a new custom layer?

Hint

Skim the paragraphs on a common symptom of a backward-pass bug in in Debugging gradients in practice before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy