Forward-mode vs reverse-mode: when each wins

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Automatic differentiation has two standard modes. Forward mode seeds a perturbation on one input direction and propagates its influence through every output: cheap when inputs are few and outputs are many. Reverse mode seeds sensitivity on one scalar output and propagates backward to all inputs: cheap when outputs are one loss and inputs are millions of weights .

Neural network training is the canonical reverse-mode case: one cross-entropy scalar, huge parameter vector. Forward mode would require one pass per parameter to build a full Jacobian column, which is untenable .

Full Jacobians for wide layers can be $4096\times 4096$ or larger; frameworks never materialize them. Instead they expose Jacobian-vector products (JVPs) and vector-Jacobian products (VJPs) as primitive operations .

Pearlmutter-style tricks compute Hessian-vector products with two AD passes without forming the Hessian explicitly, enabling curvature probes for research optimizers while staying tractable at moderate width .

If you ever implement a custom loss that returns a vector of per-example costs, remember that reverse mode expects a scalar seed unless you sum or weight those outputs explicitly before calling backward .

Jacobian columns answer sensitivity to one input coordinate; Jacobian rows answer sensitivity of one output coordinate. Training cares about the row picture aggregated through the scalar loss .

University approvals: 0
Related cards
Video Content
Tasks
Question 1

Training a neural network usually prefers:

Hint

Skim the paragraphs on Training neural network usually prefers in Forward-mode vs reverse-mode before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Forward-mode AD is the better choice when:

Hint

Skim the paragraphs on Forward mode better choice when in Forward-mode vs reverse-mode before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Pearlmutter-style tricks compute Hessian-vector products using:

Hint

Skim the paragraphs on Pearlmutter style tricks compute Hessian in Forward-mode vs reverse-mode before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

What is the acronym for the 'vector-Jacobian product' that backprop applies at each node?

Hint

Skim the paragraphs on the acronym for the 'vector-Jacobian product' that backprop in Forward-mode vs reverse-mode before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy