Memoization: forward activations feed backward formulas

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Automatic differentiation frameworks tape the forward computation: each op records inputs and enough metadata to run its backward rule. A ReLU backward needs to know which preactivations were positive; softmax backward needs logits and the normalized probabilities; max-pooling backward needs argmax routes that record which input won each window .

This memoization is not free. Storing every intermediate for a deep network can exhaust GPU memory. Gradient checkpointing trades compute for memory: some activations are discarded and recomputed during the backward pass when their local formulas need them .

Framework APIs let you detach tensors to cut gradient flow through an edge, useful for frozen layers or reinforcement-learning baselines. Higher-order gradients require differentiating through the backward implementation itself, which doubles memory pressure and is far less common in standard training .

At minimum, each graph node must know how to compute its output from parents and how to implement the local vector-Jacobian product that maps an upstream gradient into gradients w.r.t. inputs .

During training, memory planners must budget for both weights and saved activations. That is why inference-only deployment can use less memory than training the same architecture at the same batch size .

ReLU backward is especially simple: the mask of positive preactivations multiplies the upstream gradient elementwise. Sigmoid backward instead multiplies by $\sigma'(z)$, which shrinks when activations saturate .

University approvals: 0
Related cards
Builds on Backprop as structured chain rule on a DAG · Machine learning
Video Content
Tasks
Question 1

Gradient checkpointing recomputes some forward activations during the backward pass in order to:

Hint

Skim the paragraphs on Gradient checkpointing recomputes some forward in Memoization before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Calling .detach() (or similar) on a tensor stops:

Hint

Skim the paragraphs on Calling detach similar tensor stops in Memoization before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Computing higher-order gradients (e.g. Hessian information) requires:

Hint

Skim the paragraphs on Computing higher order gradients Hessian in Memoization before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

At minimum, each node in a reverse-mode computation graph must know:

Hint

Skim the paragraphs on minimum each node reverse mode in Memoization before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy