Worked tiny graphs before code-size graphs

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Before trusting a million-line framework, trace a three-node or five-node graph by hand. The pattern repeats at scale: local partials, upstream sensitivity, multiply, sum at joins. Hand work also reveals when an operation couples every output to every input .

Softmax is the standard coupling example. Outputs $p_i = e^{z_i}/\sum_j e^{z_j}$ depend on all logits $z_j$, so the Jacobian $\partial p/\partial z$ is generally dense, not diagonal .

Cross-entropy with softmax yields logit gradients with the same shape as the logits themselves: one component per class error. That compact formula is what frameworks implement in fused kernels .

Broadcasting in automatic differentiation must track which axes were expanded so backward passes sum gradients over the correct dimensions. The upstream gradient at a node is the sensitivity of the final loss to an infinitesimal change in that node's output .

For a two-layer sigmoid network on MNIST, walking through one hidden unit's influence on one output logit already shows fan-out: that hidden unit feeds all ten output neurons, so its upstream gradient sums ten downstream paths .

Doing one row of the softmax Jacobian by hand shows why frameworks ship fused backward kernels: the algebra is repetitive but easy to mis-index when rushed .

University approvals: 0
Related cards
Video Content
Tasks
Question 1

The Jacobian of softmax is:

Hint

Skim the paragraphs on Jacobian softmax in Worked tiny graphs before code-size graphs before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Cross-entropy on top of softmax produces a gradient with respect to the logits that has:

Hint

Skim the paragraphs on Cross entropy softmax produces gradient in Worked tiny graphs before code-size graphs before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

When a forward op uses broadcasting, the backward pass must track:

Hint

Skim the paragraphs on forward op uses broadcasting, the backward pass must track in Worked tiny graphs before code-size graphs before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

During backprop, the 'upstream gradient' at a node represents:

Hint

Skim the paragraphs on During backprop upstream gradient node in Worked tiny graphs before code-size graphs before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy