Structured graphs: weight sharing and modular layers

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Production networks are not bare MLP chains. Convolutions reuse the same kernel at every spatial location; batch normalization couples batch statistics to each activation; attention mixes tokens through softmax weights. Each block is still a DAG node with registered forward and backward rules, often fused into single GPU kernels .

Weight tying means one parameter tensor feeds multiple operations. Gradients w.r.t. that tensor sum contributions from every use site, exactly like fan-out summation in the chain rule .

Softmax is a coupling example: each output probability depends on all logits, so its Jacobian is generally dense, not diagonal. That is why naive softmax followed by log can underflow, while fused log_softmax plus negative log-likelihood loss keeps computations in log space .

In-place tensor operations can silently break autograd if they overwrite values still needed for backward. Custom layers must register correct backward kernels or training will diverge in subtle ways .

Batch normalization backward depends on batch statistics computed during the forward pass; eval mode freezes running averages instead. Each such layer is another node whose local Jacobian structure must be coded correctly .

Convolution backward is not magic: it is the same fan-out summation, but the shared kernel means many edges reuse identical weight indices, so gradients to that kernel accumulate from every spatial position .

University approvals: 0
Related cards
Builds on Forward-mode vs reverse-mode: when each wins · Machine learning
Next Debugging gradients in practice · Machine learning
Video Content
Tasks
Question 1

'Weight tying' (weight sharing) means:

Hint

Skim the paragraphs on Weight tying weight sharing means in Structured graphs before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Softmax's backward pass couples all the logits because:

Hint

Skim the paragraphs on Softmax backward pass couples logits in Structured graphs before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

In-place tensor operations can break autograd when:

Hint

Skim the paragraphs on place tensor operations break autograd in Structured graphs before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Why is fused log_softmax + nll_loss more numerically stable than separate softmax then log?

Hint

Skim the paragraphs on fused log_softmax + nll_loss more numerically stable than in Structured graphs before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy