Structured graphs: weight sharing and modular layers

Production networks are not bare MLP chains. Convolutions reuse the same kernel at every spatial location; batch normalization couples batch statistics to each activation; attention mixes tokens through softmax weights. Each block is still a DAG node with registered forward and backward rules, often fused into single GPU kernels .

Weight tying means one parameter tensor feeds multiple operations. Gradients w.r.t. that tensor sum contributions from every use site, exactly like fan-out summation in the chain rule .

Softmax is a coupling example: each output probability depends on all logits, so its Jacobian is generally dense, not diagonal. That is why naive softmax followed by log can underflow, while fused log_softmax plus negative log-likelihood loss keeps computations in log space .

In-place tensor operations can silently break autograd if they overwrite values still needed for backward. Custom layers must register correct backward kernels or training will diverge in subtle ways .

Batch normalization backward depends on batch statistics computed during the forward pass; eval mode freezes running averages instead. Each such layer is another node whose local Jacobian structure must be coded correctly .

Convolution backward is not magic: it is the same fan-out summation, but the shared kernel means many edges reuse identical weight indices, so gradients to that kernel accumulate from every spatial position .

Structured graphs: weight sharing and modular layers

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator