Educational Cards
Learn from video content, text, and interactive tasks
Filters
Sparse autoencoders and feature directions
Sparse autoencoders (SAEs) train on internal activations to reconstruct them from a larger sparse...
Catastrophic forgetting and continual learning pain
Fine-tuning updates weights to fit new data. Those same weights encoded prior abilities;...
MLP neurons as key-value-ish nonlinear transforms
Transformer MLP blocks widen activations, apply a smooth gate (GeLU, SiLU), then project back to...
Parametric memory is distributed, not a tidy file cabinet
When a model answers "Paris is the capital of France," where did that fact live? Not in a single...
Efficient attention approximations (concept map)
Full softmax attention costs O(n^2) per layer in sequence length. When n reaches tens of thousands,...
Masking bugs and leakage
Two mask families appear in almost every codebase. Padding masks zero out attention to padded...
Tensor programs: batch × heads × sequence × dim
Production attention is a stack of GEMMs (general matrix multiplies) on GPU. Typical shapes: batch...
Differentiable key-value lookup view
Attention can be read as a soft dictionary lookup . Keys index rows of a memory table; values store...
Score matrix and softmax along keys
Stack all query-key scores into a matrix S in mathbbR^n times n with S_ij = Q_i cdot K_j / sqrtd_k....
Linear projections produce Q, K, V
The previous chapter showed attention as a story; this one implements it as matrix multiplies....
Inference systems: KV-cache and batching
Training processes full sequences in parallel; inference generates one token at a time. Recomputing...
Training stability at scale
Training billion-parameter transformers is as much engineering as theory. AdamW decouples weight...