Educational Cards

Learn from video content, text, and interactive tasks

Filters
Clear
Sparse autoencoders and feature directions

Sparse autoencoders (SAEs) train on internal activations to reconstruct them from a larger sparse...

Advanced Machine learning
Catastrophic forgetting and continual learning pain

Fine-tuning updates weights to fit new data. Those same weights encoded prior abilities;...

Advanced Machine learning
MLP neurons as key-value-ish nonlinear transforms

Transformer MLP blocks widen activations, apply a smooth gate (GeLU, SiLU), then project back to...

Advanced Machine learning
Parametric memory is distributed, not a tidy file cabinet

When a model answers "Paris is the capital of France," where did that fact live? Not in a single...

Advanced Machine learning
Efficient attention approximations (concept map)

Full softmax attention costs O(n^2) per layer in sequence length. When n reaches tens of thousands,...

Advanced Machine learning
Masking bugs and leakage

Two mask families appear in almost every codebase. Padding masks zero out attention to padded...

Advanced Machine learning
Tensor programs: batch × heads × sequence × dim

Production attention is a stack of GEMMs (general matrix multiplies) on GPU. Typical shapes: batch...

Advanced Machine learning
Differentiable key-value lookup view

Attention can be read as a soft dictionary lookup . Keys index rows of a memory table; values store...

Advanced Machine learning
Score matrix and softmax along keys

Stack all query-key scores into a matrix S in mathbbR^n times n with S_ij = Q_i cdot K_j / sqrtd_k....

Advanced Machine learning
Linear projections produce Q, K, V

The previous chapter showed attention as a story; this one implements it as matrix multiplies....

Advanced Machine learning
Inference systems: KV-cache and batching

Training processes full sequences in parallel; inference generates one token at a time. Recomputing...

Intermediate Machine learning
Training stability at scale

Training billion-parameter transformers is as much engineering as theory. AdamW decouples weight...

Intermediate Machine learning