Educational Cards

Learn from video content, text, and interactive tasks

Filters

Sort: Newest Top Rated Most Popular

Sparse autoencoders and feature directions

Sparse autoencoders (SAEs) train on internal activations to reconstruct them from a larger sparse...

Advanced Machine learning

View Card

Catastrophic forgetting and continual learning pain

Fine-tuning updates weights to fit new data. Those same weights encoded prior abilities;...

Advanced Machine learning

View Card

MLP neurons as key-value-ish nonlinear transforms

Transformer MLP blocks widen activations, apply a smooth gate (GeLU, SiLU), then project back to...

Advanced Machine learning

View Card

Parametric memory is distributed, not a tidy file cabinet

When a model answers "Paris is the capital of France," where did that fact live? Not in a single...

Advanced Machine learning

View Card

Efficient attention approximations (concept map)

Full softmax attention costs O(n^2) per layer in sequence length. When n reaches tens of thousands,...

Advanced Machine learning

View Card

Masking bugs and leakage

Two mask families appear in almost every codebase. Padding masks zero out attention to padded...

Advanced Machine learning

View Card

Tensor programs: batch × heads × sequence × dim

Production attention is a stack of GEMMs (general matrix multiplies) on GPU. Typical shapes: batch...

Advanced Machine learning

View Card

Differentiable key-value lookup view

Attention can be read as a soft dictionary lookup . Keys index rows of a memory table; values store...

Advanced Machine learning

View Card

Score matrix and softmax along keys

Stack all query-key scores into a matrix S in mathbbR^n times n with S_ij = Q_i cdot K_j / sqrtd_k....

Advanced Machine learning

View Card

Linear projections produce Q, K, V

The previous chapter showed attention as a story; this one implements it as matrix multiplies....

Advanced Machine learning

View Card

Inference systems: KV-cache and batching

Training processes full sequences in parallel; inference generates one token at a time. Recomputing...

Intermediate Machine learning

View Card

Training stability at scale

Training billion-parameter transformers is as much engineering as theory. AdamW decouples weight...

Intermediate Machine learning

View Card