Neural networks (3Blue1Brown)
Intermediate
Machine learning
by Best
Feedforward networks, gradient descent, backpropagation, transformers, and large language models. Nine modules with exercises and short animations, aligned with 3Blue1Brown's neural networks series.
University approvals: 1
(ZHAW - Zürcher Hochschule für Angewandte Wissenschaften: 1)
Layers, weights, activations, and the handwriting thread used across the series.
- Neurons: weighted sums, bias, then a nonlinearity Card
- Why depth needs nonlinear breaks Card
- MNIST-style digits as a running example Card
- Parameter counting and width vs depth intuition Card
- Classification heads and probabilistic outputs Card
- Pedagogical edge cases: saturation, dead ReLUs, initialization Card
Cost landscapes, partial derivatives, learning rates, and minibatch noise.
- The cost function as a high-dimensional surface Card
- Learning rate as step size: fragile knob Card
- Stochasticity: minibatches approximate the full-data gradient Card
- Local minima, saddles, and plateaus Card
- Momentum and adaptive methods (conceptual) Card
- Bridge to automatic differentiation next Card
Computation graphs, chain rule along edges, and reuse vs naive perturbation.
- Backprop as structured chain rule on a DAG Card
- Memoization: forward activations feed backward formulas Card
- Forward-mode vs reverse-mode: when each wins Card
- Structured graphs: weight sharing and modular layers Card
- Debugging gradients in practice Card
- Handoff to the calculus-heavy walkthrough Card
Partial derivatives with careful indexing; softmax/log tricks; numerical hygiene.
Autoregressive next-token modeling, scale, alignment, and failure modes.
Self-attention, positional info, residuals, and transformer blocks.
Q/K/V linear maps, softmax weights, masking, and batched tensor programs.
Distributed representations, interpretability probes, forgetting, retrieval, and editing.
- Parametric memory is distributed, not a tidy file cabinet Card
- MLP neurons as key-value-ish nonlinear transforms Card
- Catastrophic forgetting and continual learning pain Card
- Sparse autoencoders and feature directions Card
- Retrieval-augmented generation and tool use Card
- Knowledge editing and limits of surgical updates Card
Diffusion and latent-space models for images and video; guest episode by Welch Labs.
- Guest bridge: from classical optimization to generative stacks Card
- Diffusion intuition: destroy, then learn to undo Card
- Video is harder than still images (temporal coherence) Card
- Engineering trade-offs: steps, guidance, distillation Card
- Safety stacks beyond the loss function Card
- Where this leaves the 3b1b neural-networks arc Card