Self-attention mixes tokens via learned compatibilities

Chapter 5 framed language as next-token prediction; this chapter asks how a model can mix information across an entire sentence in parallel. Self-attention is the core mechanism: each token position emits a query vector asking what it needs, key vectors advertising what each position offers, and value vectors carrying content to be blended .

Fix one query position $i$. Score every key $j$ with a compatibility function, typically a scaled dot product $Q_i \cdot K_j / \sqrt{d_k}$. Apply softmax across $j$ to obtain weights $\alpha_{ij}$ that sum to 1. The output at $i$ is the weighted sum $\sum_j \alpha_{ij} V_j$, a convex combination of value vectors .

Without positional information, this recipe is permutation invariant: shuffling token order while keeping the same multiset of embeddings yields the same pairwise scores up to reordering. Language is not permutation invariant, so transformers inject positional encodings (next card). Multi-head attention runs several attention maps in parallel subspaces, then concatenates and projects, letting different heads specialize in syntax, coreference, or local patterns .

Scaling dot products by $1/\sqrt{d_k}$ keeps softmax from saturating as head dimension grows: raw dot products have variance that grows with $d_k$, pushing softmax into one-hot regimes where gradients vanish .

Self-attention mixes tokens via learned compatibilities

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator