Self-attention mixes tokens via learned compatibilities

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Chapter 5 framed language as next-token prediction; this chapter asks how a model can mix information across an entire sentence in parallel. Self-attention is the core mechanism: each token position emits a query vector asking what it needs, key vectors advertising what each position offers, and value vectors carrying content to be blended .

Fix one query position $i$. Score every key $j$ with a compatibility function, typically a scaled dot product $Q_i \cdot K_j / \sqrt{d_k}$. Apply softmax across $j$ to obtain weights $\alpha_{ij}$ that sum to 1. The output at $i$ is the weighted sum $\sum_j \alpha_{ij} V_j$, a convex combination of value vectors .

Without positional information, this recipe is permutation invariant: shuffling token order while keeping the same multiset of embeddings yields the same pairwise scores up to reordering. Language is not permutation invariant, so transformers inject positional encodings (next card). Multi-head attention runs several attention maps in parallel subspaces, then concatenates and projects, letting different heads specialize in syntax, coreference, or local patterns .

Scaling dot products by $1/\sqrt{d_k}$ keeps softmax from saturating as head dimension grows: raw dot products have variance that grows with $d_k$, pushing softmax into one-hot regimes where gradients vanish .

University approvals: 0
Related cards
Video Content
Tasks
Question 1

Scaled dot-product attention divides the scores by $\sqrt{d_k}$ mainly to:

Hint

Skim the paragraphs on Scaled product attention divides scores in Self-attention mixes tokens via learned compatibilities before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Multi-head attention works by:

Hint

Skim the paragraphs on Multi head attention works in Self-attention mixes tokens via learned compatibilities before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Applying softmax over the keys for a fixed query produces:

Hint

Skim the paragraphs on Applying softmax over keys fixed in Self-attention mixes tokens via learned compatibilities before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Why must positional information be added if word order matters?

Hint

Skim the paragraphs on must positional information be added if word order matters in Self-attention mixes tokens via learned compatibilities before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy