Linear projections produce Q, K, V
The previous chapter showed attention as a story; this one implements it as matrix multiplies. Start with token embeddings $X \in \mathbb{R}^{n \times d_{\text{model}}}$ for sequence length $n$. Three learned weight matrices produce queries, keys, and values :
$$Q = X W^Q,\quad K = X W^K,\quad V = X W^V.$$
In multi-head attention, $W^Q, W^K, W^V$ split activations into $h$ heads, each with subdimension $d_h$ where typically $d_{\text{model}} = h \cdot d_h$ .

Queries and keys must share inner-product dimension so scores $Q_i \cdot K_j$ are well-defined. Values supply the vectors that get mixed; attention weights act on values, not on raw embeddings . Shape bugs (batch, heads, seq, dim) are the most common debugging category when porting attention to code .

Write the output at position $i$ as $\sum_j \alpha_{ij} V_j$ where $\alpha_{ij}$ are softmax-normalized scores from $Q_i$ against keys $K_j$. That single formula is the entire forward pass of one attention head before the output projection .
When implementing from scratch, print $Q$, $K$, $V$ shapes after every projection. The most common bug transposes sequence and head axes, producing attention maps that look plausible in plots but scramble token alignment. Unit tests on a two-token toy sequence catch this before scaling to batch size 32 .
Related cards
Video Content
Tasks
Card Info
- Topic: Machine learning
- Difficulty: Advanced
- Completed: 0 users