Linear projections produce Q, K, V

Advanced Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

The previous chapter showed attention as a story; this one implements it as matrix multiplies. Start with token embeddings $X \in \mathbb{R}^{n \times d_{\text{model}}}$ for sequence length $n$. Three learned weight matrices produce queries, keys, and values :

$$Q = X W^Q,\quad K = X W^K,\quad V = X W^V.$$

In multi-head attention, $W^Q, W^K, W^V$ split activations into $h$ heads, each with subdimension $d_h$ where typically $d_{\text{model}} = h \cdot d_h$ .

Queries and keys must share inner-product dimension so scores $Q_i \cdot K_j$ are well-defined. Values supply the vectors that get mixed; attention weights act on values, not on raw embeddings . Shape bugs (batch, heads, seq, dim) are the most common debugging category when porting attention to code .

Write the output at position $i$ as $\sum_j \alpha_{ij} V_j$ where $\alpha_{ij}$ are softmax-normalized scores from $Q_i$ against keys $K_j$. That single formula is the entire forward pass of one attention head before the output projection .

When implementing from scratch, print $Q$, $K$, $V$ shapes after every projection. The most common bug transposes sequence and head axes, producing attention maps that look plausible in plots but scramble token alignment. Unit tests on a two-token toy sequence catch this before scaling to batch size 32 .

University approvals: 0
Related cards
Next Score matrix and softmax along keys · Machine learning
Video Content
Tasks
Question 1

In multi-head attention, the per-head dimension $d_h$ usually satisfies:

Hint

Skim the paragraphs on multi head attention head dimension in Linear projections produce Q, K, V before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Queries and keys must share the same inner-product dimension so that:

Hint

Skim the paragraphs on Queries keys must share same in Linear projections produce Q, K, V before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

In attention, the value vectors are the ones that get:

Hint

Skim the paragraphs on attention value vectors ones that in Linear projections produce Q, K, V before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

The attention output at position $i$ can be written as:

Hint

Skim the paragraphs on attention output position written in Linear projections produce Q, K, V before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Advanced
  • Completed: 0 users
Creator
Best
Best
BestBuddy