Score matrix and softmax along keys

Advanced Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Stack all query-key scores into a matrix $S \in \mathbb{R}^{n \times n}$ with $S_{ij} = Q_i \cdot K_j / \sqrt{d_k}$. For each row $i$ (fixed query), apply softmax across columns $j$ (keys):

$$\alpha_{ij} = \frac{e^{S_{ij}}}{\sum_k e^{S_{ik}}}.$$

Each row lies on the probability simplex: entries are nonnegative and sum to 1 .

The attention output matrix is $A V$ where $A$ is the softmax-normalized score matrix. In batched multi-head form, the same pattern repeats independently per head and batch item .

Numerical stability matters in FP16: if scores drift too large, softmax saturates to one-hot rows and gradients vanish; if scores collapse, rows become nearly uniform and attention fails to focus. The $\sqrt{d_k}$ scaling is one guardrail . Dropout on attention weights during training randomly zeroes some $\alpha_{ij}$, a mild regularizer in some architectures .

Temperature on attention logits is rarely used in standard transformers, but the softmax saturation story still helps debug collapsed rows: if all mass lands on one key, inspect whether $d_k$ scaling was omitted or fp16 overflow clipped the logits .

Visualize attention rows for a single head on a short sentence: you should see mass concentrate on syntactic dependents and repeated tokens. Uniform rows signal a bug or a dead head worth pruning in analysis .

University approvals: 0
Related cards
Builds on Linear projections produce Q, K, V · Machine learning
Next Differentiable key-value lookup view · Machine learning
Video Content
Tasks
Question 1

After softmax over the keys, each row of the attention matrix:

Hint

Skim the paragraphs on After softmax over keys each in Score matrix and softmax along keys before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Dropping (zeroing) some attention heads or weights during training is:

Hint

Skim the paragraphs on Dropping zeroing some attention heads in Score matrix and softmax along keys before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Additive logit biases such as ALiBi modify:

Hint

Skim the paragraphs on Additive logit biases such ALiBi in Score matrix and softmax along keys before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

If a query's attention row is nearly uniform across keys, the output is:

Hint

Skim the paragraphs on query attention nearly uniform across in Score matrix and softmax along keys before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Advanced
  • Completed: 0 users
Creator
Best
Best
BestBuddy