Differentiable key-value lookup view

Attention can be read as a soft dictionary lookup. Keys index rows of a memory table; values store content; queries ask which rows to retrieve. Hard argmax attention picks one index (usually non-differentiable); softmax attention blends many rows with differentiable weights .

Dot-product scores measure compatibility between query and key directions in the head subspace. If $Q_i$ and $K_j$ are nearly orthogonal, the raw score is small and position $j$ receives little mass unless other terms (positional bias, scaling) dominate .

Some architectures multiply attention outputs by learned gates (sigmoid or similar) to suppress heads dynamically. The core lookup picture still applies: queries route information from a pool of value vectors .

This view explains why attention excels at long-range dependencies: any query can reach any key in one hop, unlike RNNs that propagate signal through $O(n)$ sequential steps .

The lookup metaphor breaks if values are identical across positions: softmax weights still mix, but the output equals that shared vector regardless of query. Upstream MLP layers create the value diversity that makes attention nontrivial .

Compare softmax attention to hard attention on the same toy keys: the soft output is smoother and trainable, which is why transformers use softmax even though hard routing looks cleaner in diagrams .

Differentiable key-value lookup view

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator