Differentiable key-value lookup view

Advanced Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Attention can be read as a soft dictionary lookup. Keys index rows of a memory table; values store content; queries ask which rows to retrieve. Hard argmax attention picks one index (usually non-differentiable); softmax attention blends many rows with differentiable weights .

Dot-product scores measure compatibility between query and key directions in the head subspace. If $Q_i$ and $K_j$ are nearly orthogonal, the raw score is small and position $j$ receives little mass unless other terms (positional bias, scaling) dominate .

Some architectures multiply attention outputs by learned gates (sigmoid or similar) to suppress heads dynamically. The core lookup picture still applies: queries route information from a pool of value vectors .

This view explains why attention excels at long-range dependencies: any query can reach any key in one hop, unlike RNNs that propagate signal through $O(n)$ sequential steps .

The lookup metaphor breaks if values are identical across positions: softmax weights still mix, but the output equals that shared vector regardless of query. Upstream MLP layers create the value diversity that makes attention nontrivial .

Compare softmax attention to hard attention on the same toy keys: the soft output is smoother and trainable, which is why transformers use softmax even though hard routing looks cleaner in diagrams .

University approvals: 0
Related cards
Builds on Score matrix and softmax along keys · Machine learning
Video Content
Tasks
Question 1

Dot-product attention scores measure:

Hint

Skim the paragraphs on product attention scores measure in Differentiable key-value lookup view before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

If a query and a key are nearly orthogonal, their raw dot-product score is:

Hint

Skim the paragraphs on query nearly orthogonal their product in Differentiable key-value lookup view before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Gated attention variants multiply:

Hint

Skim the paragraphs on Gated attention variants multiply in Differentiable key-value lookup view before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

How does soft (softmax) attention differ from hard (argmax) attention?

Hint

Skim the paragraphs on soft (softmax) attention differ from hard (argmax) attention in Differentiable key-value lookup view before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Advanced
  • Completed: 0 users
Creator
Best
Best
BestBuddy