Efficient attention approximations (concept map)

Advanced Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Full softmax attention costs $O(n^2)$ per layer in sequence length. When $n$ reaches tens of thousands, quadratic memory and compute dominate. Research explores approximations that trade fidelity for subquadratic cost .

Linear attention methods replace softmax with kernel feature maps so attention becomes an associative scan in $O(n)$. Sparse patterns attend only to fixed or learned subsets of positions. Linformer, Performer, and Nyström variants compress the key-value sequence or approximate the attention matrix .

State space models (S4, Mamba) use structured recurrent linear dynamics as an alternative sequence mixer with $O(n)$ inference scaling on long sequences . Hybrid stacks combine attention in some layers with SSM or convolution in others.

Frontier models still use full attention in many layers because global mixing remains hard to match with aggressive sparsity without quality loss on some tasks. The choice depends on sequence length distribution, hardware, and evaluation budget .

When evaluating sparse or linear attention, measure perplexity on long documents, not only wall-clock speed. Approximations that look fine on short prompts can fail where global mixing matters .

Hybrid models may use full attention in early layers for global layout and sparse attention deeper in the stack; ablations should report quality per layer group, not only aggregate benchmarks .

University approvals: 0
Related cards
Builds on Masking bugs and leakage · Machine learning
Video Content
Tasks
Question 1

Linear-attention methods reduce cost by:

Hint

Skim the paragraphs on Linear attention methods reduce cost in Efficient attention approximations (concept map) before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

State-space models (S4, Mamba) mix sequences using:

Hint

Skim the paragraphs on State space models Mamba sequences in Efficient attention approximations (concept map) before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Sparse-attention patterns:

Hint

Skim the paragraphs on Sparse attention patterns in Efficient attention approximations (concept map) before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Why do frontier models still use full attention in at least some layers despite its cost?

Hint

Skim the paragraphs on frontier models still use full attention in at in Efficient attention approximations (concept map) before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Advanced
  • Completed: 0 users
Creator
Best
Best
BestBuddy