Inference systems: KV-cache and batching
Training processes full sequences in parallel; inference generates one token at a time. Recomputing keys and values for all prior tokens at every step would waste compute. A KV-cache stores past $K_j, V_j$ projections so each new token only computes its own query against cached history .
Memory grows with context length: for $L$ layers, $h$ heads, sequence length $n$, and head dimension $d_h$, cached keys and values occupy space proportional to $L \cdot h \cdot n \cdot d_h$ per batch item .

Serving systems batch concurrent user requests, padding sequences to a common length. That trades latency (waiting to fill a batch) against throughput (amortizing GPU kernels). PagedAttention and similar memory managers reduce fragmentation when many requests share one GPU .
Quantization (INT8/INT4 weights, FP8 activations) further cuts bandwidth during decode; quality loss depends on calibration data and whether KV caches stay in higher precision .

Speculative decoding proposes several tokens with a small draft model, then verifies them in parallel with the target model, reducing wall-clock latency when drafts are accepted. These tricks sit on top of the same attention math; they change how often you pay the quadratic cost, not whether attention exists .
Related cards
Video Content
Tasks
Card Info
- Topic: Machine learning
- Difficulty: Intermediate
- Completed: 0 users