Inference systems: KV-cache and batching

Training processes full sequences in parallel; inference generates one token at a time. Recomputing keys and values for all prior tokens at every step would waste compute. A KV-cache stores past $K_j, V_j$ projections so each new token only computes its own query against cached history .

Memory grows with context length: for $L$ layers, $h$ heads, sequence length $n$, and head dimension $d_h$, cached keys and values occupy space proportional to $L \cdot h \cdot n \cdot d_h$ per batch item .

Serving systems batch concurrent user requests, padding sequences to a common length. That trades latency (waiting to fill a batch) against throughput (amortizing GPU kernels). PagedAttention and similar memory managers reduce fragmentation when many requests share one GPU .

Quantization (INT8/INT4 weights, FP8 activations) further cuts bandwidth during decode; quality loss depends on calibration data and whether KV caches stay in higher precision .

Speculative decoding proposes several tokens with a small draft model, then verifies them in parallel with the target model, reducing wall-clock latency when drafts are accepted. These tricks sit on top of the same attention math; they change how often you pay the quadratic cost, not whether attention exists .

Inference systems: KV-cache and batching

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator