Inference systems: KV-cache and batching

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Training processes full sequences in parallel; inference generates one token at a time. Recomputing keys and values for all prior tokens at every step would waste compute. A KV-cache stores past $K_j, V_j$ projections so each new token only computes its own query against cached history .

Memory grows with context length: for $L$ layers, $h$ heads, sequence length $n$, and head dimension $d_h$, cached keys and values occupy space proportional to $L \cdot h \cdot n \cdot d_h$ per batch item .

Serving systems batch concurrent user requests, padding sequences to a common length. That trades latency (waiting to fill a batch) against throughput (amortizing GPU kernels). PagedAttention and similar memory managers reduce fragmentation when many requests share one GPU .

Quantization (INT8/INT4 weights, FP8 activations) further cuts bandwidth during decode; quality loss depends on calibration data and whether KV caches stay in higher precision .

Speculative decoding proposes several tokens with a small draft model, then verifies them in parallel with the target model, reducing wall-clock latency when drafts are accepted. These tricks sit on top of the same attention math; they change how often you pay the quadratic cost, not whether attention exists .

University approvals: 0
Related cards
Builds on Training stability at scale · Machine learning
Video Content
Tasks
Question 1

A KV-cache stores:

Hint

Skim the paragraphs on cache stores in Inference systems before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Batching concurrent user requests trades:

Hint

Skim the paragraphs on Batching concurrent user requests trades in Inference systems before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Speculative decoding speeds up generation by:

Hint

Skim the paragraphs on Speculative decoding speeds generation in Inference systems before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Why does KV-cache memory grow with context length for a fixed model size?

Hint

Skim the paragraphs on KV-cache memory grow with context length for a in Inference systems before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy