Positional encodings and length generalization

Attention scores depend on content vectors alone unless you tell the model where each token sits in the sequence. Positional encodings break permutation symmetry by adding order information to embeddings before attention runs .

Classic sinusoidal encodings use sines and cosines at multiple frequencies so each position gets a unique vector with smooth relative structure. Learned absolute embeddings store one trainable vector per integer position index, simple but limited when inference exceeds trained context length. RoPE (rotary position embedding) rotates pairs of dimensions in query and key space to encode relative offsets; modern LLMs often prefer it .

ALiBi adds a distance-based bias to attention logits before softmax, penalizing far-apart positions without an explicit embedding table. Each scheme trades off interpolation beyond training length, compute cost, and inductive bias .

In deployment, context truncation interacts with positional schemes: when chat history exceeds the window, earliest tokens drop out. Absolute tables must remap indices or accept a discontinuity at the cut. Sliding-window attention caps reach per layer, trading single-layer global mixing for stable local bias .

unless architecture and training explicitly target length generalization. Long-context research (YaRN, position interpolation, sliding windows) is largely about making positional structure stable as $n$ grows .

When benchmarking long-context models, report both retrieval accuracy at the context frontier and latency per token; positional failures often appear only at the far end of the window where training saw few examples .

Positional encodings and length generalization

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator