Positional encodings and length generalization

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Attention scores depend on content vectors alone unless you tell the model where each token sits in the sequence. Positional encodings break permutation symmetry by adding order information to embeddings before attention runs .

Classic sinusoidal encodings use sines and cosines at multiple frequencies so each position gets a unique vector with smooth relative structure. Learned absolute embeddings store one trainable vector per integer position index, simple but limited when inference exceeds trained context length. RoPE (rotary position embedding) rotates pairs of dimensions in query and key space to encode relative offsets; modern LLMs often prefer it .

ALiBi adds a distance-based bias to attention logits before softmax, penalizing far-apart positions without an explicit embedding table. Each scheme trades off interpolation beyond training length, compute cost, and inductive bias .

In deployment, context truncation interacts with positional schemes: when chat history exceeds the window, earliest tokens drop out. Absolute tables must remap indices or accept a discontinuity at the cut. Sliding-window attention caps reach per layer, trading single-layer global mixing for stable local bias .

unless architecture and training explicitly target length generalization. Long-context research (YaRN, position interpolation, sliding windows) is largely about making positional structure stable as $n$ grows .

When benchmarking long-context models, report both retrieval accuracy at the context frontier and latency per token; positional failures often appear only at the far end of the window where training saw few examples .

University approvals: 0
Related cards
Video Content
Tasks
Question 1

Rotary position embeddings (RoPE) encode position by:

Hint

Skim the paragraphs on Rotary position embeddings RoPE encode in Positional encodings and length generalization before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Learned absolute positional embeddings:

Hint

Skim the paragraphs on Learned absolute positional embeddings in Positional encodings and length generalization before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

ALiBi biases attention scores based on:

Hint

Skim the paragraphs on ALiBi biases attention scores based in Positional encodings and length generalization before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Why do extremely long contexts strain GPU memory faster than linearly in $n$?

Hint

Skim the paragraphs on extremely long contexts strain GPU memory faster than in Positional encodings and length generalization before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy