Autoregressive core loop: predict the next token

Large language models treat text as a sequence of discrete tokens (subwords or bytes). At each position the model outputs a distribution over the vocabulary; training pushes probability mass onto the next token observed in the corpus, typically via cross-entropy on logits .

Generation repeats the loop: sample or argmax a token, append it to the context, run another forward pass. That autoregressive structure is the same idea as character-level models, only the vocabulary and depth grew .

A tokenizer maps Unicode text to token IDs; detokenization maps back for display. Context length limits how many prior tokens can influence the current position through attention or recurrence .

At inference, temperature rescales logits before softmax: low temperature sharpens the distribution (more deterministic), high temperature flattens it (more random). Top-$p$ (nucleus) sampling truncates the smallest tail of the distribution while keeping most probability mass .

Pre-training therefore optimizes next-token predictive accuracy over massive text, not explicit human preference labels .

Each forward pass produces a vector of logits with length equal to vocabulary size, often tens of thousands of entries. The model reads context left to right during causal pretraining, so position $t$ never sees tokens after $t$ .

Autoregressive core loop: predict the next token

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator