Autoregressive core loop: predict the next token

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Large language models treat text as a sequence of discrete tokens (subwords or bytes). At each position the model outputs a distribution over the vocabulary; training pushes probability mass onto the next token observed in the corpus, typically via cross-entropy on logits .

Generation repeats the loop: sample or argmax a token, append it to the context, run another forward pass. That autoregressive structure is the same idea as character-level models, only the vocabulary and depth grew .

A tokenizer maps Unicode text to token IDs; detokenization maps back for display. Context length limits how many prior tokens can influence the current position through attention or recurrence .

At inference, temperature rescales logits before softmax: low temperature sharpens the distribution (more deterministic), high temperature flattens it (more random). Top-$p$ (nucleus) sampling truncates the smallest tail of the distribution while keeping most probability mass .

Pre-training therefore optimizes next-token predictive accuracy over massive text, not explicit human preference labels .

Each forward pass produces a vector of logits with length equal to vocabulary size, often tens of thousands of entries. The model reads context left to right during causal pretraining, so position $t$ never sees tokens after $t$ .

University approvals: 0
Related cards
Video Content
Tasks
Question 1

During pretraining, the model's logits are trained to match:

Hint

Skim the paragraphs on During pretraining model logits trained in Autoregressive core loop before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

The context-window length limits:

Hint

Skim the paragraphs on context window length limits in Autoregressive core loop before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

A tokenizer maps text to:

Hint

Skim the paragraphs on tokenizer maps text in Autoregressive core loop before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

What is being optimized during base-model pretraining?

Hint

Skim the paragraphs on being optimized during base-model pretraining in Autoregressive core loop before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy