Bridge to transformer mechanisms

Language models before transformers often used RNNs: a hidden state updated serially along time, creating a bottleneck for very long contexts . Transformers replace much of that recurrence with self-attention, mixing tokens in parallel within each layer plus positional structure .

Naive full self-attention compares every token to every other token in a layer, giving $O(n^2)$ time and memory in sequence length $n$. That cost motivates sparse, linear, and state-space approximations in research models .

Hybrid stacks combine attention with convolutions, recurrence, or state-space blocks selectively. Later chapters in the playlist mechanize query-key-value maps, masks, and transformer blocks that make the autoregressive loop practical at scale .

Understanding the LLM story at a high level sets up why attention, not deeper RNN unrolling, became the default long-range mixer .

Chapter 6 opens the transformer block: query-key-value attention, positional encodings, residual streams, and the engineering tricks that make billion-parameter training stable .

Attention will reappear as the central mechanism for mixing token information; the LLM overview you just finished supplies motivation for why that mechanism replaced recurrence at scale .

Quadratic attention cost is why serving long contexts is expensive: memory for keys and values grows with sequence length even when compute is optimized with caching .

Bridge to transformer mechanisms

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator