Bridge to transformer mechanisms

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Language models before transformers often used RNNs: a hidden state updated serially along time, creating a bottleneck for very long contexts . Transformers replace much of that recurrence with self-attention, mixing tokens in parallel within each layer plus positional structure .

Naive full self-attention compares every token to every other token in a layer, giving $O(n^2)$ time and memory in sequence length $n$. That cost motivates sparse, linear, and state-space approximations in research models .

Hybrid stacks combine attention with convolutions, recurrence, or state-space blocks selectively. Later chapters in the playlist mechanize query-key-value maps, masks, and transformer blocks that make the autoregressive loop practical at scale .

Understanding the LLM story at a high level sets up why attention, not deeper RNN unrolling, became the default long-range mixer .

Chapter 6 opens the transformer block: query-key-value attention, positional encodings, residual streams, and the engineering tricks that make billion-parameter training stable .

Attention will reappear as the central mechanism for mixing token information; the LLM overview you just finished supplies motivation for why that mechanism replaced recurrence at scale .

Quadratic attention cost is why serving long contexts is expensive: memory for keys and values grows with sequence length even when compute is optimized with caching .

University approvals: 0
Related cards
Video Content
Tasks
Question 1

Recurrent neural networks (RNNs) process a sequence:

Hint

Skim the paragraphs on Recurrent neural networks RNNs process in Bridge to transformer mechanisms before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Transformers replace recurrence with:

Hint

Skim the paragraphs on Transformers replace recurrence with in Bridge to transformer mechanisms before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Hybrid sequence architectures may combine attention with:

Hint

Skim the paragraphs on Hybrid sequence architectures combine attention in Bridge to transformer mechanisms before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

What is the time/space complexity of naive full self-attention over a sequence of length $n$ (per layer)?

Hint

Skim the paragraphs on the time/space complexity of naive full self-attention over in Bridge to transformer mechanisms before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy