Transformer block: attention + MLP + residuals + norm

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

A transformer block stacks two sublayers around a residual highway. First, multi-head self-attention mixes tokens; second, a positionwise feed-forward network (FFN) applies the same MLP independently at each position .

The FFN is typically two linear maps with a nonlinearity in between, often expanding to $4\times$ model width before projecting back:

$$\mathrm{FFN}(x) = W_2\,\sigma(W_1 x + b_1) + b_2.$$

Attention handles who talks to whom; the FFN handles per-token nonlinear transformation after mixing .

Residual connections add sublayer inputs to outputs: $x + \mathrm{Attention}(x)$ and $x + \mathrm{FFN}(x)$. They provide gradient shortcuts through depth, mitigating vanishing signal in very deep stacks . LayerNorm or RMSNorm standardizes activations per token before each sublayer; pre-norm (norm first, then sublayer) often trains more stably than post-norm in large models .

Modern LLM FFNs often use SwiGLU gating: one linear branch is multiplied by a swish-activated gate branch before the down-projection. That adds parameters but improves quality per FLOP in many pretraining runs .

Stacking $L$ identical blocks (with distinct weights) yields depth $L$. Each block refines representations while residuals preserve a path for raw information to propagate forward . Pre-norm stacks often tolerate larger learning rates because gradients flow through norm-scaled sublayers before the residual add .

University approvals: 0
Related cards
Builds on Positional encodings and length generalization · Machine learning
Next Encoder vs decoder masking · Machine learning
Video Content
Tasks
Question 1

Residual connections in a transformer block:

Hint

Skim the paragraphs on Residual connections transformer block in Transformer block before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

LayerNorm / RMSNorm in transformers normalize:

Hint

Skim the paragraphs on LayerNorm RMSNorm transformers normalize in Transformer block before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

The positionwise feed-forward sublayer (FFN) is:

Hint

Skim the paragraphs on positionwise feed forward sublayer in Transformer block before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Which problem do residual connections primarily mitigate in very deep networks?

Hint

Skim the paragraphs on problem do residual connections primarily mitigate in very in Transformer block before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy