MLP neurons as key-value-ish nonlinear transforms

Transformer MLP blocks widen activations, apply a smooth gate (GeLU, SiLU), then project back to model width. The exposition draws a loose analogy to key-value memory: up-projection expands into a higher-dimensional space where nonlinear gates select features; down-projection compresses back .

Typical expansion ratio is $4\times$: if $d_{\text{model}} = 4096$, the intermediate layer may be 16384 wide before the second linear map shrinks back. SwiGLU variants gate one projection with a swish-activated branch, common in PaLM and LLaMA families .

The bottleneck structure enforces a low-rank pass through the nonlinear gate: intermediate features live in a compressed subspace before re-expansion. That limits representational width at the gate but keeps parameter count manageable .

GeLU and SiLU differ from ReLU in curvature near zero; smooth gates change gradient flow during training compared with hard zeroing .

Unlike a database, MLP weights are trained jointly with attention and updated by every gradient step. Editing one fact can ripple through overlapping features stored in superposition .

PaLM and LLaMA families increased FFN ratio and switched to gated activations; when comparing memory analogies across architectures, match expansion ratio and activation choice .

MLP neurons as key-value-ish nonlinear transforms

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator