Encoder vs decoder masking

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Not every transformer attends the same way. Encoder blocks (BERT-style) use bidirectional attention: every token may attend to every other token in the input. That suits masked language modeling, predicting held-out tokens from full context .

Decoder blocks in autoregressive models (GPT-style) apply a causal (triangular) mask: position $i$ may attend only to positions $j \le i$. Future tokens are invisible during training, preserving the left-to-right generation order .

Encoder-decoder architectures (T5, original translation transformers) stack an encoder over the source and a decoder that cross-attends into encoder outputs while still using causal self-attention internally .

Mixing these regimes breaks objectives. If a causal mask is disabled during GPT training, future tokens leak into predictions, inflating metrics and destroying valid autoregressive generation. If you need bidirectional context, you must train with a bidirectional objective and inference protocol to match .

Unit tests for attention masks should assert forbidden positions receive zero probability mass after softmax, not merely large negative logits that could underflow inconsistently across hardware .

Production code often factors mask construction into a reusable helper shared by training and inference so eval mode cannot accidentally drop causality .

Document which mask tensor is applied in eval versus train; many production bugs come from forgetting to rebuild the causal mask when exporting to ONNX or TensorRT .

University approvals: 0
Related cards
Next Training stability at scale · Machine learning
Video Content
Tasks
Question 1

A causal (triangular) attention mask enforces that:

Hint

Skim the paragraphs on causal triangular attention mask enforces in Encoder vs decoder masking before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

BERT-style masked language modeling predicts:

Hint

Skim the paragraphs on BERT style masked language modeling in Encoder vs decoder masking before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Classic encoder-decoder models (e.g. T5) use:

Hint

Skim the paragraphs on Classic encoder decoder models in Encoder vs decoder masking before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

What goes wrong if the causal mask is accidentally disabled during autoregressive training?

Hint

Skim the paragraphs on goes wrong if the causal mask is accidentally in Encoder vs decoder masking before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy