Masking bugs and leakage

Advanced Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Two mask families appear in almost every codebase. Padding masks zero out attention to padded positions in variable-length batches; those slots carry no semantic content . Causal masks set scores to $-\infty$ (or a large negative) for future positions $j \gt i$ before softmax, enforcing autoregressive visibility .

Swapping masks is catastrophic. Padding mask on a causal decoder lets the model attend to future real tokens; causal mask on a BERT encoder blocks useful bidirectional context. Each objective requires a matching mask and inference protocol .

Sliding-window (local) attention restricts each token to a neighborhood of width $w$, reducing $O(n^2)$ cost. Custom boolean masks define which $(i,j)$ pairs are allowed; implementation bugs often show up as sudden validation loss drops that trace to label leakage .

Regression tests should include padded and unpadded batches to ensure mask broadcast matches both shapes. Causal and padding masks compose by adding large negatives to forbidden cells before a single softmax .

Log maximum attention mass on padded positions during training; any positive mass indicates a mask leak that will hurt validation perplexity in subtle ways .

University approvals: 0
Related cards
Video Content
Tasks
Question 1

A padding mask ensures that:

Hint

Skim the paragraphs on padding mask ensures that in Masking bugs and leakage before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Sliding-window (local) attention restricts:

Hint

Skim the paragraphs on Sliding window local attention restricts in Masking bugs and leakage before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

You cannot swap causal and bidirectional attention without:

Hint

Skim the paragraphs on cannot swap causal bidirectional attention in Masking bugs and leakage before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

In self-supervised language modeling, what is 'label leakage' when future positions are visible?

Hint

Skim the paragraphs on 'label leakage' when future positions are visible in Masking bugs and leakage before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Advanced
  • Completed: 0 users
Creator
Best
Best
BestBuddy