Masking bugs and leakage

Two mask families appear in almost every codebase. Padding masks zero out attention to padded positions in variable-length batches; those slots carry no semantic content . Causal masks set scores to $-\infty$ (or a large negative) for future positions $j \gt i$ before softmax, enforcing autoregressive visibility .

Swapping masks is catastrophic. Padding mask on a causal decoder lets the model attend to future real tokens; causal mask on a BERT encoder blocks useful bidirectional context. Each objective requires a matching mask and inference protocol .

Sliding-window (local) attention restricts each token to a neighborhood of width $w$, reducing $O(n^2)$ cost. Custom boolean masks define which $(i,j)$ pairs are allowed; implementation bugs often show up as sudden validation loss drops that trace to label leakage .

Regression tests should include padded and unpadded batches to ensure mask broadcast matches both shapes. Causal and padding masks compose by adding large negatives to forbidden cells before a single softmax .

Log maximum attention mass on padded positions during training; any positive mass indicates a mask leak that will hurt validation perplexity in subtle ways .

Masking bugs and leakage

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator