Pedagogical edge cases: saturation, dead ReLUs, initialization

Beginner Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

The closing interview contrasts historical sigmoid activations, motivated by biological on/off neurons, with modern ReLU units $\max(0,a)$ that pass positive preactivations through unchanged and zero out negative ones . Sigmoid and tanh saturate: derivatives near 0 shrink backpropagated signal, which hurts very deep training (vanishing gradients). ReLU's constant slope on the active half-line eased optimization for deep stacks in practice .

ReLU has its own failure mode: if weights and biases push a unit's preactivation permanently negative for the training inputs you care about, the neuron outputs zero and receives zero gradient, a dead ReLU. Initialization schemes (Xavier/He) scale random weights with layer width so activations and gradients do not explode or vanish at startup .

Batch normalization (a later technique) standardizes intermediate activations during training batches, often stabilizing optimization; it does not remove gradients or linearize the model. Practitioners also watch validation metrics, calibration plots, and failure modes under shift, not only the training loss curve .

Finally, training loss decreasing is necessary but not sufficient for a production-ready system. You still need checks for robustness, calibration, bias, and behavior under distribution shift .

University approvals: 0
Related cards
Builds on Classification heads and probabilistic outputs · Machine learning
Video Content
Tasks
Question 1

A ReLU unit is 'dead' (outputs zero and gets ~zero gradient) when:

Hint

Skim the paragraphs on ReLU unit dead outputs zero in Pedagogical edge cases before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Xavier/He weight initialization schemes aim to keep:

Hint

Skim the paragraphs on Xavier weight initialization schemes keep in Pedagogical edge cases before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Batch normalization often helps optimization by:

Hint

Skim the paragraphs on Batch normalization often helps optimization in Pedagogical edge cases before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Why is 'training loss is decreasing' not enough to call a model production-ready?

Hint

Skim the paragraphs on 'training loss is decreasing' not enough to call in Pedagogical edge cases before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Beginner
  • Completed: 0 users
Creator
Best
Best
BestBuddy