Why depth needs nonlinear breaks

Beginner Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Before the algebra, the narrative asks for a hope about hidden layers: maybe a penultimate layer detects human-interpretable parts (loops, stems, long vertical strokes), and the final layer mixes those parts into digit decisions. A 9 shares a top loop with an 8; a 4 breaks into line segments; recognizing a loop itself decomposes into edge detectors .

That story only works if intermediate features can be recombined nonlinearly. If every layer were a plain matrix multiply, stacking depth would still be one affine transformation of the input; extra layers would not add new kinds of folding in representation space. Inserting sigmoid, tanh, ReLU, or similar gates lets later layers build on curved features instead of a single flat mix .

Width also matters: a shallow network with enough hidden units can approximate many functions in principle, yet deep stacks often reuse intermediate abstractions so the same edge detectors serve many downstream patterns. That is an efficiency story, not a claim that shallow nets are useless .

At the output, class scores are often written as logits, real numbers before any normalization. Softmax (covered next) turns logits into probabilities; during this chapter's visualization, output activations are already squashed into $(0,1)$, but the pedagogical point remains: the network emits comparative evidence per class, not arbitrary unrelated labels. ReLU activations $\max(0,x)$ became the modern default partly because they avoid the flat saturated regions of sigmoid/tanh that shrink gradients in very deep stacks .

University approvals: 0
Related cards
Video Content
Tasks
Question 1

Compared with a single wide hidden layer, deep models often:

Hint

Skim the paragraphs on Compared with single wide hidden in Why depth needs nonlinear breaks before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

ReLU-family activations became popular partly because they:

Hint

Skim the paragraphs on ReLU family activations became popular in Why depth needs nonlinear breaks before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

The logits produced just before a softmax are:

Hint

Skim the paragraphs on logits produced just before softmax in Why depth needs nonlinear breaks before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

What is one drawback of classical sigmoid activations in very deep networks?

Hint

Skim the paragraphs on one drawback of classical sigmoid activations in very in Why depth needs nonlinear breaks before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Beginner
  • Completed: 0 users
Creator
Best
Best
BestBuddy