The cost function as a high-dimensional surface

Beginner Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Chapter 1 showed how a forward pass turns pixels into digit guesses; this chapter asks how those thousands of weights get chosen. Training defines a cost function $C$ that measures average mismatch between predictions and labels. For classification, that is typically cross-entropy over softmax outputs; the exact formula matters less here than the picture: every parameter setting $\mathbf{w}$ assigns a height $C(\mathbf{w})$ .

Imagine only two weights so you can draw a surface. Valleys correspond to lower error; ridges and bumps correspond to worse fits. Real networks have millions of coordinates, so the true landscape lives in a space you cannot plot, but the metaphor still guides algorithm design: we want to roll downhill .

The gradient $\nabla_{\mathbf{w}} C$ collects partial derivatives $\partial C/\partial w_i$. It points in the direction of steepest increase of the cost at the current point. Gradient descent steps opposite that vector:

$$\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}} C,$$

with step size $\eta\gt 0$ called the learning rate .

A critical point satisfies $\nabla C = \mathbf{0}$ in smooth unconstrained settings. That includes minima, maxima, and saddles; descent only guarantees local improvement, not a global optimum. The two-weight slice shown is a pedagogical cartoon, not a faithful projection of MNIST weight space .

University approvals: 0
Related cards
Next Learning rate as step size: fragile knob · Machine learning
Video Content
Tasks
Question 1

Plain gradient descent decides where to step using:

Hint

Skim the paragraphs on to step using in The cost function as a high-dimensional surface before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

The gradient $\nabla_{\mathbf{w}} C$ points in the direction of:

Hint

Skim the paragraphs on gradient points direction in The cost function as a high-dimensional surface before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

A critical point of a smooth unconstrained cost has:

Hint

Skim the paragraphs on critical point smooth unconstrained cost in The cost function as a high-dimensional surface before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Which update rule implements gradient descent with learning rate $\eta \gt 0$?

Hint

Skim the paragraphs on update rule implements gradient descent with learning rate in The cost function as a high-dimensional surface before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Beginner
  • Completed: 0 users
Creator
Best
Best
BestBuddy