Classification heads and probabilistic outputs

Beginner Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

The output layer presents ten competing scores, one per digit. For probabilistic training we often write those scores as a vector of logits $\mathbf{z}\in\mathbb{R}^{10}$ and apply softmax:

$$\mathrm{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_j e^{z_j}}.$$

Softmax outputs are nonnegative and sum to 1, so they behave like a categorical distribution over classes .

Adding the same constant to every logit multiplies numerator and denominator by the same factor, so softmax is unchanged; only relative differences matter. Multiplying all logits by a positive scalar preserves their order, so argmax predictions stay the same even though probabilities stretch or compress .

Training with cross-entropy compares the softmax probabilities to a one-hot label vector and pushes mass onto the correct class while suppressing others, gradients with respect to logits are largest when the model is confidently wrong . In the digit demo, the brightest output activation after a forward pass is the network's guess; softmax training makes that competition explicit as a probability vector rather than an informal brightness contest.

Edge case: softmax of the all-zero vector gives the uniform distribution with each class probability $1/K$, maximum uncertainty when every logit ties.

University approvals: 0
Video Content
Tasks
Question 1

Softmax is unchanged when you add the same constant to every logit because:

Hint

Skim the paragraphs on you add the same constant to every logit because in Classification heads and probabilistic outputs before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

The argmax prediction after softmax is unchanged if you:

Hint

Skim the paragraphs on argmax prediction after softmax unchanged in Classification heads and probabilistic outputs before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

The cross-entropy gradient with respect to the logits pushes the model to:

Hint

Skim the paragraphs on cross entropy gradient with respect in Classification heads and probabilistic outputs before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Softmax applied to the all-zero logit vector (length $K$) gives:

Hint

Skim the paragraphs on Softmax applied zero logit vector in Classification heads and probabilistic outputs before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Beginner
  • Completed: 0 users
Creator
Best
Best
BestBuddy