Higher-order vs first-order in deep learning

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

SGD and Adam use first derivatives only. Newton methods incorporate curvature via the Hessian, giving better local step directions but costing far more per iteration. At LLM scale, exact second-order steps are impractical except in tiny subspaces .

The Gauss-Newton approximation exploits least-squares structure, dropping certain second-derivative terms involving residuals. L-BFGS stores low-rank secant information to approximate inverse curvature without forming full matrices .

Research approximations like K-FAC factor curvature blockwise; most transformer training still relies on AdamW, warmup, clipping, and mixed precision rather than exact Newton steps .

Second-order methods remain rare at frontier scale because Hessian memory and per-step linear algebra dominate budgets unless heavily approximated .

First-order methods tolerate noise from minibatches and approximate Hessians poorly anyway at scale, which partly explains why AdamW plus careful engineering dominates transformer pretraining pipelines .

Newton steps also require solving linear systems involving curvature; that cost dwarfs a single gradient pass when parameter counts reach billions, before you even discuss noisy minibatch estimates of Hessians .

When you do use curvature, it is usually through diagonal or block-diagonal approximations that ignore cross-parameter coupling, another reminder that exact second-order methods are ideals rather than defaults .

Treat second-order intuition as a lens on loss geometry, not a recipe you must implement to train MNIST .

University approvals: 0
Related cards
Builds on Log-space and stabilization patterns · Machine learning
Video Content
Tasks
Question 1

The Gauss-Newton approximation simplifies the Hessian for least-squares-type problems by:

Hint

Skim the paragraphs on Gauss Newton approximation simplifies Hessian in Higher-order vs first-order in deep learning before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

L-BFGS approximates curvature by storing:

Hint

Skim the paragraphs on BFGS approximates curvature storing in Higher-order vs first-order in deep learning before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Most large transformer training in practice relies on:

Hint

Skim the paragraphs on Most large transformer training practice in Higher-order vs first-order in deep learning before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Why are exact second-order methods rare at LLM scale?

Hint

Skim the paragraphs on exact second-order methods rare at LLM scale in Higher-order vs first-order in deep learning before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy