Information of language

Intermediate Information theory, 3Blue1Brown
Created by Best · 07.06.2026 at 20:46 UTC

English letter frequencies are wildly skewed; assuming twenty-six equally likely letters wastes about $\log_2(26)$ bits per character . Letters are not independent: after th you expect vowels, and wider context lowers the information of the next symbol further.

Given a predictor that assigns $p$ to the symbol that occurs, the realized information is $-\log_2 p$. Average that over a long sample to estimate entropy rate. Shannon's human guessing experiments suggested about one bit per English character, far below naive letter counts .

Modern tokenized language models report cross-entropy as expected bits per token, but the arithmetic is unchanged: unlikely tokens cost more bits. Approaching language's compression limit requires predictors close to the true generative probabilities .

University approvals: 0
Related cards
Builds on Information as negative log probability · Information theory, 3Blue1Brown
Next Entropy as average information per symbol · Information theory, 3Blue1Brown
Video Content
Tasks
Question 1

Context lowers bit cost because:

Question 2

Shannon's human experiments suggested about:

Question 3

A worse predictor yields:

Question 4

How do you estimate entropy from a predictor on a text sample?

Card Info
  • Topic: Information theory, 3Blue1Brown
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy