Text encodings (UTF-8, Latin-1, Windows-1252)

Beginner Data Science Praktikum
Created by Pavel · 21.03.2026 at 01:05 UTC · 1 completed

You open a CSV from a colleague's Excel on Windows and Genève turns into mojibake. Nothing is wrong with the file on their machine—the disk only ever stores bytes, and those bytes only become readable text once you agree on an encoding. On the modern web and in most open-source tools, that agreement is usually UTF-8, which covers the full Unicode alphabet in a compact way.

Legacy exports from Western European Windows often speak cp1252 (Windows-1252) instead; trying UTF-8 first and falling back to cp1252 is a classic ingestion story. Then there is the seductive trap of Latin-1 (iso-8859-1): Python will always decode every byte to some character, so you never see UnicodeDecodeError—but if the file was actually UTF-8, you get plausible-looking nonsense instead of an honest failure.

When you must move on despite damage, errors='replace' or errors='ignore' trades correctness for progress; libraries like charset-normalizer or chardet offer guesses, not guarantees. If the file starts with a UTF-8 BOM, encoding='utf-8-sig' strips it cleanly. Treat encoding as part of the data contract, not an afterthought.

This complements CSV files and tabular I/O in Python: correct quoting and delimiters still fail if bytes are decoded with the wrong charset.

Guides: [1], official Unicode HOWTO [2].


Sources

University approvals: 0
Tasks
Question 1

You open a CSV saved by Excel on a German Windows PC. French city names look wrong when you use encoding='utf-8'. What is a reasonable next encoding to try for Western European text?

Hint

Think about the most common legacy Windows code page in Western Europe.

Question 2

Why can data.decode('latin-1') succeed even when the bytes are actually UTF-8?

Hint

Think about whether any byte value is illegal.

Question 3

Implement try_utf8_then_cp1252(data: bytes) -> str: try strict UTF-8 first; on UnicodeDecodeError, decode with cp1252.

Submit the function only; tests use expression mode.

Hint

Catch UnicodeDecodeError around decode('utf-8').

Starter code is prefilled; replace TODO blocks with your solution.
2 test cases will be used for grading
Run checks runtime behavior only. Final correctness is evaluated when you submit.
Card Info
  • Topic: Data Science Praktikum
  • Difficulty: Beginner
  • Completed: 1 users
Creator
Pavel
Pavel