Text encodings (UTF-8, Latin-1, Windows-1252)

Beginner Data Science Praktikum

Created by Pavel · 21.03.2026 at 01:05 UTC · 1 completed

You open a CSV from a colleague's Excel on Windows and Genève turns into mojibake. Nothing is wrong with the file on their machine—the disk only ever stores bytes, and those bytes only become readable text once you agree on an encoding. On the modern web and in most open-source tools, that agreement is usually UTF-8, which covers the full Unicode alphabet in a compact way.

Legacy exports from Western European Windows often speak cp1252 (Windows-1252) instead; trying UTF-8 first and falling back to cp1252 is a classic ingestion story. Then there is the seductive trap of Latin-1 (iso-8859-1): Python will always decode every byte to some character, so you never see UnicodeDecodeError—but if the file was actually UTF-8, you get plausible-looking nonsense instead of an honest failure.

When you must move on despite damage, errors='replace' or errors='ignore' trades correctness for progress; libraries like charset-normalizer or chardet offer guesses, not guarantees. If the file starts with a UTF-8 BOM, encoding='utf-8-sig' strips it cleanly. Treat encoding as part of the data contract, not an afterthought.

This complements CSV files and tabular I/O in Python: correct quoting and delimiters still fail if bytes are decoded with the wrong charset.

Guides: [1], official Unicode HOWTO [2].

Sources

University approvals: 0

Tasks

Question 1

You open a CSV saved by Excel on a German Windows PC. French city names look wrong when you use encoding='utf-8'. What is a reasonable next encoding to try for Western European text?

Hint

Think about the most common legacy Windows code page in Western Europe.

ascii

cp1252

utf-32

rot13

Question 2

Why can data.decode('latin-1') succeed even when the bytes are actually UTF-8?

Hint

Think about whether any byte value is illegal.

Latin-1 verifies Unicode normalization before decoding

Latin-1 maps every byte to a character, so decoding never fails

Python converts Latin-1 to UTF-8 automatically

Latin-1 rejects all non-European bytes

Question 3

Implement try_utf8_then_cp1252(data: bytes) -> str: try strict UTF-8 first; on UnicodeDecodeError, decode with cp1252.

Submit the function only; tests use expression mode.

Hint

Catch UnicodeDecodeError around decode('utf-8').

def try_utf8_then_cp1252(data: bytes) -> str:
    # TODO: try UTF-8 first, then fallback to cp1252.
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Data Science Praktikum
Difficulty: Beginner
Completed: 1 users

Creator

Pavel