Text encodings (UTF-8, Latin-1, Windows-1252)
You open a CSV from a colleague's Excel on Windows and Genève turns into mojibake. Nothing is wrong with the file on their machine—the disk only ever stores bytes, and those bytes only become readable text once you agree on an encoding. On the modern web and in most open-source tools, that agreement is usually UTF-8, which covers the full Unicode alphabet in a compact way.
Legacy exports from Western European Windows often speak cp1252 (Windows-1252) instead; trying UTF-8 first and falling back to cp1252 is a classic ingestion story. Then there is the seductive trap of Latin-1 (iso-8859-1): Python will always decode every byte to some character, so you never see UnicodeDecodeError—but if the file was actually UTF-8, you get plausible-looking nonsense instead of an honest failure.
When you must move on despite damage, errors='replace' or errors='ignore' trades correctness for progress; libraries like charset-normalizer or chardet offer guesses, not guarantees. If the file starts with a UTF-8 BOM, encoding='utf-8-sig' strips it cleanly. Treat encoding as part of the data contract, not an afterthought.
This complements CSV files and tabular I/O in Python: correct quoting and delimiters still fail if bytes are decoded with the wrong charset.
Guides: [1], official Unicode HOWTO [2].
Sources
Tasks
Card Info
- Topic: Data Science Praktikum
- Difficulty: Beginner
- Completed: 1 users