CSV files and tabular I/O in Python

Beginner Data Science Praktikum

Created by Pavel · 21.03.2026 at 01:05 UTC · 1 completed

Most data still arrives as rows in a text file: someone exports a spreadsheet, a sensor dumps a log, or an open-data portal offers a download. You pull it in, turn strings into numbers, and aggregate—but the first trap is treating each line as "split on commas." A free-text column might hold "Zurich, downtown" inside quotes; a naive split turns one row into three columns and your pipeline quietly drifts.

The csv module and pandas.read_csv exist because real CSV has rules for quoting and delimiters. Once the table is in memory, a second trap appears: European exports often use commas as decimal separators (1,5 vs 1.5), so you learn to normalize before casting. That journey—from messy export to trustworthy dataframe—is what this stack is for.

Encoding issues (mojibake) are a different failure mode than bad splitting; Text encodings (UTF-8, Latin-1, Windows-1252) in this deck covers byte-to-text contracts.

Further reading: [1] (Real Python on CSV and pathlib), read_csv reference [2].

Sources

University approvals: 0

Tasks

Question 1

You receive a CSV where some text fields contain commas inside double quotes. What is the main reason plain line.split(',') is unsafe for parsing?

Hint

Think about how spreadsheets export free-text columns.

Quoted fields may contain commas that must not start a new column

CSV files never use commas as delimiters

Double quotes are forbidden in the CSV standard

Unicode characters always break split-based parsing

Question 2

Why do data science tutorials often pass encoding='utf-8' (or rely on UTF-8 defaults) when reading CSV?

Hint

Think about accented letters in European place names.

UTF-8 is the only encoding Python supports

UTF-8 avoids many cross-platform mojibake issues for international text

CSV files are always valid UTF-8 by specification

Pandas can only read UTF-8

Question 3

Implement row_numeric_sum(line: str) -> float that parses one CSV data line with two numeric columns name,value (value may be int or float) and returns the numeric value only. The line has no comma inside the name.

Example: Zurich,12.5 → 12.5.

Submit the full function; tests call it in expression mode.

Hint

Split once from the left so the name stays intact.

def row_numeric_sum(line: str) -> float:
    # TODO: parse the second column as float; name has no commas.
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Question 4

Using pandas, implement csv_row_count(csv_text: str) -> int that returns the number of data rows for UTF-8 CSV text in memory. Ignore blank lines and rows whose first non-space character in the first column is #.

Example: city,pop\nZurich,400\n# comment,0\nBern,130\n\n → 2.

Use pd.read_csv with io.StringIO. Submit the full function; tests use expression mode on the Studdyco runner (pandas is installed in the default code sandbox image).

Hint

Read first, then filter rows using a boolean mask on the first column.

import io

import pandas as pd


def csv_row_count(csv_text: str) -> int:
    # TODO: count data rows after filtering comments/blank rows.
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Data Science Praktikum
Difficulty: Beginner
Completed: 1 users

Creator

Pavel