Handling missing data: NaN, fillna, and dropna

Beginner Data Science Praktikum

Created by Pavel · 03.04.2026 at 11:49 UTC · 1 completed

Real datasets have holes. A sensor goes offline, a respondent skips a question, a join finds no match — and pandas fills those gaps with NaN (Not a Number), a special IEEE 754 float value that propagates silently through arithmetic. 10 + NaN is NaN, NaN > 5 is False, and NaN == NaN is also False — which is why you use df.isna() instead of == to detect it.

df.isna().sum() is usually the first line you run on a new dataset: it gives you missing counts per column, and from there you decide strategy. dropna() is the blunt instrument — df.dropna() removes any row with at least one NaN, which can silently halve your dataset if missingness is spread across many columns. dropna(subset=['age']) is more surgical, only looking at one column.

fillna() replaces NaN with something. A constant (fillna(0)), the column mean (df['col'].fillna(df['col'].mean())), or the last known value (fillna(method='ffill')) for time series where the previous reading is a reasonable stand-in. A more sophisticated approach: df.groupby('category')['value'].transform('mean') to fill with the group mean rather than the global mean, preserving distributional differences between groups.

The key decision — drop, fill with constant, fill with statistic, or interpolate — affects every downstream number. Document the choice; your future self will thank you when the model behaves oddly six months later.

Missing data guide: [1].

Sources

[1]https://pandas.pydata.org/docs/user_guide/missing_data.html Return to text

University approvals: 0

Tasks

Question 1

What does this code print?

import pandas as pd
import numpy as np
df = pd.DataFrame({'age': [25, np.nan, 35, np.nan, 45]})
print(df.dropna(subset=['age']).shape[0])

Hint

dropna(subset=['age']) removes rows where age is NaN. Count the non-NaN values.

Question 2

What does df['x'].fillna(df['x'].mean()) do to the values that are not NaN?

Hint

fillna only touches cells that are NaN.

Replaces them with the mean as well

Leaves them unchanged — only NaN cells are replaced

Converts them to float even if they were integer

Raises ValueError if any non-NaN values exist

Question 3

Using pandas, implement fill_age_mean(csv_text: str) -> list that reads a CSV with columns name and age (some ages may be missing), fills missing ages with the column mean rounded to 1 decimal, and returns the age column as a list of floats.

Example: name,age\nAlice,25.0\nBob,\nCharlie,35.0 → [25.0, 30.0, 35.0].

Submit the function; tests use expression mode.

Hint

df['age'].mean() ignores NaN by default; pass that value into fillna().

import io
import pandas as pd


def fill_age_mean(csv_text: str) -> list:
    # TODO: read CSV, fill missing ages with column mean, return list.
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Data Science Praktikum
Difficulty: Beginner
Completed: 1 users

Creator

Pavel