Handling missing data and validating input

Intermediate Python for Data Science
Created by Best · 24.06.2026 at 14:03 UTC

In pandas the everyday moves are to detect, then deliberately decide:

df["score"].isna().sum()        # how many are missing?
df["score"].fillna(0)           # fill them...
df.dropna(subset=["score"])     # ...or drop those rows

The word "deliberately" is the lesson. There is no universally correct choice: filling with 0 can drag an average down and pretend a gap is a real value, while dropping rows can bias your sample if the missingness isn't random. You decide per column, and you write down why.

Beyond missing values, the broader discipline is validation at the boundary — checking types and ranges as data comes in, so the clean core of your program can assume its inputs are sane. For a small trusted object a dataclass is enough; at a noisy external edge, a library like Pydantic checks every field on the way in.

Treat incoming data as untrusted until you've validated it, because the alternative is silent corruption: a NaN that turns a mean into nan, a string where a number belonged, a missing field read as zero. Clean, validated, missing-aware data is the precondition for everything that follows.
and leads into “Confusion matrix, precision, and recall”.*

University approvals: 0
Related cards
Builds on Reading JSON and the many faces of missing · Python for Data Science
Next Confusion matrix, precision, and recall · Python for Data Science
Tasks
Question 1

Why is filling missing numeric values with 0 a decision to make carefully?

Question 2

What does "validating data at the boundary" mean?

Question 3

Select every honest way to handle a missing measurement.

Select all that apply.
Card Info
  • Topic: Python for Data Science
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy