Handling missing data and validating input
In pandas the everyday moves are to detect, then deliberately decide:
df["score"].isna().sum() # how many are missing?
df["score"].fillna(0) # fill them...
df.dropna(subset=["score"]) # ...or drop those rows
The word "deliberately" is the lesson. There is no universally correct choice: filling with 0 can drag an average down and pretend a gap is a real value, while dropping rows can bias your sample if the missingness isn't random. You decide per column, and you write down why.
Beyond missing values, the broader discipline is validation at the boundary — checking types and ranges as data comes in, so the clean core of your program can assume its inputs are sane. For a small trusted object a dataclass is enough; at a noisy external edge, a library like Pydantic checks every field on the way in.
Treat incoming data as untrusted until you've validated it, because the alternative is silent corruption: a NaN that turns a mean into nan, a string where a number belonged, a missing field read as zero. Clean, validated, missing-aware data is the precondition for everything that follows.
and leads into “Confusion matrix, precision, and recall”.*
Related cards
Tasks
Card Info
- Topic: Python for Data Science
- Difficulty: Intermediate
- Completed: 0 users