Handling missing data: NaN, fillna, and dropna
Real datasets have holes. A sensor goes offline, a respondent skips a question, a join finds no match — and pandas fills those gaps with NaN (Not a Number), a special IEEE 754 float value that propagates silently through arithmetic. 10 + NaN is NaN, NaN > 5 is False, and NaN == NaN is also False — which is why you use df.isna() instead of == to detect it.
df.isna().sum() is usually the first line you run on a new dataset: it gives you missing counts per column, and from there you decide strategy. dropna() is the blunt instrument — df.dropna() removes any row with at least one NaN, which can silently halve your dataset if missingness is spread across many columns. dropna(subset=['age']) is more surgical, only looking at one column.
fillna() replaces NaN with something. A constant (fillna(0)), the column mean (df['col'].fillna(df['col'].mean())), or the last known value (fillna(method='ffill')) for time series where the previous reading is a reasonable stand-in. A more sophisticated approach: df.groupby('category')['value'].transform('mean') to fill with the group mean rather than the global mean, preserving distributional differences between groups.
The key decision — drop, fill with constant, fill with statistic, or interpolate — affects every downstream number. Document the choice; your future self will thank you when the model behaves oddly six months later.
Missing data guide: [1].
Sources
Tasks
Card Info
- Topic: Data Science Praktikum
- Difficulty: Beginner
- Completed: 1 users