groupby, merge, and counting carefully
The workhorse is groupby — the "split by key, then reduce" sentence from the aggregation topic, now vectorised and able to produce several summaries at once:
agg = (df.groupby("category")
.agg(events=("id", "size"),
mean_score=("score", "mean"))
.reset_index())
This is exactly GROUP BY category in SQL: one output row per category, with the chosen aggregates. The same logic you wrote by hand with defaultdict now fits on three lines.
Joining two tables on a shared key is merge (SQL JOIN):
merged = predictions.merge(labels, on="event_id", how="inner")
The thing to watch is cardinality: if the key isn't unique on one side, rows multiply, and a careless join can silently inflate your row count. A common cleaning step before joining is keeping the latest record per key:
latest = (df.sort_values("ingestion_ts")
.drop_duplicates("event_id", keep="last"))
pandas rewards a little vigilance about counting, because the wrong choice quietly changes a denominator — and therefore a rate. Three near-synonyms differ exactly on missing values:
sizecounts every row in a group, missing values included;countcounts only non-null values;value_counts()dropsNaNby default.
Choosing the wrong one produces a confident wrong number. This is the daily craft of analysis: read, filter, group, join, summarise — fast and expressive, as long as you carry over the rigour about which denominator you're using and where the missing values went.
columns” and leads into “Why look first, and the figure/axes model”.*
Related cards
Tasks
Card Info
- Topic: Python for Data Science
- Difficulty: Intermediate
- Completed: 0 users