groupby, merge, and counting carefully

Intermediate Python for Data Science
Created by Best · 24.06.2026 at 14:03 UTC

The workhorse is groupby — the "split by key, then reduce" sentence from the aggregation topic, now vectorised and able to produce several summaries at once:

agg = (df.groupby("category")
         .agg(events=("id", "size"),
              mean_score=("score", "mean"))
         .reset_index())

This is exactly GROUP BY category in SQL: one output row per category, with the chosen aggregates. The same logic you wrote by hand with defaultdict now fits on three lines.

Joining two tables on a shared key is merge (SQL JOIN):

merged = predictions.merge(labels, on="event_id", how="inner")

The thing to watch is cardinality: if the key isn't unique on one side, rows multiply, and a careless join can silently inflate your row count. A common cleaning step before joining is keeping the latest record per key:

latest = (df.sort_values("ingestion_ts")
            .drop_duplicates("event_id", keep="last"))

pandas rewards a little vigilance about counting, because the wrong choice quietly changes a denominator — and therefore a rate. Three near-synonyms differ exactly on missing values:

  • size counts every row in a group, missing values included;
  • count counts only non-null values;
  • value_counts() drops NaN by default.

Choosing the wrong one produces a confident wrong number. This is the daily craft of analysis: read, filter, group, join, summarise — fast and expressive, as long as you carry over the rigour about which denominator you're using and where the missing values went.
columns” and leads into “Why look first, and the figure/axes model”.*

University approvals: 0
Related cards
Builds on DataFrames: reading, filtering, adding columns · Python for Data Science
Next Why look first, and the figure/axes model · Python for Data Science
Tasks
Question 1

Using pandas, read CSV from stdin with columns category,score (a header line, then rows). Print, one per line, category mean for each category sorted by mean DESC, mean rounded to 2 decimals.

Example input:

category,score
a,10
a,20
b,5

Expected output:

a 15.0
b 5.0
3 test cases will be used for grading
Run checks runtime behavior only. Final correctness is evaluated when you submit.
Question 2

What is the main risk when you merge two tables on a key that is not unique on one side?

Question 3

In pandas, what is the difference between groupby(k)['x'].size() and groupby(k)['x'].count()?

Card Info
  • Topic: Python for Data Science
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy