groupby, merge, and counting carefully

Intermediate Python for Data Science

Created by Best · 24.06.2026 at 14:03 UTC

The workhorse is groupby — the "split by key, then reduce" sentence from the aggregation topic, now vectorised and able to produce several summaries at once:

agg = (df.groupby("category")
         .agg(events=("id", "size"),
              mean_score=("score", "mean"))
         .reset_index())

This is exactly GROUP BY category in SQL: one output row per category, with the chosen aggregates. The same logic you wrote by hand with defaultdict now fits on three lines.

Joining two tables on a shared key is merge (SQL JOIN):

merged = predictions.merge(labels, on="event_id", how="inner")

The thing to watch is cardinality: if the key isn't unique on one side, rows multiply, and a careless join can silently inflate your row count. A common cleaning step before joining is keeping the latest record per key:

latest = (df.sort_values("ingestion_ts")
            .drop_duplicates("event_id", keep="last"))

pandas rewards a little vigilance about counting, because the wrong choice quietly changes a denominator — and therefore a rate. Three near-synonyms differ exactly on missing values:

size counts every row in a group, missing values included;
count counts only non-null values;
value_counts() drops NaN by default.

Choosing the wrong one produces a confident wrong number. This is the daily craft of analysis: read, filter, group, join, summarise — fast and expressive, as long as you carry over the rigour about which denominator you're using and where the missing values went.
columns” and leads into “Why look first, and the figure/axes model”.*

University approvals: 0

Related cards

Builds on DataFrames: reading, filtering, adding columns · Python for Data Science

Next Why look first, and the figure/axes model · Python for Data Science

Tasks

Question 1

Using pandas, read CSV from stdin with columns category,score (a header line, then rows). Print, one per line, category mean for each category sorted by mean DESC, mean rounded to 2 decimals.

Example input:

category,score
a,10
a,20
b,5

Expected output:

a 15.0
b 5.0

Runtime output (stdout/stderr)

3 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Question 2

What is the main risk when you merge two tables on a key that is not unique on one side?

The merge silently drops all rows

Rows multiply, inflating the result's row count

pandas converts everything to strings

The index is lost permanently

Question 3

In pandas, what is the difference between groupby(k)['x'].size() and groupby(k)['x'].count()?

They are identical

size counts all rows in the group; count counts only non-null values of x

count includes NaN; size excludes it

size only works on strings

Card Info

Topic: Python for Data Science
Difficulty: Intermediate
Completed: 0 users

Creator

Best

BestBuddy