GroupBy: split-apply-combine in pandas

Intermediate Data Science Praktikum
Created by Pavel · 03.04.2026 at 11:49 UTC · 1 completed

Almost every analytical question sounds like "for each category, compute something": total sales per city, average rating per product, count of events per day. The answer in pandas is always groupby, which implements the split-apply-combine pattern — split the DataFrame into groups defined by one or more columns, apply a function to each group independently, then glue the results back together.

The simplest path is df.groupby('city')['sales'].sum(), which returns a Series indexed by city. When you need several statistics at once, named aggregation is cleaner than calling agg multiple times: df.groupby('city').agg(total=('sales','sum'), avg_price=('price','mean'), n=('id','count')) gives you a DataFrame with columns total, avg_price, n — and the code reads almost like a SQL SELECT city, SUM(sales) AS total, ….

A subtler tool is transform(): unlike agg(), which collapses each group to one row, transform('mean') broadcasts the group mean back to every row in the original DataFrame. This lets you write df['city_avg'] = df.groupby('city')['sales'].transform('mean') to add a per-group statistic without losing row-level detail — invaluable for normalization or residual calculations.

filter() keeps or drops entire groups: df.groupby('city').filter(lambda g: g['sales'].sum() > 1000) removes any city whose total sales are too small. apply() is the escape hatch when built-in aggregations cannot express the logic.

GroupBy guide: [1].


Sources

University approvals: 0
Tasks
Question 1

What does this code print?

import pandas as pd
df = pd.DataFrame({'city': ['NYC','LA','NYC','LA'],
                   'sales': [100, 200, 150, 50]})
print(df.groupby('city')['sales'].transform('sum').tolist())
Hint

transform broadcasts the group result back to every row in the original DataFrame.

Question 2

What is the key difference between df.groupby('city')['sales'].agg('sum') and df.groupby('city')['sales'].transform('sum')?

Hint

Compare the shapes of their outputs.

Question 3

Using pandas, implement top_city(csv_text: str) -> str that reads a CSV with columns city and sales, uses groupby to sum sales per city, and returns the city name with the highest total.

Example: city,sales\nNYC,100\nLA,200\nNYC,180\nLA,40'NYC' (NYC=280, LA=240).

Submit the function; tests use expression mode.

Hint

groupby('city')['sales'].sum() gives total per city; .idxmax() returns the label of the maximum.

Starter code is prefilled; replace TODO blocks with your solution.
2 test cases will be used for grading
Run checks runtime behavior only. Final correctness is evaluated when you submit.
Card Info
  • Topic: Data Science Praktikum
  • Difficulty: Intermediate
  • Completed: 1 users
Creator
Pavel
Pavel