GroupBy: split-apply-combine in pandas

Intermediate Data Science Praktikum

Created by Pavel · 03.04.2026 at 11:49 UTC · 1 completed

Almost every analytical question sounds like "for each category, compute something": total sales per city, average rating per product, count of events per day. The answer in pandas is always groupby, which implements the split-apply-combine pattern — split the DataFrame into groups defined by one or more columns, apply a function to each group independently, then glue the results back together.

The simplest path is df.groupby('city')['sales'].sum(), which returns a Series indexed by city. When you need several statistics at once, named aggregation is cleaner than calling agg multiple times: df.groupby('city').agg(total=('sales','sum'), avg_price=('price','mean'), n=('id','count')) gives you a DataFrame with columns total, avg_price, n — and the code reads almost like a SQL SELECT city, SUM(sales) AS total, ….

A subtler tool is transform(): unlike agg(), which collapses each group to one row, transform('mean') broadcasts the group mean back to every row in the original DataFrame. This lets you write df['city_avg'] = df.groupby('city')['sales'].transform('mean') to add a per-group statistic without losing row-level detail — invaluable for normalization or residual calculations.

filter() keeps or drops entire groups: df.groupby('city').filter(lambda g: g['sales'].sum() > 1000) removes any city whose total sales are too small. apply() is the escape hatch when built-in aggregations cannot express the logic.

GroupBy guide: [1].

Sources

[1]https://pandas.pydata.org/docs/user_guide/groupby.html Return to text

University approvals: 0

Tasks

Question 1

What does this code print?

import pandas as pd
df = pd.DataFrame({'city': ['NYC','LA','NYC','LA'],
                   'sales': [100, 200, 150, 50]})
print(df.groupby('city')['sales'].transform('sum').tolist())

Hint

transform broadcasts the group result back to every row in the original DataFrame.

[250, 250, 250, 250]

[250, 250]

[100, 200, 150, 50]

Error — transform cannot use 'sum'

Question 2

What is the key difference between df.groupby('city')['sales'].agg('sum') and df.groupby('city')['sales'].transform('sum')?

Hint

Compare the shapes of their outputs.

agg is faster; transform is slower but more accurate

agg returns one value per group (fewer rows); transform returns one value per original row (same length as df)

transform can only compute mean; agg can compute any function

They return identical results

Question 3

Using pandas, implement top_city(csv_text: str) -> str that reads a CSV with columns city and sales, uses groupby to sum sales per city, and returns the city name with the highest total.

Example: city,sales\nNYC,100\nLA,200\nNYC,180\nLA,40 → 'NYC' (NYC=280, LA=240).

Submit the function; tests use expression mode.

Hint

groupby('city')['sales'].sum() gives total per city; .idxmax() returns the label of the maximum.

import io
import pandas as pd


def top_city(csv_text: str) -> str:
    # TODO: groupby city, sum sales, return the city with highest total.
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Data Science Praktikum
Difficulty: Intermediate
Completed: 1 users

Creator

Pavel