GroupBy: split-apply-combine in pandas
Almost every analytical question sounds like "for each category, compute something": total sales per city, average rating per product, count of events per day. The answer in pandas is always groupby, which implements the split-apply-combine pattern — split the DataFrame into groups defined by one or more columns, apply a function to each group independently, then glue the results back together.
The simplest path is df.groupby('city')['sales'].sum(), which returns a Series indexed by city. When you need several statistics at once, named aggregation is cleaner than calling agg multiple times: df.groupby('city').agg(total=('sales','sum'), avg_price=('price','mean'), n=('id','count')) gives you a DataFrame with columns total, avg_price, n — and the code reads almost like a SQL SELECT city, SUM(sales) AS total, ….
A subtler tool is transform(): unlike agg(), which collapses each group to one row, transform('mean') broadcasts the group mean back to every row in the original DataFrame. This lets you write df['city_avg'] = df.groupby('city')['sales'].transform('mean') to add a per-group statistic without losing row-level detail — invaluable for normalization or residual calculations.
filter() keeps or drops entire groups: df.groupby('city').filter(lambda g: g['sales'].sum() > 1000) removes any city whose total sales are too small. apply() is the escape hatch when built-in aggregations cannot express the logic.
GroupBy guide: [1].
Sources
Tasks
Card Info
- Topic: Data Science Praktikum
- Difficulty: Intermediate
- Completed: 1 users