Counting and accumulating per key
Most analytics questions have the same shape: split the rows into groups by some key, then reduce each group to a number. "How many events per user?" "Average score per category?" Learn to say that in plain Python and every later groupby is just the same idea made fast.
Start with counting. Counter from the collections module tallies how often each key appears:
from collections import Counter
regions = ["west", "east", "west", "west", "east"]
counts = Counter(regions) # Counter({'west': 3, 'east': 2})
When you need to accumulate per key rather than just count — collect each group's values, or keep a running total — the tool is defaultdict. A plain dict raises KeyError the first time you touch a key that isn't there; a defaultdict creates a starting value automatically:
from collections import defaultdict
records = [{"user": "u1", "score": 5},
{"user": "u1", "score": 2},
{"user": "u2", "score": 9}]
totals = defaultdict(int) # a missing key starts at 0
for r in records:
totals[r["user"]] += r["score"]
# totals == {'u1': 7, 'u2': 9}
The argument is a factory — a function defaultdict calls to make a fresh starting value. defaultdict(int) starts each new key at 0; defaultdict(list) starts each at a new empty list.
leads into “Ranking, ties, and toward groupby”.*
Related cards
Tasks
Card Info
- Topic: Python for Data Science
- Difficulty: Intermediate
- Completed: 0 users