Vectorized string operations with .str
String cleaning is the unglamorous core of most data pipelines: phone numbers arrive with dashes and parentheses, names have trailing spaces, categories are mixed case, and addresses need the city extracted from after the comma. Doing this with a Python for-loop works but defeats the purpose of having a DataFrame — and once you see df['phone'].str.replace(r'\D', '', regex=True) strip every non-digit from an entire column in one line, the loop version feels like washing dishes by hand when you own a dishwasher.
The .str accessor mirrors Python's string methods but applies them element-wise across a Series: .str.lower(), .str.strip(), .str.startswith('A'), .str.len(). It also handles regex: .str.contains(r'\d{3}-\d{4}', na=False) returns a boolean mask for phone-number patterns. The na=False is important — without it, any NaN in the column propagates as NaN in the result, which quietly breaks downstream boolean logic.
Chaining works naturally: df['address'].str.split(',').str[0] first splits each address on the comma (producing a list per cell), then .str[0] indexes into each list to grab the first part. .str.extract(r'(\d+)') pulls the first match group from a regex — useful for extracting numbers embedded in strings.
Always clean strings before using them as groupby keys or merge keys — 'NYC' and ' NYC ' are different strings.
String methods guide: [1].
Sources
Tasks
Card Info
- Topic: Data Science Praktikum
- Difficulty: Beginner
- Completed: 0 users