Vectorized string operations with .str

Beginner Data Science Praktikum
Created by Pavel · 03.04.2026 at 11:49 UTC

String cleaning is the unglamorous core of most data pipelines: phone numbers arrive with dashes and parentheses, names have trailing spaces, categories are mixed case, and addresses need the city extracted from after the comma. Doing this with a Python for-loop works but defeats the purpose of having a DataFrame — and once you see df['phone'].str.replace(r'\D', '', regex=True) strip every non-digit from an entire column in one line, the loop version feels like washing dishes by hand when you own a dishwasher.

The .str accessor mirrors Python's string methods but applies them element-wise across a Series: .str.lower(), .str.strip(), .str.startswith('A'), .str.len(). It also handles regex: .str.contains(r'\d{3}-\d{4}', na=False) returns a boolean mask for phone-number patterns. The na=False is important — without it, any NaN in the column propagates as NaN in the result, which quietly breaks downstream boolean logic.

Chaining works naturally: df['address'].str.split(',').str[0] first splits each address on the comma (producing a list per cell), then .str[0] indexes into each list to grab the first part. .str.extract(r'(\d+)') pulls the first match group from a regex — useful for extracting numbers embedded in strings.

Always clean strings before using them as groupby keys or merge keys — 'NYC' and ' NYC ' are different strings.

String methods guide: [1].


Sources

University approvals: 0
Tasks
Question 1

What does this code print?

import pandas as pd
s = pd.Series(['212-555-1234', '(310) 555-5678', None])
print(s.str.replace(r'\D', '', regex=True).tolist())
Hint

str.replace with regex removes non-digits. What happens to the None?

Question 2

What does df['addr'].str.split(',').str[0] return if df['addr'] contains '123 Main St, NYC'?

Hint

split produces a list; .str[0] takes the first element.

Question 3

Using pandas, implement extract_area_codes(csv_text: str) -> list that reads a CSV with a phone column containing strings like '212-555-1234'. Use the .str accessor to split on '-' and return the first part (the area code) of each phone as a list.

Example: phone\n212-555-1234\n310-555-5678['212', '310'].

Submit the function; tests use expression mode.

Hint

str.split('-') produces a list per cell; .str[0] extracts the first element from each list.

Starter code is prefilled; replace TODO blocks with your solution.
2 test cases will be used for grading
Run checks runtime behavior only. Final correctness is evaluated when you submit.
Card Info
  • Topic: Data Science Praktikum
  • Difficulty: Beginner
  • Completed: 0 users
Creator
Pavel
Pavel