Vectorized string operations with .str

Beginner Data Science Praktikum

Created by Pavel · 03.04.2026 at 11:49 UTC

String cleaning is the unglamorous core of most data pipelines: phone numbers arrive with dashes and parentheses, names have trailing spaces, categories are mixed case, and addresses need the city extracted from after the comma. Doing this with a Python for-loop works but defeats the purpose of having a DataFrame — and once you see df['phone'].str.replace(r'\D', '', regex=True) strip every non-digit from an entire column in one line, the loop version feels like washing dishes by hand when you own a dishwasher.

The .str accessor mirrors Python's string methods but applies them element-wise across a Series: .str.lower(), .str.strip(), .str.startswith('A'), .str.len(). It also handles regex: .str.contains(r'\d{3}-\d{4}', na=False) returns a boolean mask for phone-number patterns. The na=False is important — without it, any NaN in the column propagates as NaN in the result, which quietly breaks downstream boolean logic.

Chaining works naturally: df['address'].str.split(',').str[0] first splits each address on the comma (producing a list per cell), then .str[0] indexes into each list to grab the first part. .str.extract(r'(\d+)') pulls the first match group from a regex — useful for extracting numbers embedded in strings.

Always clean strings before using them as groupby keys or merge keys — 'NYC' and ' NYC ' are different strings.

String methods guide: [1].

Sources

[1]https://pandas.pydata.org/docs/user_guide/text.html Return to text

University approvals: 0

Tasks

Question 1

What does this code print?

import pandas as pd
s = pd.Series(['212-555-1234', '(310) 555-5678', None])
print(s.str.replace(r'\D', '', regex=True).tolist())

Hint

str.replace with regex removes non-digits. What happens to the None?

['2125551234', '3105555678', None]

['2125551234', '3105555678', '']

['212-555-1234', '(310) 555-5678', None]

Error — cannot apply regex to a Series with None

Question 2

What does df['addr'].str.split(',').str[0] return if df['addr'] contains '123 Main St, NYC'?

Hint

split produces a list; .str[0] takes the first element.

'123 Main St, NYC'

'123 Main St'

'NYC'

['123 Main St', ' NYC']

Question 3

Using pandas, implement extract_area_codes(csv_text: str) -> list that reads a CSV with a phone column containing strings like '212-555-1234'. Use the .str accessor to split on '-' and return the first part (the area code) of each phone as a list.

Example: phone\n212-555-1234\n310-555-5678 → ['212', '310'].

Submit the function; tests use expression mode.

Hint

str.split('-') produces a list per cell; .str[0] extracts the first element from each list.

import io
import pandas as pd


def extract_area_codes(csv_text: str) -> list:
    # TODO: read CSV, split phone on '-', take first part.
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Data Science Praktikum
Difficulty: Beginner
Completed: 0 users

Creator

Pavel