Vectorized operations vs apply in pandas

Intermediate Data Science Praktikum

Created by Pavel · 03.04.2026 at 11:49 UTC

The performance gap between vectorized pandas code and row-wise apply is not 2x or 5x — it is often 100x or more on a million-row DataFrame. The reason is architectural: a vectorized expression like df['price'] * df['quantity'] hands two packed NumPy arrays to a compiled C routine that processes them in a single pass. df.apply(lambda row: row['price'] * row['quantity'], axis=1) iterates through Python, creating a Series object for every row, invoking a Python function call per row, and unpacking the result — all interpreter overhead that scales linearly with row count.

The rule of thumb: if you can express the operation as column arithmetic, np.where, a pandas method, or a combination of these, do that. Only reach for apply(axis=1) when the per-row logic is genuinely complex and resists decomposition into column operations — and even then, consider np.select for multi-branch conditionals or converting to raw NumPy with .to_numpy() and operating on the arrays directly.

df.query('age > 30 and city == "NYC"') is another fast path: it compiles the string expression internally and avoids creating intermediate boolean arrays in Python. df.eval('total = price * qty') similarly evaluates column expressions without Python-level temporaries.

Another common speedup: df['category'] = df['category'].astype('category') converts a string column with repeated values into a compact integer representation, saving memory and making groupby faster.

Performance guide: [1], np.where: [2].

Sources

University approvals: 0

Tasks

Question 1

A colleague writes:

df['total'] = df.apply(lambda r: r['price'] * r['qty'], axis=1)

Which rewrite is both faster and equivalent?

Hint

Direct column arithmetic stays in C-compiled NumPy.

df['total'] = df['price'] * df['qty']

df['total'] = [df['price'][i] * df['qty'][i] for i in range(len(df))]

df['total'] = df.iterrows().map(lambda r: r['price'] * r['qty'])

df['total'] = df['price'].apply(lambda x: x * df['qty'])

Question 2

What do these two lines produce?

import pandas as pd
df = pd.DataFrame({'city': ['NYC','LA','NYC'], 'age': [25, 35, 40]})
result = df.query('age > 30 and city == "NYC"')
print(len(result))

Hint

Row 0: age 25, not > 30. Row 1: age 35 but city LA. Row 2: age 40 and city NYC.

Error — query cannot combine conditions

Question 3

Using pandas and numpy, implement add_discount(csv_text: str) -> list that reads a CSV with columns price and qty. Add a column final that equals price * 0.9 where qty > 5, and price otherwise. Use np.where (not apply). Return the final column as a list of floats rounded to 2 decimals.

Example: price,qty\n100.0,3\n80.0,10 → [100.0, 72.0].

Submit the function; tests use expression mode.

Hint

np.where(df['qty'] > 5, df['price'] * 0.9, df['price']) — the vectorized version of a per-row if/else.

import io
import numpy as np
import pandas as pd


def add_discount(csv_text: str) -> list:
    # TODO: read CSV, add 'final' column using np.where, return as list.
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Data Science Praktikum
Difficulty: Intermediate
Completed: 0 users

Creator

Pavel