Vectorized operations vs apply in pandas

Intermediate Data Science Praktikum
Created by Pavel · 03.04.2026 at 11:49 UTC

The performance gap between vectorized pandas code and row-wise apply is not 2x or 5x — it is often 100x or more on a million-row DataFrame. The reason is architectural: a vectorized expression like df['price'] * df['quantity'] hands two packed NumPy arrays to a compiled C routine that processes them in a single pass. df.apply(lambda row: row['price'] * row['quantity'], axis=1) iterates through Python, creating a Series object for every row, invoking a Python function call per row, and unpacking the result — all interpreter overhead that scales linearly with row count.

The rule of thumb: if you can express the operation as column arithmetic, np.where, a pandas method, or a combination of these, do that. Only reach for apply(axis=1) when the per-row logic is genuinely complex and resists decomposition into column operations — and even then, consider np.select for multi-branch conditionals or converting to raw NumPy with .to_numpy() and operating on the arrays directly.

df.query('age > 30 and city == "NYC"') is another fast path: it compiles the string expression internally and avoids creating intermediate boolean arrays in Python. df.eval('total = price * qty') similarly evaluates column expressions without Python-level temporaries.

Another common speedup: df['category'] = df['category'].astype('category') converts a string column with repeated values into a compact integer representation, saving memory and making groupby faster.

Performance guide: [1], np.where: [2].


Sources

University approvals: 0
Tasks
Question 1

A colleague writes:

df['total'] = df.apply(lambda r: r['price'] * r['qty'], axis=1)

Which rewrite is both faster and equivalent?

Hint

Direct column arithmetic stays in C-compiled NumPy.

Question 2

What do these two lines produce?

import pandas as pd
df = pd.DataFrame({'city': ['NYC','LA','NYC'], 'age': [25, 35, 40]})
result = df.query('age > 30 and city == "NYC"')
print(len(result))
Hint

Row 0: age 25, not > 30. Row 1: age 35 but city LA. Row 2: age 40 and city NYC.

Question 3

Using pandas and numpy, implement add_discount(csv_text: str) -> list that reads a CSV with columns price and qty. Add a column final that equals price * 0.9 where qty > 5, and price otherwise. Use np.where (not apply). Return the final column as a list of floats rounded to 2 decimals.

Example: price,qty\n100.0,3\n80.0,10[100.0, 72.0].

Submit the function; tests use expression mode.

Hint

np.where(df['qty'] > 5, df['price'] * 0.9, df['price']) — the vectorized version of a per-row if/else.

Starter code is prefilled; replace TODO blocks with your solution.
2 test cases will be used for grading
Run checks runtime behavior only. Final correctness is evaluated when you submit.
Card Info
  • Topic: Data Science Praktikum
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Pavel
Pavel