Vectorized operations vs apply in pandas
The performance gap between vectorized pandas code and row-wise apply is not 2x or 5x — it is often 100x or more on a million-row DataFrame. The reason is architectural: a vectorized expression like df['price'] * df['quantity'] hands two packed NumPy arrays to a compiled C routine that processes them in a single pass. df.apply(lambda row: row['price'] * row['quantity'], axis=1) iterates through Python, creating a Series object for every row, invoking a Python function call per row, and unpacking the result — all interpreter overhead that scales linearly with row count.
The rule of thumb: if you can express the operation as column arithmetic, np.where, a pandas method, or a combination of these, do that. Only reach for apply(axis=1) when the per-row logic is genuinely complex and resists decomposition into column operations — and even then, consider np.select for multi-branch conditionals or converting to raw NumPy with .to_numpy() and operating on the arrays directly.
df.query('age > 30 and city == "NYC"') is another fast path: it compiles the string expression internally and avoids creating intermediate boolean arrays in Python. df.eval('total = price * qty') similarly evaluates column expressions without Python-level temporaries.
Another common speedup: df['category'] = df['category'].astype('category') converts a string column with repeated values into a compact integer representation, saving memory and making groupby faster.
Performance guide: [1], np.where: [2].
Sources
Tasks
Card Info
- Topic: Data Science Praktikum
- Difficulty: Intermediate
- Completed: 0 users