Measure first, then vectorise

Advanced Python for Data Science
Created by Best · 24.06.2026 at 14:03 UTC

Your intuition about where a program spends its time is almost always wrong, so you ask the program directly with a profiler:

import cProfile
cProfile.run("run_pipeline(data)")     # reports time per function, sorted

In a notebook, %timeit some_expression() gives quick, repeated timings of a single line. Almost always you find the same pattern: a small fraction of the code — often a single loop — accounts for the overwhelming majority of the runtime. That hot spot is the only place worth touching.

Once you've found the hot spot, apply fixes cheapest first — and the cheapest big win is usually vectorisation. Moving a Python loop into a whole-array NumPy or pandas expression pushes the iteration into compiled C:

# slow: a Python loop
total = 0.0
for x in xs:
    total += x * x

# fast: vectorised
import numpy as np
total = float(np.sum(np.asarray(xs) ** 2))

This typically gives the largest speed-up with no new dependency, which is why it's the first thing to try.
cross-validation” and leads into “Numba and better algorithms”.*

University approvals: 0
Related cards
Builds on Reading the report; leakage and cross-validation · Python for Data Science
Next Numba and better algorithms · Python for Data Science
Tasks
Question 1

You suspect a script is slow. What is the correct FIRST step?

Question 2

For a numeric pipeline, which optimisation should you usually try FIRST (cheapest, no new dependency)?

Card Info
  • Topic: Python for Data Science
  • Difficulty: Advanced
  • Completed: 0 users
Creator
Best
Best
BestBuddy