Measure first, then vectorise
Your intuition about where a program spends its time is almost always wrong, so you ask the program directly with a profiler:
import cProfile
cProfile.run("run_pipeline(data)") # reports time per function, sorted
In a notebook, %timeit some_expression() gives quick, repeated timings of a single line. Almost always you find the same pattern: a small fraction of the code — often a single loop — accounts for the overwhelming majority of the runtime. That hot spot is the only place worth touching.
Once you've found the hot spot, apply fixes cheapest first — and the cheapest big win is usually vectorisation. Moving a Python loop into a whole-array NumPy or pandas expression pushes the iteration into compiled C:
# slow: a Python loop
total = 0.0
for x in xs:
total += x * x
# fast: vectorised
import numpy as np
total = float(np.sum(np.asarray(xs) ** 2))
This typically gives the largest speed-up with no new dependency, which is why it's the first thing to try.
cross-validation” and leads into “Numba and better algorithms”.*
Related cards
Tasks
Card Info
- Topic: Python for Data Science
- Difficulty: Advanced
- Completed: 0 users