Reductions, statistics, and views vs copies
The real reason to switch is vectorisation: you write whole-array expressions and NumPy does the looping, fast. To scale every element, just a * 2. For a per-column or per-row total, reduce along an axis:
a.sum(axis=0) # column sums: collapse the rows -> array([5., 7., 9.])
a.sum(axis=1) # row sums: collapse the cols -> array([6., 15.])
The rule worth memorising is "axis=0 reduces down the rows, leaving one value per column." Mixing up the axes is the most common NumPy slip.
Arrays carry their statistics as methods that run in compiled code:
a = np.array([2, 4, 4, 4, 5, 5, 7, 9], dtype=float)
a.mean() # 5.0
a.std() # 2.0 (population standard deviation, the NumPy default)
One detail to remember: .std() defaults to the population standard deviation (dividing by n, i.e. ddof=0); pass ddof=1 if you want the sample version. These one-line reductions replace the explicit loops you wrote earlier and are far faster on large arrays.
Two subtleties to file away. First, slicing returns a view — a window onto the same memory, not a copy — so modifying a slice modifies the original array. That makes NumPy fast but can surprise you; use .copy() when you need an independent array.
Second, an array is homogeneous, so if you accidentally mix numbers and strings, NumPy falls back to dtype=object and you lose all the speed. Keep a numeric array numeric. The limitation behind both points is that arrays have no column names and can't join tables — which is exactly pandas' job, coming next.
leads into “Broadcasting and boolean masks”.*
Related cards
Tasks
Card Info
- Topic: Python for Data Science
- Difficulty: Intermediate
- Completed: 0 users