Reductions, statistics, and views vs copies

The real reason to switch is vectorisation: you write whole-array expressions and NumPy does the looping, fast. To scale every element, just a * 2. For a per-column or per-row total, reduce along an axis:

a.sum(axis=0)    # column sums: collapse the rows -> array([5., 7., 9.])
a.sum(axis=1)    # row sums:    collapse the cols -> array([6., 15.])

The rule worth memorising is "axis=0 reduces down the rows, leaving one value per column." Mixing up the axes is the most common NumPy slip.

Arrays carry their statistics as methods that run in compiled code:

a = np.array([2, 4, 4, 4, 5, 5, 7, 9], dtype=float)
a.mean()    # 5.0
a.std()     # 2.0  (population standard deviation, the NumPy default)

One detail to remember: .std() defaults to the population standard deviation (dividing by n, i.e. ddof=0); pass ddof=1 if you want the sample version. These one-line reductions replace the explicit loops you wrote earlier and are far faster on large arrays.

Two subtleties to file away. First, slicing returns a view — a window onto the same memory, not a copy — so modifying a slice modifies the original array. That makes NumPy fast but can surprise you; use .copy() when you need an independent array.

Second, an array is homogeneous, so if you accidentally mix numbers and strings, NumPy falls back to dtype=object and you lose all the speed. Keep a numeric array numeric. The limitation behind both points is that arrays have no column names and can't join tables — which is exactly pandas' job, coming next.
leads into “Broadcasting and boolean masks”.*

Reductions, statistics, and views vs copies

Related cards

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator