Why pandas uses columnar storage
You hand someone a spreadsheet of 50 000 rows and ask "what's the average age?" They don't read every row one by one — they run their finger straight down the age column. Pandas does the same thing internally: it stores each column as its own contiguous NumPy array, so df['age'].mean() is a single pass over a packed block of int64 values at C speed. This is columnar storage, often described as a dictionary of lists — one key per column name, one typed array per value list.
The alternative is what JSON gives you: a list of dictionaries, one dict per row, every dict repeating all the column names. Memory doubles from the repeated keys alone, and aggregating means hopping from dict to dict, extracting 'age' each time through the Python interpreter — thousands of times slower for large frames.
Columnar wins for analytical workloads: filtering, aggregation, vectorized math, and ML preprocessing. Row-oriented wins for transactional systems and APIs where you always want the full record and rarely compute column-wide statistics. When you call df.to_dict(orient='records') to feed a JSON API, you are crossing from one world to the other — sometimes necessary, but worth knowing the cost.
If your pipeline is slow, the first question is often whether you accidentally left a row-wise loop where a column operation would do. Vectorized operations vs apply in pandas in this deck covers that in detail.
Pandas internals overview: [1], NumPy array docs: [2].
Sources
Tasks
Card Info
- Topic: Data Science Praktikum
- Difficulty: Beginner
- Completed: 1 users