Why pandas uses columnar storage

Beginner Data Science Praktikum
Created by Pavel · 03.04.2026 at 11:49 UTC · 1 completed

You hand someone a spreadsheet of 50 000 rows and ask "what's the average age?" They don't read every row one by one — they run their finger straight down the age column. Pandas does the same thing internally: it stores each column as its own contiguous NumPy array, so df['age'].mean() is a single pass over a packed block of int64 values at C speed. This is columnar storage, often described as a dictionary of lists — one key per column name, one typed array per value list.

The alternative is what JSON gives you: a list of dictionaries, one dict per row, every dict repeating all the column names. Memory doubles from the repeated keys alone, and aggregating means hopping from dict to dict, extracting 'age' each time through the Python interpreter — thousands of times slower for large frames.

Columnar wins for analytical workloads: filtering, aggregation, vectorized math, and ML preprocessing. Row-oriented wins for transactional systems and APIs where you always want the full record and rarely compute column-wide statistics. When you call df.to_dict(orient='records') to feed a JSON API, you are crossing from one world to the other — sometimes necessary, but worth knowing the cost.

If your pipeline is slow, the first question is often whether you accidentally left a row-wise loop where a column operation would do. Vectorized operations vs apply in pandas in this deck covers that in detail.

Pandas internals overview: [1], NumPy array docs: [2].


Sources

University approvals: 0
Tasks
Question 1

What does this code produce?

import pandas as pd
data = {'name': ['Alice', 'Bob'], 'age': [25, 30]}
df = pd.DataFrame(data)
print(type(df['age'].values).__name__)
Hint

.values extracts the underlying storage from a pandas Series.

Question 2

A colleague writes [row['price'] * row['qty'] for _, row in df.iterrows()] to compute totals. What is the main performance problem?

Hint

Compare one C-level multiply over two arrays vs thousands of Python-level multiplies.

Question 3

What does this code print?

import pandas as pd
df = pd.DataFrame({'x': [10, 20, 30]})
print(df.to_dict(orient='list'))
Hint

orient='list' mirrors pandas' internal columnar layout.

Card Info
  • Topic: Data Science Praktikum
  • Difficulty: Beginner
  • Completed: 1 users
Creator
Pavel
Pavel