Choosing charts and headless plotting
Four charts do most of the everyday work, and each answers a different question:
- a histogram shows the distribution of one variable — its shape, spread, and outliers;
- a scatter shows the relationship between two variables;
- a line shows a trend over an ordered axis, usually time;
- a bar compares groups.
Choosing the wrong chart hides the very thing you're looking for: a scatter won't show you skew, and a histogram won't show you a trend. Tie every mark directly to an array or column so the figure says exactly, and only, what the numbers say.
One detail trips people up when plotting outside a notebook: on a server or in automated grading there is no display, so you set a non-interactive backend with matplotlib.use("Agg") before importing pyplot. The Agg backend renders straight to image files (PNG and the like) without needing a graphical window.
Keep in mind matplotlib's limitation: it draws what you give it but can't fix it — no plot rescues dirty data, and a beautiful chart of the wrong column is just a confident mistake. What EDA surfaces here sets the agenda for the validation and feature choices that follow.
and leads into “Reading JSON and the many faces of missing”.*
Related cards
Tasks
Card Info
- Topic: Python for Data Science
- Difficulty: Intermediate
- Completed: 0 users