The capstone and the full pipeline

Advanced Python for Data Science

Created by Best · 24.06.2026 at 14:03 UTC

The capstone is not a new topic — it's every topic, used at once. You take a dataset from a raw file all the way to a defended conclusion, and you make the work reproducible by someone who has never met you.

It's the artefact you can show an employer or collaborator, and the proof that you can choose the right tool for each step rather than just recognising tools in isolation. What distinguishes it from a notebook that merely "gets a number" is not the sophistication of the model, but that the result is reproducible and honestly evaluated.

The project runs the full pipeline you've assembled piece by piece:

Load and validate the data — parse each column into the right type, handle missing values deliberately.
Explore it — plot distributions and relationships before modelling.
Engineer features — small, named, typed transforms.
Model it — a clean train/test split and scikit-learn's estimator API.
Evaluate honestly — the right metric, scored on held-out data, reported with its uncertainty, never just bare accuracy.
Write it up — question, method, results and their limitations, and what you'd do next.
seeds” and leads into “What you hand in and what good looks like”.*

University approvals: 0

Related cards

Builds on Reproducibility: environments, containers, seeds · Python for Data Science

Next What you hand in and what good looks like · Python for Data Science

Tasks

Question 1

Which single property most distinguishes a professional DS capstone from a notebook that merely 'gets a number'?

It uses the most advanced model available

It is reproducible and evaluated honestly — runs from a clean checkout with held-out evaluation and stated uncertainty

It has the most lines of code

It avoids using any libraries

Question 2

In the capstone, where do you measure your model's performance?

on the training data, since the model has seen it

on held-out test data, using the right metric and acknowledging uncertainty

on whichever split gives the best number

accuracy alone is enough on any data

Card Info

Topic: Python for Data Science
Difficulty: Advanced
Completed: 0 users

Creator

Best

BestBuddy