Project paths and feature pipelines
The same instinct produces small utilities that keep a project tidy. A recurring nuisance is file paths: a notebook run from one folder can't find data/... when run from another. A repo_root() helper walks upward until it finds the project, so paths resolve no matter where the code runs:
from pathlib import Path
def repo_root() -> Path:
here = Path.cwd().resolve()
for p in [here, *here.parents]:
if (p / "data").is_dir():
return p
raise FileNotFoundError("Run from inside the project.")
The payoff of all this naming is that feature engineering — often the work that most decides whether a model succeeds — becomes a sequence of named, testable steps (standardize, clip_outliers, encode_flag) instead of a wall of inline arithmetic copied between notebooks. Each function can be checked in isolation, and each name tells the reader what happens without re-reading the body.
The honest caveat: these hand-written helpers are for learning. NumPy, pandas and scikit-learn provide faster, more robust versions, and later topics swap your implementations for theirs. But from here on the course assumes you reach for a small typed function by default.
into “Counting and accumulating per key”.*
Related cards
Tasks
Card Info
- Topic: Python for Data Science
- Difficulty: Intermediate
- Completed: 0 users