Project paths and feature pipelines

Intermediate Python for Data Science

Created by Best · 24.06.2026 at 14:03 UTC

The same instinct produces small utilities that keep a project tidy. A recurring nuisance is file paths: a notebook run from one folder can't find data/... when run from another. A repo_root() helper walks upward until it finds the project, so paths resolve no matter where the code runs:

from pathlib import Path

def repo_root() -> Path:
    here = Path.cwd().resolve()
    for p in [here, *here.parents]:
        if (p / "data").is_dir():
            return p
    raise FileNotFoundError("Run from inside the project.")

The payoff of all this naming is that feature engineering — often the work that most decides whether a model succeeds — becomes a sequence of named, testable steps (standardize, clip_outliers, encode_flag) instead of a wall of inline arithmetic copied between notebooks. Each function can be checked in isolation, and each name tells the reader what happens without re-reading the body.

The honest caveat: these hand-written helpers are for learning. NumPy, pandas and scikit-learn provide faster, more robust versions, and later topics swap your implementations for theirs. But from here on the course assumes you reach for a small typed function by default.
into “Counting and accumulating per key”.*

University approvals: 0

Related cards

Builds on Type hints and a numeric helper · Python for Data Science

Next Counting and accumulating per key · Python for Data Science

Tasks

Question 1

Why write a repo_root() helper instead of hard-coding a path like "data/file.csv"?

It makes the code run faster

So the data files are found no matter which folder the code is run from

Because pandas requires it

To hide the file location

Question 2

What is the main benefit of splitting feature engineering into small, named functions?

The program runs faster

Each step becomes readable, testable in isolation, and reusable

Python refuses to run long functions

It removes the need for type hints

Card Info

Topic: Python for Data Science
Difficulty: Intermediate
Completed: 0 users

Creator

Best

BestBuddy