Validation vs sanitization: the boundary-first principle

Beginner Data Science Praktikum
Created by Pavel · 03.04.2026 at 12:12 UTC

Your ML pipeline does not fail because of a bad model — it fails because someone uploaded a CSV with trailing spaces in the column names, or an API returned "row_count": "none" instead of null, or a partner feed swapped the date format overnight. These are data quality failures, and the fix is to enforce contracts at the boundary before data enters your system.

Two operations handle this, and they are not the same thing. Sanitization transforms input into a canonical form: strip whitespace, lowercase strings, normalize separators, convert " Customer Churn 2025 " into "customer_churn_2025". It does not reject — it repairs. Validation checks whether the (already sanitized) input satisfies rules: is this an integer? Is it positive? Is this email address well-formed? Validation either accepts or rejects — it does not transform.

Order matters. If you validate before sanitizing, a name with a trailing space fails a min-length check that it would have passed after stripping — a confusing, false error. The correct sequence is: raw input → sanitize → validate → trusted domain object → business logic. This is the boundary-first principle: treat every external data source (API, CSV, JSON, user form) as untrusted, clean it up, verify it, and only then allow it into your core logic.

Pydantic's @field_validator decorator is designed exactly for this pattern: each validator receives the (already coerced) field value, can sanitize and validate in one step, and either returns the clean value or raises ValueError with an actionable message.

Pydantic docs: [1], dataclasses docs: [2].


Sources

University approvals: 0
Tasks
Question 1

What does this Pydantic validator return for the input ' Customer Churn 2025 '?

@field_validator('dataset_name')
@classmethod
def normalize(cls, v: str) -> str:
    return '_'.join(v.strip().lower().split())
Hint

Trace each step: strip → lower → split → join with '_'.

Question 2

A colleague validates the dataset name before stripping whitespace — and names like ' AB ' (3 chars of content, 5 chars total) fail a min_length=3 check. What is the root cause?

Hint

What would the length be after stripping?

Question 3

Implement sanitize_and_validate(name: str) -> str that: (1) strips leading/trailing whitespace, (2) lowercases, (3) replaces internal whitespace runs with a single underscore, (4) raises ValueError if the result is shorter than 3 characters. Return the clean name.

Example: sanitize_and_validate(' Customer Churn ')'customer_churn'.

Submit the function; tests use expression mode.

Hint

strip() + lower() + re.sub(r'\s+', '_', ...) does the sanitization; then check len() for validation.

Starter code is prefilled; replace TODO blocks with your solution.
2 test cases will be used for grading
Run checks runtime behavior only. Final correctness is evaluated when you submit.
Card Info
  • Topic: Data Science Praktikum
  • Difficulty: Beginner
  • Completed: 0 users
Creator
Pavel
Pavel