Validation vs sanitization: the boundary-first principle

Beginner Data Science Praktikum

Created by Pavel · 03.04.2026 at 12:12 UTC

Your ML pipeline does not fail because of a bad model — it fails because someone uploaded a CSV with trailing spaces in the column names, or an API returned "row_count": "none" instead of null, or a partner feed swapped the date format overnight. These are data quality failures, and the fix is to enforce contracts at the boundary before data enters your system.

Two operations handle this, and they are not the same thing. Sanitization transforms input into a canonical form: strip whitespace, lowercase strings, normalize separators, convert " Customer Churn 2025 " into "customer_churn_2025". It does not reject — it repairs. Validation checks whether the (already sanitized) input satisfies rules: is this an integer? Is it positive? Is this email address well-formed? Validation either accepts or rejects — it does not transform.

Order matters. If you validate before sanitizing, a name with a trailing space fails a min-length check that it would have passed after stripping — a confusing, false error. The correct sequence is: raw input → sanitize → validate → trusted domain object → business logic. This is the boundary-first principle: treat every external data source (API, CSV, JSON, user form) as untrusted, clean it up, verify it, and only then allow it into your core logic.

Pydantic's @field_validator decorator is designed exactly for this pattern: each validator receives the (already coerced) field value, can sanitize and validate in one step, and either returns the clean value or raises ValueError with an actionable message.

Pydantic docs: [1], dataclasses docs: [2].

Sources

University approvals: 0

Tasks

Question 1

What does this Pydantic validator return for the input ' Customer Churn 2025 '?

@field_validator('dataset_name')
@classmethod
def normalize(cls, v: str) -> str:
    return '_'.join(v.strip().lower().split())

Hint

Trace each step: strip → lower → split → join with '_'.

'customer_churn_2025'

' customer_churn_2025 '

'Customer Churn 2025'

ValueError — spaces are not allowed

Question 2

A colleague validates the dataset name before stripping whitespace — and names like ' AB ' (3 chars of content, 5 chars total) fail a min_length=3 check. What is the root cause?

Hint

What would the length be after stripping?

Pydantic always strips whitespace automatically

Validation ran before sanitization — the raw string with whitespace passed to the length check

min_length counts bytes, not characters

The validator must use @classmethod to access the raw value

Question 3

Implement sanitize_and_validate(name: str) -> str that: (1) strips leading/trailing whitespace, (2) lowercases, (3) replaces internal whitespace runs with a single underscore, (4) raises ValueError if the result is shorter than 3 characters. Return the clean name.

Example: sanitize_and_validate(' Customer Churn ') → 'customer_churn'.

Submit the function; tests use expression mode.

Hint

strip() + lower() + re.sub(r'\s+', '_', ...) does the sanitization; then check len() for validation.

import re


def sanitize_and_validate(name: str) -> str:
    # TODO: sanitize (strip, lower, replace spaces), then validate length >= 3.
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Data Science Praktikum
Difficulty: Beginner
Completed: 0 users

Creator

Pavel