Validation vs sanitization: the boundary-first principle
Your ML pipeline does not fail because of a bad model — it fails because someone uploaded a CSV with trailing spaces in the column names, or an API returned "row_count": "none" instead of null, or a partner feed swapped the date format overnight. These are data quality failures, and the fix is to enforce contracts at the boundary before data enters your system.
Two operations handle this, and they are not the same thing. Sanitization transforms input into a canonical form: strip whitespace, lowercase strings, normalize separators, convert " Customer Churn 2025 " into "customer_churn_2025". It does not reject — it repairs. Validation checks whether the (already sanitized) input satisfies rules: is this an integer? Is it positive? Is this email address well-formed? Validation either accepts or rejects — it does not transform.
Order matters. If you validate before sanitizing, a name with a trailing space fails a min-length check that it would have passed after stripping — a confusing, false error. The correct sequence is: raw input → sanitize → validate → trusted domain object → business logic. This is the boundary-first principle: treat every external data source (API, CSV, JSON, user form) as untrusted, clean it up, verify it, and only then allow it into your core logic.
Pydantic's @field_validator decorator is designed exactly for this pattern: each validator receives the (already coerced) field value, can sanitize and validate in one step, and either returns the clean value or raises ValueError with an actionable message.
Pydantic docs: [1], dataclasses docs: [2].
Sources
Tasks
Card Info
- Topic: Data Science Praktikum
- Difficulty: Beginner
- Completed: 0 users