Damage Regression I: Data Ingestion, Cleaning, and Vocabulary Mapping
Case setup:
City-report platforms receive free-text damage descriptions with noisy punctuation and inconsistent wording. The first stage of the pipeline converts raw CSV rows into structured training records.
Lecture-style ingestion logic:
- read CSV rows line by line,
- handle quoted description fields that may contain commas,
- clean category code and words via cleanup,
- keep candidate words matching heuristic rules (for example length >= 6 and starts with uppercase),
- map words and category codes to integer IDs using dictionaries,
- keep reverse maps for inspection and debugging.
Why this stage is critical:
- model quality is bounded by parsing quality,
- unstable preprocessing creates label noise,
- traceable maps are needed to interpret predictions.
Edge cases:
- unmatched quotes in CSV rows,
- empty description after cleaning,
- duplicate words in one report,
- non-bijective ID maps if mapping policy is inconsistent.
Tasks
Card Info
- Topic: Damage Regression
- Difficulty: Intermediate
- Completed: 2 users