Damage Regression I: Data Ingestion, Cleaning, and Vocabulary Mapping

Intermediate Damage Regression
Created by Pavel · 12.03.2026 at 07:54 UTC · 2 completed

Case setup:
City-report platforms receive free-text damage descriptions with noisy punctuation and inconsistent wording. The first stage of the pipeline converts raw CSV rows into structured training records.

Lecture-style ingestion logic:
- read CSV rows line by line,
- handle quoted description fields that may contain commas,
- clean category code and words via cleanup,
- keep candidate words matching heuristic rules (for example length >= 6 and starts with uppercase),
- map words and category codes to integer IDs using dictionaries,
- keep reverse maps for inspection and debugging.

Why this stage is critical:
- model quality is bounded by parsing quality,
- unstable preprocessing creates label noise,
- traceable maps are needed to interpret predictions.

Edge cases:
- unmatched quotes in CSV rows,
- empty description after cleaning,
- duplicate words in one report,
- non-bijective ID maps if mapping policy is inconsistent.

University approvals: 0
Tasks
Question 1

Why do we keep both datamap and revmap in this case study?

Hint

Think interpretability and debugging.

Question 2

Code task: implement the cleaning helper used in loading.

def cleanup(word):
    # TODO
    pass

It should remove non-letter characters from both ends.

Submission format: submit the full function snippet shown above with # TODO/pass replaced.

Hint

Use while-loops and isalpha() checks on first/last characters.

Starter code is prefilled; replace TODO blocks with your solution.
3 test cases will be used for grading
Run checks runtime behavior only. Final correctness is evaluated when you submit.
Question 3

A report has extracted tokens ['LampPost', 'BrokenGlass', 'StreetLight', 'Sidewalk']. If the pipeline keeps at most first 3 tokens, what remains?

Hint

The rule is cap to first three in current order.

Question 4

Code task: implement vocabulary mapping with reverse dictionary.

def map_word(word, datamap, revmap):
    """Return integer id for word, creating one if missing."""
    # TODO

Submission format: submit a full function definition def map_word(...): ....

Hint

Use current map size as next ID when word is unseen.

Starter code is prefilled; replace TODO blocks with your solution.
2 test cases will be used for grading
Run checks runtime behavior only. Final correctness is evaluated when you submit.
Card Info
  • Topic: Damage Regression
  • Difficulty: Intermediate
  • Completed: 2 users
Creator
Pavel
Pavel