Damage Regression I: Data Ingestion, Cleaning, and Vocabulary Mapping

Intermediate Damage Regression

Created by Pavel · 12.03.2026 at 07:54 UTC · 2 completed

Case setup:
City-report platforms receive free-text damage descriptions with noisy punctuation and inconsistent wording. The first stage of the pipeline converts raw CSV rows into structured training records.

Lecture-style ingestion logic:
- read CSV rows line by line,
- handle quoted description fields that may contain commas,
- clean category code and words via cleanup,
- keep candidate words matching heuristic rules (for example length >= 6 and starts with uppercase),
- map words and category codes to integer IDs using dictionaries,
- keep reverse maps for inspection and debugging.

Why this stage is critical:
- model quality is bounded by parsing quality,
- unstable preprocessing creates label noise,
- traceable maps are needed to interpret predictions.

Edge cases:
- unmatched quotes in CSV rows,
- empty description after cleaning,
- duplicate words in one report,
- non-bijective ID maps if mapping policy is inconsistent.

University approvals: 0

scikit-learn.org/stable/modules/feature_extraction.html

article

scikit-learn.org

Tasks

Question 1

Why do we keep both datamap and revmap in this case study?

Hint

Think interpretability and debugging.

Only to reduce runtime complexity to O(1).

To map words to IDs and also decode IDs back to readable tokens.

Because regression requires two maps by API contract.

To avoid any duplicate words in data.

Question 2

Code task: implement the cleaning helper used in loading.

def cleanup(word):
    # TODO
    pass

It should remove non-letter characters from both ends.

Submission format: submit the full function snippet shown above with # TODO/pass replaced.

Hint

Use while-loops and isalpha() checks on first/last characters.

def cleanup(word):
    # TODO: strip non-letters from both ends
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

3 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Question 3

A report has extracted tokens ['LampPost', 'BrokenGlass', 'StreetLight', 'Sidewalk']. If the pipeline keeps at most first 3 tokens, what remains?

Hint

The rule is cap to first three in current order.

['LampPost', 'BrokenGlass', 'StreetLight']

['BrokenGlass', 'StreetLight', 'Sidewalk']

All 4 tokens remain.

Only ['LampPost'] remains.

Question 4

Code task: implement vocabulary mapping with reverse dictionary.

def map_word(word, datamap, revmap):
    """Return integer id for word, creating one if missing."""
    # TODO

Submission format: submit a full function definition def map_word(...): ....

Hint

Use current map size as next ID when word is unseen.

def map_word(word, datamap, revmap):
    # TODO: assign stable int id; update both maps
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Damage Regression
Difficulty: Intermediate
Completed: 2 users

Creator

Pavel