HTML pages as messy data sources

Beginner Data Science Praktikum
Created by Pavel · 21.03.2026 at 01:05 UTC · 1 completed

Sometimes the only "API" is a website built for eyes, not for your dataframe: tables repeat for layout, rows appear after JavaScript runs, and the same statistic hides in three places. That is HTML's job—to present—while CSV and JSON are meant to carry data cleanly.

When you must extract anyway, Beautiful Soup or pandas.read_html treat the page as structure instead of text, which is why regex-only scraping of arbitrary HTML becomes a horror story of missed closing tags and nested tables. Before you scale that up, the professional subplot matters: robots.txt, terms of use, rate limits, and a bias toward official feeds or downloads when they exist.

Downstream, scraped strings still carry & and é; html.unescape is the small cleanup step between what the browser showed and what you analyze.

Fetching HTML over the network is covered in requests, JSON APIs, and robust fetching. Docs: Beautiful Soup [1], html standard library [2].


Sources

University approvals: 0
Tasks
Question 1

Why is using regular expressions alone a fragile strategy to parse arbitrary HTML tables?

Hint

Think about optional attributes, whitespace, and broken tags.

Question 2

Before scraping a website at scale for a course project, what should you check first besides technical feasibility?

Hint

Legal and ethical constraints are part of professional practice.

Question 3

Implement strip_html_entities(text: str) -> str that decodes HTML entities repeatedly until the result stops changing (or after 5 iterations as a safety bound). This should handle values such as AT&T becoming AT&T while still decoding single-pass entities like < and é.

Use html.unescape. Submit the function; expression mode tests call it.

Hint

Run unescape in a loop until no further change, but cap iterations.

Starter code is prefilled; replace TODO blocks with your solution.
3 test cases will be used for grading
Run checks runtime behavior only. Final correctness is evaluated when you submit.
Card Info
  • Topic: Data Science Praktikum
  • Difficulty: Beginner
  • Completed: 1 users
Creator
Pavel
Pavel