HTML pages as messy data sources

Beginner Data Science Praktikum

Created by Pavel · 21.03.2026 at 01:05 UTC · 1 completed

Sometimes the only "API" is a website built for eyes, not for your dataframe: tables repeat for layout, rows appear after JavaScript runs, and the same statistic hides in three places. That is HTML's job—to present—while CSV and JSON are meant to carry data cleanly.

When you must extract anyway, Beautiful Soup or pandas.read_html treat the page as structure instead of text, which is why regex-only scraping of arbitrary HTML becomes a horror story of missed closing tags and nested tables. Before you scale that up, the professional subplot matters: robots.txt, terms of use, rate limits, and a bias toward official feeds or downloads when they exist.

Downstream, scraped strings still carry & and é; html.unescape is the small cleanup step between what the browser showed and what you analyze.

Fetching HTML over the network is covered in requests, JSON APIs, and robust fetching. Docs: Beautiful Soup [1], html standard library [2].

Sources

University approvals: 0

Tasks

Question 1

Why is using regular expressions alone a fragile strategy to parse arbitrary HTML tables?

Hint

Think about optional attributes, whitespace, and broken tags.

HTML forbids tables

Real-world HTML is often irregular or malformed; regex does not understand tag nesting

Regex is slower than Beautiful Soup in all cases

Browsers never render invalid HTML

Question 2

Before scraping a website at scale for a course project, what should you check first besides technical feasibility?

Hint

Legal and ethical constraints are part of professional practice.

Only the color scheme of the site

Whether the site uses HTTPS on the homepage only

The programming language of the backend

Question 3

Implement strip_html_entities(text: str) -> str that decodes HTML entities repeatedly until the result stops changing (or after 5 iterations as a safety bound). This should handle values such as AT&amp;T becoming AT&T while still decoding single-pass entities like < and é.

Use html.unescape. Submit the function; expression mode tests call it.

Hint

Run unescape in a loop until no further change, but cap iterations.

import html


def strip_html_entities(text: str) -> str:
    # TODO: unescape repeatedly with a bounded loop.
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

3 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Data Science Praktikum
Difficulty: Beginner
Completed: 1 users

Creator

Pavel