HTML pages as messy data sources
Sometimes the only "API" is a website built for eyes, not for your dataframe: tables repeat for layout, rows appear after JavaScript runs, and the same statistic hides in three places. That is HTML's job—to present—while CSV and JSON are meant to carry data cleanly.
When you must extract anyway, Beautiful Soup or pandas.read_html treat the page as structure instead of text, which is why regex-only scraping of arbitrary HTML becomes a horror story of missed closing tags and nested tables. Before you scale that up, the professional subplot matters: robots.txt, terms of use, rate limits, and a bias toward official feeds or downloads when they exist.
Downstream, scraped strings still carry & and é; html.unescape is the small cleanup step between what the browser showed and what you analyze.
Fetching HTML over the network is covered in requests, JSON APIs, and robust fetching. Docs: Beautiful Soup [1], html standard library [2].
Sources
Tasks
Card Info
- Topic: Data Science Praktikum
- Difficulty: Beginner
- Completed: 1 users