XML data and ElementTree

Intermediate Data Science Praktikum

Created by Pavel · 21.03.2026 at 01:05 UTC · 2 completed

New APIs often speak JSON, but many regulated corners of the world still ship XML: a bank's rate feed, an enterprise invoice, a years-old integration nobody dares rewrite. The format is deliberately hierarchical—elements inside elements, attributes on tags—so contracts can be nailed down with schemas (XSD) and validators in a way JSON ecosystems sometimes add later.

In Python you step into that world with xml.etree.ElementTree: parse once, walk children, read element.attrib like a dict, and pull text from .text (remembering that mixed content splits text across children and tail segments). Namespaces show up as tags like {uri}local, so real pipelines learn to compare the local name after the closing }.

When the file is gigabytes instead of kilobytes, the story changes: loading the whole tree can exhaust memory, and iterparse lets you stream and react instead of memorizing the entire document.

Many newer services expose JSON instead; requests, JSON APIs, and robust fetching covers that path. Reference: [1].

Sources

[1]https://docs.python.org/3/library/xml.etree.elementtree.html Return to text

University approvals: 0

Tasks

Question 1

Compared to typical REST JSON APIs, what is a common reason XML still appears in data engineering pipelines?

Hint

Think finance and enterprise interoperability.

XML cannot represent nested structures

Regulated industries and legacy systems standardized on XML schemas and tooling

XML files are always smaller than JSON

Python cannot parse XML

Question 2

What does element.text usually contain in ElementTree?

Hint

Mixed content splits text between children.

The concatenation of all descendant elements as JSON

The text immediately inside the element before the first child sub-element

Always the full file path of the XML document

The XML declaration version string

Question 3

Implement first_child_tag_text(xml_text: str, child_name: str) -> str that parses xml_text, finds the first direct child of the root whose tag equals child_name (ignore namespace prefixes: compare the part after } if the tag looks like {uri}local), and returns that child's .text or empty string if missing.

Assume simple XML like <root><city>Zurich</city></root> and child_name city → Zurich.

Submit the function; tests use expression mode.

Hint

ET.fromstring parses a string; iterate root's children and compare local tag names.

import xml.etree.ElementTree as ET


def first_child_tag_text(xml_text: str, child_name: str) -> str:
    # TODO: parse XML and return the first matching child's text.
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Data Science Praktikum
Difficulty: Intermediate
Completed: 2 users

Creator

Pavel