ORM: bridging objects and SQL with SQLAlchemy

Intermediate Data Science Praktikum
Created by Pavel · 03.04.2026 at 12:12 UTC

Software objects and relational tables represent data in fundamentally different ways. In Python, a Dataset object has identity in memory, references other objects directly (dataset.runs), and vanishes when the process ends. In SQL, the same data lives in rows with primary keys, and the reference between a dataset and its experiment runs is a foreign key column — not a Python pointer. This tension is called the object-relational impedance mismatch, and ORM exists to bridge it.

SQLAlchemy is the standard Python ORM. You define model classes that inherit from DeclarativeBase, map fields to columns with mapped_column, and express relationships with relationship(). Then you work with Python objects while SQLAlchemy writes the SQL. session.add(obj) stages an insert; session.commit() writes all staged changes to the database in one transaction — the unit of work pattern.

Two internals worth knowing: the identity map guarantees that loading the same database row twice in one session returns the same Python object (not a copy). And lazy loading of relationships — convenient but dangerous — leads to the N+1 query problem: iterate 100 datasets and access d.runs for each, and the ORM fires 100 extra queries instead of one join. Fix it with explicit join queries or eager loading.

In a well-layered system, the ORM sits at the persistence layer, separate from domain logic. The canonical flow is: external data → Pydantic (validate) → dataclass (domain logic) → ORM (persist to SQL). Putting ORM models at the API boundary leaks database schema to callers; putting Pydantic deep inside business logic ties domain code to a validation framework. Keeping each tool in its layer is what makes the system maintainable.

SQLAlchemy ORM docs: [1], session basics: [2].


Sources

University approvals: 0
Tasks
Question 1

What does this ORM code print?

with Session(engine) as session:
    ds = DatasetORM(name='churn', storage_tier='silver')
    session.add(ds)
    session.add(DatasetORM(name='fraud', storage_tier='gold'))
    session.commit()
    count = session.scalar(select(func.count()).select_from(DatasetORM))
    print(count)
Hint

session.add() stages objects; commit() writes them all to the database.

Question 2

You iterate 100 Dataset objects and access d.runs for each in a loop. With lazy loading, how many SQL queries are issued in total?

Hint

The first query loads all datasets. Each .runs access triggers a separate query.

Question 3

Using sqlite3, implement best_model_with_tier(datasets, runs) -> list where:
- datasets is a list of (name, tier) tuples
- runs is a list of (dataset_name, model_name, rmse) tuples

Create an in-memory SQLite database with tables datasets(name TEXT, tier TEXT) and runs(dataset_name TEXT, model_name TEXT, rmse REAL). Insert the data, then write a single SQL query that JOINs both tables to return, for each dataset, its name, the model with the lowest RMSE, and the dataset's storage tier. Return the result as a list of (dataset_name, best_model, tier) tuples, sorted alphabetically by dataset_name.

Example: with datasets [('churn','silver')] and runs [('churn','xgb',0.42),('churn','lgbm',0.45)], return [('churn', 'xgb', 'silver')].

Submit the function; tests use expression mode.

Hint

You need two JOINs: one to link runs to the best-RMSE subquery, and one to bring in the tier from the datasets table. ORDER BY d.name ensures consistent output.

Starter code is prefilled; replace TODO blocks with your solution.
2 test cases will be used for grading
Run checks runtime behavior only. Final correctness is evaluated when you submit.
Card Info
  • Topic: Data Science Praktikum
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Pavel
Pavel