ORM: bridging objects and SQL with SQLAlchemy

Intermediate Data Science Praktikum

Created by Pavel · 03.04.2026 at 12:12 UTC

Software objects and relational tables represent data in fundamentally different ways. In Python, a Dataset object has identity in memory, references other objects directly (dataset.runs), and vanishes when the process ends. In SQL, the same data lives in rows with primary keys, and the reference between a dataset and its experiment runs is a foreign key column — not a Python pointer. This tension is called the object-relational impedance mismatch, and ORM exists to bridge it.

SQLAlchemy is the standard Python ORM. You define model classes that inherit from DeclarativeBase, map fields to columns with mapped_column, and express relationships with relationship(). Then you work with Python objects while SQLAlchemy writes the SQL. session.add(obj) stages an insert; session.commit() writes all staged changes to the database in one transaction — the unit of work pattern.

Two internals worth knowing: the identity map guarantees that loading the same database row twice in one session returns the same Python object (not a copy). And lazy loading of relationships — convenient but dangerous — leads to the N+1 query problem: iterate 100 datasets and access d.runs for each, and the ORM fires 100 extra queries instead of one join. Fix it with explicit join queries or eager loading.

In a well-layered system, the ORM sits at the persistence layer, separate from domain logic. The canonical flow is: external data → Pydantic (validate) → dataclass (domain logic) → ORM (persist to SQL). Putting ORM models at the API boundary leaks database schema to callers; putting Pydantic deep inside business logic ties domain code to a validation framework. Keeping each tool in its layer is what makes the system maintainable.

SQLAlchemy ORM docs: [1], session basics: [2].

Sources

University approvals: 0

Tasks

Question 1

What does this ORM code print?

with Session(engine) as session:
    ds = DatasetORM(name='churn', storage_tier='silver')
    session.add(ds)
    session.add(DatasetORM(name='fraud', storage_tier='gold'))
    session.commit()
    count = session.scalar(select(func.count()).select_from(DatasetORM))
    print(count)

Hint

session.add() stages objects; commit() writes them all to the database.

0 — commit does not persist without flush

Error — cannot add two objects in one session

Question 2

You iterate 100 Dataset objects and access d.runs for each in a loop. With lazy loading, how many SQL queries are issued in total?

Hint

The first query loads all datasets. Each .runs access triggers a separate query.

1 — the ORM batches everything

2 — one for datasets, one for all runs

101 — one for the initial dataset load, plus one per lazy relationship access

100 — one per dataset, no initial query

Question 3

Using sqlite3, implement best_model_with_tier(datasets, runs) -> list where:
- datasets is a list of (name, tier) tuples
- runs is a list of (dataset_name, model_name, rmse) tuples

Create an in-memory SQLite database with tables datasets(name TEXT, tier TEXT) and runs(dataset_name TEXT, model_name TEXT, rmse REAL). Insert the data, then write a single SQL query that JOINs both tables to return, for each dataset, its name, the model with the lowest RMSE, and the dataset's storage tier. Return the result as a list of (dataset_name, best_model, tier) tuples, sorted alphabetically by dataset_name.

Example: with datasets [('churn','silver')] and runs [('churn','xgb',0.42),('churn','lgbm',0.45)], return [('churn', 'xgb', 'silver')].

Submit the function; tests use expression mode.

Hint

You need two JOINs: one to link runs to the best-RMSE subquery, and one to bring in the tier from the datasets table. ORDER BY d.name ensures consistent output.

import sqlite3


def best_model_with_tier(datasets, runs) -> list:
    # TODO: create both tables, insert data, JOIN them to find best model + tier per dataset.
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Data Science Praktikum
Difficulty: Intermediate
Completed: 0 users

Creator

Pavel