The estimator API and the train/test split

Intermediate Python for Data Science
Created by Best · 24.06.2026 at 14:03 UTC

scikit-learn is the standard library for classical machine learning, and its great gift is consistency: hundreds of models wear the same small interface. Every estimator speaks the same verbs — you fit it on training data (it learns), you predict on new data (it applies what it learned), and you score the result.

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)     # learn the parameters
pred = model.predict(X_test)    # apply them to new rows

Learn this one interface and you can drive any estimator, from logistic regression to random forests.

The single most important line in any modelling script is the train/test split:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0)

You must evaluate on data the model never saw during training; otherwise you measure memorisation, not the ability to generalise to new cases — which is the only thing that matters in practice. Holding out a portion (here 25%) and scoring only on it gives an honest estimate of future performance, and random_state makes the split reproducible.
cross-validation”.*

University approvals: 0
Related cards
Next Reading the report; leakage and cross-validation · Python for Data Science
Tasks
Question 1

In scikit-learn's estimator API, which method LEARNS parameters from training data?

Question 2

Why split data into train and test sets before evaluating a model?

Question 3

Which methods are part of scikit-learn's standard estimator API? Select all that apply.

Select all that apply.
Card Info
  • Topic: Python for Data Science
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy