The estimator API and the train/test split

scikit-learn is the standard library for classical machine learning, and its great gift is consistency: hundreds of models wear the same small interface. Every estimator speaks the same verbs — you fit it on training data (it learns), you predict on new data (it applies what it learned), and you score the result.

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)     # learn the parameters
pred = model.predict(X_test)    # apply them to new rows

Learn this one interface and you can drive any estimator, from logistic regression to random forests.

The single most important line in any modelling script is the train/test split:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0)

You must evaluate on data the model never saw during training; otherwise you measure memorisation, not the ability to generalise to new cases — which is the only thing that matters in practice. Holding out a portion (here 25%) and scoring only on it gives an honest estimate of future performance, and random_state makes the split reproducible.
cross-validation”.*

The estimator API and the train/test split

Related cards

Tasks

Question 1

Question 2

Question 3

Card Info

Creator