### Training-validation-test split and cross-validation performed right

One critical step within machine learning is the selection of model. An apt model with relevant hyperparameter is the foundation to a good forecasting outcome. When we are encountered with a selection between models, how should the decision be made?

This is why we make use of cross validation. In scikit-learn, there is a grouping of functions that assist us to do this. However, typically, we observe cross validation leveraged incorrectly, or the outcome of cross validation being impacted by improper interpretation.

In this guide by AICorespot, you will find out the right procedure/process to leverage cross validation and a dataset to choose the ideal models for a project.

After going through this guide, you will be aware of:

- The importance of training-validation-test split of data and the trade-off in differing ratios of the split
- The metric to assess a model and how to contrast models
- How to leverage cross validation to assess a model
- What should we do if we possess a decision on the basis of cross validation

**Tutorial Summarization**

This tutorial is subdivided into three portions:

- The problem of model selection
- Out-of-sample evaluation
- Instance of the model selection workflow leveraging cross-validation

**The problem of model choice**

The result of machine learning is a model that can perform forecasts. The most typical scenarios are the classification model and the regression model, the former is to forecast the class membership of an input and the latter is to forecast the value of a dependent variable on the basis of the input. Although, in either scenario, we have an array of models to select from. Classification models, for example, consists of decision tree, support vector machine, and neural network, to specify a few. Any single one of these, is dependent on a few hyperparameters. Thus, we have to decide on an array of settings prior to beginning training of a model proper.

If we possess two candidate models on the basis of our intuition, and we wish to choose one to leverage in our project, how should we choose?

There are some conventional metrics we can can typically leverage. In regression problems, we typically leverage one of the following.

- Mean squared error (MSE)
- Root mean squared error (RMSE)
- Mean absolute error (MAE)

And in the scenario of classification problems, consistently leveraged metrics consists of:

- Accuracy
- Log-loss
- F-measure

The metrics page from sci-kit learn has a longer, but not extensive, listing of typical assessments bundled into differing categorizations. If we possess a simple dataset and wish to train a model to forecast it, we can leverage one of these metrics to assess how effective the model is.

Although, there is a problem, for the sample dataset, we just assessed the model a single time. Under the assumption we rightly separated the dataset into a training set and a test set, and undertook fitment of the model with the training set while assessed with the evaluation set, we gathered just a singular sample point of evaluation with a single test set. How can we be certain it is a precise assessment, instead of a value too low or too high by chance? If we have dual models, and identified that a singular model is improved than other based on the assessment, how can we be aware this is also not due to chance.

The reasoning behind why we bother ourselves with this, is to avoid surprisingly low precision when the model is deployed and leveraged on an entirely new dataset than the one we just gathered, in the future.

**Out-of-sample evaluation**

The solution to this issue is the training-validation-test split.

The model, to start with, is fitted on a training data set, successively, the fitted model is leveraged to forecast the responses for the observations in a second data set referred to as the validation data set. Lastly, the evaluation data set is a data set leveraged to furnish an unbiased evaluation of a final model fitted on the training data set. If the data in the test data set has never been leveraged in training (for instance within cross-validation), the evaluation dataset is also referred to as a holdout data set.

The reasoning behind such practices, lies in the notion of averting data leakage.

“What gets measured gets enhanced”, or as Goodhart’s law phrases it, “When a measure becomes a target, it stops to be a good measure.” If we leverage a single set of data to select a model, the model we select, with certainty, will perform well on the same set of data under the same assessment metric. But, what we should concern ourselves with is the evaluation metric on the unobserved data, instead.

Thus, we are required to keep a slice of data from the total model selection and training procedure, and save it for the final evaluation. This slice of data is the “final exam” to our model and the exam questions must not be observed by the model prior. Precisely, this is the workflow of how the data is being leveraged.

- Training dataset is leveraged to train a few candidate models
- Validation dataset is harnessed to assess the candidate models
- One of the candidates is selected
- The selected model is trained with a fresh training dataset.
- The trained model is assessed with the test dataset.

In stages 1 and 2, we do not wish to assess the candidate models once. Rather, we prefer to assess every model several times with differing datasets and take the average score for decision at step 3. If we possess the luxury of massive amounts of data, this could be done in a rather simple fashion. Otherwise, we can leverage the trick of k-fold to resample the same dataset several times and pretend they are different in nature. As we are assessing the model, or hyperparameter, the model has to receive training from the ground up, every time, without reusing the training outcome from prior efforts. We refer to this procedure as cross validation.

From the outcome of cross-validation, we can come to the conclusion if a singular model is better off than another. As the cross validation is performed on a smaller dataset, we might wish to retrain the model again, after we have a decision on the model. The reason is identical as that for why we require to leverage k-fold in cross-validation; we do not possess a ton of data, and the smaller dataset we leveraged prior, had a portion of it held out for validation. We believe combining the training and validation dataset can generate an improved model. This is what would happen in step 4.

The dataset for evaluation in step 5, and the one we leveraged in cross validation, are differing as we do wish for data leakage. If they were identical, we would observe the similar scores as we have already observed from the cross validation. Or even worse, the test score was guaranteed to be good as it was portion of the data we leveraged in training the selected model and we have adapted the model/framework for that evaluation dataset.

After we have completed the training, we wish to 1) contrast this model to our prior evaluation and 2) estimate how it will feature performance if we deploy it.

We leverage the test dataset that was never leveraged in prior steps to assess the performance. As this is unobserved data, it can assist us in evaluating the generalization, or out-of-sample error. This will simulate what the model will do when we undertake deployment. If there exists overfitting, we would expect the error to be at a high level at this evaluation.

Likewise, we do not expect this evaluation scoring to be very different from that we collected from cross validation in the prior stage, if we performed the model training in the correct manner. This can act as a confirmation for our model choice.

**Instance of the model selection workflow leveraging cross-validation**

In the following, we fabricate a regression problem to demonstrate how a model selection workflow ought to be:

First, we leverage numpy to produce a dataset.

1 2 3 4 5 6 | … # Generate data and plot N = 300 x = np.linspace(0, 7*np.pi, N) smooth = 1 + 0.5*np.sin(x) y = smooth + 0.2*np.random.randn(N) |

We produce a sine curve and include some noise into it. Basically, the data is

y=1+0.5 sin (x) +

for some minimal noise signal, the data appears as follows:

Then we carry out a train-test split, and hold out the test set till we complete our final model. As we are going to leverage scikit-learn models for regression, and they assumed the input x to be in two-dimensional array, we reshape it here to start with. Also, to make the impact of model choice more pronounced, we do not shuffle the data in the split. Practically, this typically not the best of ideas.

1 2 3 4 | … # Train-test split, intentionally use shuffle=False X = x.reshape(-1,1) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False) |

In the next stage, we develop dual models for regression. They are namely quadratic:

y = c + b x x + a x xsquared

and linear:

y = b + a x x

There is no polynomial regression in scikit-learn but we can leverage PolynomialFeatures combined with LinearRegression to accomplish that. PolynomialFeatures (2) will translate input x into 1, x, x^{2} and linear regression on this trio will identify to us the coefficients a,b,c in the formula above.

1 2 3 4 | … # Create two models: Quadratic and linear regression polyreg = make_pipeline(PolynomialFeatures(2), LinearRegression(fit_intercept=False)) linreg = LinearRegression() |

The next step is to leverage just the training and apply k-fold cross validation to every one of the two models.

Python

1 2 3 4 5 | … # Cross-validation scoring = “neg_root_mean_squared_error” polyscores = cross_validate(polyreg, X_train, y_train, scoring=scoring, return_estimator=True) linscores = cross_validate(linreg, X_train, y_train, scoring=scoring, return_estimator=True) |

The function cross_validate() returns a Python dictionary like the following:

1 2 3 4 5 6 7 8 9 10 11 12 13 | {‘fit_time’: array([0.00177097, 0.00117302, 0.00219226, 0.0015142 , 0.00126314]), ‘score_time’: array([0.00054097, 0.0004108 , 0.00086379, 0.00092077, 0.00043106]), ‘estimator’: [Pipeline(steps=[(‘polynomialfeatures’, PolynomialFeatures()), (‘linearregression’, LinearRegression(fit_intercept=False))]), Pipeline(steps=[(‘polynomialfeatures’, PolynomialFeatures()), (‘linearregression’, LinearRegression(fit_intercept=False))]), Pipeline(steps=[(‘polynomialfeatures’, PolynomialFeatures()), (‘linearregression’, LinearRegression(fit_intercept=False))]), Pipeline(steps=[(‘polynomialfeatures’, PolynomialFeatures()), (‘linearregression’, LinearRegression(fit_intercept=False))]), Pipeline(steps=[(‘polynomialfeatures’, PolynomialFeatures()), (‘linearregression’, LinearRegression(fit_intercept=False))])], ‘test_score’: array([-1.00421665, -0.53397399, -0.47742336, -0.41834582, -0.68043053])} |

Which the key test_score holds the score for every fold. We are leveraging negative root mean square error for the cross validation and the higher the score, the lesser the error, and therefore the better the model.

The above is from the quadratic model. The corresponding evaluation score from the linear model is as follows:

array([-0.43401194, -0.52385836, -0.42231028, -0.41532203, -0.43441137])

By contrasting the average score, we identified that the linear model carries out better than the quadratic model.

1 2 3 4 5 | … # Which one is better? Linear and polynomial print(linscores[“test_score”].mean()) print(polyscores[“test_score”].mean()) print(linscores[“test_score”].mean() – polyscores[“test_score”].mean()) |

1 2 3 | Linear regression score: -0.4459827970437929 Polynomial regression score: -0.6228780695994603 Difference: 0.17689527255566745 |

Prior to proceeding to train our model of selection, we can illustrate what occurred. Take the initial cross-validation iteration as an instance, we can observe that the coefficient for quadratic regression is as follows:

1 2 3 4 | … # Let’s show the coefficient of the first fitted polynomial regression # This starts from the constant term and in ascending order of powers print(polyscores[“estimator”][0].steps[1][1].coef_) |

[-0.03190358 0.20818594 -0.00937904]

This implies our fitted quadratic model is:

y = -0.0319 + 0.2082 x x-0.0094 x x^{2}

and the coefficients of the linear regression at first iteration of its cross validation are

Python

1 2 3 | … # And show the coefficient of the last-fitted linear regression print(linscores[“estimator”][0].intercept_, linscores[“estimator”][-1].coef_) |

0.856999187854241 [-0.00918622]

Which implies that the fitted linear model is

y=0.8570-0.0092 x x

We can observe how the appear in a plot.

Python

1 2 3 4 5 6 7 8 9 10 | … # Plot and compare plt.plot(x, y) plt.plot(x, smooth) plt.plot(x, polyscores[“estimator”][0].predict(X)) plt.plot(x, linscores[“estimator”][0].predict(X)) plt.ylim(0,2) plt.xlabel(“x”) plt.ylabel(“y”) plt.show() |

Here we can observe that the red line is the linear regression while the green line is from quadratic regression. We can observe the quadratic curve is massively off from the input data (blue curve) at dual ends.

As we decided to leverage linear model for regression, we require to re-train the model and evaluate it leveraging our held out test data.

Python

1 2 3 4 5 | … # Retrain the model and evaluate linreg.fit(X_train, y_train) print(“Test set RMSE:”, mean_squared_error(y_test, linreg.predict(X_test), squared=False)) print(“Mean validation RMSE:”, -linscores[“test_score”].mean()) |

Test set RMSE: 0.4403109417232645

Mean validation RMSE: 0.4459827970437929

Here, as scikit-learn will clone a fresh model on each iteration of cross validation, the model we developed stayed untrained following cross validation. Otherwise, we should reset the model by cloning a fresh one leveraging linreg = sklearn.base.clone(linreg). However, from above, we observe that we collected the root mean squared error of 0.440 from our evaluation set while the score we gathered from cross validation is 0.446. This is not too much of a variation, and therefore, we concluded that this model should observe an error of similar magnitude for new data.

Connecting all of these together, the total instance is detailed below.

import matplotlib.pyplot as plt

import numpy as np

from sklearn.model_selection import cross_validate, train_test_split

from sklearn.preprocessing import PolynomialFeatures, StandardScaler

from sklearn.pipeline import make_pipeline

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

np.random.seed(42)

# Generate data and plot

N = 300

x = np.linspace(0, 7*np.pi, N)

smooth = 1 + 0.5*np.sin(x)

y = smooth + 0.2*np.random.randn(N)

plt.plot(x, y)

plt.plot(x, smooth)

plt.xlabel(“x”)

plt.ylabel(“y”)

plt.ylim(0,2)

plt.show()

# Train-test split, intentionally use shuffle=False

X = x.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)

# Create two models: Polynomial and linear regression

degree = 2

polyreg = make_pipeline(PolynomialFeatures(degree), LinearRegression(fit_intercept=False))

linreg = LinearRegression()

# Cross-validation

scoring = “neg_root_mean_squared_error”

polyscores = cross_validate(polyreg, X_train, y_train, scoring=scoring, return_estimator=True)

linscores = cross_validate(linreg, X_train, y_train, scoring=scoring, return_estimator=True)

# Which one is better? Linear and polynomial

print(“Linear regression score:”, linscores[“test_score”].mean())

print(“Polynomial regression score:”, polyscores[“test_score”].mean())

print(“Difference:”, linscores[“test_score”].mean() – polyscores[“test_score”].mean())

print(“Coefficients of polynomial regression and linear regression:”)

# Let’s show the coefficient of the last fitted polynomial regression

# This starts from the constant term and in ascending order of powers

print(polyscores[“estimator”][0].steps[1][1].coef_)

# And show the coefficient of the last-fitted linear regression

print(linscores[“estimator”][0].intercept_, linscores[“estimator”][-1].coef_)

# Plot and compare

plt.plot(x, y)

plt.plot(x, smooth)

plt.plot(x, polyscores[“estimator”][0].predict(X))

plt.plot(x, linscores[“estimator”][0].predict(X))

plt.ylim(0,2)

plt.xlabel(“x”)

plt.ylabel(“y”)

plt.show()

# Retrain the model and evaluate

import sklearn

linreg = sklearn.base.clone(linreg)

linreg.fit(X_train, y_train)

print(“Test set RMSE:”, mean_squared_error(y_test, linreg.predict(X_test), squared=False))

print(“Mean validation RMSE:”, -linscores[“test_score”].mean())

**Further Reading**

This section furnishes additional resources on the subject if you are seeking to delve deeper.

**APIs**

- model_selection.KFold API
- model_selection.cross_val_score API
- model_selection.cross_validate API

**Articles**

- Cross-validation (statistics), Wikipedia

**Conclusion**

In this guide, you found out how to perform training-validation-test split dataset and perform k-fold cross validation to choose a model in the right way and how to retrain the model following the selection.

Particularly, you learned:

- The significance of training-validation-test split to assist model selection.
- How to assess and contrast machine learning models leveraging k-fold cross-validation on a training set.
- How to retrain a model upon selection from the candidates on the basis of the advice from cross-validation
- How to leverage test set to confirm our model selection.