An intro to ML modelling pipelines
Applied machine learning usually concentrates on identifying a singular model that has good performance or best on a provided dataset.
Efficient leveraging of the model will need relevant preparation of the input data and hyperparameter tuning of the model.
Taken together, the linear sequence of steps needed to prep the data, tune the model, and transform the forecasts is referred to as the modelling pipeline. Advanced machine learning libraries like the scikit-learn Python library enable this sequence of steps to be defined and leveraged rightly (with no data leakage) and consistently (during evaluation and prediction)
Nonetheless, operating with modelling pipelines can be confusing to starters as it needs a shift in perspective of the applied machine learning process.
In this guide, you will find out about modelling principles for applied machine learning.
After going through this guide, you will be aware of:
- Applied machine learning is concerned with more than identifying a good performing model, it also needs identifying a relevant sequence of data prep steps and steps for the post-processing of predictions.
- Taken together, the operations needed to address a predictive modelling problem can be taken up as an atomic unit referred to as a modelling pipeline.
- Approaching applied machine learning via the lens of modelling pipelines needs an alteration in thought from assessing particular model configurations to sequences of transforms and algorithms.
The tutorial is subdivided into three portions, which are:
1] Identifying a skilful model is not adequate
2] What is a modelling pipeline?
3] Implications of a modelling pipeline
Identifying a skillful model is not adequate
Applied machine learning is the process of finding out the model that has the best performance for a provided predictive modelling dataset.
As a matter of fact, it’s more than this.
In addition to finding out which model has the best performance on your dataset, you might also discover:
- Data transforms that ideally expose the unknown underlying structure of the problem to the learning algorithms
- Model hyperparameters that have the outcome in a good or best config of a selected model.
There might also be additional factors like strategies that transform the forecasts made by the model, like threshold moving or model calibration for forecasted probabilities.
As such, it is typical to perceive of applied machine learning as a large combinatorial search problem across data transforms, models, and model configurations.
This can be a bit of a challenge in practice as it needs that the sequence of one or more data prep schemes, the model, the model configuration, and any forecast transform schemes must be assessed consistently and rightly on a provided test harness.
Even though tricky, it may be manageable with a simplistic train-test split but turns quite unmanageable when leveraging k-fold-cross-validation or even repeated k-fold cross-validation.
The solution is to leverage a modelling pipeline to keep all of it straight.
What is a Modelling Pipeline?
A pipeline is a linear sequence of data prep options, modelling operations, and prediction transform operations.
It enables the sequence of steps to be mentioned, assessed, and leveraged as an atomic unit.
Pipeline: A linear sequence of data prep and modelling steps that can be treated as an atomic unit.
To make the idea obvious, let’s observe two simplistic instances:
The first instance leverages data normalization for the input variables and fits a logistic regression model.
- [Input], [Normalization], [Logistic Regression], [Predictions]
The second instance standardizes the input variables, applies RFE feature selection, and fits a support vector machine.
- [Input], [Standardization], [RFE], [SVM], [Predictions]
You can imagine other instances of modelling pipelines.
As an atomic unit, the pipeline can be assessed leveraging a preferred resampling scheme such as a train-test split or k-fold cross-validation.
This is critical for two primary reasons:
- Prevent data leakage
- Consistency and reproducibility
A modelling pipeline prevents the most typical type of data leakage where data prep strategies, like scaling input values, are applied to the total dataset. This is data leakage as it shares knowledge of the test dataset (like observations that contribute to a mean or maximum known value) with the training dataset, and it turn, may have the outcome of overly optimistic model performance.
Rather, data transforms must be prepped on the training dataset only, then applied to the training dataset, test dataset, validation dataset, and any other datasets that need the transform prior to being leveraged with the model.
A modelling pipeline makes sure that the sequence of data prep operations carried out is reproducible.
Without a modelling pipeline, the data prep steps may be carried out manually twice: one time for evaluating the model and once for making predictions. Any modifications to the sequence must be retained consistent in both scenarios, otherwise differences will impact the capability and skill of the model.
A pipeline makes sure that the sequence of operations is defined once and is consistent when leveraged for model evaluation or making forecasts.
The Python scikit-learn ML library furnishes a machine learning modelling pipeline through the Pipeline class.
Implications of a Modelling Pipeline
The modelling pipeline is a critical utility for machine learning practitioners.
Nonetheless, there are critical implications that must be looked into when leveraging them.
The primary confusion for starters when leveraging pipelines comes in comprehending what the pipeline has learned or the particular configuration discovered by the pipeline.
For instance, a pipeline might leverage a data transform that configures itself automatically, like the RFECV strategy for feature selection.
When assessing a pipeline that leverages an automatically-configured data transform, what config does it select? Or when fitting this pipeline as a final model for making forecasts, what configuration did it select?
The answer is, it doesn’t make a difference
Another instance is the leveraging of hyperparameter tuning as the final step of the pipeline.
The grid search will be carried out on the data furnished by any prior transform steps in the pipeline and will then search for the ideal combo of hyperparameters for the model leveraging that data, then fit a model with those hyperparameters on the information.
When assessing a pipeline that grid searches model hyperparameters, what setup does it select? Or when fitment of this pipeline takes place as a final model for making predictions, what configuration did it select?
The answer again, is, it doesn’t matter.
The answer is applicable when leveraging a threshold moving or probability calibration step at the conclusion of the pipeline.
The reason is similar to the reason that we are not bothered about the particular internal structure or coefficients of the selected model.
For instance, when assessing a logistic regression model, we don’t require to inspect the coefficients selected on every k-fold cross-validation round in order to select the model. Rather, we concentrate on its out-of-fold predictive skill.
Likewise, when leveraging a logistic regression model as the final model for making forecasts on fresh data, we do not require to inspect the coefficients selected when fitting the model on the total datasets prior to making predictions.
We can inspect and find out the coefficients leveraged by the model as an exercise in analysis, however, it does not influence the selection and leveraging of the model.
The same solution generalizes when considering a modelling pipeline.
We are not bothered about which features might have been automatically chosen by a data transform in the pipeline. We are additionally not bothered about which hyperparameters were selected for the model when leveraging a grid search as the last step in the modelling pipeline.
In all three scenarios: the singular model, the pipeline with automatic feature selection, and the pipeline with a grid search, we are assessing the “model” or “modelling pipeline” as an atomic unit.
The pipeline enables us as ML practitioners to shift up one level of abstraction and be less concerned with the particular outcomes of the algorithms and more concerned with the capacity of a sequence of procedures.
As such, we can concentrate on assessing the capacity of the algorithms on the dataset, not the product of the algorithms, that is, the model. Once we possess an estimate of the pipeline, we can apply it and be assured that we will obtain similar performance.
It is a shift in thought and may take some time to get acquainted with.
It is also the basic philosophy underlying AutoML (automatic machine learning) strategies that treat applied machine learning as a large combinatorial search problem.
In this guide, you found out about modelling pipelines for applied machine learning.
Particularly, you learned:
- Applied machine learning is concerned with more than identifying a good performing model; it also needs identifying an appropriate sequence of data prep steps and steps for the post-processing of predictions.
- Collectively, the operations needed to address a predictive modelling problem can be perceived an atomic unit referred to as a modelling pipeline.
- Approaching applied machine learning through the lens of modelling pipelines needs a modification in thought from assessing particular model configurations to sequences of transforms and algorithms.