Tuning ML models leveraging the Caret R package
ML algorithms are parameterized so that they can be ideally adapted for a provided problem. A complication is that configuring an algorithm for a provided problem can be a project in and of itself.
Like choosing the ideal algorithm for a problem you cannot be aware before hand which algorithm parameters will be the best for a problem. The best thing to do is to investigate empirically with controlled experimentation.
The caret R package was developed to make identifying optimum parameters for an algorithm very easy. It furnishes a grid search strategy for searching parameters, combined with several strategies for estimating the performance of a provided model.
In this blog article you will find out five recipes that you could leverage to tune machine learning algorithms to identify optimum parameters for your problems leveraging the caret R package.
Model-tuning
The caret R package furnishes a grid search where it or you can mention the parameters to try out on your problem. It will try out all combos and locate the single combo that provides the ideal outcomes.
The instances in this post will illustrate how you can leverage the caret R package to tune a machine learning algorithm.
The Learning Vector Quantization (LVQ) will be leveraged in all instances owing to its simplicity. It’s like k-nearest neighbours, except the database of samples is smaller and adapted on the basis of training data. It has dual parameters to tune, the number of instances (codebooks) in the model referred to as the size, and the number of instances to check when making forecasts referred to as k.
Every instance will also leverage the iris flowers dataset, that comes with R. This classification dataset furnishes 150 observations for three species of iris flower and their petal and sepal measurements in centimetres.
Every instance also assumes that we hold interest in the classification precision as the metric we are optimizing, even though this can be altered. Also, every instance estimates the performance of a provided model (size and k parameter combo) leveraging repeated n-fold cross validation, with ten folds and three repeats. This too can be altered if you like.
Grid Search: Automatic Grid
There are two manners to tune an algorithm in the Caret R package, the first is by enabling the system to do it automatically. This can be performed by setting the tuneLength to signify the number of differing values to try for every algorithm parameter.
This just assists integer and categorical algorithm parameters, and it makes a crude guess as to what values to attempt, however, it can get you up and running very quickly.
The following recipe demonstrates the automatic grid search of the size and k attributes of LVQ with 5 (tuneLength=5) values of each (25 total models)
omatic parameter tuning in R
R
1 2 3 4 5 6 7 8 9 10 11 12 | # ensure results are repeatable set.seed(7) # load the library library(caret) # load the dataset data(iris) # prepare training scheme control <- trainControl(method=”repeatedcv”, number=10, repeats=3) # train the model model <- train(Species~., data=iris, method=”lvq”, trControl=control, tuneLength=5) # summarize the model print(model) |
The final values leveraged for the model were size = 10, and k = 1.
Grid Search: Manual Grid
The second manner to look for algorithm parameters is to mention a tune grid manually. In the grid, every algorithm parameter can be mentioned as a vector of potential values. These vectors come together to define all the potential combinations to try out.
The recipe below illustrates the search of a manual tune grid with 4 values for the size parameter and 5 values for the k parameter (20 combos)
search with the caret r package
R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # ensure results are repeatable set.seed(7) # load the library library(caret) # load the dataset data(iris) # prepare training scheme control <- trainControl(method=”repeatedcv”, number=10, repeats=3) # design the parameter tuning grid grid <- expand.grid(size=c(5,10,20,50), k=c(1,2,3,4,5)) # train the model model <- train(Species~., data=iris, method=”lvq”, trControl=control, tuneGrid=grid) # summarize the model print(model) |
The final values leveraged for the model were size = 50 and k = 5
Data Pre-Processing
The dataset can be preprocessed as portion of the parameter tuning. It is critical to do this within the sample leveraged to assess each model, to make sure that the outcomes account for all the variance in the test. If the data set state normalized or standardized prior to the tuning process, it would have access to extra knowledge (bias) and not provide an accurate estimation of performance on unobserved data.
The attributes in the iris dataset are all in the same units and typically the same scale, so normalization and standardization are not really required. Nonetheless, the instance below illustrates tuning the size and k parameters of LVQ while normalizing the dataset with preProcess=”scale”
d search with preprocessing with the caret r package
R
1 2 3 4 5 6 7 8 9 10 11 12 | # ensure results are repeatable set.seed(7) # load the library library(caret) # load the dataset data(iris) # prepare training scheme control <- trainControl(method=”repeatedcv”, number=10, repeats=3) # train the model model <- train(Species~., data=iris, method=”lvq”, preProcess=”scale”, trControl=control, tuneLength=5) # summarize the model print(model) |
The final values leveraged for the model were size = 8 and k = 6.
Parallel Processing
The caret package assists parallel processing in order to reduce the compute time for a provided experiment. It is assisted automatically as long as it is setup. In this instance we load the doMC package and set the number of cores to 4, making available 4 worker threads to caret during tuning of the model. This is leveraged for the loops for the repeats of cross validation for every parameter combination.
d search with parallel processing with the caret r package
R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # ensure results are repeatable set.seed(7) # configure multicore library(doMC) registerDoMC(cores=4) # load the library library(caret) # load the dataset data(iris) # prepare training scheme control <- trainControl(method=”repeatedcv”, number=10, repeats=3) # train the model model <- train(Species~., data=iris, method=”lvq”, trControl=control, tuneLength=5) # summarize the model print(model) |
The outcomes are the same as the starting example, only finished quicker.
Visualization of Performance
It can be good to graph the performance of differing algorithm parameter combos to look for trends and the sensitivity of the model. Caret assists graphing the model directly which will contrast the precision of differing algorithm combos.
In the recipe listed here, a larger manual grid of algorithm parameters are defined and the outcomes are graphed. The graph displays the size on the x axis and model precision on the y axis. Two lines are drawn, one for every k value. The graph demonstrates the general trends in the increase in performance with size and that the bigger value of k is likely preferred.
d search with visualization with the caret r package
R