Nearest shrunken centroids with Python
Nearest centroids is a linear classification ML algorithm.
It consists of forecasting a class label for new instances on the basis of which class-based centroid the instance is nearest to from the training dataset.
The Nearest Shrunken Centroids algorithm is an extension that consist of shifting class-based centroids toward the centroid of the entire training dataset and removing those input variables that are less useful at discriminating the classes.
As such, the Nearest Shrunken Centroids algorithm carries out an automatic form of feature selection, making it appropriate for datasets with really big numbers of input variables.
In this guide, you will find out the Nearest Shrunken Centroids classification machine learning algorithm.
After going through this guide, you will be aware of:
- The nearest shrunken centroids is a simplistic linear machine learning algorithm for classification
- How to fit, evaluate, and make forecasts with the Nearest Shrunken Centroids model with Scikit-learn
- How to tune the hyperparameters of the Nearest Shrunken Centroids Algorithm on a provided dataset.
Tutorial Summarization
This tutorial is subdivided into three portions, which are:
1] Nearest Centroids Algorithm
2] Nearest Centroids with Scikit-learn
3] Tuning Nearest Centroid Hyperparameters
Nearest Centroids Algorithm
Nearest centroids is a classification ML algorithm.
The algorithm consists of initially summarizing the training dataset into a grouping of centroids (centers), then leveraging the centroids to make forecasts for new instances.
For every class, the centroid of the data is identified by taking the average value of each predictor (per class) in the training set. The overall centroid is computed leveraging the data from all of the classes.
A centroid is the geometric centre of a data distribution, such as the mean. In several dimensions, this would be the mean value along every dimension, forming a point of centre of the distribution across every variable.
The Nearest Centroids algorithm assumes that the centroids within the input feature space are differing for every target label. The training data is split into groups by class label, then centroid for every group of data is calculated. Each centroid is merely the mean value of each of the input variables. If there are two classes, then two centroids or points are calculated, three classes provide three centroids, and so on.
The centroids then represent the “model”. Provided new instances, like those in the test set or new data, the distance between a provided row of data and every centroid is calculated and the closest centroid is leveraged to allocate a class label to the instance.
Distance measures, such as Euclidean distance, are leveraged for numerical data or hamming distance for categorical data, in which scenario it is best practice to scale input variables through normalization or standardization before training the model. This is to make sure that input variables with big values don’t dominate the distance calculation.
An extension to the nearest centroid strategy for classification is to shrink the centroids of each input variable towards the centroid of the total training dataset. Those variables that are shrunken down to the value of the data centroid can then be removed as they do not assist to discriminate amongst the class labels.
As such, the amount of shrinkage applied to the centroids is a hyperparameter that can be tuned for the dataset and leveraged to carry out an automatic form of feature selection. Therefore, it is apt for a dataset with a big number of input variables, some of which might not be relevant or noisy.
Consequently, the nearest shrunk centroid model also conducts feature selection during the model training procedure.
This strategy is referenced to as “Nearest Shrunken Centroids” and was initially detailed by Robert Tibshirani, et al. in their 2002 research paper entitled “Diagnosis of multiple cancer types by shrunken centroids of gene expression.”
Nearest Centroids with Scikit-learn
The nearest shrunken centroids is available in the scikit-learn Python ML library through the NearestCentroid class.
The class facilitates the configuration of the distance metric leveraged in the algorithm through the “metric” argument, which defaults to “Euclidean” for the Euclidean distance metric.
This can be modified to other built-in metrics like Manhattan.
1 2 3 | … # create the nearest centroid model model = NearestCentroid(metric=’euclidean’) |
By default, no shrinkage is leveraged, but shrinkage can be mentioned through the “shrink_threshold” argument which takes on a floating point value ranging between 0 and 1.
[Control]
1 2 3 | … # create the nearest centroid model model = NearestCentroid(metric=’euclidean’, shrink_threshold=0.5) |
We can demonstrate the Nearest Shrunken Centroids with a worked instance.
To start with, let’s define a synthetic classification dataset.
We will leverage the make_classification() function to develop a dataset with 1,000 instances, each with 20 input variables.
The instance creates and summarizes the dataset.
[Control]
1 2 3 4 5 6 | # test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape) |
Running the instance creates the dataset and confirms the number of rows and columns of the dataset.
(1000, 20) (1000,)
We can fit and assess a Nearest Shrunken Centroids model leveraging repeated stratified k-fold cross-validation through the RepeatedStratifiedKFold class. We will leverage 10 folds and three repeats in the test harness.
We will leverage the default configuration of Euclidean distance and no shrinkage.
[Control]
1 2 3 | … # create the nearest centroid model model = NearestCentroid() |
The complete instance of assessing the Nearest Shrunken Centroids model for the synthetic binary classification activity is detailed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # evaluate an nearest centroid model on the dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import NearestCentroid # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = NearestCentroid() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring=’accuracy’, cv=cv, n_jobs=-1) # summarize result print(‘Mean Accuracy: %.3f (%.3f)’ % (mean(scores), std(scores))) |
Running the instance evaluates the Nearest Shrunken Centroids algorithm on the synthetic dataset and reports the average precision across the three repeats of 10-fold cross-validation.
Your particular outcomes may demonstrate variance provided the stochastic nature of the learning algorithm. Consider executing the instance a few times.
In this scenario, we can observe that the model accomplished a mean precision of approximately 71%.
Mean Accuracy: 0.711 (0.055)
We might decide to leverage the Nearest Shrunken Centroids as our final model and make forecasts on fresh data.
This can be accomplished by fitting the model on all available data and calling the predict() function passing in a new row of data.
We can demonstrate this with a complete instance detailed below.
[Control]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # make a prediction with a nearest centroid model on the dataset from sklearn.datasets import make_classification from sklearn.neighbors import NearestCentroid # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = NearestCentroid() # fit model model.fit(X, y) # define new data row = [2.47475454,0.40165523,1.68081787,2.88940715,0.91704519,-3.07950644,4.39961206,0.72464273,-4.86563631,-6.06338084,-1.22209949,-0.4699618,1.01222748,-0.6899355,-0.53000581,6.86966784,-3.27211075,-6.59044146,-2.21290585,-3.139579] # make a prediction yhat = model.predict([row]) # summarize prediction print(‘Predicted Class: %d’ % yhat) |
Running the instance fits the model and makes a class label prediction for a fresh row of data.
Predicted Class: 0
Then, we can look into configuring the model hyperparameters.
Tuning Nearest Centroid Hyperparameters
The hyperparameters for the Nearest Shrunken Centroid strategy must be configured for your particular dataset.
Probably the most critical hyperparameter is the shrinkage controlled through the “shrink_threshold” argument. It is a good idea to evaluate values between 0 and 1 on a grid of values such as 0.1 or 0.01.
The instance below demonstrates this leveraging the GridSearchCV class with a grid of values we have defined.
# grid search shrinkage for nearest centroid
from numpy import arange
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import NearestCentroid
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# define model
model = NearestCentroid()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid[‘shrink_threshold’] = arange(0, 1.01, 0.01)
# define search
search = GridSearchCV(model, grid, scoring=’accuracy’, cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print(‘Mean Accuracy: %.3f’ % results.best_score_)
print(‘Config: %s’ % results.best_params_)
Running the instance will assess each combo of configurations leveraging repeated cross-validation.
Your particular outcomes might demonstrate variance provided the stochastic nature of the learning algorithm. Try running the instance a few times.
In this scenario, we can observe that we accomplished slightly improved results than the default, with 71.4% vs 71.1%. We can observe that the model assigned a shrink_threshold value of 0.53.
Mean Accuracy: 0.714
Config: {‘shrink_threshold’: 0.53}
The other critical configuration is the distance measure leveraged, which can be selected on the basis of the distribution of the input variables.
Any of the built-in distance measures can be leveraged, as detailed here:
- metrics.pairwise.pairwise_distances API
Typical distance measures include:
- cityblock, cosine, Euclidean, l1, l2, Manhattan
Provided that our input variables are numeric, our dataset only supports ‘euclidean’ and ‘manhattan’
We can add in these metrics within our grid search, the complete instance is detailed below.
# grid search shrinkage and distance metric for nearest centroid
from numpy import arange
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import NearestCentroid
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# define model
model = NearestCentroid()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid[‘shrink_threshold’] = arange(0, 1.01, 0.01)
grid[‘metric’] = [‘euclidean’, ‘manhattan’]
# define search
search = GridSearchCV(model, grid, scoring=’accuracy’, cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print(‘Mean Accuracy: %.3f’ % results.best_score_)
print(‘Config: %s’ % results.best_params_)
Running the instance fits the model and finds out the hyperparameters that provide the best outcomes leveraging cross-validation.
Your particular outcomes may demonstrate variance provided the stochastic nature of the learning algorithm. Try running the instance a few times.
In this scenario, we can observe that we might get a bit better precision of 75% leveraging no shrinkage and the manhanttan instead of the Euclidean Distance Measure.
Mean Accuracy: 0.750
Config: {‘metric’: ‘manhattan’, ‘shrink_threshold’: 0.0}
A good extension to these experiments would be to include data normalization or standardization to the data as part of a modelling pipeline.
Further Reading
This portion of the blog furnishes additional resources on the subject if you are seeking to delve deeper.
Papers
Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression, 2002
Books
Section 12.6 Nearest Shrunken Centroids, Applied Predictive Modelling, 2013.
APIs
sklearn.neighbors.NearestCentroid API
metrics.pairwise.pairwise_distances API
Articles
Nearest centroid classifier, Wikipedia
Centroid, Wikipedia
Conclusion
In this guide, you found out about the nearest shrunken centroids classification machine learning algorithm.
Particularly, you learned:
- The nearest shrunken centroids is a simple linear machine learning algorithm for classification.
- How to fit, evaluate, and make forecasts with the nearest shrunken centroids model with Scitkit-Learn
- How to tune the hyperparameters of the Nearest Shrunken Centroids algorithm on a provided dataset.