>Business >Nearest shrunken centroids with Python

Nearest shrunken centroids with Python

Nearest centroids is a linear classification ML algorithm. 

It consists of forecasting a class label for new instances on the basis of which class-based centroid the instance is nearest to from the training dataset. 

The Nearest Shrunken Centroids algorithm is an extension that consist of shifting class-based centroids toward the centroid of the entire training dataset and removing those input variables that are less useful at discriminating the classes. 

As such, the Nearest Shrunken Centroids algorithm carries out an automatic form of feature selection, making it appropriate for datasets with really big numbers of input variables. 

In this guide, you will find out the Nearest Shrunken Centroids classification machine learning algorithm. 

After going through this guide, you will be aware of: 

  • The nearest shrunken centroids is a simplistic linear machine learning algorithm for classification 
  • How to fit, evaluate, and make forecasts with the Nearest Shrunken Centroids model with Scikit-learn 
  • How to tune the hyperparameters of the Nearest Shrunken Centroids Algorithm on a provided dataset. 

Tutorial Summarization 

This tutorial is subdivided into three portions, which are: 

1] Nearest Centroids Algorithm 

2] Nearest Centroids with Scikit-learn 

3] Tuning Nearest Centroid Hyperparameters 

Nearest Centroids Algorithm 

Nearest centroids is a classification ML algorithm. 

The algorithm consists of initially summarizing the training dataset into a grouping of centroids (centers), then leveraging the centroids to make forecasts for new instances.  

For every class, the centroid of the data is identified by taking the average value of each predictor (per class) in the training set. The overall centroid is computed leveraging the data from all of the classes. 

A centroid is the geometric centre of a data distribution, such as the mean. In several dimensions, this would be the mean value along every dimension, forming a point of centre of the distribution across every variable. 

The Nearest Centroids algorithm assumes that the centroids within the input feature space are differing for every target label. The training data is split into groups by class label, then centroid for every group of data is calculated. Each centroid is merely the mean value of each of the input variables. If there are two classes, then two centroids or points are calculated, three classes provide three centroids, and so on. 

The centroids then represent the “model”. Provided new instances, like those in the test set or new data, the distance between a provided row of data and every centroid is calculated and the closest centroid is leveraged to allocate a class label to the instance. 

Distance measures, such as Euclidean distance, are leveraged for numerical data or hamming distance for categorical data, in which scenario it is best practice to scale input variables through normalization or standardization before training the model. This is to make sure that input variables with big values don’t dominate the distance calculation. 

An extension to the nearest centroid strategy for classification is to shrink the centroids of each input variable towards the centroid of the total training dataset. Those variables that are shrunken down to the value of the data centroid can then be removed as they do not assist to discriminate amongst the class labels. 

As such, the amount of shrinkage applied to the centroids is a hyperparameter that can be tuned for the dataset and leveraged to carry out an automatic form of feature selection. Therefore, it is apt for a dataset with a big number of input variables, some of which might not be relevant or noisy. 

Consequently, the nearest shrunk centroid model also conducts feature selection during the model training procedure. 

This strategy is referenced to as “Nearest Shrunken Centroids” and was initially detailed by Robert Tibshirani, et al. in their 2002 research paper entitled “Diagnosis of multiple cancer types by shrunken centroids of gene expression.” 

Nearest Centroids with Scikit-learn 

The nearest shrunken centroids is available in the scikit-learn Python ML library through the NearestCentroid class. 

The class facilitates the configuration of the distance metric leveraged in the algorithm through the “metric” argument, which defaults to “Euclidean” for the Euclidean distance metric. 

This can be modified to other built-in metrics like Manhattan. 

1 

2 

3 

 

# create the nearest centroid model 

model = NearestCentroid(metric=’euclidean’) 

 

By default, no shrinkage is leveraged, but shrinkage can be mentioned through the “shrink_threshold” argument which takes on a floating point value ranging between 0 and 1. 

 

[Control] 

1 

2 

3 

 

# create the nearest centroid model 

model = NearestCentroid(metric=’euclidean’, shrink_threshold=0.5) 

 

We can demonstrate the Nearest Shrunken Centroids with a worked instance.  

To start with, let’s define a synthetic classification dataset. 

We will leverage the make_classification() function to develop a dataset with 1,000 instances, each with 20 input variables. 

The instance creates and summarizes the dataset. 

 

[Control] 

1 

2 

3 

4 

5 

6 

# test classification dataset 

from sklearn.datasets import make_classification 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) 

# summarize the dataset 

print(X.shape, y.shape) 

 

Running the instance creates the dataset and confirms the number of rows and columns of the dataset. 

(1000, 20) (1000,) 

We can fit and assess a Nearest Shrunken Centroids model leveraging repeated stratified k-fold cross-validation through the RepeatedStratifiedKFold class. We will leverage 10 folds and three repeats in the test harness. 

We will leverage the default configuration of Euclidean distance and no shrinkage.  

 

[Control] 

1 

2 

3 

 

# create the nearest centroid model 

model = NearestCentroid() 

 

The complete instance of assessing the Nearest Shrunken Centroids model for the synthetic binary classification activity is detailed below. 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

# evaluate an nearest centroid model on the dataset 

from numpy import mean 

from numpy import std 

from sklearn.datasets import make_classification 

from sklearn.model_selection import cross_val_score 

from sklearn.model_selection import RepeatedStratifiedKFold 

from sklearn.neighbors import NearestCentroid 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) 

# define model 

model = NearestCentroid() 

# define model evaluation method 

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) 

# evaluate model 

scores = cross_val_score(model, X, y, scoring=’accuracy’, cv=cv, n_jobs=-1) 

# summarize result 

print(‘Mean Accuracy: %.3f (%.3f)’ % (mean(scores), std(scores))) 

 

Running the instance evaluates the Nearest Shrunken Centroids algorithm on the synthetic dataset and reports the average precision across the three repeats of 10-fold cross-validation. 

Your particular outcomes may demonstrate variance provided the stochastic nature of the learning algorithm. Consider executing the instance a few times. 

In this scenario, we can observe that the model accomplished a mean precision of approximately 71%.  

Mean Accuracy: 0.711 (0.055) 

We might decide to leverage the Nearest Shrunken Centroids as our final model and make forecasts on fresh data. 

This can be accomplished by fitting the model on all available data and calling the predict() function passing in a new row of data. 

We can demonstrate this with a complete instance detailed below. 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

# make a prediction with a nearest centroid model on the dataset 

from sklearn.datasets import make_classification 

from sklearn.neighbors import NearestCentroid 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) 

# define model 

model = NearestCentroid() 

# fit model 

model.fit(X, y) 

# define new data 

row = [2.47475454,0.40165523,1.68081787,2.88940715,0.91704519,-3.07950644,4.39961206,0.72464273,-4.86563631,-6.06338084,-1.22209949,-0.4699618,1.01222748,-0.6899355,-0.53000581,6.86966784,-3.27211075,-6.59044146,-2.21290585,-3.139579] 

# make a prediction 

yhat = model.predict([row]) 

# summarize prediction 

print(‘Predicted Class: %d’ % yhat) 

 

Running the instance fits the model and makes a class label prediction for a fresh row of data. 

Predicted Class: 0 

Then, we can look into configuring the model hyperparameters. 

Tuning Nearest Centroid Hyperparameters 

The hyperparameters for the Nearest Shrunken Centroid strategy must be configured for your particular dataset. 

Probably the most critical hyperparameter is the shrinkage controlled through the “shrink_threshold” argument. It is a good idea to evaluate values between 0 and 1 on a grid of values such as 0.1 or 0.01. 

The instance below demonstrates this leveraging the GridSearchCV class with a grid of values we have defined. 

# grid search shrinkage for nearest centroid 

from numpy import arange 

from sklearn.datasets import make_classification 

from sklearn.model_selection import GridSearchCV 

from sklearn.model_selection import RepeatedStratifiedKFold 

from sklearn.neighbors import NearestCentroid 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) 

# define model 

model = NearestCentroid() 

# define model evaluation method 

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) 

# define grid 

grid = dict() 

grid[‘shrink_threshold’] = arange(0, 1.01, 0.01) 

# define search 

search = GridSearchCV(model, grid, scoring=’accuracy’, cv=cv, n_jobs=-1) 

# perform the search 

results = search.fit(X, y) 

# summarize 

print(‘Mean Accuracy: %.3f’ % results.best_score_) 

print(‘Config: %s’ % results.best_params_) 

 

Running the instance will assess each combo of configurations leveraging repeated cross-validation. 

Your particular outcomes might demonstrate variance provided the stochastic nature of the learning algorithm. Try running the instance a few times. 

In this scenario, we can observe that we accomplished slightly improved results than the default, with 71.4% vs 71.1%. We can observe that the model assigned a shrink_threshold value of 0.53. 

Mean Accuracy: 0.714 

Config: {‘shrink_threshold’: 0.53} 

 

The other critical configuration is the distance measure leveraged, which can be selected on the basis of the distribution of the input variables. 

Any of the built-in distance measures can be leveraged, as detailed here: 

  • metrics.pairwise.pairwise_distances API 

Typical distance measures include: 

  • cityblock, cosine, Euclidean, l1, l2, Manhattan 

Provided that our input variables are numeric, our dataset only supports ‘euclidean’ and ‘manhattan’ 

We can add in these metrics within our grid search, the complete instance is detailed below. 

# grid search shrinkage and distance metric for nearest centroid 

from numpy import arange 

from sklearn.datasets import make_classification 

from sklearn.model_selection import GridSearchCV 

from sklearn.model_selection import RepeatedStratifiedKFold 

from sklearn.neighbors import NearestCentroid 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) 

# define model 

model = NearestCentroid() 

# define model evaluation method 

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) 

# define grid 

grid = dict() 

grid[‘shrink_threshold’] = arange(0, 1.01, 0.01) 

grid[‘metric’] = [‘euclidean’, ‘manhattan’] 

# define search 

search = GridSearchCV(model, grid, scoring=’accuracy’, cv=cv, n_jobs=-1) 

# perform the search 

results = search.fit(X, y) 

# summarize 

print(‘Mean Accuracy: %.3f’ % results.best_score_) 

print(‘Config: %s’ % results.best_params_) 

 

Running the instance fits the model and finds out the hyperparameters that provide the best outcomes leveraging cross-validation. 

Your particular outcomes may demonstrate variance provided the stochastic nature of the learning algorithm. Try running the instance a few times.  

In this scenario, we can observe that we might get a bit better precision of 75% leveraging no shrinkage and the manhanttan instead of the Euclidean Distance Measure. 

Mean Accuracy: 0.750 

Config: {‘metric’: ‘manhattan’, ‘shrink_threshold’: 0.0} 

 

A good extension to these experiments would be to include data normalization or standardization to the data as part of a modelling pipeline. 

Further Reading 

This portion of the blog furnishes additional resources on the subject if you are seeking to delve deeper. 

Papers 

Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression, 2002 

Books 

Section 12.6 Nearest Shrunken Centroids, Applied Predictive Modelling, 2013. 

APIs 

sklearn.neighbors.NearestCentroid API 

metrics.pairwise.pairwise_distances API 

Articles 

Nearest centroid classifier, Wikipedia 

Centroid, Wikipedia 

Conclusion 

In this guide, you found out about the nearest shrunken centroids classification machine learning algorithm. 

Particularly, you learned: 

  • The nearest shrunken centroids is a simple linear machine learning algorithm for classification. 
  • How to fit, evaluate, and make forecasts with the nearest shrunken centroids model with Scitkit-Learn 
  • How to tune the hyperparameters of the Nearest Shrunken Centroids algorithm on a provided dataset. 
Add Comment