Semi-supervised learning with label spreading
Semi-supervised algorithms references to algorithms that make an effort to leverage both labelled and unlabelled training data.
Semi-supervised learning algorithms are not like supervised learning algorithms that are just able to learn from labelled training data.
A widespread approach to semi-supervised learning is to develop a graph that links instances within the training dataset and propagates known labels via the edges of the graph to label unlabelled instances. An instance of this strategy to semi-supervised learning is the label spreading algorithm for classification predictive modelling.
In this guide, you will find out how to apply the label spreading algorithm to a semi-supervised learning classification dataset.
In this guide, you will find out how to apply the label spreading algorithm to a semi-supervised learning classification dataset.
After going through this guide, you will be aware of:
- An intuition for how the label spreading semi-supervised learning algorithm operates.
- How to produce a semi-supervised classification dataset and setup a baseline in performance with a supervised learning algorithm.
- How to produce and assess a label spreading algorithm and leverage the model output to train a supervised learning algorithm.
Tutorial Summarization
This tutorial is subdivided into three portions, which are:
1] Label Spreading Algorithm
2] Semi-supervised Classification Dataset
3] Label spreading for semi-supervised learning
Label Spreading Algorithm
Label spreading is a semi-supervised learning algorithm.
The algorithm was put forth by Dengyong Zhou, et al. in their 2003 paper entitled “Learning With Local and Global Consistency”
The intuition for the wider approach of semi-supervised learning is that close by points within the input space should possess the same label, and indicates in the same structure or manifold within the input space should possess the same label.
Critical to semi-supervised learning problems is the prior assumption of consistency, which implies: close by points are probable to have the same label, and points on the same structure usually referenced to as a cluster or a manifold) are probable to possess the same label.
The label spreading draws inspiration by a strategy from experimental psychology referred to as spreading activation networks.
This algorithm can be comprehended intuitively in terms of spreading activation networks from experimental psychology.
Points in the dataset are linked in a graph on that basis of their comparative distances in the input space. The weight matrix of the graph is normalized symmetrically, a lot like spectral clustering. Data is passed via the graph, which is adapted to capture the structure within the input space. The strategy is very much like the label propagation for semi-supervised learning.
Another similar label propagation algorithm was provided by Zhou et al. at every step a nodi i obtains a contribution from its neighbours j (weighted by the normalized weight of the edge (i.j)), and an extra small contribution provided by its initial value.
Following convergence, labels are applied on the basis of nodes that passed on the most data.
Lastly, the label of every unlabelled point is set to be the class of which it has obtained most data during the iteration procedure.
Now that we are acquainted with the label spreading algorithm, let’s observe how we might leverage it on a project. To start with, we must define a semi-supervised classification dataset.
Semi-Supervised Classification Dataset
In this portion of the blog, we will go about defining a dataset for semi-supervised learning and setup a baseline in performance within the dataset.
To start with, we can go about defining a synthetic classification dataset leveraging the make_classification() function.
We will define the dataset with two categories (binary classification) and dual input variables and 1,000 instances.
[Control]
1 2 3 | … # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) |
Then, we will split the dataset into train and test datasets with equivalent 50%-50% split (for example 500 rows in each)
[Control]
1 2 3 | … # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) |
Lastly, we will split the training dataset in half again into a portion that will possess labels and portion that we will pretend is unlabelled.
[Control]
1 2 3 | … # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) |
Connecting this together, the complete instance of prepping the semi-supervised learning dataset is detailed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # prepare semi-supervised learning dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # summarize training set size print(‘Labeled Train Set:’, X_train_lab.shape, y_train_lab.shape) print(‘Unlabeled Train Set:’, X_test_unlab.shape, y_test_unlab.shape) # summarize test set size print(‘Test Set:’, X_test.shape, y_test.shape) |
Running the instance preps the dataset and then summarizes the shape of every one of the three portions.
The outcomes confirm that we possess a test dataset of 500 rows, a labelled training dataset of 250 rows, and 250 rows of unlabelled data.
[Control]
1 2 3 | Labeled Train Set: (250, 2) (250,) Unlabeled Train Set: (250, 2) (250,) Test Set: (500, 2) (500,) |
A supervised learning algorithm will just possess 250 rows from which to train a model.
A semi-supervised learning algorithm will possess the 250 labelled rows in addition to the 250 unlabelled rows that could be leveraged in various ways to enhance the labelled training dataset.
Next, we can determine a baseline in performance on the semi-supervised learning dataset leveraging a supervised learning algorithm fit just on the labelled training data.
This is critical as we would expect a semi-supervised learning algorithm to outpace a supervised learning algorithm fitted on the labelled data alone. If this is not the scenario, then the semi-supervised learning algorithm does not possess skill.
In this scenario, we will leverage a logistic regression algorithm fitted on the labelled portion of the training dataset.
[Control]
1 2 3 4 5 | … # define model model = LogisticRegression() # fit model on labeled dataset model.fit(X_train_lab, y_train_lab) |
The model can then be leveraged to make forecasts on the entire holdout test dataset and assessed leveraging classification precision.
[Control]
1 2 3 4 5 6 7 | … # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print(‘Accuracy: %.3f’ % (score*100)) |
Connecting this together, the complete instance of assessing a supervised learning algorithm on the semi-supervised learning dataset is detailed below:
[Control]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # baseline performance on the semi-supervised learning dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # define model model = LogisticRegression() # fit model on labeled dataset model.fit(X_train_lab, y_train_lab) # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print(‘Accuracy: %.3f’ % (score*100)) |
Executing the algorithm fits the model on the labelled training dataset and assesses it on the holdout dataset and prints the classification precision.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome.
In this scenario, we can observe that the algorithm accomplished a classification precision of approximately 84.8%.
We would expect an efficient semi-supervised learning algorithm to accomplish a better precision than this.
Accuracy: 84.800
Now, let’s look into how to go about applying the label spreading algorithm to the dataset.
Label Spreading for Semi-supervised learning
The label spreading algorithm is available in the scikit-learn Python ML library through the LabelSpreading class.
The model can be fitted just like any other classification model by calling the fit() function and leveraged to make forecasts for new data through the predict() function.
1 2 3 4 5 6 7 | … # define model model = LabelSpreading() # fit model on training dataset model.fit(…, …) # make predictions on hold out test set yhat = model.predict(…) |
Critically, the training dataset furnished to the fit() function must consist of labelled instances that are ordinal encoded (as per normal) and unlabelled instances marked with a label of -1.
The model will then determine a label for the unlabelled instances as part of fitting the model.
After the model is fitted, the estimated labels for the labelled and unlabelled information in the training dataset is available through the “transduction_” attribute on the LabelSpreading class.
[Control]
1 2 3 | … # get labels for entire training dataset data tran_labels = model.transduction_ |
Now that we are acquainted with how to leverage the label spreading algorithm within scikit-learn, let’s observe how we might apply it to our semi-supervised learning dataset.
To start with, we must prep the training dataset.
We can concatenate the input data of the training dataset into a singular array.
1 2 3 | … # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) |
We can then develop a list of -1 valued (unlabelled) for every row in the unlabelled portion of the training dataset.
[Control]
1 2 3 | … # create “no label” for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] |
This listing can then be concatenated with the labels from the labelled portion of the training dataset to correspond with the input array for the training dataset.
[Control]
1 2 3 | … # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) |
We can then train the LabelSpreading model on the total training dataset.
[Control]
1 2 3 4 5 | … # define model model = LabelSpreading() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) |
Then, we can leverage the model to make forecasts on the holdout dataset and assess the model leveraging classification precision.
[Control]
1 2 3 4 5 6 7 | … # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print(‘Accuracy: %.3f’ % (score*100)) |
Connecting this together, the complete instance of assessing label spreading on the semi-supervised learning dataset is detailed below:
[Control]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | # evaluate label spreading on the semi-supervised learning dataset from numpy import concatenate from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.semi_supervised import LabelSpreading # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) # create “no label” for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) # define model model = LabelSpreading() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print(‘Accuracy: %.3f’ % (score*100)) |
Running the algorithm fits the model on the total training dataset and evaluates it on the holdout dataset and prints the classification precision.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Take up running the example a few times and contrast the average outcome.
In this scenario, we can observe that the label spreading model accomplishes a classification precision of approximately 85.4% which is a bit higher than a logistic regression fit just on the labelled training dataset the accomplished a precision of approximately 84.8%.
Accuracy: 85.400
So far so good.
Another strategy we can leverage with the semi-supervised learning model is to take the estimated labels for the training dataset and fit a supervised learning model.
Remember that we can recover the labels for the total training dataset from the label spreading model as follows.
[Control]
1 2 3 | … # get labels for entire training dataset data tran_labels = model.transduction_ |
We can then leverage these labels, combined with all of the input data, to train and assess a supervised learning algorithm, like a logistic regression model.
The hope is that the supervised learning model fitted on the total training dataset would accomplished improved performance than the semi-supervised learning model alone.
[Control]
1 2 3 4 5 6 7 8 9 10 11 | … # define supervised learning model model2 = LogisticRegression() # fit supervised learning model on entire training dataset model2.fit(X_train_mixed, tran_labels) # make predictions on hold out test set yhat = model2.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print(‘Accuracy: %.3f’ % (score*100)) |
Connecting this together, the complete instance of leveraging the estimated training set labels to train and assess a supervised learning model is detailed below:
[Control]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | # evaluate logistic regression fit on label spreading for semi-supervised learning from numpy import concatenate from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.semi_supervised import LabelSpreading from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) # create “no label” for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) # define model model = LabelSpreading() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) # get labels for entire training dataset data tran_labels = model.transduction_ # define supervised learning model model2 = LogisticRegression() # fit supervised learning model on entire training dataset model2.fit(X_train_mixed, tran_labels) # make predictions on hold out test set yhat = model2.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print(‘Accuracy: %.3f’ % (score*100)) |
Running the algorithm fits the semi-supervised model on the total training dataset, then fits a supervised learning model on the complete training dataset with inferred labels and assesses it on the holdout dataset, printing the classification precision.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome.
In this scenario, we can observe that this hierarchal strategy of semi-supervised model followed by a supervised model that accomplishes a classification precision of approximately 85.8% on the holdout dataset, a bit better than the semi-supervised learning algorithm leveraged alone that accomplished a precision of approximately 85.6%.
Accuracy: 85.800
Can you accomplish improved results by tuning the hyperparameters of the LabelSpreading model
Further Reading
This section furnishes additional resources on the subject if you are seeking to delve deeper.
Books
Introduction to semi-supervised learning, 2009.
Chapter 11: Label Propagation and Quadratic Criterion, Semi-supervised learning, 2006
Papers
Learning with Local and Global Consistency, 2003.
APIs
sklearn.semi_supervised.LabelSpreading API
Section 1.14 Semi-Supervised, Scikit-Learn User Guide
sklearn.model_selection.train_test_split API
sklearn.linear_model.LogsiticRegression API
sklearn.datasets.make_classification API
Articles
Semi-supervised learning, Wikipedia
Conclusion
In this guide, you found out about how to go about applying the label spreading algorithm to a semi-supervised learning classification dataset.
Particularly, you learned:
- An intuition for how the label spreading semi-supervised learning algorithm functions.
- How to produce a semi-supervised classification dataset and setup a baseline in performance with a supervised learning algorithm.
- How to produce and assess a label spreading algorithm and leverage the model output to train a supervised learning algorithm.