Semi-supervised learning with label propagation
Semi-supervised learning is a reference to algorithms that make an effort to leverage both labelled and unlabelled training data.
Semi-supervised learning algorithms are not like supervised learning algorithm that are just able to learn from labelled training data.
A widespread strategy to semi-supervised learning is to develop a graph that links instances within the training dataset and propagate known labels via the edges of the graph to label unlabelled instances. An instance of this strategy to semi-supervised learning is the label propagation algorithm for classification predictive modelling.
In this guide, you will find out how to apply the label propagation algorithm for classification predictive modelling.
In this guide, you will find out how to go about applying the label propagation algorithm to a semi-supervised learning classification dataset.
After going through this guide, you will be aware of:
- An intuition for how the label propagation semi-supervised learning algorithm functions.
- How to produce a semi-supervised classification dataset and determine a baseline in performance with a supervised learning algorithm.
- How to develop and assess a label propagation algorithm and leverage the model output to train a supervised learning algorithm.
Tutorial Summarization
This guide is divided into three portions, which are:
1] Label Propagation Algorithm
2] Semi-supervised classification dataset
3] Label propagation for semi-supervised learning
Label propagation algorithm
Label propagation is a semi-supervised learning algorithm.
The algorithm was put forth in the 2002 technical report by Xiaojin Zhu and Zoubin Ghahramani entitled “Learning from Labelled and Unlabelled Data with Label Propagation”
The intuition for the algorithm is that a graph is developed that links all instances (rows) within the dataset on the basis of their distance, like Euclidean Distance. Nodes within the graph then posses label soft labels or label distribution on the basis of labels or label distributions of instances linked close by within the graph.
Several semi-supervised learning algorithms are reliant on the geometry of the data induced by both unlabelled and labelled instances to enhance on supervised methods that leverage just the labelled data. This geometry can be naturally indicated by an empirical graph g = (V,E) where nodes V = {1, …, n} are representative of the training data and edges E are representative of similarities amongst them.
Propagation references to the iterative nature that labels are allocated to nodes within the graph and propagate along the edges of the graph to linked nodes.
This process is at times referred to as label propagation, as it “propagates” labels from the labelled vertices (which are fixed) gradually through the edges to all the unlabelled vertices.
The procedure is repeated for a static number of iterations to fortify the labels allocated to unlabelled examples.
Beginning with nodes 1,2, …, I labelled with their known label (1 or -1) and nodes I + 1, …, n labelled with 0 every node begins to propagate its label to its neighbours, and the procedure is rinsed and repeated till convergence.
Now that we are acquainted with the Label Propagation algorithm, let’s look at how we might leverage it on a project. To start with, we must define a semi-supervised classification dataset.
Semi-supervised Classification Dataset
In this portion of the blog, we will define a dataset for semi-supervised learning and determine a baseline in performance on the dataset.
To start with, we can define a synthetic classification dataset leveraging the make_classification() function.
We will then define the dataset with dual classes (binary classification) and dual input variables and 1,000 instances.
[Control]
1 2 3 | … # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) |
Then, we will split the dataset into train and test datasets with an equivalent 50-50 split, for example 500 rows in each one.
[Control]
1 2 3 | … # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) |
Lastly, we will split the training dataset in half again into a portion that will possess labels and a portion that we will pretend is unlabelled.
1 2 3 | … # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_ |
Connecting all of this together, the complete instance of prepping the semi-supervised learning dataset is detailed here.
[Control]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # prepare semi-supervised learning dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # summarize training set size print(‘Labeled Train Set:’, X_train_lab.shape, y_train_lab.shape) print(‘Unlabeled Train Set:’, X_test_unlab.shape, y_test_unlab.shape) # summarize test set size print(‘Test Set:’, X_test.shape, y_test.shape) |
Running the instance preps the dataset and then summarizes the shape of every one of the three portions.
The outcomes confirm that we possess an evaluation dataset of 500 rows, a labelled training dataset of 250 rows, and 250 rows of unlabelled data.
1 2 3 | Labeled Train Set: (250, 2) (250,) Unlabeled Train Set: (250, 2) (250,) Test Set: (500, 2) (500,) |
A supervised learning algorithm will just possess 250 rows from which to train a model.
A semi-supervised learning algorithm will possess the 250 labelled rows as well as the 250 unlabelled rows that could be leveraged in various ways to enhance the labelled training dataset.
Then, we can establish a baseline in performance on the semi-supervised learning dataset leveraging a supervised learning algorithm fit just on the labelled training data.
This is critical as we would expect a semi-supervised learning algorithm to outpace a supervised learning algorithm fitted on the labelled data alone. If this is not the scenario, then the semi-supervised learning algorithm does not possess skill.
In this scenario, we will leverage a logistic regression algorithm fitted on the labelled portion of the training dataset.
1 2 3 4 5 | … # define model model = LogisticRegression() # fit model on labeled dataset model.fit(X_train_lab, y_train_lab) |
The model can then be leveraged to make forecasts on the entire hold out test dataset and assessed leveraging classification precision.
[Control]
1 2 3 4 5 6 7 | … # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print(‘Accuracy: %.3f’ % (score*100)) |
Connecting all of this together, the complete example of assessing a supervised learning algorithm on the semi-supervised learning dataset is detailed here.
[Control]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # baseline performance on the semi-supervised learning dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # define model model = LogisticRegression() # fit model on labeled dataset model.fit(X_train_lab, y_train_lab) # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print(‘Accuracy: %.3f’ % (score*100)) |
Running the algorithm fits the model on the labelled training dataset and assesses it on the holdout dataset and prints the classification precision.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome.
In this scenario, we can observe that the algorithm accomplished a classification precision of approximately 84.8 percentage. We would expect an effective semi-supervised learning algorithm to accomplish improved precision than this.
Accuracy: 84.800
Then, let’s explore how to apply the label propagation algorithm to the dataset.
Label Propagation for Semi-supervised learning
The Label Propagation algorithm is available in the scikit-learn Python machine learning library through the LabelPropagation class.
The model can be fitted like any other classification model by calling the fit() function and leveraged to make forecasts for new data through the predict() function.
1 2 3 4 5 6 7 | … # define model model = LabelPropagation() # fit model on training dataset model.fit(…, …) # make predictions on hold out test set yhat = model.predict(…) |
Critically, the training dataset furnished to the fit() function must include labelled instances that are integer encoded (as per normal) and unlabelled instances marked with a label of -1.
The model will then decide a label for the unlabelled instances as portion of fitting the model.
After the model is fitted, the estimated labels for the unlabelled and labelled data within the training data is available through the “transduction_” attribute on the LabelPropagation class.
[Control]
1 2 3 | … # get labels for entire training dataset data tran_labels = model.transduction_ |
Now that we are acquainted with how to leverage the Label Propagation algorithm within scikit-learn, let’s observe how we may apply it to our semi-supervised learning dataset.
To start with, we must prep the training dataset.
We can concatenate the input data of the training dataset into a singular array.
[Control]
1 2 3 | … # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) |
We can then develop a listing of -1 valued (unlabelled) for every row in the unlabelled portion of the training dataset.
1 2 3 | … # create “no label” for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] |
This listing can then be concatenated with the labels from the labelled part of the training dataset to correlate with the input array for the training dataset.
[Control]
1 2 3 | … # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) |
We can now go about training the LabelPropagation model on the entire training dataset.
1 2 3 4 5 | … # define model model = LabelPropagation() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) |
Then, we can leverage the model to make forecasts on the holdout dataset and assess the model leveraging classification precision.
[Control]
1 2 3 4 5 6 7 | … # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print(‘Accuracy: %.3f’ % (score*100)) |
Connecting this together, the complete instance of assessing label propagation on the semi-supervised learning dataset is detailed below.
[Control]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | # evaluate label propagation on the semi-supervised learning dataset from numpy import concatenate from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.semi_supervised import LabelPropagation # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) # create “no label” for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) # define model model = LabelPropagation() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print(‘Accuracy: %.3f’ % (score*100)) |
Executing the algorithm fits the model on the entire training dataset and assesses it on the holdout dataset and prints the classification precision.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome.
In this scenario, we can observe that the label propagation model accomplishes a classification precision of approximately 85.6%, which is a bit higher than a logistic regression fitted just on the labelled training dataset that accomplished a precision of approximately 84.8%.
Accuracy: 85.600
So far, so good.
Another strategy we can leverage with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model.
Remember that we can recover the labels for the entire training dataset from the label propagation model as follows:
[Control]
1 2 3 | … # get labels for entire training dataset data tran_labels = model.transduction_ |
We can then leverage these labels along with all of the input data to train and assess a supervised learning algorithm, like a logistic regression model.
The desire is that the supervised learning model fitted on the entire dataset would accomplish even improved performance than the semi-supervised learning model alone.
[Control]
1 2 3 4 5 6 7 8 9 10 11 | … # define supervised learning model model2 = LogisticRegression() # fit supervised learning model on entire training dataset model2.fit(X_train_mixed, tran_labels) # make predictions on hold out test set yhat = model2.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print(‘Accuracy: %.3f’ % (score*100)) |
Connecting this together, the complete instance of leveraging the estimated training set labels to train and assess a supervised learning model is detailed below.
[Control]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | # evaluate logistic regression fit on label propagation for semi-supervised learning from numpy import concatenate from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.semi_supervised import LabelPropagation from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) # create “no label” for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) # define model model = LabelPropagation() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) # get labels for entire training dataset data tran_labels = model.transduction_ # define supervised learning model model2 = LogisticRegression() # fit supervised learning model on entire training dataset model2.fit(X_train_mixed, tran_labels) # make predictions on hold out test set yhat = model2.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print(‘Accuracy: %.3f’ % (score*100)) |
Executing the algorithm fits the semi-supervised model on the entire training dataset, then fits a supervised learning model on the complete training dataset with inferred labels and assesses it on the holdout dataset, printing the classification precision.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Take up executing the instance a few times and contrast the average outcome.
In this scenario, we can observe that this hierarchal strategy of the semi-supervised model followed by supervised model accomplishes a classification precision of approximately 86.2% on the holdout dataset, even better than the semi-supervised learning leveraged alone that accomplished a precision of approximately 85.6%.
Accuracy: 86.200
Further Reading
This portion of the blog provides additional resources on the subject if you are seeking to delve deeper.
Books
Introduction to semi-supervised learning, 2009
Chapter 11: Label Propagation and Quadratic Criterion, Semi-supervised learning, 2006
Papers
Learning from Labelled and Unlabelled data with Label Propagation, 2002.
APIs
sklearn.semi_supervised.LabelPropagation API.
Section 1.14. Semi-supervised, Scikit-Learn User Guide
sklearn.model_selection.train_test_split API
sklearn.linear_model.LogisticRegression API
sklearn.datasets.make_classification API
Articles
Semi-supervised learning, Wikipedia
Conclusion
In this guide, you found how to apply the label propagation algorithm to a semi-supervised learning classification dataset.
Particularly, you learned:
- An intuition for how the label propagation semi-supervised learning algorithm operates.
- How to produce a semi-supervised classification dataset and setup a baseline in performance with a supervised learning algorithm
- How to produce and assess a label propagation algorithm and leverage the model output to train a supervised learning algorithm.