>Business >Semi-supervised learning with label propagation

Semi-supervised learning with label propagation

Semi-supervised learning is a reference to algorithms that make an effort to leverage both labelled and unlabelled training data. 

Semi-supervised learning algorithms are not like supervised learning algorithm that are just able to learn from labelled training data. 

A widespread strategy to semi-supervised learning is to develop a graph that links instances within the training dataset and propagate known labels via the edges of the graph to label unlabelled instances. An instance of this strategy to semi-supervised learning is the label propagation algorithm for classification predictive modelling. 

In this guide, you will find out how to apply the label propagation algorithm for classification predictive modelling. 

In this guide, you will find out how to go about applying the label propagation algorithm to a semi-supervised learning classification dataset. 

After going through this guide, you will be aware of: 

  • An intuition for how the label propagation semi-supervised learning algorithm functions. 
  • How to produce a semi-supervised classification dataset and determine a baseline in performance with a supervised learning algorithm. 
  • How to develop and assess a label propagation algorithm and leverage the model output to train a supervised learning algorithm. 

Tutorial Summarization 

This guide is divided into three portions, which are: 

1] Label Propagation Algorithm 

2] Semi-supervised classification dataset 

3] Label propagation for semi-supervised learning 

Label propagation algorithm 

Label propagation is a semi-supervised learning algorithm. 

The algorithm was put forth in the 2002 technical report by Xiaojin Zhu and Zoubin Ghahramani entitled “Learning from Labelled and Unlabelled Data with Label Propagation” 

The intuition for the algorithm is that a graph is developed that links all instances (rows) within the dataset on the basis of their distance, like Euclidean Distance. Nodes within the graph then posses label soft labels or label distribution on the basis of labels or label distributions of instances linked close by within the graph. 

Several semi-supervised learning algorithms are reliant on the geometry of the data induced by both unlabelled and labelled instances to enhance on supervised methods that leverage just the labelled data. This geometry can be naturally indicated by an empirical graph g = (V,E) where nodes V = {1, …, n} are representative of the training data and edges E are representative of similarities amongst them. 

Propagation references to the iterative nature that labels are allocated to nodes within the graph and propagate along the edges of the graph to linked nodes. 

This process is at times referred to as label propagation, as it “propagates” labels from the labelled vertices (which are fixed) gradually through the edges to all the unlabelled vertices. 

The procedure is repeated for a static number of iterations to fortify the labels allocated to unlabelled examples. 

Beginning with nodes 1,2, …, I labelled with their known label (1 or -1) and nodes I + 1, …, n labelled with 0 every node begins to propagate its label to its neighbours, and the procedure is rinsed and repeated till convergence. 

Now that we are acquainted with the Label Propagation algorithm, let’s look at how we might leverage it on a project. To start with, we must define a semi-supervised classification dataset. 

Semi-supervised Classification Dataset 

In this portion of the blog, we will define a dataset for semi-supervised learning and determine a baseline in performance on the dataset. 

To start with, we can define a synthetic classification dataset leveraging the make_classification() function. 

We will then define the dataset with dual classes (binary classification) and dual input variables and 1,000 instances. 

 

[Control] 

1 

2 

3 

 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) 

 

Then, we will split the dataset into train and test datasets with an equivalent 50-50 split, for example 500 rows in each one. 

 

[Control] 

1 

2 

3 

 

# split into train and test 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) 

 

Lastly, we will split the training dataset in half again into a portion that will possess labels and a portion that we will pretend is unlabelled. 

1 

2 

3 

 

# split train into labeled and unlabeled 

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_ 

 

Connecting all of this together, the complete instance of prepping the semi-supervised learning dataset is detailed here. 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

# prepare semi-supervised learning dataset 

from sklearn.datasets import make_classification 

from sklearn.model_selection import train_test_split 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) 

# split into train and test 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) 

# split train into labeled and unlabeled 

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) 

# summarize training set size 

print(‘Labeled Train Set:’, X_train_lab.shape, y_train_lab.shape) 

print(‘Unlabeled Train Set:’, X_test_unlab.shape, y_test_unlab.shape) 

# summarize test set size 

print(‘Test Set:’, X_test.shape, y_test.shape) 

 

Running the instance preps the dataset and then summarizes the shape of every one of the three portions. 

The outcomes confirm that we possess an evaluation dataset of 500 rows, a labelled training dataset of 250 rows, and 250 rows of unlabelled data. 

1 

2 

3 

Labeled Train Set: (250, 2) (250,) 

Unlabeled Train Set: (250, 2) (250,) 

Test Set: (500, 2) (500,) 

 

A supervised learning algorithm will just possess 250 rows from which to train a model. 

A semi-supervised learning algorithm will possess the 250 labelled rows as well as the 250 unlabelled rows that could be leveraged in various ways to enhance the labelled training dataset. 

Then, we can establish a baseline in performance on the semi-supervised learning dataset leveraging a supervised learning algorithm fit just on the labelled training data. 

This is critical as we would expect a semi-supervised learning algorithm to outpace a supervised learning algorithm fitted on the labelled data alone. If this is not the scenario, then the semi-supervised learning algorithm does not possess skill. 

In this scenario, we will leverage a logistic regression algorithm fitted on the labelled portion of the training dataset. 

1 

2 

3 

4 

5 

 

# define model 

model = LogisticRegression() 

# fit model on labeled dataset 

model.fit(X_train_lab, y_train_lab) 

 

The model can then be leveraged to make forecasts on the entire hold out test dataset and assessed leveraging classification precision. 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

 

# make predictions on hold out test set 

yhat = model.predict(X_test) 

# calculate score for test set 

score = accuracy_score(y_test, yhat) 

# summarize score 

print(‘Accuracy: %.3f’ % (score*100)) 

 

Connecting all of this together, the complete example of assessing a supervised learning algorithm on the semi-supervised learning dataset is detailed here. 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

# baseline performance on the semi-supervised learning dataset 

from sklearn.datasets import make_classification 

from sklearn.model_selection import train_test_split 

from sklearn.metrics import accuracy_score 

from sklearn.linear_model import LogisticRegression 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) 

# split into train and test 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) 

# split train into labeled and unlabeled 

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) 

# define model 

model = LogisticRegression() 

# fit model on labeled dataset 

model.fit(X_train_lab, y_train_lab) 

# make predictions on hold out test set 

yhat = model.predict(X_test) 

# calculate score for test set 

score = accuracy_score(y_test, yhat) 

# summarize score 

print(‘Accuracy: %.3f’ % (score*100)) 

 

Running the algorithm fits the model on the labelled training dataset and assesses it on the holdout dataset and prints the classification precision. 

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome. 

In this scenario, we can observe that the algorithm accomplished a classification precision of approximately 84.8 percentage. We would expect an effective semi-supervised learning algorithm to accomplish improved precision than this. 

Accuracy: 84.800 

Then, let’s explore how to apply the label propagation algorithm to the dataset. 

Label Propagation for Semi-supervised learning 

The Label Propagation algorithm is available in the scikit-learn Python machine learning library through the LabelPropagation class. 

The model can be fitted like any other classification model by calling the fit() function and leveraged to make forecasts for new data through the predict() function. 

1 

2 

3 

4 

5 

6 

7 

 

# define model 

model = LabelPropagation() 

# fit model on training dataset 

model.fit(…, …) 

# make predictions on hold out test set 

yhat = model.predict(…) 

 

Critically, the training dataset furnished to the fit() function must include labelled instances that are integer encoded (as per normal) and unlabelled instances marked with a label of -1. 

The model will then decide a label for the unlabelled instances as portion of fitting the model. 

After the model is fitted, the estimated labels for the unlabelled and labelled data within the training data is available through the “transduction_” attribute on the LabelPropagation class. 

 

[Control] 

1 

2 

3 

 

# get labels for entire training dataset data 

tran_labels = model.transduction_ 

 

Now that we are acquainted with how to leverage the Label Propagation algorithm within scikit-learn, let’s observe how we may apply it to our semi-supervised learning dataset. 

To start with, we must prep the training dataset. 

We can concatenate the input data of the training dataset into a singular array. 

 

[Control] 

1 

2 

3 

 

# create the training dataset input 

X_train_mixed = concatenate((X_train_lab, X_test_unlab)) 

 

We can then develop a listing of -1 valued (unlabelled) for every row in the unlabelled portion of the training dataset. 

1 

2 

3 

 

# create “no label” for unlabeled data 

nolabel = [-1 for _ in range(len(y_test_unlab))] 

 

This listing can then be concatenated with the labels from the labelled part of the training dataset to correlate with the input array for the training dataset. 

 

[Control] 

1 

2 

3 

 

# recombine training dataset labels 

y_train_mixed = concatenate((y_train_lab, nolabel)) 

 

We can now go about training the LabelPropagation model on the entire training dataset. 

1 

2 

3 

4 

5 

 

# define model 

model = LabelPropagation() 

# fit model on training dataset 

model.fit(X_train_mixed, y_train_mixed) 

 

Then, we can leverage the model to make forecasts on the holdout dataset and assess the model leveraging classification precision. 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

 

# make predictions on hold out test set 

yhat = model.predict(X_test) 

# calculate score for test set 

score = accuracy_score(y_test, yhat) 

# summarize score 

print(‘Accuracy: %.3f’ % (score*100)) 

 

Connecting this together, the complete instance of assessing label propagation on the semi-supervised learning dataset is detailed below. 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

# evaluate label propagation on the semi-supervised learning dataset 

from numpy import concatenate 

from sklearn.datasets import make_classification 

from sklearn.model_selection import train_test_split 

from sklearn.metrics import accuracy_score 

from sklearn.semi_supervised import LabelPropagation 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) 

# split into train and test 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) 

# split train into labeled and unlabeled 

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) 

# create the training dataset input 

X_train_mixed = concatenate((X_train_lab, X_test_unlab)) 

# create “no label” for unlabeled data 

nolabel = [-1 for _ in range(len(y_test_unlab))] 

# recombine training dataset labels 

y_train_mixed = concatenate((y_train_lab, nolabel)) 

# define model 

model = LabelPropagation() 

# fit model on training dataset 

model.fit(X_train_mixed, y_train_mixed) 

# make predictions on hold out test set 

yhat = model.predict(X_test) 

# calculate score for test set 

score = accuracy_score(y_test, yhat) 

# summarize score 

print(‘Accuracy: %.3f’ % (score*100)) 

 

Executing the algorithm fits the model on the entire training dataset and assesses it on the holdout dataset and prints the classification precision. 

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome. 

In this scenario, we can observe that the label propagation model accomplishes a classification precision of approximately 85.6%, which is a bit higher than a logistic regression fitted just on the labelled training dataset that accomplished a precision of approximately 84.8%. 

Accuracy: 85.600 

So far, so good. 

Another strategy we can leverage with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model. 

Remember that we can recover the labels for the entire training dataset from the label propagation model as follows: 

 

[Control] 

1 

2 

3 

 

# get labels for entire training dataset data 

tran_labels = model.transduction_ 

 

We can then leverage these labels along with all of the input data to train and assess a supervised learning algorithm, like a logistic regression model. 

The desire is that the supervised learning model fitted on the entire dataset would accomplish even improved performance than the semi-supervised learning model alone. 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

 

# define supervised learning model 

model2 = LogisticRegression() 

# fit supervised learning model on entire training dataset 

model2.fit(X_train_mixed, tran_labels) 

# make predictions on hold out test set 

yhat = model2.predict(X_test) 

# calculate score for test set 

score = accuracy_score(y_test, yhat) 

# summarize score 

print(‘Accuracy: %.3f’ % (score*100)) 

 

Connecting this together, the complete instance of leveraging the estimated training set labels to train and assess a supervised learning model is detailed below. 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

# evaluate logistic regression fit on label propagation for semi-supervised learning 

from numpy import concatenate 

from sklearn.datasets import make_classification 

from sklearn.model_selection import train_test_split 

from sklearn.metrics import accuracy_score 

from sklearn.semi_supervised import LabelPropagation 

from sklearn.linear_model import LogisticRegression 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) 

# split into train and test 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) 

# split train into labeled and unlabeled 

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) 

# create the training dataset input 

X_train_mixed = concatenate((X_train_lab, X_test_unlab)) 

# create “no label” for unlabeled data 

nolabel = [-1 for _ in range(len(y_test_unlab))] 

# recombine training dataset labels 

y_train_mixed = concatenate((y_train_lab, nolabel)) 

# define model 

model = LabelPropagation() 

# fit model on training dataset 

model.fit(X_train_mixed, y_train_mixed) 

# get labels for entire training dataset data 

tran_labels = model.transduction_ 

# define supervised learning model 

model2 = LogisticRegression() 

# fit supervised learning model on entire training dataset 

model2.fit(X_train_mixed, tran_labels) 

# make predictions on hold out test set 

yhat = model2.predict(X_test) 

# calculate score for test set 

score = accuracy_score(y_test, yhat) 

# summarize score 

print(‘Accuracy: %.3f’ % (score*100)) 

 

Executing the algorithm fits the semi-supervised model on the entire training dataset, then fits a supervised learning model on the complete training dataset with inferred labels and assesses it on the holdout dataset, printing the classification precision. 

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Take up executing the instance a few times and contrast the average outcome. 

In this scenario, we can observe that this hierarchal strategy of the semi-supervised model followed by supervised model accomplishes a classification precision of approximately 86.2% on the holdout dataset, even better than the semi-supervised learning leveraged alone that accomplished a precision of approximately 85.6%. 

Accuracy: 86.200 

Further Reading 

This portion of the blog provides additional resources on the subject if you are seeking to delve deeper. 

Books 

Introduction to semi-supervised learning, 2009 

Chapter 11: Label Propagation and Quadratic Criterion, Semi-supervised learning, 2006 

Papers 

Learning from Labelled and Unlabelled data with Label Propagation, 2002. 

APIs 

sklearn.semi_supervised.LabelPropagation API. 

Section 1.14. Semi-supervised, Scikit-Learn User Guide 

sklearn.model_selection.train_test_split API 

sklearn.linear_model.LogisticRegression API 

sklearn.datasets.make_classification API 

Articles 

Semi-supervised learning, Wikipedia 

Conclusion 

In this guide, you found how to apply the label propagation algorithm to a semi-supervised learning classification dataset. 

Particularly, you learned: 

  • An intuition for how the label propagation semi-supervised learning algorithm operates. 
  • How to produce a semi-supervised classification dataset and setup a baseline in performance with a supervised learning algorithm 
  • How to produce and assess a label propagation algorithm and leverage the model output to train a supervised learning algorithm.
Add Comment