>Business >Semi-supervised learning with label spreading

Semi-supervised learning with label spreading

Semi-supervised algorithms references to algorithms that make an effort to leverage both labelled and unlabelled training data. 

Semi-supervised learning algorithms are not like supervised learning algorithms that are just able to learn from labelled training data. 

A widespread approach to semi-supervised learning is to develop a graph that links instances within the training dataset and propagates known labels via the edges of the graph to label unlabelled instances. An instance of this strategy to semi-supervised learning is the label spreading algorithm for classification predictive modelling. 

In this guide, you will find out how to apply the label spreading algorithm to a semi-supervised learning classification dataset. 

In this guide, you will find out how to apply the label spreading algorithm to a semi-supervised learning classification dataset. 

After going through this guide, you will be aware of: 

  • An intuition for how the label spreading semi-supervised learning algorithm operates. 
  • How to produce a semi-supervised classification dataset and setup a baseline in performance with a supervised learning algorithm. 
  • How to produce and assess a label spreading algorithm and leverage the model output to train a supervised learning algorithm. 

Tutorial Summarization 

This tutorial is subdivided into three portions, which are: 

1] Label Spreading Algorithm 

2] Semi-supervised Classification Dataset 

3] Label spreading for semi-supervised learning 

Label Spreading Algorithm 

Label spreading is a semi-supervised learning algorithm. 

The algorithm was put forth by Dengyong Zhou, et al. in their 2003 paper entitled “Learning With Local and Global Consistency” 

The intuition for the wider approach of semi-supervised learning is that close by points within the input space should possess the same label, and indicates in the same structure or manifold within the input space should possess the same label. 

Critical to semi-supervised learning problems is the prior assumption of consistency, which implies: close by points are probable to have the same label, and points on the same structure usually referenced to as a cluster or a manifold) are probable to possess the same label. 

The label spreading draws inspiration by a strategy from experimental psychology referred to as spreading activation networks. 

This algorithm can be comprehended intuitively in terms of spreading activation networks from experimental psychology. 

Points in the dataset are linked in a graph on that basis of their comparative distances in the input space. The weight matrix of the graph is normalized symmetrically, a lot like spectral clustering. Data is passed via the graph, which is adapted to capture the structure within the input space. The strategy is very much like the label propagation for semi-supervised learning. 

Another similar label propagation algorithm was provided by Zhou et al. at every step a nodi i obtains a contribution from its neighbours j (weighted by the normalized weight of the edge (i.j)), and an extra small contribution provided by its initial value. 

Following convergence, labels are applied on the basis of nodes that passed on the most data. 

Lastly, the label of every unlabelled point is set to be the class of which it has obtained most data during the iteration procedure. 

Now that we are acquainted with the label spreading algorithm, let’s observe how we might leverage it on a project. To start with, we must define a semi-supervised classification dataset. 

Semi-Supervised Classification Dataset 

In this portion of the blog, we will go about defining a dataset for semi-supervised learning and setup a baseline in performance within the dataset. 

To start with, we can go about defining a synthetic classification dataset leveraging the make_classification() function. 

We will define the dataset with two categories (binary classification) and dual input variables and 1,000 instances. 

 

[Control] 

1 

2 

3 

 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) 

 

Then, we will split the dataset into train and test datasets with equivalent 50%-50% split (for example 500 rows in each) 

 

[Control] 

1 

2 

3 

 

# split into train and test 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) 

 

Lastly, we will split the training dataset in half again into a portion that will possess labels and portion that we will pretend is unlabelled. 

 

[Control] 

1 

2 

3 

 

# split train into labeled and unlabeled 

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) 

 

Connecting this together, the complete instance of prepping the semi-supervised learning dataset is detailed below. 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

# prepare semi-supervised learning dataset 

from sklearn.datasets import make_classification 

from sklearn.model_selection import train_test_split 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) 

# split into train and test 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) 

# split train into labeled and unlabeled 

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) 

# summarize training set size 

print(‘Labeled Train Set:’, X_train_lab.shape, y_train_lab.shape) 

print(‘Unlabeled Train Set:’, X_test_unlab.shape, y_test_unlab.shape) 

# summarize test set size 

print(‘Test Set:’, X_test.shape, y_test.shape) 

 

Running the instance preps the dataset and then summarizes the shape of every one of the three portions. 

The outcomes confirm that we possess a test dataset of 500 rows, a labelled training dataset of 250 rows, and 250 rows of unlabelled data. 

 

[Control] 

1 

2 

3 

Labeled Train Set: (250, 2) (250,) 

Unlabeled Train Set: (250, 2) (250,) 

Test Set: (500, 2) (500,) 

 

A supervised learning algorithm will just possess 250 rows from which to train a model. 

A semi-supervised learning algorithm will possess the 250 labelled rows in addition to the 250 unlabelled rows that could be leveraged in various ways to enhance the labelled training dataset. 

Next, we can determine a baseline in performance on the semi-supervised learning dataset leveraging a supervised learning algorithm fit just on the labelled training data. 

This is critical as we would expect a semi-supervised learning algorithm to outpace a supervised learning algorithm fitted on the labelled data alone. If this is not the scenario, then the semi-supervised learning algorithm does not possess skill. 

In this scenario, we will leverage a logistic regression algorithm fitted on the labelled portion of the training dataset. 

 

[Control] 

1 

2 

3 

4 

5 

 

# define model 

model = LogisticRegression() 

# fit model on labeled dataset 

model.fit(X_train_lab, y_train_lab) 

 

The model can then be leveraged to make forecasts on the entire holdout test dataset and assessed leveraging classification precision. 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

 

# make predictions on hold out test set 

yhat = model.predict(X_test) 

# calculate score for test set 

score = accuracy_score(y_test, yhat) 

# summarize score 

print(‘Accuracy: %.3f’ % (score*100)) 

 

Connecting this together, the complete instance of assessing a supervised learning algorithm on the semi-supervised learning dataset is detailed below: 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

# baseline performance on the semi-supervised learning dataset 

from sklearn.datasets import make_classification 

from sklearn.model_selection import train_test_split 

from sklearn.metrics import accuracy_score 

from sklearn.linear_model import LogisticRegression 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) 

# split into train and test 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) 

# split train into labeled and unlabeled 

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) 

# define model 

model = LogisticRegression() 

# fit model on labeled dataset 

model.fit(X_train_lab, y_train_lab) 

# make predictions on hold out test set 

yhat = model.predict(X_test) 

# calculate score for test set 

score = accuracy_score(y_test, yhat) 

# summarize score 

print(‘Accuracy: %.3f’ % (score*100)) 

 

Executing the algorithm fits the model on the labelled training dataset and assesses it on the holdout dataset and prints the classification precision. 

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome. 

In this scenario, we can observe that the algorithm accomplished a classification precision of approximately 84.8%. 

We would expect an efficient semi-supervised learning algorithm to accomplish a better precision than this. 

Accuracy: 84.800 

Now, let’s look into how to go about applying the label spreading algorithm to the dataset. 

Label Spreading for Semi-supervised learning 

The label spreading algorithm is available in the scikit-learn Python ML library through the LabelSpreading class. 

The model can be fitted just like any other classification model by calling the fit() function and leveraged to make forecasts for new data through the predict() function. 

1 

2 

3 

4 

5 

6 

7 

 

# define model 

model = LabelSpreading() 

# fit model on training dataset 

model.fit(…, …) 

# make predictions on hold out test set 

yhat = model.predict(…) 

 

Critically, the training dataset furnished to the fit() function must consist of labelled instances that are ordinal encoded (as per normal) and unlabelled instances marked with a label of -1. 

The model will then determine a label for the unlabelled instances as part of fitting the model. 

After the model is fitted, the estimated labels for the labelled and unlabelled information in the training dataset is available through the “transduction_” attribute on the LabelSpreading class. 

 

[Control] 

1 

2 

3 

 

# get labels for entire training dataset data 

tran_labels = model.transduction_ 

 

Now that we are acquainted with how to leverage the label spreading algorithm within scikit-learn, let’s observe how we might apply it to our semi-supervised learning dataset. 

To start with, we must prep the training dataset. 

We can concatenate the input data of the training dataset into a singular array. 

1 

2 

3 

 

# create the training dataset input 

X_train_mixed = concatenate((X_train_lab, X_test_unlab)) 

 

We can then develop a list of -1 valued (unlabelled) for every row in the unlabelled portion of the training dataset. 

 

[Control] 

1 

2 

3 

 

# create “no label” for unlabeled data 

nolabel = [-1 for _ in range(len(y_test_unlab))] 

 

This listing can then be concatenated with the labels from the labelled portion of the training dataset to correspond with the input array for the training dataset. 

 

[Control] 

1 

2 

3 

 

# recombine training dataset labels 

y_train_mixed = concatenate((y_train_lab, nolabel)) 

 

We can then train the LabelSpreading model on the total training dataset. 

 

[Control] 

1 

2 

3 

4 

5 

 

# define model 

model = LabelSpreading() 

# fit model on training dataset 

model.fit(X_train_mixed, y_train_mixed) 

 

Then, we can leverage the model to make forecasts on the holdout dataset and assess the model leveraging classification precision. 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

 

# make predictions on hold out test set 

yhat = model.predict(X_test) 

# calculate score for test set 

score = accuracy_score(y_test, yhat) 

# summarize score 

print(‘Accuracy: %.3f’ % (score*100)) 

 

Connecting this together, the complete instance of assessing label spreading on the semi-supervised learning dataset is detailed below: 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

# evaluate label spreading on the semi-supervised learning dataset 

from numpy import concatenate 

from sklearn.datasets import make_classification 

from sklearn.model_selection import train_test_split 

from sklearn.metrics import accuracy_score 

from sklearn.semi_supervised import LabelSpreading 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) 

# split into train and test 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) 

# split train into labeled and unlabeled 

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) 

# create the training dataset input 

X_train_mixed = concatenate((X_train_lab, X_test_unlab)) 

# create “no label” for unlabeled data 

nolabel = [-1 for _ in range(len(y_test_unlab))] 

# recombine training dataset labels 

y_train_mixed = concatenate((y_train_lab, nolabel)) 

# define model 

model = LabelSpreading() 

# fit model on training dataset 

model.fit(X_train_mixed, y_train_mixed) 

# make predictions on hold out test set 

yhat = model.predict(X_test) 

# calculate score for test set 

score = accuracy_score(y_test, yhat) 

# summarize score 

print(‘Accuracy: %.3f’ % (score*100)) 

 

Running the algorithm fits the model on the total training dataset and evaluates it on the holdout dataset and prints the classification precision. 

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Take up running the example a few times and contrast the average outcome. 

In this scenario, we can observe that the label spreading model accomplishes a classification precision of approximately 85.4% which is a bit higher than a logistic regression fit just on the labelled training dataset the accomplished a precision of approximately 84.8%. 

Accuracy: 85.400 

So far so good. 

Another strategy we can leverage with the semi-supervised learning model is to take the estimated labels for the training dataset and fit a supervised learning model. 

Remember that we can recover the labels for the total training dataset from the label spreading model as follows. 

 

[Control] 

1 

2 

3 

 

# get labels for entire training dataset data 

tran_labels = model.transduction_ 

 

We can then leverage these labels, combined with all of the input data, to train and assess a supervised learning algorithm, like a logistic regression model. 

The hope is that the supervised learning model fitted on the total training dataset would accomplished improved performance than the semi-supervised learning model alone. 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

 

# define supervised learning model 

model2 = LogisticRegression() 

# fit supervised learning model on entire training dataset 

model2.fit(X_train_mixed, tran_labels) 

# make predictions on hold out test set 

yhat = model2.predict(X_test) 

# calculate score for test set 

score = accuracy_score(y_test, yhat) 

# summarize score 

print(‘Accuracy: %.3f’ % (score*100)) 

 

Connecting this together, the complete instance of leveraging the estimated training set labels to train and assess a supervised learning model is detailed below: 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

# evaluate logistic regression fit on label spreading for semi-supervised learning 

from numpy import concatenate 

from sklearn.datasets import make_classification 

from sklearn.model_selection import train_test_split 

from sklearn.metrics import accuracy_score 

from sklearn.semi_supervised import LabelSpreading 

from sklearn.linear_model import LogisticRegression 

# define dataset 

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) 

# split into train and test 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) 

# split train into labeled and unlabeled 

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) 

# create the training dataset input 

X_train_mixed = concatenate((X_train_lab, X_test_unlab)) 

# create “no label” for unlabeled data 

nolabel = [-1 for _ in range(len(y_test_unlab))] 

# recombine training dataset labels 

y_train_mixed = concatenate((y_train_lab, nolabel)) 

# define model 

model = LabelSpreading() 

# fit model on training dataset 

model.fit(X_train_mixed, y_train_mixed) 

# get labels for entire training dataset data 

tran_labels = model.transduction_ 

# define supervised learning model 

model2 = LogisticRegression() 

# fit supervised learning model on entire training dataset 

model2.fit(X_train_mixed, tran_labels) 

# make predictions on hold out test set 

yhat = model2.predict(X_test) 

# calculate score for test set 

score = accuracy_score(y_test, yhat) 

# summarize score 

print(‘Accuracy: %.3f’ % (score*100)) 

 

Running the algorithm fits the semi-supervised model on the total training dataset, then fits a supervised learning model on the complete training dataset with inferred labels and assesses it on the holdout dataset, printing the classification precision. 

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome. 

In this scenario, we can observe that this hierarchal strategy of semi-supervised model followed by a supervised model that accomplishes a classification precision of approximately 85.8% on the holdout dataset, a bit better than the semi-supervised learning algorithm leveraged alone that accomplished a precision of approximately 85.6%. 

Accuracy: 85.800 

Can you accomplish improved results by tuning the hyperparameters of the LabelSpreading model 

Further Reading 

This section furnishes additional resources on the subject if you are seeking to delve deeper. 

Books 

Introduction to semi-supervised learning, 2009. 

Chapter 11: Label Propagation and Quadratic Criterion, Semi-supervised learning, 2006 

Papers 

Learning with Local and Global Consistency, 2003. 

APIs 

sklearn.semi_supervised.LabelSpreading API 

Section 1.14 Semi-Supervised, Scikit-Learn User Guide 

sklearn.model_selection.train_test_split API 

sklearn.linear_model.LogsiticRegression API 

sklearn.datasets.make_classification API 

Articles 

Semi-supervised learning, Wikipedia 

Conclusion 

In this guide, you found out about how to go about applying the label spreading algorithm to a semi-supervised learning classification dataset. 

Particularly, you learned: 

  • An intuition for how the label spreading semi-supervised learning algorithm functions. 
  • How to produce a semi-supervised classification dataset and setup a baseline in performance with a supervised learning algorithm. 
  • How to produce and assess a label spreading algorithm and leverage the model output to train a supervised learning algorithm. 
Add Comment