Bayesian Semi-supervised learning under nonparanormality

01/11/2020 ∙ by Rui Zhu, et al. ∙ NC State University 0

Semi-supervised learning is a classification method which makes use of both labeled data and unlabeled data for training. In this paper, we propose a semi-supervised learning algorithm using a Bayesian semi-supervised model. We make a general assumption that the observations will follow two multivariate normal distributions depending on their true labels after the same unknown transformation. We use B-splines to put a prior on the transformation function for each component. To use unlabeled data in a semi-supervised setting, we assume the labels are missing at random. The posterior distributions can then be described using our assumptions, which we compute by the Gibbs sampling technique. The proposed method is then compared with several other available methods through an extensive simulation study. Finally we apply the proposed method in real data contexts for diagnosing breast cancer and classify radar returns. We conclude that the proposed method has better prediction accuracy in a wide variety of cases.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semi-supervised learning is a classification method which makes use of both labeled data and unlabeled data for training. There has been a growing interest in semi-supervised learning in recent years. Traditional classification methods are supervised in nature. They use only labeled data for training. To train a traditional classifier, we have to obtain the true class information for each unit in the dataset, along with the measurements associated with it. However, labels are often difficult to obtain. It can be expensive and time consuming, and it requires human labor and expertise. For example, obtaining label can be a hard problem in a fraud detection system. One way to identify fraud transaction or account takeover is through client report. However it is unlikely for clients to notice every crime act. There could be a huge amount of undetected fraud if we rely the labeling process only on client report. The only way to obtain a thorough and reliable label for every instance is through human evaluation, meaning that a verification team will give a judgement given the information, which will cost both time and human labor. On the other hand, unlabeled data in most cases are substantial and easy to obtain. Unsupervised learning methods like clustering provide a way to make use of unlabeled data. However, it may not be appropriate or useful in a classification problem. Supervised learning seems to be the best solution for these cases. When a few units in a dataset have labels, then semi-supervised learning technique can use these large number of unlabeled units in classification instead of discarding them; thus greatly improving the performance of the classifier. In summary, semi-supervised learning lies in between supervised learning and unsupervised learning and the goal is to make the most efficient use with a dataset contains both labeled and unlabeled observations.

Semi-supervised learning can be classified to two big classes-generative methods and discriminative methods. Generative methods are mainly based on Expectation-Maximization (EM) algorithm

(dempster1977maximum) and they have to make some assumptions on the underlying distributions of different classes. Discriminative methods, on the other hand, only learn the boundary between classes. In the recent literature, discriminative methods are explored much more compared to generative methods. Self-Training (yarowsky1995unsupervised) is the most widely-used semi-supervised learning method. The basic idea is that the chosen classifier teaches itself the labels and then learn iteratively. Co-Training (blum1998combining)

is conducted by splitting the features to two subsets. The idea is that each subset can learn and teach the other some labels they are confident of. Another commonly used algorithm is Semi-Supervised Support Vector Machines (

). The goal of is to find a linear boundary which has the maximum margin for both labeled and unlabeled data. This is an NP-hard problem and there are many algorithms being proposed to solve this problem. In our simulations, we have compared with several different SVM algotithms like belkin2006manifold, li2013convex, sindhwani2006large. The method is based on the assumption that the boundary will not cut in dense regions. Other methods like Gaussian Process (lawrence2005semi) also make use of this assumption. There are also a huge number of methods based on graphical analysis, for example, balcan2009person, zemel2005proximity, zhang2007hyperparameter, hein2007manifold. A thorough introduction of semi-supervised learning methods can be found in zhu05survey.

Generative methods based on EM algorithm have two widely known disadvantages. First, some assumptions on the underlying distributions have to be made. If the assumptions are wrong, then the mislabeled data will hurt the accuracy (cozman2003semi). Second, EM algorithm has a tendency to stick to a local maximum instead of the global maximum. This may also cause problems when using unlabeled data.

In this paper, we propose a new generative method for semi-supervised learning which can solve the two problems mentioned. First of all, we make a flexible semi-parametric modeling assumption. We assume that the two underlying distributions map to multivariate normal distributions after the same unknown transformation on each class. This is definitely a lot more general than a specific parametric assumption. This is called nonparanormal assumption in graphical model

(liu2012high; mulgrave2018bayesian)

. Nonparametric since the transformation has unrestricted functional form. We shall use B-spline to estimate the transformation so this will be reduced to parameter estimation. Second, instead of finding point estimates using the EM algorithm, we obtain the posterior distributions using Gibbs sampling framework. This prevent the problem of trapping at local maxima.

The rest of the paper is organized as follows. We describe the model and the algorithm given by Gibbs Sampling in Section 2

. We give a method of selecting hyperparameters in Section

3. Then we present the results of a simulation study to compare our method with other semi-supervised learning methods in Section 4. Finally we apply our method to real data sets in Section 5.

2 Model

Below we shall use the following notation list: denotes a -dimensional normal distribution with mean and covariance ; and

denote the cumulative distribution function and the density function of a normal distribution with mean

and covariance respectively; stands for a truncated normal distribution restricted on a set ; stands for the Wishart distribution with degree of freedom and scale matrix ;

stands for the beta distribution with shape parameters

and .

Suppose we observe independently, which take value in for some , each observation as . We denote the class for by . The observation belongs to the Class 0 if and it belongs to the Class 1 otherwise, . Notice that in the semi-supervised learning settings, not all are available to us. We denote the observed label by . If -th label is missing, then we set ; otherwise . Here we assume that the label is missing at random. That is, . This assumption makes sense because at the time we decide whether or not to verify an instance, we do not know the true class . the only information we have is the observation .

When we talk about a generic observation, we omit the index from and just write . We assume that under some unknown increasing transformation , the transformed observations follow one of the two normal distributions according to their classes,


where is a -dimensional vector of functions,

. Notice that any continuous random variable can be transformed to a normal variable by a strictly increasing transformation and hence is not an assumption if considered individually. The model assumption here is that the two distributions for two classes are mapped to normal distributions under the same transformation. This is called nonparanormality assumption in the literature


The method we use to estimate the transformation is very similar to the method used by mulgrave2018bayesian. However the purpose of estimating this transformation is completely different. Here we are estimating the transformation for semi-supervised learning, while mulgrave2018bayesian used this approach to learn the graphical structure. Like mulgrave2018bayesian, we shall estimate each component of the transformation . We denote the -th dimension of the transformation by . We put prior distributions on the unknown transformation functions through a random series based on B-splines, i.e.,

where is the -th dimension of an observation, are the B-spline basis functions, is a coefficient in the expansion of the function, , and is the number of B-spline basis functions with equispaced knots used in the expansion. The coefficients are ordered to induce monotonicity, and the smoothness is controlled by the order of the B-splines and the number of basis functions used in the expansion. The posterior means of the coefficients give a monotone smooth Bayes estimate of the transformations. In this chapter, we choose cubic splines, which correspond to B-splines of order 4.

Notice that for B-spline functions, can only take values in the range of . Because of that, we have to do some transformation to the data if it is not in this range. For example, we can calculate the mean

and variance

for the -th dimension of the training data, and then apply the cumulative distribution function on the -th dimension of the data, .

2.1 Prior distributions

2.1.1 Prior on the B-spline coefficients

The prior we use was introduced by mulgrave2018bayesian. Here we include the prior specification for contiguity.

Let us first put the monotonicity aside, and put normal prior on the coefficients of the B-splines, i.e., , where is some positive constant, is a dimensional vector of constants, and is the identity matrix. We choose normal prior for conjugacy. Because the means and the covariances of the normal distributions are unknown, this gives the flexibility in the location and the scale of the transformation and causes identifiability issues. To address this and to retain the conjugacy normal prior, impose the following constraints on the locations and the scales of the transformation function :

The constriants can be written in matrix form , where

and .

By the property of the normal distribution, we have

where , . Because of the two linear constraints, the covariance matrix is actually singular. So we remove two coefficients by representing them as linear combinations of the others. Here we choose where is the largest (middle one if

is odd, either of middle two if

is even) and where is the largest (upper

th quantile one) and that

. In principle, any and with non-zero coefficients can work. Here we choose them in this way to guarantee numerical stability in later calculation. Then by we can get


where is the reduced vector after removing and from , and and can be calculated correspondingly.

The reduced vector follows the prior distribution:

where and are obtained by removing the and dimension correspondingly.

Finally, consider the monotonicity constraint . Written in matrix form, that is, , where

This can be further reduced to according to (2).

The final prior is given by


where . Notice that this prior reserves the conjugacy.

The parameter is chosen such that , where is a constant and is a positive constant. The motivation behind this prior specification is that this is an approximation for the expected values of the order statistics of a random variable.

2.1.2 Prior on the means and covariances

Because the means and covariances corresponds to transformed measurements, it is hard to obtain any prior information. Hence, we put an noninformative prior about them, i.e., , .

2.2 Gibbs sampling algorithm

Denote the transformed observations by , i.e., , where , . Because of (2), we can calculate based on , i.e.,


where is obtained by removing the -th and the -th columns of , and is given by .

Before doing the Gibbs sampling, we need to assign some initial values. We assign the initial values for B-spline coefficients by assuming that the true transformation is . These initial values will also be useful when sampling from the posteriors which are truncated normal distributions. After finding the initial values , we can calculate the initial value for according to (4). The initial values for and is the mean and the covariance matrix of whose respectively; similarly the initial values for and is the mean and the covariance matrix of the whose . The original value for the class is given as follows: for ,

where stands for the Euclidean distance.

  1. First sample the B-spline coefficients for .

    The joint posterior for the B-spline coefficients is a truncated normal with density

    where is the prior distribution of given by (3).

    Let denote the vector . We have


    So we can sample from , where

    The truncated normal distributions are sampled using the method proposed by li2015efficient. After obtaining the posterior samples of , we can update according to (4).

  2. Update

    according to their posterior probability. That is,

    where , , , .

  3. Update the missing labels according to the current distributions; for , if label is missing, update

    since based on the missing at random assumption, where and stand for the proportion of class 0 and class 1. There are two ways to figure out and , one is to specify them in advance if we know the proportion of each category in the population. The other way is to treat them as unknown parameters. We can specify the prior distribution and update them with each MCMC iteration .

We can then obtain the posterior mean of (and if treated as unknown parameters). For the new data coming in, we shall first apply the transformation as the one applied to the original training data. Then calculate according to (4). If , then assign it to Class 0; otherwise, assign it to Class 1. We will call our method Nonparanormal method in the simulations.

3 Model selection

Notice that we have to choose the number of basis functions we want to use to estimate the transformation function. If the labeled data is sufficiently extensive, we can consider using cross validation to choose . However, in semi-supervised learning settings, the number of labeled units in the data is usually very limited. The method we propose here is inspired by the low-density assumption which is widely used in the semi-supervised learning literature. The idea is that we choose the classifier which best satisfies the low-density assumption, i.e., the one with the least number of points on the boundary. More specifically, we define the points that is close to the boundary by


where are the posterior means of respectively, is a fixed value we shall specify. Obviously, the set is more precise to the boundary when is closer to 1. However, this may lead to a small number of points in the boundary set and therefore cannot serve the purpose of distinguishing different . According to our numerical experiment, moderate value is a good choice. Notice that here is calculated based on the posterior mean . We choose the which has the smallest number of points in the set defined by (6).

In the simulation studies, to save computation time, we run the procedure with , each for iterations to get a decision rule. Then we choose the best according to the low density assumption introduced above and then run iterations to get the final classifier.

4 Simulations

Since our proposed method relies on the nonparanormality assumption, in the simulation studies, we consider two cases: the case when the assumption satisfies and the case when the assumption is violated.

4.1 Nonparanormality assumption satisfied

We consider in our simulations. For each dimension

, instead of specifying some values for means and covariances of the underlying Gaussian distributions

and , we decide to randomly generate some values as follows:

  • Generate and

    independently and identically from a uniform distribution ranging in

    for .

  • Generate by , where every element in is generated from the uniform distribution over , is generated the same way independently.

Here we considered two true transformations:

  1. The logistic transformation for each dimension .

  2. The probit transformation for each dimension .

For each of the settings above, generate number of observations for each class and randomly select of them to be labeled. Here takes values in and takes values in . For the testing set, we will generate data for each class. We compare our Nonparanormal (NN) method to other widely used semi-supervised learning methods in R package ’RSSL’ (krijthe2016rssl). We have tried all methods included in this package. However, EM Linear Discriminant Classifier (Expectation Maximization applied to the linear discriminant classifier assuming Gaussian classes with a shared covariance matrix) (dempster1977maximum), IC Linear Discriminant Classifier (krijthe2014implicitly), Kernel Least Squares Classifier, Laplacian Kernel Least Squares Classifier (belkin2006manifold), MC Linear Discriminant Classifier (loog2011semi)

, Quadratic Discriminant Classifier do not work for our generated datasets. Entropy Regularized Logistic Regression

(grandvalet2005semi), Linear SVM, Logistic Loss Classifier, Logistic Regression, MC Nearest Mean Classifier (loog2010constrained) do not have good performances. Here we only include 5 methods which turned out to be the the best ones for our data. These methods are IC Least Squares Classifier (ICLS) (krijthe2015implicitly), Laplacian SVM (LSVM) (belkin2006manifold), Well SVM (WSVM) (li2013convex), svmlin (SVML) (sindhwani2006large), EM Nearest Mean Classifier (EM) (dempster1977maximum)

. To prove that using unlabeled data can actually improve the accuracy of the classifier, we also include Support Vector Machine (SVM) and Random Forest (RF), two supervised classification methods, trained only on labeled data for comparison. We simulate 30 datasets for each setting and the results shown in Table

1 and 2 are the average of classification error rates for 30 datasets. 111All the code for simulations and real data are available in

5 (50,3) 3.39 15.82 *29.29 19.7 29.20 41.00 26.00 31.55
(50,5) 5.35 12.48 29.00 17.65 26.06 37.24 21.11 26.43
(50,10) 3.23 7.70 30.37 12.00 19.75 31.02 13.73 19.94
(100,3) 3.90 18.16 *29.46 19.86 28.88 41.94 28.40 29.76
(100,5) 2.81 13.61 *30.01 14.03 24.05 40.08 21.09 24.70
(100,10) 2.80 8.24 31.94 10.7 19.44 34.09 14.72 19.84
10 (50,3) 0.20 18.13 *13.98 6.26 11.52 7.44 16.54 16.89
(50,5) 0.20 11.43 10.13 5.62 9.17 7.53 10.57 14.44
(50,10) 0.20 4.66 7.05 4.84 6.46 7.05 6.15 11.23
(100,3) 0.12 18.38 *12.35 4.53 10.71 7.76 17.23 16.82
(100,5) 0.12 12.28 10.38 4.42 8.83 7.72 11.26 15.39
(100,10) 0.12 5.20 7.51 3.98 6.52 7.51 6.63 11.15
15 (50,3) 0.10 27.29 13.99 3.48 13.98 4.33 17.10 18.29
(50,5) 0.08 16.36 8.09 3.17 8.97 4.12 7.64 13.93
(50,10) 0.10 6.91 5.11 2.43 5.87 3.93 3.81 8.89
(100,3) 0.04 26.38 14.67 2.08 14.05 3.97 17.94 18.41
(100,5) 0.04 18.78 8.36 1.95 9.07 3.97 10.44 13.58
(100,10) 0.04 7.55 6.17 1.90 6.55 3.88 4.49 10.09
Table 1: Classification error rate () for the test data when the data is generated with a logistic transformation. (Here * means 1–6 cases failed to give an output. The error is calculated based on the remianing outputs.)
5 (50,3) 3.88 18.47 *30.31 22.44 28.97 42.77 26.22 31.11
(50,5) 3.75 14.96 29.18 18.49 25.11 39.14 22.12 26.16
(50,10) 3.66 9.77 31.52 13.23 18.51 31.84 14.57 20.12
(100,3) 6.21 19.37 *29.32 20.72 28.47 43.99 28.55 29.24
(100,5) 3.10 14.92 30.18 15.29 23.26 42.53 21.29 24.49
(100,10) 3.08 10.39 27.96 12.26 18.37 36.81 17.61 19.79
10 (50,3) 0.35 20.08 *13.66 6.89 12.11 7.95 16.57 16.72
(50,5) 0.36 12.77 10.27 6.19 9.52 7.97 10.40 14.45
(50,10) 0.33 6.74 7.60 5.39 7.00 7.54 6.54 11.34
(100,3) 0.23 19.87 *12.86 5.10 11.42 8.10 17.44 16.80
(100,5) 0.22 14.95 10.56 4.99 9.32 8.08 11.33 15.69
(100,10) 0.21 7.16 7.93 4.47 6.91 7.83 7.07 11.17
15 (50,3) 0.10 27.21 *14.04 3.97 14.29 4.66 17.50 18.18
(50,5) 0.11 16.91 8.44 3.55 9.44 4.44 8.04 13.89
(50,10) 0.11 7.57 5.54 2.80 6.39 4.26 4.25 8.86
(100,3) 0.05 26.50 14.79 2.61 14.37 4.33 18.09 18.62
(100,5) 0.05 20.21 8.92 2.28 9.56 4.34 10.58 13.56
(100,10) 0.06 8.95 6.57 2.26 7.15 4.26 4.99 10.11
Table 2: Classification error rate () for the test data when the data is generated with a probit transformation. (Here * means 1–6 cases failed to give an output. The error is calculated based on the remianing outputs.)

From the above results, we can see that clearly our Nonparanormal method outperforms other methods when our assumption is satisfied. It is worth pointing out that because the parameters generated for different dimensions are different, the difficulty for classification is different as well. Obviously here the difficulty for classification decreases as the dimensions increases. The accuracy increases as the number of labels increases, which is expected. The accuracy also increases as the number of samples increases in general, (except when , comparing to , maybe because the problem is too difficult and the labeling proportion is playing a more important part). Comparing different methods, the EM algorithm is highly dependent on the assumption that the underlying distributions are Gaussian. This assumption is not applicable in these cases so it does not have a good performance as expected. Indeed it is the worst when . It is actually a bit surprising to find out that for the case when and , EM is actually doing reasonably, maybe because the distributions can be approximated by Gaussian to some extent. Among those SVM methods, WSVM has the best performance; LSVM sometimes even failed to give a result. The reason why SVM does not work well in our cases is that the low density assumption of the decision boundary does not satisfy in our cases. Looking at supervised classifiers trained with only labeled data, for some cases they perform even better than some semi-supervised learning classifiers. For example, SVM is better than LSVM, SVML and EM when . Indeed, this shows that unlabeled data can hurt accuracy when used inappropriately. Only when used in the right way, unlabeled data can greatly improve the accuracy, like the NN methods we proposed, in these cases.

4.2 Nonparanormality assumption fails

To simulate the case when the nonparanormality assumption is violated, we still generate Gaussian distributions and as the underlying distributions, but we will apply different transformation on different classes. For simplicity, we will only consider the case when and we will use the same values generated above for means and covariances. The transformation we applied are:

  1. The logisic transformation for each dimension on observations from Class 0.

  2. The probit transformation for each dimension , on observations from Class 1.

We generate number of observations for each class and randomly select of them to be labeled. Once again, takes values in and takes values in . For testing set, we will generate data for each class. 30 datasets are generated for each setting and we compare our proposed method with other methods mentioned above. The results are shown in Table 3.

5 (50,3) 5.32 16.38 31.58 17.41 28.45 42.95 22.54 31.30
(50,5) 2.17 14.19 29.74 16.03 25.75 38.14 17.18 23.99
(50,10) 2.00 8.89 31.73 10.40 18.65 31.26 10.41 16.31
(100,3) 3.46 18.97 30.49 17.95 28.18 44.01 23.24 29.12
(100,5) 1.66 13.82 33.13 11.96 23.39 40.46 17.42 23.11
(100,10) 1.67 9.51 31.97 9.23 18.81 35.83 11.07 16.04
Table 3: Classification error rate () for the test data when the data violate the Nonparanormal assumption. (Here * means 7–9 cases failed to give an output. The error is calculated based on the remianing outputs.)

From the result above, we can see that our proposed NN method has substantially lower error rates than others even when nonparanormality assumption fails. This shows that our method is robust with respect to this assumption.

5 Real data

For real data part, we follow the literature by choosing datasets with labels and then generate missing labels, so that we can make a comparison of the performances for different methods. Here we consider two datasets from UCI Machine Learning Repository

222 For each dataset, we randomly select 70% of the data for training purpose and 30% of the data as testing set. For the training set, we again randomly select 15% of the label and regard the rest as unlabeled data for our semi-supervised learning method. For each dataset, we repeat the process for 10 times and take the mean of the false positive rate, the false negative rate, the overall error rate and the Matthews correlation coefficient (MCC) for all semi-supervised methods considered. We calculate the MCC because it takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The two data sets we considered are:

  • Breast Cancer Wisconsin (Diagnostic) Data Set 333

    The purpose of this data is to use the features of the cell nucleus to predict whether the disease is malignant or benign. The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. Those features include radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and fractal dimension. For each feature, the mean, standard error, and ”worst” or largest (mean of the three largest values) are calculated for each image. All features are continuous variables. The data have 357 benign cases and 212 malignant cases.

  • The purpose of this data is to classify the radar returns of free electrons in the ionosphere as ”good” or ”bad”. ”Good” radar returns are the ones showing evidence of some type of structure in the ionosphere while ”bad” returns are those whose signals pass through the ionosphere. There are 17 pulse numbers for the system and instances are described by 2 attributes per pulse number. There are 34 features in this data set and all of them are continuous. The data contains 126 bad cases and 225 good cases.

False Positive Rate 0.089 0.025 0.014 0.030 0.057 0.004
False Negative Rate 0.074 0.196 0.157 0.177 0.139 0.361
Overall Error Rate 0.083 0.091 0.068 0.085 0.087 0.140
MCC 0.829 0.810 0.858 0.821 0.814 0.717
Table 4: Classification results on Breast Cancer Wisconsin (Diagnostic) Data Set
False Positive Rate 0.087 0.456 0.564 0.337 0.450 0.276
False Negative Rate 0.150 0.043 0.034 0.088 0.047 0.339
Overall Error Rate 0.127 0.192 0.229 0.181 0.196 0.317
MCC 0.741 0.581 0.504 0.612 0.576 0.369
Table 5: Classification results on Ionosphere Data Set

The results are given in Table 4 and Table 5. For Breast Cancer Wisconsin (Diagnostic) Data Set, LSVM has the best performance in terms of overall error rate and MCC. Our proposed method NN has slightly higher error rate and lower MCC. But Unlike LSVM, whose false negative rate is much higher than false positive rate, the false positive rate and false negative rate for NN are almost the same. WSVM has comparable overall error and MCC with NN. For Ionosphere Data Set, NN has much better result in terms of both overall error rate and MCC. In conclusion, NN is a safe choice for semi-supervised learning in the real world.