In statistical machine learning, the size and quality of training data sets has a notable impact on the performance. The labeled data is often considered to be expensive to obtain, and therefore typically limited size. By contrast, unlabeled data are often available or can be collected with reasonable costs. Therefore, there is a strong motivation to improve the performance by utilizing both labeled or unlabeled data, which is known as semi-supervised learning.
Semi-supervised learning is readily achieved using generative models, cf. Zhu & Goldberg (2009); Kingma et al. (2014); Gordon & Hernandez-Lobato (2017). However, several studies have reported that improved performance of learning methods is not guaranteed by incorporating the unlabeled data (cf. (Cozman et al., 2003; Krijthe & Loog, 2016; Oliver et al., 2018)), especially for those unlabeled data with features which are rarely observed in the labeled data.
Let the pair denote features and an associated class. For unlabeled data, is missing and we only observe . The appropriate utilization of unlabeled data depends critically on assumptions made about the data-generating distributions underlying labeled and unlabeled pairs, respectively (see Chawla et al. (2005)).
Missing completely at random (MCAR): The labeled and unlabeled data-generating processes match exactly, i.e.,
MCAR is a common assumption but also highly restrictive and requires careful considerations when collecting unlabeled data. In a medical diagnosis scenario, this means that the features from an unscreened population must match the population of screened patients. When this assumption fails, the performance of a semi-supervised learning method can degrade drastically. For an illustration of this limitation, consider Figure 0(a).
Missing at random (MAR): The feature distributions may not match each other, i.e.,
The features by which it is possible to discriminate between classes in the labeled data are the same as those for the unlabeled data, which therefore can potentially be used in semi-supervised learning. In the medical diagnosis case this means that unscreened population data can be used in conjunction with screened patient data.
Missing not at random (MNAR): Neither features nor conditional class distributions may match each other, i.e.,
MNAR is a very conservative assumption since there is no necessary relation between labeled and unlabeled data. The features by which it is possible to discriminate between classes in the labeled data may therefore be different from those for the unlabeled data. This effectively means that the unlabeled data cannot be used to improve a classifier and thus not considered here.
In this paper, we consider the use of generative models for semi-supervised learning under MAR. We develop an approach that utilizes part of the unlabeled data to refine the models obtained from the labeled data, while the remaining unlabeled data is used to robustify the learned class probabilities of rarely observed features. The method is applicable using any generative model and associated learning algorithm, cf. Hastie et al. (2016); Bishop (2016); Murphy & Bach (2012).
2 Problem Formulation
The observed datasets are denoted
where the features belong to a -dimensional space and we consider a set of class labels . The samples are obtained as i.i.d. from unknown distributions and , respectively, where indicates whether the data is labeled or unlabeled.
Using and , the goal is to develop a classifier that provides both class predictions for test points
as well as a robust estimate of the class probability. The semi-supervised learning problem is motivated by the fact that typically.
2.1 Optimal Classifier under MAR
For a test sample , we pursue the optimal classification rule
which minimizes an expected loss function. The most common loss function for classification problems is the zero-one loss function(Hastie et al., 2016),
The expected loss function is
and the optimal classifier is given by
where is the likelihood of observing , and is the prior class probabilities. The error probability of the optimal classifier evaluated at the test sample , is given by
Note that the above distributions marginalize over the labeling process, i.e.,
2.2 Learning model of and
The labeled data provides information about and . Together with
, it also provides information about the prior probability of obtaining samples from the labeled population. However, under MAR, the unlabeled data does not necessarily provide information about or . Learning models of (9) using both and is therefore an open question.
In supervised learning, and are replaced by and and thus is discarded. This approach, however, can lead to serious misclassifications and highly inaccurate error probabilities in regions of the feature space where we observe unlabeled data, see Figure 0(a) for an illustration. Since only samples from the top-right and bottom-right regions of the feature space are labeled, the labeling process is obviously selective in the feature space.
Most semi-supervised learning algorithms are based on the MCAR assumption in which
These methods are therefore are not robust to MAR data (Chawla et al., 2005). As exemplified by a generative models in Figure 0(a), the MCAR assumption may lead to serious performance degradation, cf. the discussion in Zhu & Goldberg (2009).
Considering Figure 0(a), we draw two conclusions about . In the top-left and bottom-right regions of the feature space, the unlabeled data is not informative about the classes. By contrast, the top-right and bottom-left regions represent features that are shared with the labeled population.
Next, we generalize these observation to a robust learning approach.
3 Learning Approach under MAR
We now develop a semi-supervised approach for learning (9) that is robust to MAR data. The approach is applicable to any generative model of the data using any supervised method of choice.
Under MAR, feature regions represented in that are
not shared with provide no information about ,
shared with may provide information about .
For unlabeled features in the first case, the principle of insufficient reason dictates a robust model of the prior class probability as uniform, i.e.,
For the same reason, the class will not provide any information about these features so that a robust model of should be class independent, i.e.,
The unlabeled features in the second case are, however, statistically indistinguishable from the labeled features and thus informative of class under (3). Such unlabeled data used to provide high-quality estimates of as we show below.
Next, we partition the feature space using the likelihood ratio and use this partitioning to utilize in a robust manner.
3.1 Regions of Statistically Similar Features
Consider learning initial generative models
with which we partition of the feature space using the likelihood ratio:
Thus all features in are statistically indistinguishable from features of the labeled population. These contain information about .
Testing whether a feature belongs to corresponds to a likelihood ratio test. When an unlabeled feature belongs to , we assign it a class with the appropriate uncertainty under MAR. That is,
where we use (3) to obtain
All unlabeled features in that are statistically indistinguishable from labeled samples, i.e.,
are assigned a class in a manner that propagates their uncertainty consistent with MAR. The resulting pair is augmented with the labeled data to form a dataset , while the remaining unlabeled samples form the set .
3.2 Robust Classifier
Using and together with (10) and (11), we can learn robust models of (9) denoted and . The procedure is summarized in Algorithm 1 and can be implemented using any generative model and learning algorithm of choice. This general applicability is similar to the self-training approach (Zhu & Goldberg (2009)), but unlike that approach it achieves robustness by cautiously assigning labels to parts of the unlabeled data under the MAR assumption while preserving the uncertainty with respect to .
For a test sample , the resulting classifier is given by
and the error probability is
Using the learned model, we may also introduce a reject option if for additional robustness, see Bishop (2016). An illustration of the learned model under MAR is shown in Figure 0(b), where uncertainty about class is preserved in regions where there is only unlabeled data.
4 Experimental Results
4.1 Two-moons dataset
. Here we use a Gaussian mixture model that is learned using variational approximation. Each class conditional distribution is estimated using 8 components with full covariance matrices and a Dirichlet process prior with weight concentration prior set to 1e-2. The mean precision prior was set to 1 and the covariance prior to the identity matrix. In The self-training method, only one label was added at each iteration.
We contrast the case when MCAR assumption holds and when it fails in Figures 1(a) and 1(b), respectively. If we compare the self-training method in both cases, we clearly see that the resulting model is very accurate under MCAR in Fig. 2(a) but it is grossly inaccurate for about half of the data region under MAR in Fig. 2(b). By contrast, our proposed method is conservative when MCAR data is given in Fig. 3(a), leaving near 0.50 for large portions of the unlabeled data. However, under MAR this corresponds to robustness, cf. Fig. 3(b) and 1(b).
4.2 MNIST dataset with adversarial features
To illustrate the robustness our approach in a more realistic scenario, we consider training a classifier for hand-written digits using MNIST with unlabeled data. For each class, we introduce a certain proportion of adversarial features in the unlabeled dataset by rotating their image upside-down. We then consider a test set with adversarial features , see Fig. 5. Since some hand-written digits are invariant to the rotation, ideally these features , such as those representing ‘0’, should be correctly classified while others should, representing, say, ’4’, should have a low probability and thus rejected.
Here we use the generative model considered in Bishop (2016). Each conditional distribution ,
, is approximated with a mixture of Bernoulli models with 784 dimensions and 3 components. The parameters in the mixture models are optimized with the expectation maximization (EM) algorithm.
We compare the supervised case, which discards the unlabeled data, the self-training method and the proposed approach all using an option to reject test points for which . The results are summarized in Table 1. Neither the supervised or self-trained models reject any adversarial examples and consequently make significant errors for certain classes that are not invariant to flipping (such as class ‘7’). By contrast, the robust approach rejects many more adversarial examples, or erroneously classifies them to a lesser degree, than the standard approaches.
|0||62.5||45||17.5 + 57.5|
|1||42.5||42.5||7.5 + 37.5|
|2||92.5||97.5||52.5 + 47.5|
|3||100||95.0||40.0 + 60.0|
|4||95.0||97.5||10.0 + 90.0|
|5||85.0||80.0||22.5 + 75.0|
|6||100||100||72.5 + 27.5|
|7||100||100||32.5 + 67.5|
|8||90||95.0||35.0 + 60.0|
|9||100||100||35.0 + 65.0|
Examples of a valid and an adversarial samples of hand written numbers 2. The adversarial example is obtained by flipping the original feature vector.
We have developed a semi-supervised learning approach that is robust to cases in which labels are missing at random. Unlike methods based on labels missing completely at random, this approach does not make the restrictive assumption that labeled and unlabeled features have matching distributions. The proposed ensures that uncertainty about the classes is propagated to the unlabeled features in a robust manner. Moreover, it is widely applicable using any generative model with an associated learning algorithm. Finally, we demonstrated the robustness of the method in both synthetic datasets and real datasets with adversarial examples.
- Bishop (2016) C.M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer New York, 2016. ISBN 9781493938438.
Chawla et al. (2005)
Nitesh V Chawla et al.
Learning from labeled and unlabeled data: An empirical study across
techniques and domains.
Journal of Artificial Intelligence Research, 23:331–366, 2005.
- Cozman et al. (2003) Fabio G Cozman, Ira Cohen, and Marcelo C Cirelo. Semi-supervised learning of mixture models. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 99–106, 2003.
Gordon & Hernandez-Lobato (2017)
Jonathan Gordon and Jose Miguel Hernandez-Lobato.
Bayesian semi-supervised learning with deep generative models.
Second workshop on Bayesian Deep Learning (NIPS 2017), 2017.
- Hastie et al. (2016) Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: Data mining, inference, and prediction. 2nd ed., corrected at 11th printing. Springer, New York, 2016. ISBN 0172-7397.
- Kingma et al. (2014) Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
- Krijthe & Loog (2016) Jesse H Krijthe and Marco Loog. The pessimistic limits of margin-based losses in semi-supervised learning. arXiv preprint arXiv:1612.08875, 2016.
- Murphy & Bach (2012) K.P. Murphy and F. Bach. Machine Learning: A Probabilistic Perspective. Adaptive Computation and Machi. MIT Press, 2012. ISBN 9780262018029.
- Oliver et al. (2018) Avital Oliver, Augustus Odena, Colin Raffel, Ekin D Cubuk, and Ian J Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. arXiv preprint arXiv:1804.09170, 2018.
- Zhu & Goldberg (2009) Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.