An Overview of Deep Semi-Supervised Learning

06/09/2020 ∙ by Yassine Ouali, et al. ∙ 0

Deep neural networks demonstrated their ability to provide remarkable performances on particular supervised learning tasks (e.g., image classification) when trained on extensive collections of labeled data (e.g., ImageNet). However, creating such large datasets requires a considerable amount of resources, time, and effort. Such resources may not be available in many practical cases, limiting the adoption and application of many deep learning methods. In a search for more data-efficient deep learning methods to overcome this need for large annotated datasets, we a rising research interest in recent years with regards to the application of semi-supervised learning to deep neural nets as a possible alternative, by developing novel methods and adopting existing semi-supervised learning frameworks for a deep learning setting. In this paper, we provide a comprehensive overview of deep semi-supervised learning, starting with an introduction to semi-supervised learning. Followed by a summarization of the dominant semi-supervised approaches in deep learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, semi-supervised learning (SSL) has emerged as an exciting new research direction in deep learning in dealing with the situation where few labeled training examples are available together with a more significant number of unlabeled samples, making it applicable for real-world applications where the unlabeled data are readily available and easy to acquire, while labeled instances are often hard, expensive, and time-consuming to collect. SSL is capable of building better classifiers that compensate for the lack of labeled training data. However, to avoid a lousy matching of the problem structure with the model assumption, which can lead to a degradation in classification performance

[191], SSL is only effective under certain assumptions, such as assuming that the decision boundary should avoid regions with high density, facilitating the extraction of additional information from the unlabeled instances to regularize training. In this paper, we will start by an introduction to SSL with its main assumptions and methods, followed by a summarization of the dominant semi-supervised approaches in deep learning. For a detailed and comprehensive review of the field, Semi-Supervised Learning Book [19] is a good resource.

1.1 Semi-supervised learning

“Semi-supervised learning (SSL) is halfway between supervised and unsupervised learning. In addition to unlabeled data, the algorithm is provided with some supervision information – but not necessarily for all examples. Often, this information will be the targets associated with some of the examples. In this case, the data set

can be divided into two parts: the points , for which labels are provided, and the points , the labels of which are not known.” – Chapelle et al. [19].

As stated in the definition above, in SSL, we are provided with a dataset containing both labeled and unlabeled examples. The portion of labeled examples is usually quite small compared to the unlabeled example (e.g., 1 to 10% of the total number of examples). So with a dataset containing a labeled subset and an unlabeled subset . The objective, or rather hope, is to leverage the unlabeled examples to train a better performing model than what can be obtained using only the labeled portion. And hopefully, get closer to the desired optimal performance, in which all of the dataset is labeled.

More formally, the goal of SSL is to leverage the unlabeled data to produce a prediction function with trainable parameters , that is more accurate than what would have been obtained by only using the labeled data . For instance, might provide us with additional information about the structure of the data distribution

, to better estimate the decision boundary between the different classes. For examples, as shown in

fig. 1, where the data points with distinct labels are separated with low-density regions, leveraging unlabeled data with a SSL approach can provide us with additional information about the shape of the decision boundary between two classes, and reduce the ambiguity present in the supervised case.

SSL first appeared in the form of self-training [19]

, which is also known as self-labeling or self-teaching. A model is first trained on labeled data. Then, iteratively, a portion of the unlabeled data is annotated using the trained model and added to the training set for the next training iteration. SSL took off in the 1970s after its success with iterative algorithms such as the expectation-maximization algorithm

[107], in which the labeled and unlabeled data are jointly used to maximize the likelihood of the model.

Figure 1: SSL toy example. The decision boundaries obtained on two-moons dataset, with a supervised and different SSL approaches using 6 labeled examples, 3 for each class and the rest of the points as unlabeled data.

1.2 SSL Methods

There have been many SSL methods and approaches that have been introduced over the years. These algorithms can be broadly divided into the following categories:

  • Consistency Regularization (a.k.a Consistency Training). Based on the assumption that if a realistic perturbation was applied to the unlabeled data points, the prediction should not change significantly. The model can then be trained to have a consistent prediction on a given unlabeled example and its perturbed version.

  • Proxy-label Methods.

    Such methods leverage a trained model on the labeled set to produce additional training examples by labeling instances of the unlabeled set based on some heuristics. These approaches can also be referred to as

    bootstrapping algorithms. We follow Ruder et al. [131] and refer to them as proxy-label methods. Some examples of such methods are Self-training, Co-training and Multi-View Learning.

  • Generative Models. Similar to the supervised setting, where the learned features on one task can be transferred to other downstream tasks. Generative models that are able to generate images from the data distribution must learn transferable features to a supervised task for a given task with targets .

  • Graph-Based Methods. The labeled and unlabeled data points can be considered as nodes of a graph, and the objective is to propagate the labels from the labeled nodes to the unlabeled ones by utilizing the similarity of two nodes and , which is reflected by how strong is the edge between them.

In addition to these main categories, there is also some SSL work on entropy minimization, where we force the model to make confident predictions by minimizing the entropy of the predictions. Consistency training can also be considered a proxy-label method, with a subtle difference. Instead of considering the predictions as ground-truths and compute the cross-entropy loss, we enforce consistency of predictions by minimizing a given distance between the outputs.

SSL methods can also be categorized based on two dominant learning paradigms, transductive learning and inductive learning. Transductive learning aims to apply the trained classifier on only the unlabeled instances observed at training time; in this case, it does not generalize to unobserved instances. This type of algorithm is mainly used on graphs, such as random walks for node embedding [117, 57], where the objective is to label the unlabeled nodes of the graph that are present at training time. The more popular paradigm, inductive learning, aims at the other hand, to learn a classifier capable of generalizing to unobserved instances at test time.

1.3 Main Assumptions in SSL

The first question we need to answer, is under what assumptions can we apply SSL algorithms? SSL algorithms only work under some conditions, where some assumptions about the structure of the data need to hold. Without such assumptions, it would not be possible to generalize from a finite training set to a set of possibly infinitely many unseen test cases. The main assumptions in SSL are:

  • The Smoothness Assumption. If two points , reside in a high-density region are close, then so should be their corresponding outputs , [19]. Meaning that if two inputs are of the same class and belong to the same cluster, which is a high-density region of the input space, then their corresponding outputs need to be close. The inverse also holds true; if the two points are separated by a low-density region, the outputs must be distant from each other. This assumption can be quite helpful in a classification task, but not so much for regression.

  • The Cluster Assumption. If points are in the same cluster, they are likely to be of the same class [19]. In this particular case of the smoothness assumption, we suppose that input data points form clusters, and each cluster corresponds to one of the output classes. The cluster assumption can also be seen as the low-density separation assumption: The decision boundary should lie in the low-density regions. The relation between the two assumptions is easy to see, if a given decision boundary lies in a high-density region, it will likely cut a cluster into two different classes, resulting in samples from different classes belonging to the same cluster, which is a violation of the cluster assumption. In this case, we can restrict our model to have consistent predictions on the unlabeled data over some small perturbations pushing its decision boundary to low-density regions.

  • The Manifold Assumption.

    The (high-dimensional) data lie (roughly) on a low-dimensional manifold

    [19]. In high dimensional spaces, where the volume grows exponentially with the number of dimensions, it can be quite hard to estimate the true data distribution for generative tasks. For discriminative tasks, the distances are similar regardless of the class type, making classification quite challenging. However, if our input data lies on some lower-dimensional manifold, we can try to find a low dimensional representation using the unlabeled data and then use the labeled data to solve the simplified task.

1.4 Related Problems

Active Learning

In active learning

[138, 61], the learning algorithm is provided with a large pool of unlabeled data points, with the ability to request the labeling of any given examples from the unlabeled set in an interactive manner. As opposed to classical passive learning, in which the examples to be labeled are chosen randomly from the unlabeled pool, active learning aims to carefully choose the examples to be labeled to achieve a higher accuracy while using as few requests as possible, thereby minimizing the cost of obtaining labeled data. This is of particular interest in problems where data may be abundant, but labels are scarce or expensive to obtain.

Although it is not possible to obtain a universally good active learning strategy [31], there exist many heuristics [138], which have been proven to be effective in practice. The two widely used selection criteria are informativeness and representativeness [70, 187]. Informativeness measures how well an unlabeled instance helps reduce the uncertainty of a statistical model, while representativeness measures how well an instance helps represent the structure of input patterns.

Active learning and SSL are naturally related, since both aim to use a limited amount of data to improve a learner. Several works considered combining SSL and AL in different tasks. [39] demonstrated a significant error reduction with limited labeled data for speech understanding, [127] propose an active semi-supervised learning system for pedestrian detection, [190] combine AL and SSL using Gaussian fields applied to synthetic datasets, and [49] exploit both labeled and unlabeled data using SSL to distill information from unlabeled data that improves representation learning and sample selection.

Transfer Learning and Domain Adaptation

Transfer learning [114, 160] is used to improve a learner trained on one domain, called the target domain, by transferring the knowledge learned from a related domain, referred to as the source domain. For instance, we may wish to train the model on a synthetic, cheap-to-generate data, with the goal of using it on real data. In this case, the source domain used to train the model is related but different from the target domain used to test the model. When the source and target differ but are related, then transfer learning can be applied to obtain higher accuracy on the target data.

One popular type of transfer learning is domain adaptation [120, 116, 164]. Domain adaptation is a type of transductive transfer learning, where the target task remains the same as the source, but the domain differs. The objective of domain adaptation is to train a learner capable of generalizing across different domains of different distributions in which the labeled data are available for the source domain. As for the target domain, there are different categories describing it; we refer to the case where no labeled data is available on target as unsupervised domain adaptation, while semi-supervised and supervised domain adaptation refers to situations where we have a limited or a fully labeled target domain receptively [10].

SSL and unsupervised domain adaptation are closely related; in both cases, we are provided with labeled and unlabeled data, with the objective of learning a function capable of generalizing to the unlabeled data and unseen examples. However, in SSL, both the labeled and unlabeled sets come from the same distribution, while in unsupervised domain adaptation, the target and source distributions differ. Methods in both subjects can be leveraged interchangeably. In SSL, [102] proposed to use adversarial distribution alignment [48] for semi-supervised image classification using only a small amount of labeled samples. As for unsupervised domain adaptation, semi-supervised methods, such as consistency regularization [140, 93, 45], co-regularization [89] or proxy labeling [132, 131] demonstrated their effectiveness in domain adaptation.

Weakly-Supervised Learning

To overcome the need for large hand-labeled and expensive training sets, most sizeable deep learning systems use some form of weak supervision: lower-quality, but larger-scale training sets constructed via strategies such as using cheap annotators or programmatic [124]. In the weak-supervised learning, the objective is the same as in the supervised learning, however, instead of a ground-truth labeled training set, we are provided with one or more weak annotated examples, that could come from crowd workers, be the output of heuristic rules, the result of distant supervision [104], or the output of other classifiers. For example, in weakly-supervised semantic segmentation, pixel-level labels, which are harder and more expensive to acquire, are substituted for inexact annotations, e.g., image labels [157, 182, 159, 95, 92], points [9], scribbles [98] and bounding boxes [142, 29]. In such a scenario, SSL approaches can be used to enhance the performance further if a limited number of strongly labeled examples are available while still taking advantage of the weakly labeled examples.

Learning with Noisy Labels

Learning from noisy labels [44, 50]

can be challenging given the negative impact label noise can have on the performance of deep learning methods if the noise is significant. To overcome this, most existing methods for training deep neural networks with noisy labels seek to correct the loss function. One type of correction consists of treating all the examples as equal and relabeling the noisy examples, where proxy labels methods can be used for relabeling

[172, 99, 125]. Another type of correction applies a reweighing to the training examples to distinguish the clean and noisy samples [26, 147]. Other works [33, 67, 85, 94] have shown that SSL can be useful in learning from noisy labels, where the noisy labels are discarded, and the noisy examples are considered as unlabeled data and used to regularize training using SSL methods.

1.5 Evaluating SSL Approaches

The conventional experimental procedure used to evaluate SSL methods consists of choosing a dataset (e.g., CIFAR-10 [86], SVHN [108], ImageNet [32], IMDb [101], Yelp review [178]) commonly used for supervised learning, a large portion of the labels are then ignored, resulting in a small labeled set and a larger unlabeled . A deep learning model is trained with a given SSL approach, and the results are reported on the original test set over various and standardized portions of labeled examples. In order to make this procedure applicable to real-world settings, Oliver et al. [111] proposed the following ways to improve this experimental methodology:

  • A Shared Implementation.

    For a realistic comparison of different SSL methods, they must share the same underlying architectures and other implementation details (hyperparameters, parameter initialization, data augmentation, regularization, etc.).

  • High-Quality Supervised Baseline. The main objective of SSL is to obtain better performance than what can be obtained in a supervised manner. This is why it is essential to provide a strong baseline consisting of training the same model on the labeled set in a supervised manner, with modified hyperparameters to report the best-case performance of the fully-supervised model.

  • Comparison to Transfer Learning. Another robust baseline to compare SSL methods to can be obtained by training the model on large labeled datasets and then fine-tune it on the small labeled set .

  • Considering Class Distribution Mismatch. The possible distribution mismatch between the labeled and unlabeled examples can be ignored when doing evaluation since both sets come from the same dataset. Still, such a mismatch is prevalent in real-world applications, where the unlabeled data can have different class distributions compared to the labeled data. The effect of this discrepancy needs to be addressed for better real-world adoption of SSL.

  • Varying the Amount of Labeled and Unlabeled Data. A common practice is varying the number of labeled examples, but varying the size in a systematic way to simulate realistic scenarios, such as relatively small unlabeled set, can provide additional insights into the effectiveness of SSL approaches.

  • Realistically Small Validation Sets. In many practical cases, we might end-up with a validation set that is significantly larger than the labeled set used for training, in such a setting, extensive hyperparameter tuning might result in overfitting to the validation set. In contrast, small validation sets constrain the ability to select models [19, 43], resulting in a more realistic assessment of the performance of SSL methods.

2 Consistency Regularization

A recent line of works in deep semi-supervised learning utilize the unlabeled data to enforce the trained model to be in line with the cluster assumption, i.e., the learned decision boundary must lie in low-density regions. These methods are based on a simple concept that, if a realistic perturbation was to be applied to an unlabeled example, the prediction should not change significantly, given that under the cluster assumption, data points with distinct labels are separated with low-density regions, so the likelihood of one example to switch classes after a perturbation is small (e.g., fig. 1).

More formally, with consistency regularization, we are favoring the functions that give consistent predictions for similar data points. So rather than minimizing the classification cost at the zero-dimensional data points of the inputs space, the regularized model minimizes the cost on a manifold around each data point, pushing the decision boundaries away from the unlabeled data points and smoothing the manifold on which the data resides [191]. Given an unlabeled data point and its perturbed version , the objective is to minimize the distance between the two outputs . The popular distance measures are mean squared error (MSE), Kullback-Leiber divergence (KL) and Jensen-Shannon divergence (JS). For two outputs and

in the form of a probability distribution over the

classes, and , we can compute these measures as follows:

(2.1)
(2.2)
(2.3)

Note that we can also enforce a consistency over two perturbed versions of , and .

2.1 Ladder Networks

Figure 2: Ladder Networks. An illustration of one forward pass of Ladder Networks. The objective is to reconstruct the clean activations of the encoder using a denoising decoder that takes as input the corrupted activations of the noisy encoder.

To take any well-performing feed-forward network on supervised data and augment it with additional branches to be able to utilize additional unlabeled data. Rasmus et al. [123] propose to use Ladder Networks [151] with an additional encoder and decoder for SSL. As illustrated in fig. 2, the network consists of two encoders, a corrupted and clean one, and a decoder. At each training iteration, the input

is passed through both encoders. In the corrupted encoder, Gaussian noise is injected at each layer after batch normalization, producing two outputs, a clean prediction

and a prediction based on corrupted activations . The output is then fed into the decoder to reconstruct the uncorrupted input and the clean hidden activations. The unsupervised training loss is then computed as the MSE between the activations of the clean encoder and the reconstructed activations (i.e., after batch normalization), computed over all layers, from the input to the last layer , with a weighting for each layer’s contribution to the total loss:

(2.4)

If the input is a labeled data point with a label , a supervised loss cross-entropy term can be added to to obtain the final loss.

(2.5)

The method can be easily adapted for convolutional neural networks (CNNs) by replacing the fully-connected layers with convolution and deconvolution layers for semi-supervised vision tasks. However, the ladder network is quite heavy computationally, approximately tripling the computation needed for one training iteration. To mitigate this, the authors propose a variant of ladder networks called

-Model where when . In this case, the decoder is omitted, and the unsupervised loss is computed as the MSE between the two outputs and .

2.2 Pi-Model

Figure 3: -Model. Loss computation for -Model, the MSE between the two outputs if computed for the unsupervised loss, and if the input is a labeled example, we add the supervised loss to the weighted unsupervised loss.

The -Model [90] is a simplification of the -Model of Ladder Networks, where the corrupted encoder is removed, and the same network is used to get the prediction for both corrupted and uncorrupted inputs. Specifically, -Model takes advantage of the stochastic nature of the prediction function in neural networks due to common regularization techniques, such as data augmentation and dropout that typically don’t alter the model predictions. For any given input , the objective is to reduce the distances between two predictions of with as input in both forward passes. Concretely, as illustrated in fig. 3, we would like to minimize , where we consider one of the two outputs as a target. Given the stochastic nature of the predictions function (e.g., using dropout as noise source), the two outputs and will be distinct, and the objective is to obtain consistent predictions for both of them. In case the input is a labeled data point, we also compute the cross-entropy supervised loss using the provided labels :

(2.6)

with as a weighting function, starting from 0 up to a fixed weight (e.g

., 30) after a given number of epochs (

e.g., 20% of training time). This way, we avoid using the untrained and random prediction function, providing us with unstable predictions at the start of training.

2.3 Temporal Ensembling

-Model can be divided into two stages, we first classify all of the training data without updating the weights of the model, obtaining the predictions , and in the second stage, we consider the predictions as targets for the unsupervised loss and enforce consistency of predictions by minimizing the distance between the current outputs and the outputs of the first stage under different dropouts and augmentations.

The problem with this approach is that the targets are based on a single evaluation of the network and can rapidly change. This instability in the targets can lead to instability during training and reduces the amount of training signal that can be extracted from the unlabeled examples. To solve this, Laine et al. [90] propose a second version of -Model called Temporal Ensembling, where the targets are the aggregation of all the previous predictions. This way, during training, we only need a single forward pass to get the current predictions and the aggregated targets , speeding up the training time by approximately 2. The training process is illustrated in fig. 4.

Figure 4: Temporal Ensembling. Loss computation for Temporal Ensembling, the MSE between the current prediction and the aggregated target is computed for the unsupervised loss, and if the input is a labeled example, we add the supervised loss to the weighted unsupervised loss.

For a target , at each training iteration, the current output is accumulated into the ensemble output by an exponentially moving average update:

(2.7)

where is a momentum term that controls how far the ensemble reaches into training history. can also be seen as the output of an ensemble network from previous training epochs, with the recent ones having greater weight than the distant ones.

At the start of training, temporal ensembling reduces to -Model since the aggregated targets are very noisy, to overcome this, similar to the bias correction used in Adam optimizer [80], the targets are corrected for the startup bias at a training step as follows:

(2.8)

The loss computation in temporal ensembling remains the same as in -Model, but with two essential benefits. First, the training is faster since we only need a single forward pass through the network to obtain , while maintaining an exponential moving average (EMA) of label predictions on each training example and penalizing predictions that are inconsistent with these targets. Second, the targets are more stable during training, yielding better results. The downside of such a method is a large amount of memory needed to keep an aggregate of the predictions for all of the training examples, which can become quite memory intensive for large datasets and dense tasks (e.g., semantic segmentation).

2.4 Mean teachers

-Model and its improved version with Temporal Ensembling provides a better and more stable teacher model by maintaining an EMA of the predictions of each example, formed by an ensemble of the model’s current version and those earlier versions that evaluated the same example. This ensembling improves the quality of the predictions and using them as the teacher predictions improve results. However, the newly learned information is incorporated into the training at a slow pace, since each target is updated only once during training, and the larger the dataset, the bigger the span between the updates gets.

Figure 5: Mean Teacher. The teacher model, which is an EMA of the student model, is responsible for generating the targets for consistency training. The student model is then trained to minimize the supervised loss over labeled examples and the consistency loss over unlabeled examples. At each training iteration, both models are evaluated with an injected noise (, ), and the weights of the teacher model are updated using the current student model to incorporate the learned information at a faster pace.

Additionally, in the previous approaches, the same model plays a dual role, as a teacher and a student. Given a set of unlabeled data, as a teacher, the model generates the targets, which are then used by itself as a student for learning using a consistency loss. These targets may very well be misclassified. If the weight of the unsupervised loss outweighs that of the supervised loss, the model is prevented from learning new information and predicting the same targets, resulting in a form of confirmation bias. To solve this, the quality of the targets must be improved. The quality of targets can be improved by either: (1) carefully choosing the perturbations instead of merely injecting additive or multiplicative noise, or (2) carefully choosing the teacher model responsible for generating the targets, instead of using a replica of the student model.

To overcome these limitations, Mean Teacher [146] proposes using a teacher model for faster incorporation of the learned signal and avoiding the problem of confirmation bias. A training iteration of Mean Teacher (fig. 5) is very similar to previous methods; the main difference is that -Model uses the same model as a student and a teacher and temporal ensembling approximate a stable teacher as an ensemble function with a weighted average of successive predictions. While Mean Teacher defines the weights of the teacher model at a training step as the EMA of successive student’s weights :

(2.9)

The loss computation in this case is the sum of the supervised and unsupervised loss, where the teacher model is used to obtain the targets for the unsupervised loss for a given input :

(2.10)

2.5 Dual Students

One of the main drawbacks of using Mean Teacher is that given a large number of training iterations, the teacher model weights will converge to that of the student model, and any biased and unstable predictions will be carried over to the student.

To solve this, Ke et al. [78] propose a dual students step-up. Two student models with different initialization are simultaneously trained, and at a given iteration, one of them provides the targets for the other. To choose which one, we test for the most stable predictions that satisfy the following stability conditions:

  • The predictions using two input versions, a clean and a perturbed version give the same results: .

  • Both predictions are confident, i.e., are far from the decision boundary. This can be tested by seeing if (resp. ) is greater than a confidence threshold , e.g., as 0.1.

Given two student models, and , an unlabeled input and its perturbed version . We compute four predictions: . In addition to training each model to minimize both the supervised and unsupervised losses:

(2.11)

we also force one of the students to have similar predictions to its counterpart. To chose which one to update its weights, we check for both models’ stability constraint; if the predictions one of the models is unstable, we update its weights. If both are stable, we update the model with the largest variation , so the least stable. In this case, the least stable model is trained an additional loss:

(2.12)

where and are hyperparameters specifying the contribution of each loss term.

2.6 Fast-SWA

Athiwaratkun et al. [5] observed that

-Model and Mean Teacher continue taking significant steps in the weight space at the end of training, given that the models stochastic gradient descent (SGD) traverses a large flat region of the weight space late in training, ending up iterating at the periphery of this flat region, and continuing to actively explore the set of plausible solutions and producing diverse predictions on the test set even in the late stages of training. Based on this behavior, averaging the SGD iterates moves the final weights towards the center of the flat region, stabilizing the SGD trajectory, improving the generalization, and leading to significant gains in performance.

One way to produce an ensemble of the model late in training is Stochastic Weight Averaging (SWA) [73], an approach based on averaging the weights traversed by SGD at the end of training with a cyclic learning rate (fig. 6). After a given number of epochs, the learning rate changes to a cyclic learning rate and the training repeats for several cycles, the weights at the end of each cycle corresponding to the minimum values of the learning rate are stored, and averaged together to obtain a model with the averaged weights , which is then used to make predictions.

Figure 6: SWA and fast-SWA. Left and Center. Cyclical cosine learning rate schedule used at the end of training for SWA and fast-SWA with different averaging strategies. Right. 2d illustration of the impact of SWA and fast-SWA averaging strategies on the final weights. Based on: [5].

Motivated by the observation that the benefits of averaging are the most prominent when the distance between the averaged points is large, and given that SWA only collects the weights once per cycle, which means that many additional training epochs are needed in order to collect enough weights for averaging. The authors propose fast-SWA, a modification of SWA that averages the networks corresponding to many points during the same cycle, resulting in a better final model and a faster ensembling procedure.

2.7 Virtual Adversarial Training

The previous approaches focused on applying random perturbations to each input to generate artificial input points, encouraging the model to assign similar outputs to the unlabeled data points and their perturbed versions. This way, we push for a smoother output distribution. As a result, the generalization performance of the model can be improved. However, such random noise and random data augmentation often leaves the predictor particularly vulnerable to small perturbations in a specific direction, that is, the adversarial direction, which is the direction in the input space in which the label probability of the model is most sensitive.

To overcome this, and inspired by adversarial training [53] that trains the model to assign to each input data a label that is similar to the labels to be assigned to its neighbors in the adversarial direction. Miyato et al. [106] propose Virtual Adversarial Training (VAT), a regularization technique that enhances the model’s robustness around each input data point against random and local perturbations, the term "virtual" comes from the fact that the adversarial perturbation is approximated without label information and is hence applicable to SSL to smooth the output distribution.

Concretely, VAT trains the output distribution to be identically smooth around each data point by selectively smoothing the model in its most adversarial direction. For a given data point , we would like to compute the adversarial perturbation that will alter the model’s predictions the most. We start by sampling a Gaussian noise of the same dimensions as the input , we then compute its gradients with respect the loss between the two predictions, with and without the injections of the noise (i.e., KL-divergence is used as a distance measure ). can then be obtained by normalizing and scaling by a hyperparameter . The computation can be summarized in the following steps:

Note that the computation above is a single iteration of the approximation of , for a more accurate estimate, we consider and recompute following the last two steps. But in general, given how computationally expensive this computation is, requiring an additional forward and backward passes, we only apply a single power iteration for computing the adversarial perturbation. With the optimal perturbation , we can then compute the unsupervised loss as the MSE between the two predictions of the model, with and without the injection of :

(2.13)

For more stable training, Mean Teacher can be used to generate stable targets by replacing with , where is an EMA of the student .

Figure 7: Virtual Adversarial Examples. Examples of the perturbed ImagetNet images for different values of the scaling hyperparameter .

2.8 Adversarial Dropout

Instead of using an additive adversarial noise as VAT, Park et al. [115] propose adversarial dropout (AdD), a.k.a, element-wise adversarial dropout (EAdD), in which dropout masks are adversarially optimized to alter the model’s predictions. With this type of perturbations, we induce a sparse structure of the neural network, while the other forms of additive noise does not make changes to the structure of the neural network directly.

The first step is to find the dropout conditions that are most sensitive to the model’s predictions. In a SSL setting, where we do not have access to the true labels, we use the model’s predictions on the unlabeled data points to approximate the adversarial dropout mask , which is subject to the boundary condition: with as the dropout layer dimension and a hyperparameter , which restricts adversarial dropout masks to be infinitesimally different from the random dropout mask

. Without this constraint, the adversarial dropout might induce a layer without any connections. By restricting the adversarial dropout to be similar to the random dropout, we prevent finding such an irrational layer, which does not support backpropagation.

Similar to VAT, we start from a random dropout mask, we compute a KL-divergence loss between the outputs with and without dropout, and given the gradients of the loss with respect to the activations before the dropout layer, we update the random dropout mask in an adversarial manner. The prediction function is divided into two parts, and , where , we start by computing an approximation of the Jacobian matrix as follows:

(2.14)

Using , we can then update the random dropout mask to obtain , so that if and or and at a given position , we inverse the value of at that location. Resulting in , which can then be used to compute the unsupervised loss:

(2.15)
Figure 8: EAdD and CAdD. EAdD drops activation individually regardless of the spatial correlation, while CAdD drops entire feature maps, making it more suitable for convolutional layers. Image Source: [93].
Channel-wise Adversarial Dropout

The element-wise adversarial dropout (EAdD) introduced by Park et al. is limited to fully-connected networks, to use AdD in a wider range of tasks, Lee et al. [93] proposed channel-wise AdD (CAdD), an extension the element-wise masking in AdD to convolutional layers (fig. 8). In these layers, standard dropout is relatively ineffective due to the strong spatial correlation between individual activations of a feature map [148]. EAdD dropout suffers from the same issues when naively applied to convolutional layers. To solve this, EAdD adversarially drops entire feature maps rather than individual activations. While the general procedure is similar to that of EAdD, an additional constraint is imposed on the mask to represent spatial dropout [148]. In this case, the mask is of the same shape as the activations; the adversarial dropout mask is approximated under the following new condition:

(2.16)

where is a hyperparameter to restrict the different between the two masks to be small, and is the mask corresponding to the -th activation map. The process of finding the channel-wise adversarial dropout mask is similar to those of element-wise adversarial dropout, but with a per activation map approximation.

2.9 Interpolation Consistency Training

Figure 9: ICT.

A student model is trained to have consistent predictions at different interpolations of unlabeled data points, where a teacher is used to generate the targets before the Mixup operation.

As discussed earlier, the random perturbations are inefficient in high dimensions, given that only a limited subset of the input perturbations are capable of pushing the decision boundary into low-density regions. VAT and AdD find the adversarial perturbations that will maximize the change in the model’s predictions, which involve multiple forward and backward passes to compute these perturbations. This additional computation can be restrictive in many cases and makes such methods less appealing. As an alternative, Verma et al. [154] propose Interpolation Consistency Training (ICT) as an efficient consistency regularization technique for SSL.

Given a MixUp operation [176]: that outputs an interpolation between the two inputs with a weight for . As shown in fig. 9, ICT trains the prediction function to provide consistent predictions at different interpolations of unlabeled data points and , where the targets are generated using a teacher model which is an EMA of :

(2.17)

The unsupervised objective is to have similar values between the student model’s prediction given a mixed input of two unlabeled data points, and the mixed outputs of the teacher model.

(2.18)

The benefit of ICT compared to random perturbations can be analyzed by considering the mixup operation as a perturbation applied to a given unlabeled example: , for a large number of classes and a with a similar distribution of examples per class, it is likely that the pair of points, lie in different clusters and belong to different classes. If one of these two data points lies in a low-density region, applying an interpolation toward points to a low-density region, which is a good direction to move the decision boundary toward.

2.10 Unsupervised Data Augmentation

Unsupervised Data Augmentation [167] uses advanced data augmentation methods, such as AutoAugment [27], RandAugment [28] and Back Translation [41, 137] as perturbations for consistency training based SSL. Similar to supervised learning, advanced data augmentation methods can also provide extra advantages over simple augmentations and random noise for consistency training, given that (1) it generates realistic augmented examples, making it safe to encourage the consistency between predictions on the original and augmented examples. (2) it can generate a diverse set of examples improving the sample efficiency and (3) it is capable of providing the missing inductive biases for different tasks.

Motivated by these points, Xie et al. [167] propose to apply the following augmentations to generate transformed versions of the unlabeled inputs:

  • RandAugment for Image Classification. Consists of uniformly sampling from the same set of possible transformations in PIL, without requiring any labeled data to search to find a good augmentation strategy.

  • Back-translation for Text Classification. Consists of translating an existing example in language A into another language B, and then translating it back into A to obtain an augmented example.

Figure 10: UDA. The training procedure consists of computing the supervised loss for the labeled examples and the consistency loss between the two outputs of the augmented and clean input.

After defining the augmentations to be applied during training, the training procedure (fig. 10) is straight forward. The objective is to have the correct predictions over the labeled set and consistent predictions on the original and augmented examples from the unlabeled set.

3 Entropy Minimization

In the previous section, in a setting where the cluster assumption is maintained, we enforce consistency of predictions to push the decision boundary into low-density regions to avoid classifying samples from the same cluster with distinct classes, which is a violation of the cluster assumption. Another way to enforce this is to encourage the network to make confident (low-entropy) predictions on unlabeled data regardless of the predicted class, discouraging the decision boundary from passing near data points where it would otherwise be forced to produce low-confidence predictions. This is done by adding a loss term which minimizes the entropy of the prediction function , e.g. for a categorical output space with possible classes, the entropy minimization term [56] is:

(3.1)

However, with high capacity models such as neural networks, the model can quickly overfit to low confident data points by simply outputting large logits, resulting in a model with very confident predictions

[111]. On its own, entropy minimization doesn’t produce competitive results compared to other SSL methods, but can produce state-of-the-art results when combined with different approaches.

4 Proxy-label Methods

Proxy label methods are the class of SSL algorithms that produce proxy labels on unlabeled data, using the prediction function itself or some variant of it without any supervision. These proxy labels are then used as targets together with the labeled data, providing some additional training information even if the produced labels are often noisy or weak and do not reflect the ground truth. These methods can be divided mainly into two groups: self-training, where the model itself produces the proxy labels, and multi-view learning, where the proxy labels are produced by models trained on different views of the data.

4.1 Self-training

In self-training [171, 136, 130, 129], the small amount of labeled data is first used to train a prediction function . The train model is then used to assign pseudo-labels to the unlabeled data points . Given an output for an unlabeled data point in the form of a probability distribution over the classes, the pair is added to the labeled set if the probability assigned to its most likely class is higher than a predetermined threshold . The process of training the model using the augmented labeled set, and then set using it to label the remaining of is repeated until the model is incapable of producing confident predictions. Other heuristics can be used to decide which proxy labeled examples to retain, such as using the relative confidence instead of the absolute confidence, where the top unlabeled samples predicted with the highest confidence after every epoch are added to the labeled training dataset . The impact of self-training is similar to that of entropy minimization; in both cases, the network is forced to have more confident predictions. The main downside of such methods is that the model is unable to correct its own mistakes, and any biased and wrong classifications can be quickly amplified resulting in confident but erroneous proxy labels on the unlabeled data points.

Yalnizet et al. [169] proposed to use self-training to improve ResNet-50 [65] top-1 accuracy and enhance the robustness of the trained model to various perturbations (e.g., ImageNet-A, C and P [66]), where the model is first trained only on unlabeled images and then fine-tuned on labeled images in the final stage. Instead of using the same model for both proxy labels generation and training, Xie et al. [168] proposed to use the student-teacher setting. In an iterative manner, the teacher model is first trained on labeled examples and used to generate soft proxy labels on unlabeled data. The student can then be trained on both the labeled set and the proxy labels while aggressively injecting noise to train a better student. In the next iteration, the student is considered as a teacher, and a bigger version of EfficientNet [144] is used for the student, and the same procedure is repeated up to the largest model.

In addition to image classification, self-training was also successfully applied to a variety of tasks, such as semantic segmentation [7], text classification [96, 77], machine translation [137, 62, 23, 63] and learning from noisy data [152].

Pseudo-labeling

[91, 2, 72, 139], similar to self-training, the objective of pseudo-labeling is to generate proxy labels to enhance the learning process. A first attempt at adapting pseudo-labeling [91] for deep learning constrained the usage of the proxy labels to a fine-tuning stage after pretraining the network. Shi et al. [139] proposed to adapt Transductive SSL [75, 76, 179, 158] by treating the labels of unlabeled examples as variables and trying to determine their optimal labels together with the optimal model parameters by minimizing the proposed loss function through the iterative training process. The generated proxy labels are considered as hard labels for the unlabeled examples, an uncertainty weight is then introduced, with large weights for examples with distant -nearest neighbors in the feature space, in additiont to wo loss terms encouraging intra-class compactness and inter-class separation and a consistency term between samples with different perturbations. Iscen et al. [72] integrated label-propagation [189, 162, 52] within pseudo-labeling. The method alternates between training the network on the labeled examples and pseudo-labels and then leveraging the learned representations to build a nearest neighbor graph where label propagation is applied to refine the hard pseudo-labels. They also introduce two uncertainty scores, one for every sample based on the entropy of the output probabilities to overcome the unequal confidence in the predictions, and a per-class scored based class population to deal with class-imbalance. Arazo et al. [2] showed that a naive pseudo-labeling overfits to incorrect pseudo-labels due to the so-called confirmation bias, and demonstrate that MixUp [176] and setting a minimum number of labeled samples per mini-batch are effective regularization techniques for reducing this bias.

Meta Pseudo Labels
Figure 11: The MPL training procedure. At each training iteration, the teacher model is trained along with a student model to set the student’s target distributions and adapt to the student’s learning state. Image Source: [118].

Given how important the heuristics used to generate the proxy labels, where a proper method could lead to a sizable gain. Pham et al. [118] propose to use the student-teacher setting, where the teacher model is responsible for producing the proxy labels based on an efficient meta-learning algorithm called Meta Pseudo Labels (MPL), which encourages the teacher to adjust the target distributions of training examples in a manner that improves the learning of the student model. The teacher is updated by policy gradients computed by evaluating the student model on a held-out validation set.

A given training step of MPL consists of two phases (fig. 11):

  • Phase 1: The Student learns from the Teacher. In this phase, given a single input example , the teacher produces a target class-distribution to train the student , where the pair is shown to the student to update its parameters by back-propagating from the cross-entropy loss.

  • Phase 2: The Teacher learns from the Student’s Validation Loss. After the student updates its parameters in first step, its new parameter are evaluated on an example from the held-out validation dataset using the cross-entropy loss. Since the validation loss depends on via the first step, this validation cross-entropy loss is also a function of the teacher’s weights . This dependency allows us to compute the gradients of the validation loss with respect to the teacher’s weights, and then update to minimize the validation loss using policy gradients.

While the student’s performance allows the teacher to adjust and adapt to the student’s learning state, this signal alone is not sufficient to train the teacher since when the teacher has observed enough evidence to produce meaningful target distributions to teach the student, the student might have already entered a bad region of parameters. To overcome this, the teacher is also trained using the pair of labeled data points from the held-out validation set.

4.2 Multi-view training

Multi-view training (MVL) [87, 180] utilizes multi-view data that are very common in real-world applications, where different views can be collected by different measuring methods (e.g., color information and texture information for images) or by creating limited views of the original data. In such a setting, MVL aims to learn a distinct prediction function to model a given view of a data point , and jointly optimize all the functions to improve the generalization performance. Ideally, the possible views complement each other so that the produced models can collaborate in improving each other’s performance.

4.2.1 Co-training

Co-training [15] requires that each data point can be represented using two conditionally independent views and , and that each view is sufficient to train a good model. After training two prediction functions and on a specific view on the labeled set . We start the proxy labeling procedure. At each iteration, an unlabeled data point is added to the training set of the model if the other model outputs a confident prediction with a probability higher than a threshold . This way, one of the models provides newly labeled examples where the other model is uncertain. Co-training has been combined with deep learning for some applications, such as object recognition [22] by utilizing RGB-D data, with RGB and depth as the two views used to train the two models, or for combining multi-modal data [3] (i.e., image and text) by training each model on a given modality and use it to provide pseudo-labels for other models. However, in most real applications the data have only one view rather than two, in this case, different learning algorithms or different parameter configurations to learn two different classifiers can be employed. The two views and can also be generated by injecting noise or by applying different augmentations, for example, Qiao et al. [119] used adversarial perturbations to produce new views for deep co-training for image classification, where the models are encouraged to have the same predictions on but make different errors when they are exposed to adversarial attacks.

Democratic Co-training

[185]. An extension of Co-training, consists of replacing the different views of the input data with a number of models with different architectures and learning algorithms, which are first trained on the labeled examples. The trained models are then used to label a given example if a majority of models confidently agree on its label.

4.2.2 Tri-Training

Tri-training [186] tries to overcome the lack of data with multiple views and reduce the bias of the predictions on unlabeled data produced with self-training by utilizing the agreement of three independently trained models instead of a single model. First, the labeled data is used to train three prediction functions: , and . An unlabeled data point is then added to the supervised training set of the function if the other two models agree on its predicted label. The training stops if no data points are being added to any of the models’ training sets. Tri-training requires neither the existence of multiple views nor unique learning algorithms. Thus it can be applied to more real applications. Using Tri-training with neural networks can be very expensive, requiring predictions for each one of the three models on all the unlabeled data. Ruder et al. [131] propose to sample a limited number of unlabeled data points at each training epoch, the candidate pool size is increased as the training progresses and the models become more accurate.

Multi-task tri-training

[131] can also be used to reduce the time and sample complexity, where all three models share the same feature-extractor with model-specific classification layers. This way, the models are trained jointly with an additional orthogonality constraint on two of the three classification layers to be added to loss term, to avoid learning similar models and falling back to the standard case of self-training. Tri-Net [37] also falls in this category, with a shared module for joint learning and three output modules for tri-training, in addition to utilizing output smearing [16] to initialize these modules, followed by a fine-tuning stage on labeled data to augment diversity and eliminate unstable and suspicious pseudo-labeled data.

Cross-View Training
Figure 12: Cross-view Training. An example of auxiliary student prediction modules. Each student sees a restricted view of the input. For instance, the forward prediction module does not see any context to the right of the current token when predicting that tokens label. Image Source: [25]

In self-training, the model plays a dual role of a teacher and a student, producing the predictions it is being trained on, resulting in a very moderate performance gains. As a solution, and taking inspiration from multi-view learning and consistency training, Clark et al. [25] proposed Cross-View Training, where the model is trained to produce consistent predictions across different views of the inputs. Instead of using a single model as a teacher and a student, they propose to use a shared encoder, and then add auxiliary prediction modules that transform the encoder representations into predictions, these modules are then divided into auxiliary student modules and a primary teacher module. The input to each student prediction module is a subset of the model’s intermediate representations corresponding to a restricted view of the input, such as feeding one of the student only the forward LSTM from a given Bi-LSTM layer, so it makes predictions without seeing any tokens to the right of the current one (fig. 12). The primary teacher module in trained only on labeled examples, and is responsible of generating the pseudo-labels taking as input the full view of the input for the unlabeled inputs, the students are trained to have consistent predictions with the teacher module. Given an encoder , a teacher module and student modules with , where each student receives a limited view of the input, the training objective is written as follows:

(4.1)

Cross-view training takes advantage of unlabeled data by improving the encoder’s representation learning. The student prediction modules can learn from the teacher module predictions because this primary module has a better, unrestricted view of the inputs. As the student modules learn to make accurate predictions despite their restricted views of the input, they improve the quality of the representations produced by the encoder. Which, in turn, improves the full model, which uses the same shared representations.

5 Holistic Methods

An emerging line of work in SSL is a set of holistic approaches that try to unify the current dominant methods in SSL in a single framework, achieving better performances.

5.1 MixMatch

Figure 13: MixMatch. The procedure of label guessing process used in MixMatch, taking as input a batch of unlabeled examples, and outputting a batch of augmented version of each input, with a corresponding sharpened proxy labels. Image Source: [118].

Berthelot et al. [13] propose a holistic approach which gracefully unifies ideas and components from the dominant paradigms for SSL, resulting in an algorithm that is greater than the sum of its parts and surpasses the performance of the traditional approaches.

MixMatch takes as input a batch from the labeled set containing pairs of inputs and their corresponding one-hot targets, a batch from the unlabeled set containing only unlabeled data, and a set of hyperparameters: sharpening softmax temperature , number of augmentations

, Beta distribution parameter

for MixUp. Producing a batch of augmented labeled examples and a batch of augmented unlabeled examples with their proxy labels. These augmented examples can then be used to compute the losses and train the model. Precisely, MixMatch consists of the following steps:

  • Step 1: Data Augmentation. Using an given transformation, a labeled example from the labeled batch is transformed, generating its augmented versions . For an unlabeled example , the augmentation function is applied times, resulting in augmented versions of the unlabeled examples , …, .

  • Step 2: Label Guessing. The second step consists of producing proxy labels for the unlabeled examples. First, we generate the predictions for the augmented versions of each unlabeled example using the predictions function . The predictions are then averaged together, obtaining a proxy or a pseudo label for each one of the augmentations of the unlabeled example : (), …, ().

  • Step 3: Sharpening. To push the model to produce confident predictions and minimize the entropy of the output distribution, the generated proxy labels in step 2 in the form of a probability distribution over classes are sharpened by adjusting the temperature of the categorical distribution, computed as follows where refers to the probability of class out of classes:

    (5.1)
  • Step 4 MixUp. The previous steps resulted in two new augmented batches, a batch of augmented labeled examples and their target, and a batch of augmented unlabeled examples and their sharpened proxy labels. Note that the size of is times larger than the original batch given that each example is replaced by its augmented versions. In the last step, we mix these two batches. First, a new batch merging both batches is created . is then divided into two batches: of the same size as and of the same size as . Using the Mixup operation that is slightly adjusted so that the mixed example is closer the labeled examples, the final step is to create new labeled and unlabeled batches by mixing the produced batches together using Mixup as follows:

(5.2)
(5.3)

After creating two augmented batches and using MixMatch, we can then train the model using the standard SSL losses by computing the CE loss for the supervised loss, and the consistency loss for the unsupervised loss using the augmented batches as follows:

(5.4)

5.2 ReMixMatch

Berthelot et al. [12] propose to improve MixMatch by introducing two new techniques: distribution alignment and augmentation anchoring. Distribution alignment encourages the marginal distribution of predictions on unlabeled data to be close to the marginal distribution of ground-truth labels. Augmentation anchoring feeds multiple strongly augmented versions of the input into the model. It encourages each output to be close to the prediction for a weakly-augmented version of the same input.

Figure 14: ReMixMatch. Left. Distribution alignment adjusts the guessed labels distributions to match the ground-truth class distribution divided by the average model predictions on . Right. Augmentation anchoring uses the prediction obtained using a weakly augmented image as targets for a strongly augmented version of the same image. Image Source: [12].

Distribution alignment. In order to force that the aggregate of predictions on unlabeled data matches the distribution of the provided labeled data. Over the course of training, a running average of the model’s predictions on unlabeled data is maintained over the last 128 batches. For the marginal class distribution , it is estimated based on the labeled examples seen during training. Given a prediction on the unlabeled example , the output probability distribution is aligned as follows: .

Augmentation Anchoring. MixMatch uses a simple flip-and-crop augmentation strategy, ReMixMatch replaces the weak augmentations with strong augmentations learned using a control theory based augmentation strategy following AutoAugment. With such augmentations, the model’s prediction for a weakly augmented unlabeled image is used as a proxy label for many strongly augmented versions of the same image in a standard cross-entropy loss.

For training, MixMatch is applied to the unlabeled and labeled batches, with the application of distribution alignment and replacing the weakly augmented example with a strongly augmented example, in addition to using the weakly augmented examples for predicting proxy labels for the unlabeled strongly augmented examples. With two augmented batches and , the supervised and unsupervised losses are computed both using the cross-entropy loss as follows:

(5.5)

In addition to these losses, the authors add a self-supervised loss. First, a new unlabeled batch of examples is created by rotating all of the examples with an angle . The rotated examples are then used to compute a self-supervised loss, where the classification layer on top of the model predicts the correct applied rotation, in addition to the cross-entropy loss over the rotated examples:

(5.6)

5.3 FixMatch

Figure 15: FixMatch. The model prediction on a weakly augmented input is considered as target if the maximum output class probability is above threshold, this target can then be used to train the model on a strongly augmented version of the same input using standard cross-entropy loss. Image Source: [12].

FixMatch[141], a simple SSL algorithm that combines consistency regularization and pseudo-labeling. In FixMatch (fig. 15), both the supervised and unsupervised losses are computed using a cross-entropy loss. For labeled examples, the provided targets are used. For unlabeled examples , a weakly augmented version is first computed using weak augmentation function . As in self-training, the predicted label is then considered as a proxy label if the highest class probability is greater than a threshold . With a proxy label, strongly augmented examples are generated using a strong augmentation function . We then assign to these augmented versions the proxy label obtained with the weakly labeled version. The unsupervised loss can be written as follows:

(5.7)

Augmentations. Weak augmentations consist of a standard flip-and-shift augmentation strategy. Specifically, the images are flipped horizontally with a probability of 50% on all datasets except SVHN, in addition to randomly translating images by up to 12.5% vertically and horizontally. For the strong augmentations, RandAugment and CTAugment [12] are used where a given transformation (e.g., ., color inversion, translation, contrast adjustment, etc.) is randomly selected for each sample in a batch of training examples, and the amplitude of the transformation is a hyperparameter that is optimized during training.

Other important factors in the FixMatch are the usage of Adam optimizer [80], weight decay regularization and the learning rate schedule, the authors propose to use a cosine learning rate decay with a decay of , where is the initial learning rate, is the current training step, and is the total number of training iterations.

6 Generative Models

In unsupervised learning, we are provided with samples drawn i.i.d. from an unknown data distribution with density , and the objective is to estimate this density. Supervised learning, on the other hand, consists of estimating a functional relationship between the inputs and the labels

with the goal of minimizing the functional of the joint distribution

[19]. Classification can be treated as a special case of estimating , where we are only interested in the conditional distributions , without the need to estimate the input distribution since will always be given at prediction time. Semi-supervised learning with generative models can be viewed as either an extension of supervised learning, classification in addition to information about provided by , or as an extension of unsupervised learning, clustering in addition to the provided labels from . In this section, we explore some generative approaches for deep SSL.

6.1 Variational Autoencoders for SSL

Variational Autoencoders (VAEs)

[81, 35]

have emerged as one of the most popular approaches to unsupervised learning of complicated distributions. A standard VAE is an autoencoder trained with a reconstruction objective between the inputs and their reconstructed versions, in addition to a variational objective term that attempts to learn a latent space that roughly follows a unit Gaussian distribution, this objective is implemented as the KL-divergence between the latent space and the standard Gaussian. With an input

, the conditional distribution modeled by an encoder, the standard Gaussian distribution and the reconstructed input generated using a decoder . The parameters and are trained to minimize the following objective:

(6.1)

6.1.1 Variational Autoencoder

Kingma et al. [82] expanded the work on variational generative techniques [81, 126] for SSL that exploit generative descriptions of the data to improve upon the classification performance that would be obtained using the labeled data alone.

Standard VAEs for SSL (M1 Model)

The first model consists of an unsupervised pretraining stage, in which the VAE is trained using the labeled and unlabeled examples. Using a fully trained VAE, the observed labeled data are transformed into the latent space defined by , the standard supervised task can then be solved using where are the labels of . With this approach, the classification can be performed in a lower dimensional space since the latent variables

dimensionality is much less than that of the observations. These low dimensional embeddings are more easily separable since the latent space is formed by independent Gaussian posteriors parameterized by an encoder built by a sequence of non-linear transformations of the inputs.

Extending VAEs for SSL (M2 Model)

In the M1 model, the labels of were ignored when training the VAE. With the second model, in addition to the latent variable , if the class labels are not available, is treated as a latent variable and used for training. The network in this case contains three components, modeled by a classification network, modeled by an encoder, and modeled by a decoder, with parameters and . The training is similar to a standard VAE with the addition of the posterior on and loss terms to train if the labels are available. The distribution can then be used at test time to get the predictions of any unseen data.

Stacked VAEs (M1+M2 Model)

The two previous models can be concatenated to form a joint model. In this case, the model M1 is first trained to obtain the latent variables , model M2 then uses the latent variables from model M1 as new representations of the data as opposed to raw values . The final model can be described as follows:

(6.2)

6.1.2 Variational Auxiliary Autoencoder

Variational Auxiliary Autoencoder [100, 122] extends the variational distribution with auxiliary variables : such that the marginal distribution can fit more complicated posteriors while improving the flexibility of inference. In order to have an unchanged generative model , it is required that the joint mode gives back the original under marginalization over , thus , with to avoid falling back to the original VAE model.

In SSL, to incorporate the class information, an additional latent variable is introduced, the generative model become , with , , as the auxiliary variable, class label, and latent features respectively. In this case, the auxiliary unit introduces a latent feature extractor to the inference model giving a richer mapping between and . The resulting model is parametrized by 5 neural networks: 1) an auxiliary inference model , 2) a latent inference model , 3) a classification model , 4) a generative model , and 5) a generative model , which are trained on both a generative and discriminative tasks simultaneously.

6.1.3 Infinite Variational Autoencoder

Another variation of VAEs for SSL is Infinite Variational Autoencoder [42], to overcome the limitation of VAEs of having a fixed dimension of the latent space and a fixed number of parameters in the generative model in advance, in which the capacity of the model must be chosen a priori with some foreknowledge of the training data characteristics. Infinite VAE solves this by producing an infinite mixture of autoencoders capable of growing with the complexity of the data to best capture its intrinsic structure. After training the generative model using unlabeled data, this model can then be combined with the available labeled data to train a discriminative model, which is also a mixture of experts, for classification. For a given test example , each discriminative expert produces a tentative output that is then weighted by the generative model. As such, each discriminative expert learns to perform better with instances that are more structurally similar from the generative model’s perspective. With a higher modeling capability, the infinite VAE is able to capture the distribution of the unlabeled data more accurately. Therefore, it provides a generative model that allows the discriminative model, which is trained based on its output, to be more effectively learned using a small number of samples.

6.2 Generative Adversarial Networks for SSL

Figure 16: GAN framework. During training, the discriminator alternates between receiving real images from the data distribution , with the goal of correctly classifying them as real, i.e., , and generated samples with the aim of correctly classifying them as fake, i.e., , while competing with the generator, trying to generate real-looking samples to fool the discriminator, i.e., .

A Generative Adversarial Network (GAN) [54] (fig. 16) consists of a generator network and a discriminator network . The generator receives a latent variable sampled from the prior distribution and maps to the input space. The discriminator takes an input, either coming from the real data or generated by and outputs the probability of the input being from either or the real data distribution , represented with an empirical distribution . The standard training procedure of GANs minimizes two objectives by alternating between training the discriminator and the generator :

(6.3)

where

is usually chosen as a standard normal distribution. Other formulations have been proposed to improve and stabilize the training procedure, such as the hinge-loss version of the adversarial loss

[97, 149] and Wassertein GAN (WGAN) [4]. Which are subsequently improved in several ways [174, 105, 175], such as using spectral normalization [105] on both the generator and the discriminator, or consistency regularization on the discriminator [175].

6.2.1 CatGAN

Categorical generative adversarial networks (CatGAN) [143] consist of combining both the generative and the discriminative perspectives within the training procedure. The discriminator in this case plays the role of classifiers and is trained to maximize the mutual information between the inputs and the predicted labels for a number of unknown classes. To aid these classifiers in their task of discovering categories that generalize well to unseen data, and avoid overfitting to spurious correlations in the data, the adversarial generative network comes into play and provides the examples the discriminator must become robust to.

The traditional two-player game in the GAN framework can be extended to CatGAN by having a discriminator that assign all examples to one the classes instead of a probability of belonging to while staying uncertain of the class assignments for the generated samples by . After training such a classifier-generator pair where the discovered classes coincide with the classification problem we are interested in, the classifier can then be used during inference being trained only on unlabeled data.

CatGAN objective dictates three requirements for the discriminator and two requirements that the generator that should fulfilled:

  • Discriminator requirements: should (1) be certain of class assignment for samples from , (2) be uncertain of assignment for generated samples, and (3) by assuming a uniform prior over classes, all classes must be distributed equally.

  • Generator requirements: should (1) generate samples with highly certain class assignments, and (2) similar to the discriminator, equally distribute samples across all classes.

In order to have the output class distribution to be highly peaked where is certain about the class assignment, the entropy of the class distribution must be low. For the generated samples, the predictions should be highly uncertain with a uniform class distribution , in this case the entropy must be high. The first two requirements can then be enforced by simply minimizing and maximizing the . To meet the third requirement that all classes should be used equally, the entropy of the marginal class distribution as measured empirically for both and needs to be maximized:

(6.4)

Combining these requirements, CatGAN objective for the discriminator and the generator is:

(6.5)

In SSL, if the input comes from the labeled set with a label

in the form of a one-hot vector, the discriminator

is trained with a cross-entropy loss in addition to :

(6.6)

where is a cost weighting term.

6.2.2 Dcgan

Another way of using GANs for SSL is to leverage the unlabeled examples to learn good and transferable intermediate representations, which can then be used on a variety of supervised learning tasks such as image classification based on a small labeled set . Radford et al. [121]

propose to build good image representations by training GANs, and later reusing parts of the generator and discriminator networks as feature extractors for supervised tasks. The authors propose Deep Convolutional GANs (DCGAN), a class of architectures with a set of constraints on the architectural topology of convolutional GANs to be able to scale them while maintaining a stable training in most settings, such as replacing polling layers with strided convolutions for the discriminator and fractional-strided convolutions for generator, using batchnorm

[71] in both the generator and the discriminator and removing fully connected layers for deeper architectures.

After training DCGANs for image generation, the representations learned by DCGANs can be utilized for downstream tasks, by either fine tuning the discriminator features with an additional classification layer added on top and trained on , or by flattening and concatenating the learned features and training a linear classifier on top of them.

6.2.3 Sgan

DCGAN demonstrated the utility of the learned representations for SSL, but it has several undesirable properties. Using the learned representations of the discriminator after the fact doesn’t allow for training the classifier and the generator simultaneously, doing this is more efficient, but more importantly, improving the discriminator improves the classifier, and improving the classifier improves the discriminator, which improves the generator. Semi-Supervised GAN (SGAN) [110] takes advantage of this feedback loop by allowing to learn a generative model and a classifier simultaneously, significantly improving the quality of the classification performance, the quality of the generated samples and reducing training time.

Instead of a discriminator network outputting an estimated probability that the input image is drawn from the data distribution. For classes, SGAN consists of a discriminator with output, with per class output in addition to a fake class output. Training an SGAN is similar to training a GAN; the only difference is using the labels to train the discriminator if the input is drawn for the labeled set . The discriminator is trained to minimize the negative log-likelihood with respect to the given labels, and the generator is trained to maximize it.

6.2.4 Feature Matching GAN

Training GANs consists if finding a Nash equilibrium to a two-player non-cooperative game, with each player trying to minimize its cost function. To solve this, GAN training consists of applying gradient descent on each player’s cost simultaneously, but with such a training procedure, there is no guarantee of convergence. Feature matching [133] was proposed to encourage convergence. Feature matching addresses the instability of GANs by specifying a new objective for the generator that prevents it from over training on the current discriminator. Instead of directly maximizing the output of the discriminator, the new objective requires the generator to generate data that matches the first-order feature statistics between of the data distribution, i.e

., the hidden representations of the discriminator. For some activations

of a given intermediate layer, the new objective is defined as:

(6.7)

The problem of the generator mode collapse, where it always emits the same point, is still present even with feature matching because the discriminator processes each example independently, so there is no coordination between its gradients, and thus no mechanism to tell the outputs of the generator to become more dissimilar to each other. In addition to feature matching, a new technique called minibatch discrimination is also integrated into the training procedure to allow the discriminator to look at multiple data examples in combination, where the discriminator still classifies single examples as real or generated data, but it is now able to use the other examples in the minibatch as side information.

For SSL, similar to SGAN, the discriminator in feature matching GAN employs a -class objective instead of binary classification, where true samples are classified into the first classes and generated samples are classified into the -th fake class, the probability of being fake in this case is , corresponding to in the original GAN framework. The loss function for training the classifier then becomes where:

(6.8)

The above objective is similar to the original GAN formulation by considering to be the probability of fake samples, while the only difference is that the probability of true samples if split into sub-classes. This -class discriminator objective lead to strong empirical results, and was later widely used to evaluate the effectiveness of generative models [40, 150]. The main drawback is that feature matching works well in classification but fails to generate indistinguishable samples, while the other objective of minibatch discrimination is good at realistic image generation but cannot predict labels accurately.

6.2.5 Bad GAN

Feature matching GAN formulation raises two questions. First, it is not clear why the formulation of the discriminator can improve the performance when combined with a generator. Second, it seems that good semi-supervised learning and a good generator cannot be obtained at the same time. Dai et al. [30] addressed these questions by showing that for a -class discriminator formulation of GAN-based SSL, good semi-supervised learning requires a bad generator that does not match the true data distribution, but simply plays the role of a complement generator to help the discriminator obtain correct decision boundaries in high-density areas in the feature space.

To overcome the drawbacks of feature matching GANs, the new objective function of the generator is:

(6.9)

where is the distribution induced by the generator , is an indicator function and as the threshold. The first term maximizes the entropy of generator to avoid the collapsing issues that are a clear sign of low entropy, but given that for implicit generative models, GANs only provide samples rather than an analytic density form, the entropy can either optimized in the input space i.e., using variational inference or the feature space i.e., using a pull-away term (PT) [181] as an auxiliary cost for the entropy. The second term enforces the generation of samples with low density in the input space by pushing the generated samples to move towards low-density regions defined by , this probability distribution over images is estimated using PixelCNN++ [134] model, which pretrained on the training set, and fixed during semi-supervised training. The last term is the feature matching objective. This method substantially improves the performance of image classification over vanilla feature matching GANs on several benchmark datasets.

6.2.6 Triple-GAN

As discussed in Bad GAN, the generator and the discriminator (i.e., the classifier) may not be optimal at the same time, since that for an optimal generator, i.e., , an optimal discriminator should identify as fake. Still, as a classifier, the discriminator should predict the correct class of confidently since , indicating that the discriminator and generator may not be optimal at the same time. Instead of learning a complement generator for classification, Triple-GAN [24] is designed to achieve simultaneously a good generation of realistically-looking samples conditioned on class labels, and produce a good classifier with the smallest possible prediction error.

Triple-GAN consists of three components: (1) a classifier that characterizes the conditional distribution ; (2) a class-conditional generator that characterizes the conditional distribution in the other direction ; and (3) a discriminator that distinguishes whether a pair of data comes from the true distribution . All the components are parameterized as neural networks. The desired equilibrium is that the joint distributions defined by the classifier and the generator both converge to the true data distribution.

For as the empirical distribution of inputs and

as a uniform distribution which is assumed to be the same as the distribution of labels on labeled data, the classifier produces pseudo-labels

given , in this case the examples and the pseudo-labels are drown from the joint distribution . Similarly, the generator produces examples , with and the latent variables , the generated examples and labels are drown from the joint distribution . These pseudo input-label pairs generated by both and are sent to the single discriminator . The objective function is formulated as:

(6.10)

where is a constant that controls the relative importance of generation and classification. To properly leverage unlabeled data, an additional regularization is enforced on classifier , consisting of minimizing the conditional entropy of , the cross-entropy between and , and a consistency regularization with a dropout as the source of noise. In such a setting, the classifier achieves high accuracy with only very few labeled dataset, while the generator produces state-of-the-art images, even when conditioned on labels.

Enhanced TripleGAN (EnhancedTGAN) [165] improves Triple-GAN by adopting a class-wise mean feature matching to regularize the generator and a semantic matching term to ensure the semantics consistency of the synthesized data between the generator and the classifier, further improving the state-of-the-art results in both SSL and instance synthesis.

6.2.7 BiGAN

One of the limitations of the traditional GAN framework is not being able to infer latent representations that can be used as rich representations of the data for a more efficient training. Unlike VAEs with an inference network (i.e., decoder) that can learn a variational posterior over latent variables, the generator is typically a directed, latent variable model with latent variables and observed variables , making it unable to infer the latent feature representations for a given data point. BiGAN [36] solves this by introducing an encoder as an additional component in the GAN framework, which maps data to latent representations . The BiGAN discriminator discriminates not only in data space between and , but jointly in data and latent space, between pairs and , where the latent component is either an encoder output or a generator input . A trained BiGAN encoder can then serve as feature representation for downstream tasks. The BiGAN training objective is defined as a minimax objective:

(6.11)

Kumar et al. [88]

proposed Augmented-BiGAN, an improved version of BiGAN for SSL. The Augmented-BiGAN is similar to other GAN frameworks used for SSL, treating the generated samples as an additional class to the regular classes that the classifier aims to label, with an additional Jacobian-based regularization that is introduced to encourage the classifier to be robust to local variations in the tangent space of the input manifold. The BiGAN trained encoder is used in calculating these Jacobians, resulting in an efficient estimation of the tangents space at each training sample, and avoiding the expensive SVD-based method used in contractive autoencoders

[128].

7 Graph-Based SSL

Graphs are a powerful tool to model interactions and relations between different entities, in order to understand the represented system in both a global and local manner. In Graph-based SSL [191] methods, each data point , be it labeled or unlabeled, is represented as a node in the graph, and the edge connecting each pair of nodes reflects their similarity. Formerly, A graph is a collection of vertices or nodes and edges. The adjacency matrix of a graph describes the structure of the graph, with each element as a non-negative weight associated with each edge, if two nodes and are not connected to each other, then = 0. The adjacency matrix can either be derived using a similarity measure between the data points [188, 72]

, or be explicitly derived from external data, such as a knowledge graph

[163], and provided as input. Graph-based tasks can be broadly categorized into four categories [55]: node classification, link prediction, clustering, and visualization. Graph methods can also be transductive or inductive in nature; transductive methods are only capable of producing labels assignments of the examples seen during training (i.e., the unlabeled nodes of the graph), while inductive methods are more generalizable, and can be transferred and applied to unseen examples. In this section, we will discuss node classification approaches, given that the objective in SSL is to assign labels to the unlabeled examples. Node classification approaches can be broadly grouped into methods which propagate the labels from labeled nodes to unlabeled nodes based on the assumption that nearby nodes tend to have the same labels [6, 188, 183], and methods which learn node embeddings based on the assumption that nearby nodes should have similar embeddings in vector space and then apply classifiers on the learned embeddings [57]. First, we start with some graph construction approaches and then discuss several popular methods for graph-based SSL.

7.1 Graph Construction

To apply graph-based SSL, we first need a graph. The graph can either be presented as an input in the form of an adjacency matrix or can be constructed to reflect the similarity of the nodes. A useful graph should reflect our prior knowledge about the domain and is the practitioner’s responsibility to feed a good graph to graph-based SSL algorithms in order to produce valuable outputs (for more details, see Ch3 & 7 [188]).

In case we have limited domain knowledge about the dataset at hand, Zhu et al. describes some common ways to create graphs:

  • Fully connected graphs. A simple form the graph can take is being fully connected with weighted edges between all pairs of data. With full connectivity, the derivatives of the graph w.r.t., the weights can be computed to update the weights