S^4L: Self-Supervised Semi-Supervised Learning

by   Xiaohua Zhai, et al.

This work tackles the problem of semi-supervised learning of image classifiers. Our main insight is that the field of semi-supervised learning can benefit from the quickly advancing field of self-supervised visual representation learning. Unifying these two approaches, we propose the framework of self-supervised semi-supervised learning (S^4L) and use it to derive two novel semi-supervised image classification methods. We demonstrate the effectiveness of these methods in comparison to both carefully tuned baselines, and existing semi-supervised learning methods. We then show that S^4L and existing semi-supervised methods can be jointly trained, yielding a new state-of-the-art result on semi-supervised ILSVRC-2012 with 10


Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

As the adoption of deep learning techniques in industrial applications g...

Learning to Impute: A General Framework for Semi-supervised Learning

Recent semi-supervised learning methods have shown to achieve comparable...

RSSL: Semi-supervised Learning in R

In this paper, we introduce a package for semi-supervised learning resea...

Self-supervised Mean Teacher for Semi-supervised Chest X-ray Classification

The training of deep learning models generally requires a large amount o...

Deep Semi-Supervised Learning for Time Series Classification

While Semi-supervised learning has gained much attention in computer vis...

A statistical theory of semi-supervised learning

We currently lack a solid statistical understanding of semi-supervised l...

Iterative Self-Learning: Semi-Supervised Improvement to Dataset Volumes and Model Accuracy

A novel semi-supervised learning technique is introduced based on a simp...

1 Introduction

Modern computer vision systems demonstrate outstanding performance on a variety of challenging computer vision benchmarks, such as image recognition 

[34], object detection [22], semantic image segmentation [8], etc. Their success relies on the availability of a large amount of annotated data that is time-consuming and expensive to acquire. Moreover, applicability of such systems is typically limited in scope defined by the dataset they were trained on.

Many real-world computer vision applications are concerned with visual categories that are not present in standard benchmark datasets, or with applications of dynamic nature where visual categories or their appearance may change over time. Unfortunately, building large labeled datasets for all these scenarios is not practically feasible. Therefore, it is an important research challenge to design a learning approach that can successfully learn to recognize new concepts by leveraging only a small amount of labeled examples. The fact that humans quickly understand new concepts after seeing only a few (labeled) examples suggests that this goal is achievable in principle.

Figure 1: A schematic illustration of one of the proposed self-supervised semi-supervised techniques: -Rotation. Our model makes use of both labeled and unlabled images. The first step is to create four input images for any image by rotating it by , , and (inspired by [10]). Then, we train a single network that predicts which rotation was applied to all these images and, additionally, predicts semantic labels of annotated images. This conceptually simple technique is competitive with existing semi-supervised learning methods.

Notably, a large research effort is dedicated towards learning from unlabeled data that, in many realistic applications, is much less onerous to acquire than labeled data. Within this effort, the field of self-supervised visual representation learning has recently demonstrated the most promising results [17]. Self-supervised learning techniques define pretext tasks which can be formulated using only unlabeled data, but do require higher-level semantic understanding in order to be solved. As a result, models trained for solving these pretext tasks learn representations that can be used for solving other downstream tasks of interest, such as image recognition.

Despite demonstrating encouraging results [17], purely self-supervised techniques learn visual representations that are significantly inferior to those delivered by fully-supervised techniques. Thus, their practical applicability is limited and as of yet, self-supervision alone is insufficient.

We hypothesize that self-supervised learning techniques could dramatically benefit from a small amount of labeled examples. By investigating various ways of doing so, we bridge self-supervised and semi-supervised learning, and propose a framework of semi-supervised losses arising from self-supervised learning targets. We call this framework self-supervised semi-supervised learning or, in short, . The techniques derived in that way can be seen as new semi-supervised learning techniques for natural images. Figure 1 illustrates the idea of the proposed techniques. We thus evaluate our models both in the semi-supervised setup, as well as in the transfer setup commonly used to evaluate self-supervised representations. Moreover, we design strong baselines for benchmarking methods which learn using only or of the labels in ILSVRC-2012.

We further experimentally investigate whether our methods could further benefit from regularizations proposed by the semi-supervised literature, and discover that they are complementary, i.ecombining them leads to improved results.

Our main contributions can be summarized as follows:

  • We propose a new family of techniques for semi-supervised learning with natural images that leverage recent advances in self-supervised representation learning.

  • We demonstrate that the proposed self-supervised semi-supervised () techniques outperform carefully tuned baselines that are trained with no unlabeled data, and achieve performance competitive with previously proposed semi-supervised learning techniques.

  • We further demonstrate that by combining our best methods with existing semi-supervised techniques, we achieve new state-of-the-art performance on the semi-supervised ILSVRC-2012 benchmark.

2 Related Work

In this work we build on top of the current state-of-the-art in both fields of semi-supervised and self-supervised learning. Therefore, in this section we review the most relevant developments in these fields.

2.1 Semi-supervised Learning

Semi-supervised learning describes a class of algorithms that seek to learn from both unlabeled and labeled samples, typically assumed to be sampled from the same or similar distributions. Approaches differ on what information to gain from the structure of the unlabeled data.

Given the wide variety of semi-supervised learning techniques proposed in the literature, we refer to [4]

for an extensive survey. For more context, we focus on recent developments based on deep neural networks.

The standard protocol for evaluating semi-supervised learning algorithms works as such: (1) Start with a standard labeled dataset; (2) Keep only a portion of the labels (say, ) on that dataset; (3) Treat the rest as unlabeled data. While this approach may not reflect realistic settings for semi-supervised learning [30], it remains the standard evaluation protocol, which we follow it in this work.

Many of the initial results on semi-supervised learning with deep neural networks were based on generative models such as denoising autoencoders 

[33], variational autoencoders [16] and generative adversarial networks [29, 35]. More recently, a line of research showed improved results on standard baselines by adding consistency regularization losses computed on unlabeled data. These consistency regularization losses measure discrepancy between predictions made on perturbed unlabeled data points. Additional improvements have been shown by smoothing predictions before measuring these perturbations. Approaches of these kind include -Model [19], Temporal Ensembling [19], Mean Teacher [41] and Virtual Adversarial Training [24]. Recently, fast-SWA[1] showed improved results by training with cyclic learning rates and measuring discrepancy with an ensemble of predictions from multiple checkpoints. By minimizing consistency losses, these models implicitly push the decision boundary away from high-density parts of the unlabeled data. This may explain their success on typical image classification datasets, where points in each clusters typically share the same class.

Two additional important approaches for semi-supervised learning, which have shown success both in the context of deep neural networks and other types of models are Pseudo-Labeling [20]

, where one imputes approximate classes on unlabeled data by making predictions from a model trained only on labeled data, and conditional entropy minimization 

[11], where all unlabeled examples are encouraged to make confident predictions on some class.

Semi-supervised learning algorithms are typically [30, 24, 2, 42, 1, 23] evaluated on small-scale datasets such as CIFAR-10 [18] and SVHN [25]. We are aware of very few examples in the literature where semi-supervised learning algorithms are evaluated on larger, more challenging datasets such as ILSVRC-2012 [34]. To our knowledge, Mean Teacher [41] currently holds the state-of-the-art result on ILSVRC-2012 when using only of the labels. Recent concurrent work [43, 13] presents competitive results on ILSVRC-2012.

2.2 Self-supervised Learning

Self-supervised learning is a general learning framework that relies on surrogate (pretext) tasks that can be formulated using only unsupervised data. A pretext task is designed in a way that solving it requires learning of a useful image representation. Self-supervised techniques have a variety of applications in a broad range of computer vision topics [15, 37, 7, 31, 36].

In this paper we employ self-supervised learning techniques that are designed to learn useful visual representations from image databases. These techniques achieve state-of-the-art performance among approaches that learn visual representations from unsupervised images only. Below we provided a non-comprehensive summary of the most important developments in this direction.

Doersch et al. propose to train a CNN model that predicts relative location of two randomly sampled non-overlapping image patches [5]. Follow-up papers [26, 28] generalize this idea for predicting a permutation of multiple randomly sampled and permuted patches.

Beside the above patch-based methods, there are self-supervised techniques that employ image-level losses. Among those, in [44]

the authors propose to use grayscale image colorization as a pretext task. Another example is a pretext task 

[10] that predicts an angle of the rotation transformation that was applied to an input image.

Some techniques go beyond solving surrogate classification tasks and enforce constraints on the representation space. A prominent example is the exemplar loss from [6] that encourages the model to learn a representation that is invariant to heavy image augmentations. Another example is [27], that enforces additivity constraint on visual representation: the sum of representations of all image patches should be close to representation of the whole image. Finally, [3] proposes a learning procedure that alternates between clustering images in the representation space and learning a model that assigns images to their clusters.

3 Methods

In this section we present our self-supervised semi-supervised learning () techniques. We first provide a general description of our approach. Afterwards, we introduce specific instantiations of our approach.

We focus on the semi-supervised image classification problem. Formally, we assume an (unknown) data generating joint distribution

over images and labels. The learning algorithm has access to a labeled training set , which is sampled i.i.d. from and an unlabeled training set , which is sampled i.i.d. from the marginal distribution .

The semi-supervised methods we consider in this paper have a learning objective of the following form:


where is a standard cross-entropy classification loss of all labeled images in the dataset, is a loss defined on unsupervised images (we discuss its particular instances later in this section), is a non-negative scalar weight and is the parameters for model . Note that the learning objective can be extended to multiple unsupervised losses.

3.1 Self-supervised Semi-supervised Learning

We now describe our self-supervised semi-supervised learning techniques. For simplicity, we present our approach in the context of multiclass image recognition, even though it can be easily generalized to other scenarios, such as dense image segmentation.

It is important to note that in practice the objective 1

is optimized using a stochastic gradient descent (or a variant) that uses mini-batches of data to update the parameters

. In this case the size of a supervised mini-batch and an unsupervised mini-batch can be arbitrary chosen. In our experiments we always default to simplest possible option of using minibatches of equal sizes.

We also note that we can choose whether to include the minibatch into the self-supervised loss, i.e. apply to the union of and . We experimentally study the effect of this choice in our experiments Section 4.4.

We demonstrate our framework on two prominent self-supervised techniques: predicting image rotation [10] and exemplar [6]. Note, that with our framework, more self-supervised losses can be explored in the future.

-Rotation.  The key idea of rotation self-supervision is to rotate an input image then predict which rotation degree was applied to these rotated images. The loss is defined as:


where is the set of the rotation degrees , is the image rotated by , is the model with parameters , is the cross-entropy loss. This results in a 4-class classification problem. We follow a recommendation from [10] and in a single optimization step we always apply and predict all four rotations for every image in a minibatch.

We also apply the self-supervised loss to the labeled images in each minibatch. Since we process rotated supervised images in this case, we suggest to also apply a classification loss to these images. This can be seen as an additional way to regularize a model in a regime when a small amount of labeled image are available. We measure the effect of this choice later in Section 4.4.

-Exemplar.  The idea of exemplar self-supervision [6] is to learn a visual representation that is invariant to a wide range of image transformations. Specifically, we use “Inception” cropping [40], random horizontal mirroring, and HSV-space color randomization as in [6] to produce 8 different instances of each image in a minibatch. Following [17], we implement as the batch hard triplet loss [14] with a soft margin. This encourages transformation of the same image to have similar representations and, conversely, encourages transformations of different images to have diverse representations.

Similarly to the rotation self-supervision case, is applied to all eight instances of each image.

3.2 Semi-supervised Baselines

In the following section, we compare to several leading semi-supervised learning algorithms that are not based on self-supervised objectives. We now describe the approaches that we compare to.

Our proposed objective 1 is applicable for semi supervised learning methods as well, where the loss is the standard semi supervised loss as described below.

Virtual Adversarial Training  (VAT) [24]: The idea is making the predicted labels robust around input data point against local perturbation. It approximates the maximal change in predictions within an vicinity of unlabeled data points, where

is a hyperparameter. Concretely, the VAT loss for a model





While computing directly is not tractable, it can be efficiently approximated at the cost of an extra forward and backwards pass for every optimization step. [24].

Conditional Entropy Minimization  (EntMin) [11]: This works under the assumption that unlabeled data indeed has one of the classes that we are training on, even when the particular class is not known during training. It adds a loss for unlabeled data that, when minimized, encourages the model to make confident predictions on unlabeled data. Specifically, the conditional entropy minimization loss for a model (treating as a conditional distribution of labels over images) is:


Alone, the EntMin loss is not useful in the context of deep neural networks because the model can easily become extremely confident by increasing the weights of the last layer. One way to resolve this is to encourage the model predictions to be locally-Lipschitz, which VAT does[38]. Therefore, we only consider VAT and EntMin combined, not just EntMin alone, i.e.

Pseudo-Label  [20] is a simple approach: Train a model only on labeled data, then make predictions on unlabeled data. Then enlarge your training set with the predicted classes of the unlabeled data points whose predictions are confident past some threshold of confidence. Re-train your model with this enlarged labeled dataset. While [30] shows that in a simple ”two moons” dataset, psuedo-label fails to learn a good model, in many real datasets this approach does show meaningful gains.

4 ILSVRC-2012 Experiments and Results

In this section, we present the results of our main experiments. We used the ILSVRC-2012 dataset due to its widespread use in self-supervised learning methods, and to see how well semi-supervised methods scale.

Since the test set of ILSVRC-2012 is not available, and numbers from the validation set are usually reported in the literature, we performed all hyperparameter selection for all models that we trained on a custom train/validation split of the public training set. This custom split contains training and validation images. We then retrain the model using the best hyperparameters on the full training set ( images), possibly with fewer labels, and report final results obtained on the public validation set ( images).

We follow standard practice [41, 32] and perform experiments where class-balanced labels are available for only of the dataset. Note that of ILSVRC-2012 still corresponds to roughly labeled images, and that previous work uses the full (public) validation set for model selection. While we use a custom validation set extracted from the training set, using such a large validation set does not correspond to a realistic scenario, as already discussed by [33, 41, 30]. We also want to cover more realistic cases in our evaluation. We thus perform experiments on of labeled examples (roughly labeled images), while also using a validation set of only images. We analyze the impact of validation set size in Section 7.

We always define epochs in terms of the available labeled data,

i.eone epoch corresponds to one full pass through the labeled data, regardless of how many unlabeled examples have been seen. We optimize our models using stochastic gradient descent with momentum on minibatches of size unless specified otherwise. While we do tune the learning rate, we keep the momentum fixed at across all experiments. Table 1 summarizes our main results.

4.1 Plain Supervised Learning

Whenever new methods are introduced, it is crucial to compare them against a solid baseline of existing methods. The simplest baseline to which any semi-supervised learning method should be compared to, is training a plain supervised model on the available labeled data.

Oliver  [30] discovered that reported baselines trained on labeled examples alone are unfairly weak, perhaps given that there is not a strong community behind tuning those baselines. They provide strong supervised-only baselines for SVHN and CIFAR-10, and show that the gap shown by the use of unlabeled data is smaller than reported.

We observed the same in the case of ILSVRC-2012. Thus, we aim to provide a strong baseline for future research by performing a relatively large search over training hyperparameters for training a model on only of ILSVRC-2012. Specifically, we try weight-decay values in , learning rates in , four different learning rate schedules spanning to epochs, and finally we explore various model architectures: ResNet50, ResNet34, ResNet18, in both “regular” (v1) and “pre-activation” (v2) variants, as well as half-, double-, triple-, and quadruple-width variants of these, testing the assumption that smaller or shallower models overfit less.

In total, we trained several thousand models on our custom training/validation split of the public training set of ILSVRC-2012. In summary, it is crucial to tune both weight decay and training duration while, perhaps surprisingly, model architecture, depth, and width only have a small influence on the final results. We thus use a standard, unmodified ResNet50v2 as model, trained with weight decay of for epochs, using a standard learning rate of , ramped up from for the first five epochs, and decayed by a factor of at epochs , , and . We train in total for epochs. The standard augmentation procedure of random cropping and horizontal flipping is used during training, and predictions are made using a single central crop keeping aspect ratio.

We perform a similar search when training our baseline on of ILSVRC-2012, but additionally include two choices of data augmentation (whether or not to apply random color augmentation) and two minibatch sizes ( and ) in the hyperparameter search. Perhaps somewhat surprisingly, the results here are similar, in that tuning the weight decay and training duration is crucial, but model architecture does not matter much. Additionally, performing color augmentation becomes important. Here too, we use a standard, unmodified ResNet50v2 as model, trained with weight decay of for epochs, using a learning rate of 111While the standard learning rate of worked equally well, learning curves seemed significantly less stable., ramped up from for the first ten epochs222This was likely not necessary, but kept for consistency., and decayed by a factor of at epochs , , and . We train in total for epochs. A more detailed presentation of the results is provided in the supplementary material.

Using this only slightly altered training procedure, our baseline models achieve top5 accuracy ( top1) on the public ILSVRC-2012 validation set when trained on only of the full training set. Our baseline achieves top5 accuracy ( top1). These results form a solid baseline to compare to, considering that the same standard ResNet50v2 model achieves top5 accuracy ( top1) on of the labels.

For all further experiments, we reuse the best hyperparameters discovered here, except that we try two additional learning rates: for and for , and two additional weight decays: for and for . We also try two different weights for the additionally introduced loss : .

ILSVRC-2012 labels:
(i.e. images per class) () ()
Supervised Baseline (Section 4.1) 80.43 48.43
Pseudolabels [20] 82.41 51.56
VAT [24] 82.78 44.05
VAT + Entropy Minimization [11] 83.39 46.96
Self-sup. Rotation [17] + Linear 39.75 25.98
Self-sup. Exemplar [17] + Linear 32.32 21.33
Self-sup. Rotation [17] + Fine-tune 78.53 45.11
Self-sup. Exemplar [17] + Fine-tune 81.01 44.90
-Rotation 83.82 53.37
-Exemplar 83.72 47.02
Table 1: Top-5 accuracy [%] obtained by individual methods when training them on ILSVRC-2012 with a subset of labels. All methods use the same standard width ResNet50v2 architecture.

4.2 Semi-supervised Baselines

We train semi-supervised baseline models using (1) Pseudo-Label, (2) VAT, and (3) VAT+EntMin. To the best of our knowledge, we present the first evaluation of these techniques on ILSVRC-2012.

Pseudo-Label  Using the plain supervised learning models from Section 4.1, we assign pseudo-labels to the full dataset. Then, in a second step, we train a ResNet50v2 from scratch following standard practice, i.ewith learning rate , weight decay , and epochs on the full (pseudo-labeled) dataset.

We try both using all predictions as pseudo-labels, as well as using only those predictions with a confidence above . Both perform closely on our validation set, and we choose no filtering for the final model for simplicity.

Table 1 shows that a second step training with pseudo-labels consistently improves the results on both and the labels case. This motivates us to apply the idea to our best semi supervised model, which is discussed in Section 5.

VAT  We first verify our VAT implementation on CIFAR-10. With labels, we are able to achieve top-1 accuracy, which is in line with the reported in [30].

Besides the previously mentioned hyperparameters common to all methods, VAT needs tuning

. Since it corresponds to a distance in pixel space, we use a simple heuristic for defining a range of values to try for

: values should be lower than half the distance between neighbouring images in the dataset. Based on this heuristic, we try values of and found to work best.

VAT+EntMin  VAT is intended to be used together with an additional entropy minimization (EntMin) loss. EntMin adds a single hyperparameter to our best VAT model: the weight of the EntMin loss, for which we try , without re-tuning .

The results of our best VAT and VAT+EntMin model are shown in Table 1. As can be seen, VAT performs well in the case, and adding adding entropy minimization consistently improves its performance. In Section 5, we further extend the co-training idea to include the self-supervised rotation loss.

4.3 Self-supervised Baselines

Previous work has evaluated features learned via self-supervision on the unlabeled data in a “semi-supervised” way by either freezing the features and learning a linear classifier on top, or by using the self-supervised model as an initialization and fine-tuning, using a subset of the labels in both cases. In order to compare our proposed way to do self-supervised semi-supervised learning to these common evaluations, we train a rotation and an exemplar model following the best practice from [17] but with standard width (“” in [17]).

Following our established protocol, we tune the weight decay and learning rate for the logistic regression, although interestingly the standard values from 

[12] of weight decay and learning rate worked best.

The results of evaluating these models with both and are presented in Table 1 as “Self-sup. + Linear” and “Self-sup. + Fine-tune”. Note that while our results for the linear experiment are similar to those reported in [17], they are not directly. This is due to 1) ours being evaluated on the public validation set, while they evaluated on a custom validation set, and 2) they used L-BFGS while we use SGD with standard augmentations. Furthermore, fine-tuning approaches or slightly surpasses the supervised baseline.

4.4 Self-supervised Semi-supervised Learning ()

For training our full self-supervised semi-supervised models (), we follow the same protocol as for our semi-supervised baselines, i.ewe use the best settings of the plain supervised baseline and only tune the learning rate, weight decay, and weight of the newly introduced loss. We found that for both -Rotation and -Exemplar, the self-supervised loss weight worked best (though not by much) and the optimal weight decay and learning rate were the same as for the supervised baseline.

As described in Section 3.1, we apply the self-supervised loss on both labeled and unlabeled images. Furthermore, both Rotation and Exemplar self-supervision generate augmented copies of each image, and we do apply the supervision loss on all copies of the labeled images. We performed one case study on -Rotation in order to investigate this choice, and found that whether or not the self-supervision loss is also applied on the labeled images does not have significant impact. On the other hand, applying the supervision loss on the augmented images generated by self-supervision does indeed improve performance by almost . Furthermore, this allows to use multiple transformed copies of an image at inference-time (e.gfour rotations) and take the average of their predictions. While this 4-rot prediction is   to   more accurate, the results we report do not make use of this in order to keep comparison fair.

The results shown in Table 1 show that our proposed way of doing self-supervised semi-supervised learning is indeed effective for the two self-supervision methods we tried. We hypothesize that such approaches can be designed for other self-supervision objectives.

We additionally verified that our proposed method is not sensitive to the random seed, nor the split of the dataset, see Appendix B for details.

Finally, in order to explore the limits of our proposed models and match capacity of the architectures used in concurrent papers (e.g. [13]), we train the -Rotation model with a more powerful architecture, such as ResNet152v2 wider, and also use large computational budget to tune hyperparameters. In this case our model achieves even better results: top-5 accuracy with labels and with labels.

5 Semi-supervised Learning is Complementary to

labels Top-5 Top-1
MOAM full (proposed) 91.23 73.21
MOAM + pseudo label (proposed) 89.96 71.56
MOAM (proposed) 88.80 69.73
ResNet50v2 (wider) 81.29 58.15
VAE + Bayesian SVM [32] 64.76 48.41
Mean Teacher [41] 90.89 -
UDA [43] 88.52   68.66  
CPCv2 [13] 84.88   64.03  
Training with all labels:
ResNet50v2 (wider) 94.10 78.57
MOAM (proposed) 94.97 80.17
UDA [43] 94.45   79.04  
CPCv2 [13] 93.35   -
marks concurrent work.
Table 2: Comparing our MOAM to previous methods in the literature on ILSVRC-2012 with of the labels. Note that different models use different architectures, larger than those in Table 1.

Since we found that different types of models perform similarly well, the natural next question is whether they are complementary, in which case a combination would lead to an even better model, or whether they all reach a common “intrinsic” performance plateau.

In this section, we thus describe our Mix Of All Models (MOAM). In short: in a first step, we combine -Rotation and VAT+EntMin to learn a wider [17] model. We then use this model in order to generate pseudo labels for a second training step, followed by a final fine-tuning step. Results of the final model, as well as the models obtained in the two intermediate steps, are reported in Table 2 along with previous results reported in the literature.

Step 1: Rotation+VAT+EntMin  In the first step, our model jointly optimizes the -Rotation loss and the VAT and EntMin losses. We iterated on hyperparameters for this setup in a less structured way than in our controlled experiments above (always on our custom validation set) and only mention the final values here. Our model was trained with batch size , learning rate , weight decay , training for epochs with linear learning rate rampup up to epoch , then 10-fold decays at 100, 150, and 190 epochs. We use inception crop augmentation as well as horizontal mirroring. We used the following relative loss weights: , , , . We tried a few heuristics for setting those weights automatically, but found that manually tuning them led to better performance. We also applied Polyak averaging to the model parameters, choosing the decay factor such that parameters decay by over each epoch. Joint training of these losses consistently improve over the models with a single objective. The model obtained after this first step achieves top-5 accuracy on the ILSVRC-2012 dataset.

Step 2: Retraining on Pseudo Labels  Using the above model, we assign pseudo labels to the full dataset by averaging predictions across five crops and four rotations of each image333Generating pseudo-labels using 20 crops only slightly improved performance by , but is cheap and simple.. We then train the same network again in the exact same way (i.ewith all the losses) except for the following three differences: (1) the network is initialized using the weights obtained in the first step (2) every example has a label: the pseudo label (3) due to this, an epoch now corresponds to the full dataset; we thus train for 18 epochs, decaying the learning rate after 6 and 12 epochs.

Step 3: Fine-tuning the model  Finally, we fine-tune the model obtained in the second step on the original labels only. This step is trained with weight decay and learning rate for 20 epochs, decaying the learning rate 10-fold every 5 epochs.

Remember that all hyper-parameters described here were selected on our custom validation set which is taken from the training set. The final model “MOAM (full)” achieves top-5 accuracy, which sets the new state-of-the-art.

We conduct additional experiments and report performance of MOAM (i.eonly Step 1) with labels in Table 2. Interestingly, MOAM achieves promising results even in the high-data regime with labels, outperforming the fully supervised baseline: for top-5 accuracy and for top-1 accuracy.

Figure 2: Places205 learning curves of logistic regression on top of the features learned by pre-training a self-supervised versus -Rotation model on ILSVRC-2012. The significantly faster convergence (“long” training schedule vs. “short” one) suggests that more easily separable features are learned.

6 Transfer of Learned Representations

Figure 3: Correlation between validation score on a (custom) validation set of , , and images on ILSVRC-2012. Each point corresponds to a trained model during a sweep for plain supervised baseline for the labeled case. The best model according to the validation set of is marked in red. As can be seen, evaluating our models even with only a single validation image per class is robust, and in particular selecting an optimal model with this validation set works as well as with the full validation set.

Self-supervision methods are typically evaluated in terms of how generally useful their learned representation is. This is done by treating the learned model as a fixed feature extractor, and training a linear logistic regression model on top the features it extracts on a different dataset, usually Places205 [45]. We perform such an evaluation on our models in order to gain some insight into the generality of the learned features, and how they compare to those obtained by pure self-supervision.

We closely follow the protocol defined by [17]

. The representation is extracted from the pre-logits layer. We use stochastic gradient descent (SGD) with momentum for training these linear evaluation models with a minibatch size of 2048 and an initial learning rate of 0.1, warmed up in the first epoch.

While Kolesnikov  [17] show that a very long training schedule (520 epochs) is required for the linear model to converge using self-supervised representations, we observe dramatically different behaviour when evaluating our self-supervised semi-supervised representations. Figure 2 shows the accuracy curve of the plain self-supervised rotation method [17] and our proposed -Rotation method trained on

of ILSVRC-2012. As can be seen, the logistic regression is able to find a good separating hyperplane in very few epochs and then plateaus, whereas in the self-supervised case it struggles for a very long number of epochs. This indicates that the addition of labeled data leads to much more separable representations, even across datasets.

We further investigate the gap between the representation learned by a good model (MOAM) and a corresponding baseline trained on of the labels (the baseline from Table 2). Surprisingly, we found that the representation learned by “MOAM (full)” transfers slightly better than the baseline, which used ten times more labelled data: accuracy vs. accuracy, respectively. We provide full details of this experiment in the Supplementary Material.

7 Is a Tiny Validation Set Enough?

Current standard practice in semi-supervised learning is to use a subset of the labels for training on a large dataset, but still perform model selection using scores obtained on the full validation set.444To make matters worse, in the case of ILSVRC-2012, this validation set is used both to select hyperparameters as well as to report final performance. Remember that we avoid this by creating a custom validation set from part of the training set for all hyperparameter selections.

But having a large labeled validation set at hand is at odds with the promised practicality of semi-supervised learning, which is all about having only few labeled examples. This fact has been acknowledged by 

[33], but has been mostly ignored in the semi-supervised literature. Oliver  [30]

questions the viability of tuning with small validation sets by comparing the estimated model accuracy on small validation sets. They find that the variance of the estimated accuracy gap between two models can be larger than the actual gap between those models, hinting that model selection with small validation sets may not be viable. That said, they did not empirically evaluate whether it’s possible to find the

best model with a small validation set, especially when choosing hyperparameters for a particular semi-supervised method.

We now describe our analysis of this important question. We look at the many models we trained for the plain supervised baseline on of ILSVRC-2012. For each model, we compute a validation score on a validation set of labeled images (i.eone labeled image per class), labeled images (i.efive labeled images per class), and compare these scores to those obtained on a “full-size” validation set of labeled images. The result is shown in Figure 3 and it is striking: there is a very strong correlation between performance on the tiny and the full validation set. Especially, while in parts there is high variability, those hyperparameters which work best do so in either case. Most notably, the best model tuned on a small validation set is also the best model tuned on a large validation set. We thus conclude that for selecting hyperparameters of a model, a tiny validation set is enough.

8 Discussion and Future Work

In this paper, we have bridged the gap between self-supervision methods and semi-supervised learning by suggesting a framework () which can be used to turn any self-supervision method into a semi-supervised learning algorithm.

We instantiated two such methods: -Rotation and -Exemplar and have shown that they perform competitively to methods from the semi-supervised literature on the challenging ILSVRC-2012 dataset. We further showed that methods are complementary to existing semi-supervision techniques, and MOAM, our proposed combination of those, leads to state-of-the-art performance.

While all of the methods we investigated show promising results for learning with of the labels on ILSVRC-2012, the picture is much less clear when using only . It is possible that in this low data regime, when only labeled examples per class are available, the setting fades into the few-shot scenario, and a very different set of methods would be required for reaching much better performance.

Nevertheless, we hope that this work inspires other researchers in the field of self-supervision to consider extending their methods into semi-supervised methods using our framework, as well as researchers in the field of semi-supervised learning to take inspiration from the vast amount of recently proposed self-supervision methods.

Acknowledgements.  We thank the Google Brain Team in Zürich, and especially Sylvain Gelly for discussions.


Appendix A Detailed Results of the Supervised Baselines

Figure 4: The “hypersweep curves” for the supervised baseline trained on of ILSVRC-2012. See text for details.
Figure 5: The “hypersweep curves” for the supervised baseline trained on of ILSVRC-2012. See text for details.

Since we performed quite extensive hyperparameter search and trained many models in order to find a solid fully-supervised baseline on and of ILSVRC-2012, we believe that it is valuable to report the full results to the community, instead of just providing the final best model.

We present the results in the form of what we call “hypersweep curves” in Figures 4 and 5.

Each plot shows a large collection of models – each point on each plot is a fully trained model. The curves are sorted by accuracy, allowing testing sensitivity to different hyperparameters, not only comparing the best model.

For each curve, we plot the accuracy of models where one of the hyperparameters is fixed.

Thus, by comparing curves, one can see:

  1. Which value of a hyperparameter performs best by looking at which curve’s rightmost point is highest.

  2. How sensitive the model is to a hyperparameter in the best case by looking at how far apart the curves are from eachother at their rightmost point.

  3. How robust a hyperparameter is on average by looking at how similar the curves are overall.

  4. How independent a specific hyperparameter value is from all others by looking at the curve’s shape, and whether curves cross-over (strong interplay) or not (strong independence).

While the results shown in Figure 4 use the full (custom) validation set, those in Figure 5 were computed using the validation set of size , i.ewith only one image per class. As we have shown in Section 7, this is sufficient to determine the best hyperparameters, and we encourage the community to follow this more realistic protocol.

As can be seen, weight decay and number of training epochs are the two things which matter most when training using only a fraction of ILSVRC-2012.

Perhaps the most surprising finding is that, contrary to current folklore, reducing model capacity is detrimental to performance on the smaller dataset. Neither reducing depth, nor reducing width improve performance. In fact, the deeper and wider models still outperform their shallower and thinner counterparts, even when using only of the training data. Even more so, the wider models are more robust to other hyperparameter’s values as evidenced by their curves being significantly higher on the left end. This is in line with recent findings suggesting wider models ease optimization [21, 9, 39].

Furthermore, when reducing the dataset size to , we found that adding the same color augmentation as introduced by Exemplar is helpful. We thereafter tried adding it to our best few models on , but it did not help there.

Finally, while in the case, learning-rate of and seem to perform equally well in the good cases (right hand side of curves), we manually inspected training curves and found that is significantly less robust, typically not learning anything before the first decay, and only catching up later on.

While we trained thousands of models in order to rigorously test multiple hypotheses (such as that of reducing model capacity), almost all boost in performance could have been achieved in just a few dozen trials with intuitively important hyperparameters (weight decay and epochs), which would take about a week on a modern four-GPU machine.

Overall, we hope that this thorough baseline investigation inspires the semi-supervised learning community to be more careful with baselines, as those that we found perform almost absolute better than those previously reported in the literature.

Appendix B Randomness of

Method 10% ImageNet 1% ImageNet
Table 3: performance for 9 runs with random image subsets. Top-5 accuracies [%] are reported as meanstandard deviation.

There are two factors of randomness of a semi supervised model: (1) labeled subset sampling, (2) run with different seeds. In order to estimate the randomness in the performance we train 9 models with random data subsets and random seeds for our proposed method. Table 3 presents the detailed results. Overall, we observe that standard deviation is fairly small across both subsets and different runs and, therefore, our empirical evaluation provides robust comparison of various techniques.

Appendix C More Results in the Transfer Setup

Method %-labels top-5 top-1
Supervised 1 65.4 36.2
Supervised 10 75.0 44.7
Supervised 100 81.9 52.5
Supervised 100 83.1 53.7
SS Rotation [17] 0 71.4 41.7
SS Exemplar [17] 0 69.0 39.8
Pseudolabels [20] 1 71.6 41.8
VAT [24] 1 64.9 35.9
VAT + EntMin [11] 1 65.9 36.4
Pseudolabels [20] 10 78.1 48.2
VAT [24] 10 76.4 45.8
VAT + EntMin [11] 10 76.4 46.2
SS Rotation [17] + Fine-tune 1 66.1 36.3
SS Exemplar [17] + Fine-tune 1 60.0 31.1
SS Rotation [17] + Fine-tune 10 75.4 45.9
SS Exemplar [17] + Fine-tune 10 75.6 45.9
-Rotation 1 67.3 38.0
-Exemplar 1 61.2 32.2
-Rotation 10 76.4 46.6
-Exemplar 10 75.9 45.9
MOAM full 10 83.3 54.2
MOAM + pseudo label 10 83.3 54.2
10 79.2 49.5
Table 4: Accuracy (in percent) obtained by various individual methods when transferring their representation to the Places205 dataset using linear models on frozen representations. All methods use the same plain ResNet50v2 base model, except for the ones marked by , which use a wider network. When it was necessary, a marks longer transfer training of 520 epochs. The “-labels” column shows the percentage of ILSVRC-2012 labels that was used for training the model.

In this section we present more results from the transfer evaluation task on Places205 [45]. Table 4 shows the results for the models mentioned in our main paper. For each method, we select the best model and evaluate its transfer to Places205.

We follow the same setup as [17] to train a linear models with SGD on top of frozen representations. The only difference is the training epochs, we train for 30 epochs in total with learning rate decayed at 10 and 20 epochs respectively. The learning rate is linearly ramped up for the first epoch. Kolesnikov et.al. [17] train for 520 epochs with learning rate decays at 480 and 500 epochs. The schedule used in our paper is much shorter because of our finding that representation learned with labels are more separable and converges significantly faster. (See in Section 6 of the main paper for details.) To make fair comparison with the self-supervised models, results in Table 4 with labels are trained for 520 epochs to ensure their convergence.

From the plain supervised baselines, we observe that either more labels or wider networks lead to more transferable representations. Surprisingly, we found that pseudo labels outperforms the other two semi-supervised baselines in the transfer setup. On the labels evaluation setup, pseudo labels achieves the best result comparing to the other methods. With labels, is comparable to the semi-supervised baselines, and our MOAM clearly outperforms all other models trained on of labels. More interestingly, the MOAM (full) model on is slightly better than the supervised baseline with the same wider network. This indicates that learning a model with multiple losses may lead to representations that generalize better to unseen tasks.