Big Self-Supervised Models are Strong Semi-Supervised Learners

06/17/2020 ∙ by Ting Chen, et al. ∙ Google 6

One paradigm for learning from few labeled examples while making best use of a large amount of unlabeled data is unsupervised pretraining followed by supervised fine-tuning. Although this paradigm uses unlabeled data in a task-agnostic way, in contrast to most previous approaches to semi-supervised learning for computer vision, we show that it is surprisingly effective for semi-supervised learning on ImageNet. A key ingredient of our approach is the use of a big (deep and wide) network during pretraining and fine-tuning. We find that, the fewer the labels, the more this approach (task-agnostic use of unlabeled data) benefits from a bigger network. After fine-tuning, the big network can be further improved and distilled into a much smaller one with little loss in classification accuracy by using the unlabeled examples for a second time, but in a task-specific way. The proposed semi-supervised learning algorithm can be summarized in three steps: unsupervised pretraining of a big ResNet model using SimCLRv2 (a modification of SimCLR), supervised fine-tuning on a few labeled examples, and distillation with unlabeled examples for refining and transferring the task-specific knowledge. This procedure achieves 73.9% ImageNet top-1 accuracy with just 1% of the labels (<13 labeled images per class) using ResNet-50, a 10× improvement in label efficiency over the previous state-of-the-art. With 10% of labels, ResNet-50 trained with our method achieves 77.5% top-1 accuracy, outperforming standard supervised training with all of the labels.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 15

Code Repositories

simclr

SimCLRv2 - Big Self-Supervised Models are Strong Semi-Supervised Learners


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Bigger models yield larger gains when fine-tuning with fewer labeled examples.
Figure 2: Top-1 accuracy of previous state-of-the-art (SOTA) methods Chen et al. (2020a); Pham et al. (2020) and our method (SimCLRv2) on ImageNet using only 1% or 10% of the labels. Dashed line denotes fully supervised ResNet-50 trained with 100% of labels. Full comparisons in Table  3.

Learning from just a few labeled examples while making best use of a large amount of unlabeled data is a long-standing problem in machine learning. One approach to semi-supervised learning involves unsupervised or self-supervised pretraining, followed by supervised fine-tuning 

Hinton et al. (2006); Bengio et al. (2007). This approach leverages unlabeled data in a task-agnostic

way during pretraining, as the supervised labels are only used during fine-tuning. Although it has received little attention in computer vision, this approach has become predominant in natural language processing, where one first trains a large language model on unlabeled text (e.g., Wikipedia), and then fine-tunes the model on a few labeled examples 

Dai and Le (2015); Kiros et al. (2015); Radford et al. (2018); Peters et al. (2018); Devlin et al. (2018); Radford et al. . An alternative approach, common in computer vision, directly leverages unlabeled data during supervised learning, as a form of regularization. This approach uses unlabeled data in a task-specific way to encourage class label prediction consistency on unlabeled data among different models Lee ; Xie et al. (2019a); Pham et al. (2020) or under different data augmentations Berthelot et al. (2019); Xie et al. (2019b); Sohn et al. (2020).

Motivated by recent advances in self-supervised learning of visual representations Oord et al. (2018); Wu et al. (2018); Bachman et al. (2019); Hénaff et al. (2019); He et al. (2019a); Chen et al. (2020a), this paper first presents a thorough investigation of the “unsupervised pretrain, supervised fine-tune” paradigm for semi-supervised learning on ImageNet Russakovsky et al. (2015)

. During self-supervised pretraining, images are used without class labels (in a task-agnostic way), hence the representations are not directly tailored to a specific classification task. With this task-agnostic use of unlabeled data, we find that network size is important: Using a big (deep and wide) neural network for self-supervised pretraining and fine-tuning greatly improves accuracy. In addition to the network size, we characterize a few important design choices for contrastive representation learning that benefit supervised fine-tuning and semi-supervised learning.

Once a convolutional network is pretrained and fine-tuned, we find that its task-specific predictions can be further improved and distilled into a smaller network. To this end, we make use of unlabeled data for a second time to encourage the student network to mimic the teacher network’s label predictions. Thus, the distillation Hinton et al. (2015); Buciluǎ et al. (2006) phase of our method using unlabeled data is reminiscent of the use of pseudo labels Lee in self-training Yalniz et al. (2019); Xie et al. (2019a), but without much extra complexity.

In summary, the proposed semi-supervised learning framework comprises three steps as shown in Figure 3: (1) unsupervised or self-supervised pretraining, (2) supervised fine-tuning, and (3) distillation using unlabeled data. We develop an improved variant of a recently proposed contrastive learning framework, SimCLR Chen et al. (2020a), for unsupervised pretraining of a ResNet architecture He et al. (2016). We call this framework SimCLRv2. We assess the effectiveness of our method on ImageNet ILSVRC-2012 Russakovsky et al. (2015) with only 1% and 10% of the labeled images available. Our main findings and contributions can be summarized as follows:

  • [topsep=0pt, partopsep=0pt, leftmargin=15pt, parsep=0pt, itemsep=5pt]

  • Our empirical results suggest that for semi-supervised learning (via the task-agnostic use of unlabeled data), the fewer the labels, the more it is possible to benefit from a bigger model (Figure 2). Bigger self-supervised models are more label efficient, performing significantly better when fine-tuned on only a few labeled examples, even though they have more capacity to potentially overfit.

  • We show that although big models are important for learning general (visual) representations, the extra capacity may not be necessary when a specific target task is concerned. Therefore, with the task-specific use of unlabeled data, the predictive performance of the model can be further improved and transferred into a smaller network.

  • We further demonstrate the importance of the nonlinear transformation (a.k.a. projection head) after convolutional layers used in SimCLR for semi-supervised learning. A deeper projection head not only improves the representation quality measured by linear evaluation, but also improves semi-supervised performance when fine-tuning from a middle layer of the projection head.

We combine these findings to achieve a new state-of-the-art in semi-supervised learning on ImageNet as summarized in Figure 2. Under the linear evaluation protocol, SimCLRv2 achieves 79.8% top-1 accuracy, a 4.3% relative improvement over the previous state-of-the-art Chen et al. (2020a). When fine-tuned on only 1% / 10% of labeled examples and distilled to the same architecture using unlabeled examples, it achieves 76.6% / 80.9% top-1 accuracy, which is a 21.6% / 8.7% relative improvement over previous state-of-the-art. With distillation, these improvements can also be transferred to smaller ResNet-50 networks to achieve 73.9% / 77.5% top-1 accuracy using 1% / 10% of labels. By comparison, a standard supervised ResNet-50 trained on all of labeled images achieves a top-1 accuracy of 76.6%.

Figure 3: The proposed semi-supervised learning framework leverages unlabeled data in two ways: (1) task-agnostic use in unsupervised pretraining, and (2) task-specific use in self-training / distillation.

2 Method

Inspired by the recent successes of learning from unlabeled data Hénaff et al. (2019); He et al. (2019a); Chen et al. (2020a); Lee ; Yalniz et al. (2019); Xie et al. (2019a)

, the proposed semi-supervised learning framework leverages unlabeled data in both task-agnostic and task-specific ways. The first time the unlabeled data is used, it is in a task-agnostic way, for learning general (visual) representations via unsupervised pretraining. The general representations are then adapted for a specific task via supervised fine-tuning. The second time the unlabeled data is used, it is in a task-specific way, for further improving predictive performance and obtaining a compact model. To this end, we train student networks on the unlabeled data with imputed labels from the fine-tuned teacher network. Our method can be summarized in three main steps:

pretrain, fine-tune, and then distill. The procedure is illustrated in Figure 3. We introduce each specific component in detail below.

Self-supervised pretraining with SimCLRv2. To learn general visual representations effectively with unlabeled images, we adopt and improve SimCLR Chen et al. (2020a), a recently proposed approach based on contrastive learning. SimCLR learns representations by maximizing agreement Becker and Hinton (1992) between differently augmented views of the same data example via a contrastive loss in the latent space. More specifically, given a randomly sampled mini-batch of images, each image is augmented twice using random crop, color distortion and Gaussian blur, creating two views of the same example and . The two images are encoded via an encoder network (a ResNet He et al. (2016)) to generate representations and

. The representations are then transformed again with a non-linear transformation network

(a MLP projection head), yielding and that are used for the contrastive loss. With a mini-batch of augmented examples, the contrastive loss between a pair of positive example (augmented from the same image) is given as follows:

(1)

Where

is cosine similarity between two vectors, and

is a temperature scalar.

In this work, we propose SimCLRv2, which improves upon SimCLR Chen et al. (2020a) in three major ways. Below we summarize the changes as well as their improvements of accuracy on Imagenet ILSVRC-2012 Russakovsky et al. (2015).

  • [topsep=0pt, partopsep=0pt, leftmargin=13pt, parsep=0pt, itemsep=4pt]

  • To fully leverage the power of general pretraining, we explore larger ResNet models. Unlike SimCLR Chen et al. (2020a) and other previous work Kolesnikov et al. (2019a); He et al. (2019a), whose largest model is ResNet-50 (4), we train models that are deeper but less wide. The largest model we train is a 152-layer ResNet He et al. (2016) with 3 wider channels and selective kernels (SK) Li et al. (2019), a channel-wise attention mechanism that improves the parameter efficiency of the network. By scaling up the model from ResNet-50 to ResNet-152 (+SK), we obtain a 29% relative improvement in top-1 accuracy when fine-tuned on 1% of labeled examples.

  • We also increase the capacity of the non-linear network (a.k.a. projection head), by making it deeper.222In our experiments, we set the width of projection head’s middle layers to that of its input, so it is also adjusted by the width multiplier. However, a wider projection head improves performance even when the base network remains narrow. Furthermore, instead of throwing away entirely after pretraining as in SimCLR Chen et al. (2020a), we fine-tune from a middle layer (detailed later). This small change yields a significant improvement for both linear evaluation and fine-tuning with only a few labeled examples. Compared to SimCLR with 2-layer projection head, by using a 3-layer projection head and fine-tuning from the 1st layer of projection head, it results in as much as 14% relative improvement in top-1 accuracy when fine-tuned on 1% of labeled examples (see Figure E.1).

  • Motivated by Chen et al. (2020b), we also incorporate the memory mechanism from MoCo He et al. (2019a), which designates a memory network (with a moving average of weights for stabilization) whose output will be buffered as negative examples. Since our training is based on large mini-batch which already supplies many contrasting negative examples, this change yields an improvement of 1% for linear evaluation as well as when fine-tuning on 1% of labeled examples (see Appendix D).

Fine-tuning. Fine-tuning is a common way to adapt the task-agnostically pretrained network for a specific task. In SimCLR Chen et al. (2020a), the MLP projection head is discarded entirely after pretraining, while only the ResNet encoder is used during the fine-tuning. Instead of throwing it all away, we propose to incorporate part of the MLP projection head into the base encoder during the fine-tuning. This is equivalent to fine-tuning from a middle layer of the projection head, instead of the input layer of the projection head as in SimCLR.

We elaborate with a 3-layer projection head, i.e. where

is a ReLU non-linearity and we ignore the bias term here for brevity. For fine-tuning, SimCLR uses

to compute the logits of pre-defined classes, where

is the weight for added task-specific linear layer (bias also ignored for brevity). This is fine-tuning from the input layer of the projection head. To fine-tune from the first layer of the projection head, we have a new encoder function as , which is a ResNet followed by fully connected layers.

Self-training / knowledge distillation via unlabeled examples. To further improve the network for the target task, here we leverage the unlabeled data directly for the target task. Inspired by Buciluǎ et al. (2006); Lee ; Hinton et al. (2015); Yalniz et al. (2019); Xie et al. (2019a), we use the fine-tuned network as a teacher to impute labels for training a student network. Specifically, we minimize the following distillation loss where no real labels are used:

(2)

where , and is a scalar temperature parameter. The teacher network, which produces , is fixed during the distillation; only the student network, which produces , is trained. While we focus on distillation using only unlabeled examples in this work, when the number of labeled examples is significant, one can also combine the distillation loss with ground-truth labeled examples using a weighted combination

(3)

This procedure can be performed either with students with the same model architecture (self-distillation), which further improves the task-specific performance, or with a smaller model architecture, which leads to a compact model.

3 Empirical Study

3.1 Settings and Implementation Details

Following the semi-supervised learning setting in Zhai et al. (2019); Hénaff et al. (2019); Chen et al. (2020a), we evaluate the proposed method on ImageNet ILSVRC-2012 Russakovsky et al. (2015). While all 1.28 million images are available, only a randomly sub-sampled 1% (12811) or 10% (128116) of images are associated with labels.333See https://www.tensorflow.org/datasets/catalog/imagenet2012_subset for the details of the 1%/10% subsets.

As in previous work, we also report performance when training a linear classifier on top of a fixed representation with all labels 

Zhang et al. (2016); Oord et al. (2018); Wu et al. (2018); Chen et al. (2020a) to directly evaluate SimCLRv2 representations. We use the LARS optimizer You et al. (2017) (with a momentum of 0.9) throughout for pretraining, fine-tuning and distillation.

For pretraining, similar to Chen et al. (2020a)

, we train our model on 128 Cloud TPUs, with a batch size of 4096 and global batch normalization 

Ioffe and Szegedy (2015)

, for total of 800 epochs. The learning rate is linearly increased for the first 5% of epochs, reaching maximum of 6.4 (

), and then decayed with a cosine decay schedule. A weight decay of is used. We use a 3-layer MLP projection head on top of a ResNet encoder. The memory buffer is set to 64K, and exponential moving average (EMA) decay is set to 0.999 according to He et al. (2019a). We use the same set of simple augmentations as SimCLR Chen et al. (2020a), namely random crop, color distortion, and Gaussian blur.

For fine-tuning, by default we fine-tune from the first layer of the projection head for 1%/10% of labeled examples, but from the input of the projection head when 100% labels are present. We use global batch normalization, but we remove weight decay, learning rate warmup, and use a much smaller learning rate, i.e. 0.16 () for standard ResNets He et al. (2016), and 0.064 () for larger ResNets variants (with width multiplier larger than 1 and/or SK Li et al. (2019)). A batch size of 1024 is used. Similar to Chen et al. (2020a), we fine-tune for 60 epochs with 1% of labels, and 30 epochs with 10% of labels, as well as full ImageNet labels.

For distillation, we only use unlabeled examples, unless otherwise specified. We consider two types of distillation: self-distillation where the student has the same model architecture as the teacher (excluding projection head), and big-to-small distillation where the student is a much smaller network. We use the same learning rate schedule, weight decay, batch size as pretraining, and the models are trained for 400 epochs. Only random crop and horizontal flips of training images are applied during fine-tuning and distillation.

3.2 Bigger Models Are More Label-Efficient

Depth Width SK Param (M) F-T (1%) F-T (10%) F-T (100%) Linear eval Supervised
50 1 False 24 57.9 68.4 76.3 71.7 76.6
True 35 64.5 72.1 78.7 74.6 78.5
2 False 94 66.3 73.9 79.1 75.6 77.8
True 140 70.6 77.0 81.3 77.7 79.3
101 1 False 43 62.1 71.4 78.2 73.6 78.0
True 65 68.3 75.1 80.6 76.3 79.6
2 False 170 69.1 75.8 80.7 77.0 78.9
True 257 73.2 78.8 82.4 79.0 80.1
152 1 False 58 64.0 73.0 79.3 74.5 78.3
True 89 70.0 76.5 81.3 77.2 79.9
2 False 233 70.2 76.6 81.1 77.4 79.1
True 354 74.2 79.4 82.9 79.4 80.4
152 3 True 795 74.9 80.1 83.1 79.8 80.5
Table 1: Top-1 accuracy of fine-tuning SimCLRv2 (on varied label fractions) or training a linear classifier on the ResNet output. The supervised baselines are trained from scratch using all labels in 90 epochs. The parameter count only include ResNet up to final average pooling layer. For fine-tuning results with 1% and 10% labeled examples, the models include additional non-linear projection layers, which incurs additional parameter count (4M for models, and 17M for models). See Table G.1 for Top-5 accuracy.
(a) Supervised
(b) Semi-supervised
(c) Semi-supervised (y-axis zoomed)
Figure 4: Top-1 accuracy for supervised vs semi-supervised (SimCLRv2 fine-tuned) models of varied sizes on different label fractions. ResNets with depths of 50, 101, 152, width multiplier of , (w/o SK) are presented here. For supervised models on 1%/10% labels, AutoAugment Cubuk et al. (2019) and label smoothing Szegedy et al. (2016) are used. Increasing the size of SimCLRv2 models by 10, from ResNet-50 to ResNet-152 (2), improves label efficiency by 10.

In order to study the effectiveness of big models, we train ResNet models by varying width and depth as well as whether or not to use selective kernels (SK) Li et al. (2019).444Although we do not use grouped convolution in this work, we believe it can further improve parameter efficiency. Whenever SK is used, we also use the ResNet-D He et al. (2019b) variant of ResNet. The smallest model is the standard ResNet-50, and biggest model is ResNet-152 (+SK).

Table 1 compares self-supervised learning and supervised learning under different model sizes and evaluation protocols, including both fine-tuning and linear evaluation. We can see that increasing width and depth, as well as using SK, all improve the performance. These architectural manipulations have relatively limited effects for standard supervised learning (4% differences in smallest and largest models), but for self-supervised models, accuracy can differ by as much as 8% for linear evaluation, and 17% for fine-tuning on 1% of labeled images. We also note that ResNet-152 (3+SK) is only marginally better than ResNet-152 (2+SK), though the parameter size is almost doubled, suggesting that the benefits of width may have plateaued.

Figure 4 shows the performance as model size and label fraction vary. These results show that bigger models are more label-efficient for both supervised and semi-supervised learning, but gains appear to be larger for semi-supervised learning (more discussions in Appendix A). Furthermore, it is worth pointing out that although bigger models are better, some models (e.g. with SK) are more parameter efficient than others (Appendix B), suggesting that searching for better architectures is helpful.

3.3 Bigger/Deeper Projection Heads Improve Representation Learning

(a) Effect of projection head’s depth when fine-tuning from optimal middle layer.
(b) Effect of fine-tuning from middle of a 3-layer projection head (0 is SimCLR).
Figure 5: Top-1 accuracy via fine-tuning under different projection head settings and label fractions (using ResNet-50).

To study the effects of projection head for fine-tuning, we pretrain ResNet-50 using SimCLRv2 with different numbers of projection head layers (from 2 to 4 fully connected layers), and examine performance when fine-tuning from different layers of the projection head. We find that using a deeper projection head during pretraining is better when fine-tuning from the optimal layer of projection head (Figure 4(a)), and this optimal layer is typically the first layer of projection head rather than the input (0th layer), especially when fine-tuning on fewer labeled examples (Figure 4(b)).

It is also worth noting that when using bigger ResNets, the improvements from having a deeper projection head are smaller (see Appendix E). In our experiments, wider ResNets also have wider projection heads, since the width multiplier is applied to both. Thus, it is possible that increasing the depth of the projection head has limited effect when the projection head is already relatively wide.

When varying architecture, the accuracy of fine-tuned models is correlated with the accuracy of linear evaluation (see Appendix C). Although we use the input of the projection head for linear classification, we find that the correlation is higher when fine-tuning from the optimal middle layer of the projection head than when fine-tuning from the projection head input.

3.4 Distillation Using Unlabeled Data Improves Semi-Supervised Learning

Distillation typically involves both a distillation loss that encourages the student to match a teacher and an ordinary supervised cross-entropy loss on the labels (Eq. 3). In Table 2, we demonstrate the importance of using unlabeled examples when training with the distillation loss. Furthermore, using the distillation loss alone (Eq. 2) works almost as well as balancing distillation and label losses (Eq. 3) when the labeled fraction is small. For simplicity, Eq. 2 is our default for all other experiments.

Method Label fraction
1% 10%
Label only 12.3 52.0
Label + distillation loss (on labeled set) 23.6 66.2
Label + distillation loss (on labeled+unlabeled sets) 69.0 75.1
Distillation loss (on labeled+unlabeled sets; our default) 68.9 74.3
Table 2: Top-1 accuracy of a ResNet-50 trained on different types of targets. For distillation, the teacher is ResNet-50 (2+SK), and the temperature is set to 1.0. The distillation loss (Eq. 2) does not use label information. Neither strong augmentation nor extra regularization are used.

Distillation with unlabeled examples improves fine-tuned models in two ways, as shown in Figure 6: (1) when the student model has a smaller architecture than the teacher model, it improves the model efficiency by transferring task-specific knowledge to a student model, (2) even when the student model has the same architecture as the teacher model (excluding the projection head after ResNet encoder), self-distillation can still meaningfully improve the semi-supervised learning performance. To obtain the best performance for smaller ResNets, the big model is self-distilled before distilling it to smaller models.

(a) Label fraction 1%
(b) Label fraction 10%
Figure 6: Top-1 accuracy of distilled SimCLRv2 models compared to the fine-tuned models as well as supervised learning with all labels. The self-distilled student has the same ResNet as the teacher (without MLP projection head). The distilled student is trained using the self-distilled ResNet-152 (2+SK) model, which is the largest model included in this figure.
Method Architecture Top-1 Top-5
Label fraction Label fraction
1% 10% 1% 10%
Supervised baseline Zhai et al. (2019) ResNet-50 25.4 56.4 48.4 80.4
Methods using unlabeled data in a task-specific way:
Pseudo-label Lee ; Zhai et al. (2019) ResNet-50 - - 51.6 82.4
VAT+Entropy Min. Grandvalet and Bengio (2005); Miyato et al. (2018); Zhai et al. (2019) ResNet-50 - - 47.0 83.4
UDA (w. RandAug) Xie et al. (2019b) ResNet-50 - 68.8 - 88.5
FixMatch (w. RandAug) Sohn et al. (2020) ResNet-50 - 71.5 - 89.1
S4L (Rot+VAT+Entropy Min.) Zhai et al. (2019) ResNet-50 (4) - 73.2 - 91.2
MPL (w. RandAug) Pham et al. (2020) ResNet-50 - 73.8 - -
Methods using unlabeled data in a task-agnostic way:
BigBiGAN Donahue and Simonyan (2019) RevNet-50 () - - 55.2 78.8
PIRL Misra and van der Maaten (2019) ResNet-50 - - 57.2 83.8
CPC v2 Hénaff et al. (2019) ResNet-161() 52.7 73.1 77.9 91.2
SimCLR Chen et al. (2020a) ResNet-50 48.3 65.6 75.5 87.8
SimCLR Chen et al. (2020a) ResNet-50 () 63.0 74.4 85.8 92.6
Methods using unlabeled data in both ways:
SimCLRv2 distilled (ours) ResNet-50 73.9 77.5 91.5 93.4
SimCLRv2 distilled (ours) ResNet-50 (+SK) 75.9 80.2 93.0 95.0
SimCLRv2 self-distilled (ours) ResNet-152 (+SK) 76.6 80.9 93.4 95.5
Table 3: ImageNet accuracy of models trained under semi-supervised settings. For our methods, we report results with distillation after fine-tuning. For our smaller models, we use self-distilled ResNet-152 (+SK) as the teacher.

We compare our best models with previous state-of-the-art semi-supervised learning methods on ImageNet in Table 3. Our approach greatly improves upon previous results, for both small and big ResNet variants.

4 Related work

Task-agnostic use of unlabeled data. Unsupervised or self-supervised pretraining followed by supervised fine-tuning on a few labeled examples has been extensively used in natural language processing Kiros et al. (2015); Dai and Le (2015); Radford et al. (2018); Peters et al. (2018); Devlin et al. (2018), but has only shown promising results in computer vision very recently Hénaff et al. (2019); He et al. (2019a); Chen et al. (2020a). Our work builds upon recent success on contrastive learning of visual representations Dosovitskiy et al. (2014); Oord et al. (2018); Wu et al. (2018); Bachman et al. (2019); Hénaff et al. (2019); Tian et al. (2019); Misra and van der Maaten (2019); He et al. (2019a); Li et al. (2020); Tian et al. (2020); Chen et al. (2020a), a sub-area within self-supervised learning. These contrastive learning based approaches learn representations in a discriminative fashion instead of a generative one as in Hinton et al. (2006); Kingma and Welling (2013); Goodfellow et al. (2014); Donahue and Simonyan (2019); Chen et al. (2019). There are other approaches to self-supervised learning that are based on handcrafted pretext tasks Doersch et al. (2015); Zhang et al. (2016); Noroozi and Favaro (2016); Gidaris et al. (2018); Kolesnikov et al. (2019a). We also note a concurrent work on advancing self-supervised pretraining without using negative examples Grill et al. (2020), which we compare against in Appendix G. Our work also extends the “unsupervised pretrain, supervised fine-tune” paradigm by combining it with (self-)distillation Buciluǎ et al. (2006); Hinton et al. (2015); Lee using unlabeled data.

Task-specific use of unlabeled data. Aside from the representation learning paradigm, there is a large and diverse set of approaches for semi-supervised learning, we refer readers to Chapelle et al. (2006); Zhu and Goldberg (2009); Oliver et al. (2018) for surveys of classical approaches. Here we only review methods closely related to ours (especially within computer vision). One family of highly relevant methods are based on pseudo-labeling Lee ; Sohn et al. (2020) or self-training Xie et al. (2019a); Yalniz et al. (2019). The main differences between these methods and ours are that our initial / teacher model is trained using SimCLRv2 (with unsupervised pretraining and supervised fine-tuning), and the student models can also be smaller than the initial / teacher model. Furthermore, we use temperature scaling instead of confidence-based thresholding, and we do not use strong augmentation for training the student. Another family of methods are based on label consistency regularization Bachman et al. (2014); Laine and Aila (2016); Sajjadi et al. (2016); Tarvainen and Valpola (2017); Xie et al. (2019b); Berthelot et al. (2019); Verma et al. (2019); Sohn et al. (2020), where unlabeled examples are directly used as a regularizer to encourage task prediction consistency. Although in our SimCLRv2 pretraining, we maximize the agreement/consistency of representations of the same image under different augmented views, there is no supervised label utilized in our contrastive loss, a crucial difference from label consistency losses.

5 Discussion

In this work, we present a simple framework for semi-supervised ImageNet classification in three steps: unsupervised pretraining, supervised fine-tuning, and distillation with unlabeled data. Although similar approaches are common in NLP, we demonstrate that this approach can also be a surprisingly strong baseline for semi-supervised learning in computer vision, outperforming the state-of-the-art by a large margin.

We observe that bigger models can produce larger improvements with fewer labeled examples. The effectiveness of big models have been demonstrated on supervised learning Sun et al. (2017); Hestness et al. (2017); Kaplan et al. (2020); Mahajan et al. (2018), fine-tuning supervised models on a few examples Kolesnikov et al. (2019b)

, and unsupervised learning on language 

Devlin et al. (2018); Raffel et al. (2019); Radford et al. ; Brown et al. (2020). However, it is still somewhat surprising that bigger models, which could easily overfit with few labeled examples, can generalize much better. With task-agnostic use of unlabeled data, we conjecture bigger models can learn more general features, which increases the chances of learning task-relevant features. However, further work is needed to gain a better understanding of this phenomenon. Beyond model size, we also see the importance of increasing parameter efficiency as the other important dimension of improvement.

Although big models are important for pretraining and fine-tuning, given a specific task, such as classifying images into 1000 ImageNet classes, we demonstrate that task-agnostically learned general representations can be distilled into a more specialized and compact network using unlabeled examples. We simply use the teacher to impute labels for the unlabeled examples for this purpose, without using noise, augmentation, confidence thresholding, or consistency regularization. When the student network has the same or similar architecture as the teacher, this process can consistently improve the classification performance. We believe our framework can benefit from better approaches to leverage the unlabeled data for improving and transferring task-specific knowledge.

6 Broader Impact

The findings described in this paper can potentially be harnessed to improve accuracy in any application of computer vision where it is more expensive or difficult to label additional data than to train larger models. Some such applications are clearly beneficial to society. For example, in medical applications where acquiring high-quality labels requires careful annotation by clinicians, better semi-supervised learning approaches can potentially help save lives. Applications of computer vision to agriculture can increase crop yields, which may help to improve the availability of food. However, we also recognize that our approach could become a component of harmful surveillance systems. Moreover, there is an entire industry built around human labeling services, and technology that reduces the need for these services could lead to a short-term loss of income for some of those currently employed or contracted to provide labels.

Acknowledgements

We would like to thank David Berthelot, Han Zhang, Lala Li, Xiaohua Zhai, Lucas Beyer, Alexander Kolesnikov for their helpful feedback on the draft. We are also grateful for general support from Google Research teams in Toronto and elsewhere.

References

Appendix A When Do Bigger Models Help More?

Figure A.1 shows relative improvement by increasing the model size under different amount of labeled examples. Both supervised learning and semi-supervised learning (i.e. SimCLRv2) seem to benefit from having bigger models. The benefits are larger when (1) regularization techniques (such as augmentation, label smoothing) are used, or (2) the model is pretrained using unlabeled examples. It is also worth noting that these results may reflect a “ceiling effect”: as the performance gets closer to the ceiling, the improvement becomes smaller.

(a) Supervised
(b) Supervised (auto-augment+LS)
(c) Semi-supervised
Figure A.1: Relative improvement (top-1) when model size is increased. (a) supervised learning without extra regularization, (b), Supervised learning with auto-augmentation Cubuk et al. (2019) and label smoothing Szegedy et al. (2015) are applied for 1%/10% label fractions, (c) semi-supervised learning by fine-tuning SimCLRv2.

Appendix B Parameter Efficiency Also Matters

Figure B.1 shows the top-1 accuracy of fine-tuned SimCLRv2 models of different sizes. It shows that (1) bigger models are better, but (2) with SK Li et al. (2019), better performance can be achieved with the same parameter count. It is worth to note that, in this work, we do not leverage group convolution for SK Li et al. (2019) and we use only kernels. We expect further improvement in terms of parameter efficiency if group convolution is utilized.

(a) Models without SK Li et al. (2019)
(b) Models with SK Li et al. (2019)
Figure B.1: Top-1 accuracy of fined-tuned SimCLRv2 models of different sizes on three label fractions. ResNets with depth in , width in are included here. Parameter efficiency also plays an important role. For fine-tuning on 1% of labels, SK is much more efficient.

Appendix C The Correlation Between Linear Evaluation and Fine-tuning

Most existing work Wu et al. (2018); Hénaff et al. (2019); Bachman et al. (2019); Misra and van der Maaten (2019); He et al. (2019a); Chen et al. (2020a) on self-supervised learning leverages linear evaluation as a main metric for evaluating representation quality, and it is not clear how it correlates with semi-supervised learning through fine-tuning. Here we further study the correlation of fine-tuning and linear evaluation (the linear classifier is trained on the ResNet output instead of some middle layer of projection head). Figure C.1 shows the correlation under two different fine-tuning strategies: fine-tuning from the input of projection head, or fine-tuning from a middle layer of projection head. We observe that overall there is a linear correlation. When fine-tuned from a middle layer of the projection head, we observe a even stronger linear correlation. Additionally, we notice the slope of correlation becomes smaller as number of labeled images for fine-tuning increases .

Figure C.1: The effects of projection head for the correlation between fine-tuning and linear evaluation. When allowing fine-tuning from the middle of the projection head, the linear correlation becomes stronger. Furthermore, as label fraction increases, the slope is decreasing. The points here are from the variants of ResNets with depth in , width in , and with/without SK.

Appendix D The Impact of Memory

Figure D.1 shows the top-1 comparisons for SimCLRv2 models trained with or without memory (MoCo) He et al. (2019a). Memory provides modest advantages in terms of linear evaluation and fine-tuning with 1% of the labels; the improvement is around 1%. We believe the reason that memory only provides marginal improvement is that we already use a big batch size (i.e. 4096).

Figure D.1: Top-1 results of ResNet-50, ResNet-101, and ResNet-152 trained with or without memory.

Appendix E The Impact of Projection Head Under Different Model Sizes

(a) w/o SK (label fraction: 1%)
(b) w/o SK (label fraction: 10%)
(c) w/o SK (label fraction: 100%)
(d) w/ SK (label fraction: 1%)
(e) w/ SK (label fraction: 10%)
(f) w/ SK (label fraction: 100%)
Figure E.1: Top-1 fine-tuning performance under different projection head settings (number of layers included for fine-tuning) and model sizes. With fewer labeled examples, fine-tuning from the first layer of a 3-layer projection head is better, especially when the model is small. Points reflect ResNets with depths of and width multipliers of . Networks in the first row are without SK, and in the second row are with SK.

To understand the effects of projection head settings across model sizes, Figure E.1 shows effects of fine-tuning from different layers of 2- and 3-layer projection heads. These results confirm that with only a few labeled examples, pretraining with a deeper projection head and fine-tuning from a middle layer can improve the semi-supervised learning performance. The improvement is larger with a smaller model size.

Figure E.2 shows fine-tuning performance for different projection head settings of a ResNet-50 pretrained using SimCLRv2. Figure 5 in the main text is an aggregation of results from this Figure.

Figure E.2: Top-1 accuracy of ResNet-50 with different projection head settings. Deeper projection head help more, when allowing to fine-tune from a middle layer of the projection head.

Appendix F Further Distillation Ablations

Figure F.1 shows the impact of distillation weight () in Eq. 3, and temperature used for distillation. We see distillation without actual labels (i.e. distillation weight is 1.0) works on par with distillation with actual labels. Furthermore, temperature of 0.1 and 1.0 work similarly, but 2.0 is significantly worse. For our distillation experiments in this work, we by default use a temperature of 0.1 when the teacher is a fine-tuned model, otherwise 1.0.

(a) label fraction: 1%
(b) Label fraction: 10%
Figure F.1: Top-1 accuracy with different distillation weight (), and temperature ().

We further study the distillation performance with teachers that are fine-tuned using different projection head settings. More specifically, we pretrain two ResNet-50 (2+SK) models, with two or three layers of projection head, and fine-tune from a middle layer. This gives us five different teachers, corresponding to different projection head settings. Unsurprisingly, as shown in Figure F.2, distillation performance is strongly correlated with the top-1 accuracy of fine-tuned teacher. This suggests that a better fine-tuned model (measured by its top-1 accuracy), regardless their projection head settings, is a better teacher for transferring task specific knowledge to the student using unlabeled data.

(a) Label fraction: 1%
(b) Label fraction: 10%
Figure F.2: The strong correlation between teacher’s task performance and student’s task performance.

Appendix G Extra Results

Table G.1 shows top-5 accuracy of the fine-tuned SimCLRv2 (under different model sizes) on ImageNet.

Depth Width SK Param (M) F-T (1%) F-T (10%) F-T (100%) Linear eval Supervised
50 1 False 24 82.5 89.2 93.3 90.4 93.3
True 35 86.7 91.4 94.6 92.3 94.2
2 False 94 87.4 91.9 94.8 92.7 93.9
True 140 90.2 93.7 95.9 93.9 94.5
101 1 False 43 85.2 90.9 94.3 91.7 93.9
True 65 89.2 93.0 95.4 93.1 94.8
2 False 170 88.9 93.2 95.6 93.4 94.4
True 257 91.6 94.5 96.4 94.5 95.0
152 1 False 58 86.6 91.8 94.9 92.4 94.2
True 89 90.0 93.7 95.9 93.6 95.0
2 False 233 89.4 93.5 95.8 93.6 94.5
True 354 92.1 94.7 96.5 94.7 95.0
152 3 True 795 92.3 95.0 96.6 94.9 95.1
Table G.1: Top-5 accuracy of fine-tuning SimCLRv2 (on varied label fractions) or training a linear classifier on the ResNet output. The supervised baselines are trained from scratch using all labels in 90 epochs. The parameter count only include ResNet up to final average pooling layer. For fine-tuning results with 1% and 10% labeled examples, the models include additional non-linear projection layers, which incurs additional parameter count (4M for models, and 17M for models).

Here we also include results for a concurrent work, BYOL Grill et al. (2020), that also significantly improves previous state-of-the-art in self-supervised learning He et al. (2019a); Tian et al. (2020); Chen et al. (2020a). We were not aware of BYOL Grill et al. (2020) until it appeared on arXiv one day before we released our paper. Table G.2 extends our comparisons of semi-supervised learning settings and includes this concurrent work.

Method Architecture Top-1 Top-5
Label fraction Label fraction
1% 10% 1% 10%
Supervised baseline Zhai et al. (2019) ResNet-50 25.4 56.4 48.4 80.4
Methods using unlabeled data in a task-specific way:
Pseudo-label Lee ; Zhai et al. (2019) ResNet-50 - - 51.6 82.4
VAT+Entropy Min. Grandvalet and Bengio (2005); Miyato et al. (2018); Zhai et al. (2019) ResNet-50 - - 47.0 83.4
Mean teacher Tarvainen and Valpola (2017) ResNeXt-152 - - - 90.9
UDA (w. RandAug) Xie et al. (2019b) ResNet-50 - 68.8 - 88.5
FixMatch (w. RandAug) Sohn et al. (2020) ResNet-50 - 71.5 - 89.1
S4L (Rot+VAT+Entropy Min.) Zhai et al. (2019) ResNet-50 (4) - 73.2 - 91.2
MPL (w. RandAug) Pham et al. (2020) ResNet-50 - 73.8 - -
Methods using unlabeled data in a task-agnostic way:
InstDisc Wu et al. (2018) ResNet-50 - - 39.2 77.4
BigBiGAN Donahue and Simonyan (2019) RevNet-50 () - - 55.2 78.8
PIRL Misra and van der Maaten (2019) ResNet-50 - - 57.2 83.8
CPC v2 Hénaff et al. (2019) ResNet-161() 52.7 73.1 77.9 91.2
SimCLR Chen et al. (2020a) ResNet-50 48.3 65.6 75.5 87.8
SimCLR Chen et al. (2020a) ResNet-50 () 58.5 71.7 83.0 91.2
SimCLR Chen et al. (2020a) ResNet-50 () 63.0 74.4 85.8 92.6
BYOL Grill et al. (2020) (concurrent work) ResNet-50 53.2 68.8 78.4 89.0
BYOL Grill et al. (2020) (concurrent work) ResNet-200 () 71.2 77.7 89.5 93.7
Methods using unlabeled data in both ways:
SimCLRv2 distilled (ours) ResNet-50 73.9 77.5 91.5 93.4
SimCLRv2 distilled (ours) ResNet-50 (+SK) 75.9 80.2 93.0 95.0
SimCLRv2 self-distilled (ours) ResNet-152 (+SK) 76.6 80.9 93.4 95.5
Table G.2: ImageNet accuracy of models trained under semi-supervised settings. For our methods, we report results with distillation after fine-tuning. For our smaller models, we use self-distilled ResNet-152 (+SK) as the teacher. This table includes more comparisons (including concurrent work Grill et al. (2020)).