Revisiting Self-Supervised Visual Representation Learning

by   Alexander Kolesnikov, et al.

Unsupervised visual representation learning remains a largely unsolved problem in computer vision research. Among a big body of recently proposed approaches for unsupervised learning of visual representations, a class of self-supervised techniques achieves superior performance on many challenging benchmarks. A large number of the pretext tasks for self-supervised learning have been studied, but other important aspects, such as the choice of convolutional neural networks (CNN), has not received equal attention. Therefore, we revisit numerous previously proposed self-supervised models, conduct a thorough large scale study and, as a result, uncover multiple crucial insights. We challenge a number of common practices in selfsupervised visual representation learning and observe that standard recipes for CNN design do not always translate to self-supervised representation learning. As part of our study, we drastically boost the performance of previously proposed techniques and outperform previously published state-of-the-art results by a large margin.


Self-Supervised Learning for Large-Scale Unsupervised Image Clustering

Unsupervised learning has always been appealing to machine learning rese...

Towards Good Practices in Self-supervised Representation Learning

Self-supervised representation learning has seen remarkable progress in ...

Visual Probing: Cognitive Framework for Explaining Self-Supervised Image Representations

Recently introduced self-supervised methods for image representation lea...

A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

Probabilistic Latent Variable Models (LVMs) provide an alternative to se...

Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

Studies on self-supervised visual representation learning (SSL) improve ...

Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning

Instance discriminative self-supervised representation learning has been...

Self-Supervised Representation Learning via Neighborhood-Relational Encoding

In this paper, we propose a novel self-supervised representation learnin...

1 Introduction

Automated computer vision systems have recently made drastic progress. Many models for tackling challenging tasks such as object recognition, semantic segmentation or object detection can now compete with humans on complex visual benchmarks [15, 48, 14]. However, the success of such systems hinges on a large amount of labeled data, which is not always available and often prohibitively expensive to acquire. Moreover, these systems are tailored to specific scenarios, e.g. a model trained on the ImageNet (ILSVRC-2012) dataset [41] can only recognize 1000 semantic categories or a model that was trained to perceive road traffic at daylight may not work in darkness [5, 4].

Figure 1: Quality of visual representations learned by various self-supervised learning techniques significantly depends on the convolutional neural network architecture that was used for solving the self-supervised learning task. In our paper we provide a large scale in-depth study in support of this observation and discuss its implications for evaluation of self-supervised models.

As a result, a large research effort is currently focused on systems that can adapt to new conditions without leveraging a large amount of expensive supervision. This effort includes recent advances on transfer learning, domain adaptation, semi-supervised, weakly-supervised and unsupervised learning. In this paper, we concentrate on self-supervised visual representation learning, which is a promising sub-class of unsupervised learning. Self-supervised learning techniques produce state-of-the-art unsupervised representations on standard computer vision benchmarks 

[11, 37, 3].

The self-supervised learning framework requires only unlabeled data in order to formulate a pretext learning task such as predicting context [7] or image rotation [11], for which a target objective can be computed without supervision. These pretext tasks must be designed in such a way that high-level image understanding is useful for solving them. As a result, the intermediate layers of convolutional neural networks (CNNs) trained for solving these pretext tasks encode high-level semantic visual representations that are useful for solving downstream tasks of interest, such as image recognition.

Most of the prior work, which aims at improving performance of self-supervised techniques, does so by proposing novel pretext tasks and showing that they result in improved representations. Instead, we propose to have a closer look at CNN architectures. We revisit a prominent subset of the previously proposed pretext tasks and perform a large-scale empirical study using various architectures as base models. As a result of this study, we uncover numerous crucial insights. The most important are summarized as follows:

  • Standard architecture design recipes do not necessarily translate from the fully-supervised to the self-supervised setting. Architecture choices which negligibly affect performance in the fully labeled setting, may significantly affect performance in the self-supervised setting.

  • In contrast to previous observations with the AlexNet architecture [11, 51, 34], the quality of learned representations in CNN architectures with skip-connections does not degrade towards the end of the model.

  • Increasing the number of filters in a CNN model and, consequently, the size of the representation significantly and consistently increases the quality of the learned visual representations.

  • The evaluation procedure, where a linear model is trained on a fixed visual representation using stochastic gradient descent, is sensitive to the learning rate schedule and may take many epochs to converge.

In Section 4 we present experimental results supporting the above observations and offer additional in-depth insights into the self-supervised learning setting. We make the code for reproducing our core experimental results publicly available111

In our study we obtain new state-of-the-art results for visual representations learned without labeled data. Interestingly, the context prediction [7] technique that sparked the interest in self-supervised visual representation learning and that serves as the baseline for follow-up research, outperforms all currently published results (among papers on self-supervised learning) if the appropriate CNN architecture is used.

2 Related Work

Self-supervision is a learning framework in which a supervised signal for a pretext task is created automatically, in an effort to learn representations that are useful for solving real-world downstream tasks. Being a generic framework, self-supervision enjoys a wide number of applications, ranging from robotics to image understanding.

In robotics, both the result of interacting with the world, and the fact that multiple perception modalities simultaneously get sensory inputs are strong signals which can be exploited to create self-supervised tasks [22, 44, 29, 10].

Similarly, when learning representation from videos, one can either make use of the synchronized cross-modality stream of audio, video, and potentially subtitles [38, 42, 26, 47], or of the consistency in the temporal dimension [44].

In this paper we focus on self-supervised techniques that learn from image databases. These techniques have demonstrated impressive results for learning high-level image representations. Inspired by unsupervised methods from the natural language processing domain which rely on predicting words from their context 

[31], Doersch et al. [7] proposed a practically successful pretext task of predicting the relative location of image patches. This work spawned a line of work in patch-based self-supervised visual representation learning methods. These include a model from [34] that predicts the permutation of a “jigsaw puzzle” created from the full image and recent follow-ups [32, 36].

In contrast to patch-based methods, some methods generate cleverly designed image-level classification tasks. For instance, in [11] Gidaris et al. propose to randomly rotate an image by one of four possible angles and let the model predict that rotation. Another way to create class labels is to use clustering of the images [3]

. Yet another class of pretext tasks contains tasks with dense spatial outputs. Some prominent examples are image inpainting 


, image colorization 

[50], its improved variant split-brain [51] and motion segmentation prediction [39]. Other methods instead enforce structural constraints on the representation space. Noroozi et al. propose an equivariance relation to match the sum of multiple tiled representations to a single scaled representation [35]. Authors of [37] propose to predict future patches in representation space via autoregressive predictive coding.

Our work is complimentary to the previously discussed methods, which introduce new pretext tasks, since we show how existing self-supervision methods can significantly benefit from our insights.

Finally, many works have tried to combine multiple pretext tasks in one way or another. For instance, Kim et al. extend the “jigsaw puzzle” task by combining it with colorization and inpainting in [25]. Combining the jigsaw puzzle task with clustering-based pseudo labels as in [3] leads to the method called Jigsaw++ [36]. Doersch and Zisserman [8] implement four different self-supervision methods and make one single neural network learn all of them in a multi-task setting.

The latter work is similar to ours since it contains a comparison of different self-supervision methods using a unified neural network architecture, but with the goal of combining all these tasks into a single self-supervision task. The authors use a modified ResNet101 architecture [16] without further investigation and explore the combination of multiple tasks, whereas our focus lies on investigating the influence of architecture design on the representation quality.

3 Self-supervised study setup

In this section we describe the setup of our study and motivate our key choices. We begin by introducing six CNN models in Section 3.1 and proceed by describing the four self-supervised learning approaches used in our study in Section 3.2

. Subsequently, we define our evaluation metrics and datasets in Sections 

3.3 and 3.4. Further implementation details can be found in Supplementary Material.

3.1 Architectures of CNN models

A large part of the self-supervised techniques for visual representation approaches uses AlexNet [27] architecture. In our study, we investigate whether the landscape of self-supervision techniques changes when using modern network architectures. Thus, we employ variants of ResNet

and a batch-normalized

VGG architecture, all of which achieve high performance in the fully-supervised training setup. VGG is structurally close to AlexNet as it does not have skip-connections and uses fully-connected layers.

In our preliminary experiments, we observed an intriguing property of ResNet models: the quality of the representations they learn does not degrade towards the end of the network (see Section 4.5). We hypothesize that this is a result of skip-connections making residual units invertible under certain circumstances [2], hence facilitating the preservation of information across the depth even when it is irrelevant for the pretext task. Based on this hypothesis, we include RevNet[12] into our study, which come with stronger invertibility guarantees while being structurally similar to ResNets.

ResNet  was introduced by He et al[16], and we use the width-parametrization proposed in [49]: the first convolutional layer outputs channels, where is the widening factor, defaulting to 4. This is followed by a series of residual units of the form , where is a residual function

consisting of multiple convolutions, ReLU non-linearities 

[33] and batch normalization layers [20]. The variant we use, ResNet50, consists of four blocks with 3, 4, 6, and 3 such units respectively, and we refer to the output of each block as block1, block2, etc

. The network ends with a global spatial average pooling producing a vector of size

, which we call


as it is followed only by the final, task-specific logits layer. More details on this architecture are provided in [16].

In our experiments we explore , resulting in pre-logits of size , , and respectively. For some self-supervised techniques we skip configurations that do not fit into memory.

Moreover, we analyze the sensitivity of the self-supervised setting to underlying architectural details by using two variants of ordering operations known as ResNet v1 [16] and ResNet v2 [17] as well as a variant without ReLU preceding the global average pooling, which we mark by a “(-)”. Notably, these variants perform similarly on the pretext task.

RevNet  slightly modifies the design of the residual unit such that it becomes analytically invertible [12]. We note that the residual unit used in [12] is equivalent to double application of the residual unit from [21] or [6]. Thus, for conceptual simplicity, we employ the latter type of unit, which can be defined as follows. The input is split channel-wise into two equal parts and . The output is then the concatenation of and .

It easy to see that this residual unit is invertible, because its inverse can be computed in closed form as and .

Apart from this slightly different residual unit, RevNet is structurally identical to ResNet and thus we use the same overall architecture and nomenclature for both. In our experiments we use RevNet50 network, that has the same depth and number of channels as the original Resnet50 model. In the fully labelled setting, RevNet performs only marginally worse than its architecturally equivalent ResNet.

VGG  as proposed in [45] consists of a series of convolutions followed by ReLU non-linearities, arranged into blocks

separated by max-pooling operations. The

VGG19 variant we use has 5 such blocks of 2, 2, 4, 4, and 4 convolutions respectively. We follow the common practice of adding batch normalization between the convolutions and non-linearities.

In an effort to unify the nomenclature with ResNets, we introduce the widening factor such that corresponds to the architecture in [45], i.e. the initial convolution produces channels and the fully-connected layers have channels. Furthermore, we call the inputs to the second, third, fourth, and fifth max-pooling operations block1 to block4, respectively, and the input to the last fully-connected layer pre-logits.

3.2 Self-supervised techniques

In this section we describe the self-supervised techniques that are used in our study.

Rotation [11]: Gidaris et al. propose to produce 4 copies of a single image by rotating it by {0°, 90°, 180°, 270°} and let a single network predict the rotation which was applied—a 4-class classification task. Intuitively, a good model should learn to recognize canonical orientations of objects in natural images.

Exemplar [9]: In this technique, every individual image corresponds to its own class, and multiple examples of it are generated by heavy random data augmentation such as translation, scaling, rotation, and contrast and color shifts. We use data augmentation mechanism from [46]. [8] proposes to use the triplet loss [43, 18] in order to scale this pretext task to a large number of images (hence, classes) present in the ImageNet dataset. The triplet loss avoids explicit class labels and, instead, encourages examples of the same image to have representations that are close in the Euclidean space while also being far from the representations of different images. Example representations are given by a 1000-dimensional logits layer.

Jigsaw [34]: the task is to recover relative spatial position of 9 randomly sampled image patches after a random permutation of these patches was performed. All of these patches are sent through the same network, then their representations from the pre-logits

layer are concatenated and passed through a two hidden layer fully-connected multi-layer perceptron (MLP), which needs to predict a permutation that was used. In practice, the fixed set of 100 permutations from 

[34] is used.

In order to avoid shortcuts relying on low-level image statistics such as chromatic aberration [34]

or edge alignment, patches are sampled with a random gap between them. Each patch is then independently converted to grayscale with probability 23 and normalized to zero mean and unit standard deviation. More details on the preprocessing are provided in Supplementary Material. After training, we extract representations by averaging the representations of nine uniformly sampled, colorful, and normalized patches of an image.

Relative Patch Location [7]: The pretext task consists of predicting the relative location of two given patches of an image. The model is similar to the Jigsaw one, but in this case the 8 possible relative spatial relations between two patches need to be predicted, e.g. “below” or “on the right and above”. We use the same patch prepossessing as in the Jigsaw model and also extract final image representations by averaging representations of 9 cropped patches.

3.3 Evaluation of Learned Visual Representations

We follow common practice and evaluate the learned visual representations by using them for training a linear logistic regression model to solve multiclass image classification tasks requiring high-level scene understanding. These tasks are called

downstream tasks. We extract the representation from the (frozen) network at the pre-logits level, but investigate other possibilities in Section 4.5.

In order to enable fast evaluation, we use an efficient convex optimization technique for training the logistic regression model unless specified otherwise. Specifically, we precompute the visual representation for all training images and train the logistic regression using L-BFGS [30].

For consistency and fair evaluation, when comparing to the prior literature in Table 1, we opt for using stochastic gradient descent (SGD) with momentum and use data augmentation during training.

We further investigate this common evaluation scheme in Section 4.3, where we use a more expressive model, which is an MLP with a single hidden layer with 1000 channels and the ReLU non-linearity after it. More details are given in Supplementary material.

3.4 Datasets

In our experiments, we consider two widely used image classification datasets: ImageNet and Places205.

ImageNet contains roughly natural images that represent 1000 various semantic classes. There are images in the official validation and test sets, but since the official test set is held private, results in the literature are reported on the validation set. In order to avoid overfitting to the official validation split, we report numbers on our own validation split ( random images from the training split) for all our studies except in Table 2, where for a fair comparison with the literature we evaluate on the official validation set.

The Places205 dataset consists of roughly images depicting 205 different scene types such as airfield, kitchen, coast, etc. This dataset is qualitatively different from ImageNet and, thus, a good candidate for evaluating how well the learned representations generalize to new unseen data of different nature. We follow the same procedure as for ImageNet regarding validation splits for the same reasons.

4 Experiments and Results

In this section we present and interpret results of our large-scale study. All self-supervised models are trained on ImageNet (without labels) and consequently evaluated on our own hold-out validation splits of ImageNet and Places205. Only in Table 2, when we compare to the results from the prior literature, we use the official ImageNet and Places205 validation splits.

Model Rotation Exemplar RelPatchLoc Jigsaw
RevNet50 47.3 50.4 53.1 53.7 42.4 45.6 46.4 40.6 45.0 40.1 43.7
ResNet50 v2 43.8 47.5 47.2 47.6 43.0 45.7 46.6 42.2 46.7 38.4 41.3
ResNet50 v1 41.7 43.4 43.3 43.2 42.8 46.9 47.7 46.8 50.5 42.2 45.4
RevNet50 (-) 45.2 51.0 52.8 53.7 38.0 42.6 44.3 33.8 43.5 36.1 41.5
ResNet50 v2 (-) 38.6 44.5 47.3 48.2 33.7 36.7 38.2 38.6 43.4 32.5 34.4
VGG19-BN 16.8 14.6 16.6 22.7 26.4 28.3 29.0 28.5 29.4 19.8 21.1
Table 1: Evaluation of representations from self-supervised techniques based on various CNN architectures. The scores are accuracies (in %) of a linear logistic regression model trained on top of these representations using ImageNet training split. Our validation split is used for computing accuracies. The architectures marked by a “(-)” are slight variations described in Section 3.1. Sub-columns such as correspond to widening factors. Top-performing architectures in a column are bold; the best pretext task for each model is underlined.

4.1 Evaluation on ImageNet and Places205

In Table 1 we highlight our main evaluation results: we measure the representation quality produced by six different CNN architectures with various widening factors (Section 3.1), trained using four self-supervised learning techniques (Section 3.2). We use the pre-logits of the trained self-supervised networks as representation. We follow the standard evaluation protocol (Section 3.3

) which measures representation quality as the accuracy of a linear regression model trained and evaluated on the

ImageNet dataset.

Figure 2: Different network architectures perform significantly differently across self-supervision tasks. This observation generalizes across datasets: ImageNet evaluation is shown on the left and Places205 is shown on the right.


ImageNet Places205
Prev. Ours Prev. Ours
A Rotation[11] 38.7 55.4 35.1 48.0
R Exemplar[8] 31.5 46.0 - 42.7
R Rel. Patch Loc.[8] 36.2 51.4 - 45.3
A Jigsaw[34, 51] 34.7 44.6 35.5 42.2
V CC+vgg-Jigsaw++[36] 37.3 - 37.5 -
A Counting[35] 34.3 - 36.3 -
A Split-Brain[51] 35.4 - 34.1 -
V DeepClustering[3] 41.0 - 39.8 -
R CPC[37] 48.7   - - -
R Supervised RevNet50 74.8 74.4 - 58.9
R Supervised ResNet50 v2 76.0 75.8 - 61.6
V Supervised VGG19 72.7 75.0 58.9 61.5
marks results reported in unpublished manuscripts.
Table 2: Comparison of the published self-supervised models to our best models. The scores correspond to accuracy of linear logistic regression that is trained on top of representations provided by self-supervised models. Official validation splits of ImageNet and Places205 are used for computing accuracies. The “Family” column shows which basic model architecture was used in the referenced literature: AlexNet, VGG-style, or Residual.

Now we discuss key insights that can be learned from the table and motivate our further in-depth analysis. First, we observe that similar models often result in visual representations that have significantly different performance. Importantly, neither is the ranking of architectures consistent across different methods, nor is the ranking of methods consistent across architectures. For instance, the RevNet50 v2 model excels under Rotation self-supervision, but is not the best model in other scenarios. Similarly, relative patch location seems to be the best method when basing the comparison on the ResNet50 v1 architecture, but not otherwise. Notably, VGG19-BN consistently demonstrates the worst performance, even though it achieves performance similar to ResNet50 models on standard vision benchmarks [45]. Note that VGG19-BN performs better when using representations from layers earlier than the pre-logit layer are used, though still falls short. We investigate this in Section 4.5. We depict the performance of the models with the largest widening factor in Figure 2 (left), which displays these ranking inconsistencies.

Our second observation is that increasing the number of channels in CNN models improves performance of self-supervised models. While this finding is in line with the fully-supervised setting [49], we note that the benefit is more pronounced in the context of self-supervised representation learning, a fact not yet acknowledged in the literature.

We further evaluate how visual representations trained in a self-supervised manner on ImageNet generalize to other datasets. Specifically, we evaluate all our models on the Places205 dataset using the same evaluation protocol. The performance of models with the largest widening factor are reported in Figure 2 (right) and the full result table is provided in Supplementary Material. We observe the following pattern: ranking of models evaluated on Places205 is consistent with that of models evaluated on ImageNet, indicating that our findings generalize to new datasets.

4.2 Comparison to prior work

In order to put our findings in context, we select the best model for each self-supervision from Table 1 and compare them to the numbers reported in the literature. For this experiment only, we precisely follow standard protocol by training the linear model with stochastic gradient descent (SGD) on the full ImageNet training split and evaluating it on the public validation set of both ImageNet and Places205. We note that in this case the learning rate schedule of the evaluation plays an important role, which we elaborate in Section 4.7.

Figure 3: Comparing linear evaluation () of the representations to non-linear () evaluation, i.e. training a multi-layer perceptron instead of a linear model. Linear evaluation is not limiting: conclusions drawn from it carry over to the non-linear evaluation.

Table 2 summarizes our results. Surprisingly, as a result of selecting the right architecture for each self-supervision and increasing the widening factor, our models significantly outperform previously reported results. Notably, context prediction [7], one of the earliest published methods, achieves top-1 accuracy on ImageNet. Our strongest model, using Rotation, attains unprecedentedly high accuracy of . Similar observations hold when evaluating on Places205.

Importantly, our design choices result in almost halving the gap between previously published self-supervised result and fully-supervised results on two standard benchmarks. Overall, these results reinforce our main insight that in self-supervised learning architecture choice matters as much as choice of a pretext task.

4.3 A linear model is adequate for evaluation.

Using a linear model for evaluating the quality of a representation requires that the information relevant to the evaluation task is linearly separable in representation space. This is not necessarily a prerequisite for a “useful” representation. Furthermore, using a more powerful model in the evaluation procedure might make the architecture choice for a self-supervised task less important. Hence, we consider an alternative evaluation scenario where we use a multi-layer perceptron (MLP) for solving the evaluation task, details of which are provided in Supplementary Material.

Figure 3 clearly shows that the MLP provides only marginal improvement over the linear evaluation and the relative performance of various settings is mostly unchanged. We thus conclude that the linear model is adequate for evaluation purposes.

4.4 Better performance on the pretext task does not always translate to better representations.

Figure 4: A look at how predictive pretext performance is to eventual downstream performance. Colors correspond to the architectures in Figure 3 and circle size to the widening factor . Within an architecture, pretext performance is somewhat predictive, but it is not so across architectures. For instance, according to pretext accuracy, the widest VGG model is the best one for Rotation, but it performs poorly on the downstream task.

In many potential applications of self-supervised methods, we do not have access to downstream labels for evaluation. In that case, how can a practitioner decide which model to use? Is performance on the pretext task a good proxy?

Figure 5: Evaluating the representation from various depths within the network. The vertical axis corresponds to downstream ImageNet performance in percent. For residual architectures, the pre-logits are always best.

In Figure 4 we plot the performance on the pretext task against the evaluation on ImageNet. It turns out that performance on the pretext task is a good proxy only once the model architecture is fixed, but it can unfortunately not be used to reliably select the model architecture. Other label-free mechanisms for model-selection need to be devised, which we believe is an important and underexplored area for future work.

4.5 Skip-connections prevent degradation of representation quality towards the end of CNNs.

We are interested in how representation quality depends on the layer choice and how skip-connections affect this dependency. Thus, we evaluate representations from five intermediate layers in three models: Resnet v2, RevNet and VGG19-BN. The results are summarized in Figure 5.

Figure 6: Disentangling the performance contribution of network widening factor versus representation size. Both matter independently, and larger is always better. Scores are accuracies of logistic regression on ImageNet. Black squares mark models which are also present in Table 1.

Similar to prior observations [11, 51, 34] for AlexNet [28], the quality of representations in VGG19-BN deteriorates towards the end of the network. We believe that this happens because the models specialize to the pretext task in the later layers and, consequently, discard more general semantic features present in the middle layers.

In contrast, we observe that this is not the case for models with skip-connections: representation quality in ResNet consistently increases up to the final pre-logits layer. We hypothesize that this is a result of ResNet’s residual units being invertible under some conditions [2]. Invertible units preserve all information learned in intermediate layers, and, thus, prevent deterioration of representation quality.

We further test this hypothesis by using the RevNet model that has stronger invertibility guarantees. Indeed, it boosts performance by more than on the Rotation task, albeit it does not result in improvements across other tasks. We leave identifying further scenarios where Revnet models result in significant boost of performance for the future research.

4.6 Model width and representation size strongly influence the representation quality.

Table 1 shows that using a wider network architecture consistently leads to better representation quality. It should be noted that increasing the network’s width has the side-effect of also increasing the dimensionality of the final representation (Section 3.1). Hence, it is unclear whether the increase in performance is due to increased network capacity or to the use of higher-dimensional representations, or to the interplay of both.

In order to answer this question, we take the best rotation model (RevNet50) and disentangle the network width from the representation size by adding an additional linear layer to control the size of the pre-logits layer. We then vary the widening factor and the representation size independently of each other, training each model from scratch on ImageNet with the Rotation pretext task. The results, evaluated on the ImageNet classification task, are shown in Figure 6. In essence, it is possible to increase performance by increasing either model capacity, or representation size, but increasing both jointly helps most. Notably, one can significantly boost performance of a very thin model from to by increasing representation size.

Figure 7: Performance of the best models evaluated using all data as well as a subset of the data. The trend is clear: increased widening factor increases performance across the board.

Low-data regime.  In principle, the effectiveness of increasing model capacity and representation size might only work on relatively large datasets for downstream evaluation, and might hurt representation usefulness in the low-data regime. In Figure 7, we depict how the number of channels affects the evaluation using both full and heavily subsampled ( and ) ImageNet and Places205 datasets.

We observe that increasing the widening factor consistently boosts performance in both the full- and low-data regimes. We present more low-data evaluation experiments in Supplementary Material. This suggests that self-supervised learning techniques are likely to benefit from using CNNs with increased number of channels across wide range of scenarios.

4.7 SGD for training linear model takes long time to converge

In this section we investigate the importance of the SGD optimization schedule for training logistic regression in downstream tasks. We illustrate our findings for linear evaluation of the Rotation task, others behave the same and are provided in Supplementary Material.

We train the linear evaluation models with a mini-batch size of 2048 and an initial learning rate of 0.1, which we decay twice by a factor of 10. Our initial experiments suggest that when the first decay is made has a large influence on the final accuracy. Thus, we vary the moment of first decay, applying it after 30, 120 or 480 epochs. After this first decay, we train for an extra 40 extra epochs, with a second decay after the first 20.

Figure 8 depicts how accuracy on our validation split progresses depending on when the learning rate is first decayed. Surprisingly, we observe that very long training (

epochs) results in higher accuracy. Thus, we conclude that SGD optimization hyperparameters play an important role and need to be reported.

Figure 8: Downstream task accuracy curve of the linear evaluation model trained with SGD on representations from the Rotation task. The first learning rate decay starts after 30, 120 and 480 epochs. We observe that accuracy on the downstream task improves even after very large number of epochs.

5 Conclusion

In this work, we have investigated self-supervised visual representation learning from the previously unexplored angles. Doing so, we uncovered multiple important insights, namely that (1) lessons from architecture design in the fully-supervised setting do not necessarily translate to the self-supervised setting; (2) contrary to previously popular architectures like AlexNet, in residual architectures, the final pre-logits layer consistently results in the best performance; (3) the widening factor of CNNs has a drastic effect on performance of self-supervised techniques and (4) SGD training of linear logistic regression may require very long time to converge. In our study we demonstrated that performance of existing self-supervision techniques can be consistently boosted and that this leads to halving the gap between self-supervision and fully labeled supervision.

Most importantly, though, we reveal that neither is the ranking of architectures consistent across different methods, nor is the ranking of methods consistent across architectures. This implies that pretext tasks for self-supervised learning should not be considered in isolation, but in conjunction with underlying architectures.


Appendix A Self-supervised model details

For training all self-supervised models we use stochastic gradient descent (SGD) with momentum. The initial learning rate is set to and the momentum coefficient is set to . We train for 35 epochs in total and decay the learning rate by a factor of 10 after 15 and 25 epochs. As we use large mini-batch sizes during training, we leverage two recommendations from [13]: (1) a learning rate scaling, where the learning rate is multiplied by and (2) a linear learning rate warm-up during the initial 5 epochs.

In the following we give additional details that are specific to the choice of self-supervised learning technique.

Rotation:  During training we use the data augmentation mechanism from [46]. We use mini-batches of images, where each image is repeated 4 times: once for every rotation. The model is trained on 128 TPU [24] cores.

Exemplar:  In order to generate image examples, we use the data augmentation mechanism from [46]. During training, we use mini-batches of size , and for each image in a mini-batch we randomly generate 8 examples. We use an implementation222 of the triplet loss [43] from the tensorflow package [1]. The margin parameter of the triplet loss is set to . We use 32 TPU cores for training.

Jigsaw:  Similar to [34], we preprocess the input images by: (1) resizing the input image to and randomly cropping it to ; (2) converting the image to grayscale with probability 23 by averaging the color channels; (3) splitting the image into a regular grid of cells (size each) and randomly cropping

-sized patches inside every cell; (4) standardize every patch individually such that its pixel intensities have zero mean and unit variance. We use SGD with batch size

. For each image individually, we randomly select 16 out of the 100 pre-defined permutations and apply all of them. The model is trained on 32 TPU cores.

Relative Patch Location: 

We use the same patch prepossessing, representation extraction and training setup as in the Jigsaw model. The only difference is the loss function as discussed in the main text.

Appendix B Downstream training details

Training linear models with SGD:  For training linear models with SGD, we use a standard data augmentation technique in the Rotation and Exemplar cases: (1) resize the image, preserving its aspect ratio such that its smallest side is 256. (2) apply a random crop of size . For the patch-based methods, we extract representations by averaging the representations of all nine, colorful, standardized patches of an image. At final evaluation-time, fixed patches are obtained by scaling the image to , cropping the central patch and taking the grid of -sized patches from it.

We use a batch-size of for evaluation of representations from Rotation and Exemplar models and of for Jigsaw and Relative Patch Location models. As we use large mini-batch sizes, we perform learning-rate scaling, as suggested in [13].

Training linear models with L-BFGS:  We use a publicly available implementation of the L-BFGS algorithm [30] from the scipy [23] package with the default parameters and set the maximum number of updates to 800. For training all our models we apply penalty , where is the matrix of model parameters, is the size of the representation, and is the number of classes. We set .

Training MLP models with SGD:  In the MLP evaluation scenario, we use a single hidden layer with channels. At training time, we apply dropout [19] to the hidden layer with a drop rate of 50%. The regularization scheme is the same as in the L-BFGS setting. We optimize the MLP model using stochastic gradient descent with momentum (the momentum coefficient is 0.9) for 180 epochs. The batch size is 512, initial learning rate is 0.01 and we decay it twice by a factor of 10: after 60 and 120 epochs.

Appendix C Training linear models with SGD

In Figure 9 we demonstrate how accuracy on the validation data progresses during the course of SGD optimization. We observe that in all cases achieving top accuracy requires training for a very large number of epochs.

Appendix D More Results on Places205 and ImageNet

For completeness, we present full result tables for various settings considered in the main paper. These include numbers for ImageNet evaluated on of the data (Table 3) as well as all results when evaluating on the Places205 dataset (Table 4) and a random subset of of the Places205 dataset (Table 5).

Finally, Table 6 is an extended version of Table 2 in the main paper, additionally providing the top-5 accuracies of our various best models on the public ImageNet validation set.

Model Rotation Exemplar RelPatchLoc Jigsaw
RevNet50 31.3 34.6 37.9 38.4 27.1 30.0 31.1 24.6 27.8 25.0 24.2
ResNet50 v2 28.2 31.7 32.2 33.3 28.3 30.1 31.2 25.8 29.4 23.3 24.1
ResNet50 v1 26.8 27.2 27.4 27.8 28.7 30.8 31.7 30.2 33.2 26.4 28.3
RevNet50 (-) 30.2 32.3 33.3 33.4 25.7 26.3 26.4 21.6 25.0 24.1 24.9
ResNet50 v2 (-) 28.4 28.6 28.2 28.5 26.5 27.3 27.3 26.1 26.3 23.9 23.1
VGG19-BN 08.8 06.7 07.6 13.1 16.6 17.7 18.2 15.8 16.8 10.6 10.7
Table 3: Evaluation on ImageNet with of the data.
Model Rotation Exemplar RelPatchLoc Jigsaw
RevNet50 41.8 45.3 47.4 47.9 39.4 43.1 44.5 37.5 41.9 37.1 40.7
ResNet50 v2 39.8 43.2 44.2 44.8 39.5 42.8 44.3 38.7 43.2 36.3 39.2
ResNet50 v1 38.1 40.0 41.3 42.0 39.3 43.1 44.5 42.3 46.2 39.4 42.9
RevNet50 (-) 39.5 44.3 46.3 47.5 35.8 39.3 40.7 32.5 39.7 34.5 38.5
ResNet50 v2 (-) 35.5 39.5 41.8 42.8 32.6 34.9 36.0 35.8 39.1 31.6 33.2
VGG19-BN 22.6 21.6 23.8 30.7 29.3 32.0 33.3 31.5 33.6 24.6 27.2
Table 4: Evaluation on Places205.
Model Rotation Exemplar RelPatchLoc Jigsaw
RevNet50 32.1 33.4 34.5 34.8 30.7 31.2 31.6 28.9 29.7 29.3 29.3
ResNet50 v2 30.6 31.8 31.8 32.0 32.1 31.8 32.2 29.8 31.1 29.4 28.9
ResNet50 v1 30.0 29.2 29.0 29.2 32.5 32.5 32.7 33.2 33.9 31.2 31.3
RevNet50 (-) 33.5 34.4 34.5 34.3 31.0 32.2 32.2 27.4 30.8 29.8 31.1
ResNet50 v2 (-) 31.6 33.2 33.6 33.6 30.0 31.4 31.9 30.9 31.9 28.4 28.9
VGG19-BN 16.8 13.9 15.3 20.2 23.5 23.4 23.7 23.8 24.0 19.3 18.7
Table 5: Evaluation on Places205 with of the data.


ImageNet Places205
Prev. top1 Ours top1 Ours top5 Prev. top1 Ours top1 Ours top5
A Rotation[11] 38.7 55.4 77.9 35.1 48.0 77.9
R Exemplar[8] 31.5 46.0 68.8 - 42.7 72.5
R Rel. Patch Loc.[8] 36.2 51.4 74.0 - 45.3 75.6
A Jigsaw[34, 51] 34.7 44.6 68.0 35.5 42.2 71.6
R Supervised RevNet50 74.8 74.4 91.9 - 58.9 87.5
R Supervised ResNet50 v2 76.0 75.8 92.8 - 61.6 89.0
V Supervised VGG19 72.7 75.0 92.3 58.9 61.5 89.3
Table 6: Comparison of the published self-supervised models to our best models. The scores correspond to accuracy of linear logistic regression that is trained on top of representations provided by self-supervised models. Official validation splits of ImageNet and Places205 are used for computing accuracies. The “Family” column shows which basic model architecture was used in the referenced literature: AlexNet, VGG-style, or Residual.