Modern image-recognition systems learn image representations from large collections of images and corresponding semantic annotations. These annotations can be provided in the form of class labels , hashtags , bounding boxes [15, 39], etc. Pre-defined semantic annotations scale poorly to the long tail of visual concepts , which hampers further improvements in image recognition.
Self-supervised learning tries to address these limitations by learning image representations from the pixels themselves without relying on pre-defined semantic annotations. Often, this is done via a pretext task that applies a transformation to the input image and requires the learner to predict properties of the transformation from the transformed image (see Figure 1). Examples of image transformations used include rotations , affine transformations [76, 49, 57, 31], and jigsaw transformations . As the pretext task involves predicting a property of the image transformation, it encourages the construction og image representations that are covariant to the transformations. Although such covariance is beneficial for tasks such as predicting 3D correspondences [49, 31, 57], it is undesirable for most semantic recognition tasks. Representations ought to be invariant under image transformations to be useful for image recognition [29, 13] because the transformations do not alter visual semantics.
Motivated by this observation, we propose a method that learns invariant representations rather than covariant ones. Instead of predicting properties of the image transformation, Pretext-Invariant Representation Learning (PIRL) constructs image representations that are similar to the representation of transformed versions of the same image and different from the representations of other images. We adapt the “Jigsaw” pretext task 
to work with PIRL and find that the resulting invariant representations perform better than their covariant counterparts across a range of vision tasks. PIRL substantially outperforms all prior art in self-supervised learning from ImageNet (Figure2) and from uncurated image data (Table 4). Interestingly, PIRL even outperforms supervised pre-training in learning image representations suitable for object detection (Tables 1 & 6).
2 PIRL: Pretext-Invariant
Our work focuses on pretext tasks for self-supervised learning in which a known image transformation is applied to the input image. For example, the “Jigsaw” task divides the image into nine patches and perturbs the image by randomly permuting the patches . Prior work used Jigsaw as a pretext task by predicting the permutation from the perturbed input image. This requires the learner to construct a representation that is covariant to the perturbation. The same is true for a range of other pretext tasks that have recently been studied [18, 9, 44, 76]. In this work, we adopt the existing Jigsaw pretext task in a way that encourages the image representations to be invariant to the image patch perturbation. While we focus on the Jigsaw pretext task in this paper, our approach is applicable to any pretext task that involves image transformations (see Section 4.3).
2.1 Overview of the Approach
Suppose we are given an image dataset, with , and a set of image transformations, . The set may contain transformations such as a re-shuffling of patches in the image , image rotations , etc. We aim to train a convolutional network, , with parameters that constructs image representations that are invariant to image transformations . We adopt an empirical risk minimization approach to learning the network parameters . Specifically, we train the network by minimizing the empirical risk:
where is some distribution over the transformations in , and denotes image after application of transformation , that is, . The function
is a loss function that measures the similarity between two image representations. Minimization of this loss encourages the networkto produce the same representation for image as for its transformed counterpart , i.e., to make representation invariant under transformation .
where is a function that measures some properties of transformation . Such losses encourage network to learn image representations that contain information on transformation , thereby encouraging it to maintain information that is not semantically relevant.
Loss function. We implement using a contrastive loss function . Specifically, we define a matching score,
, that measures the similarity of two image representations and use this matching score in a noise contrastive estimator. In our noise contrastive estimator (NCE), each “positive” sample has corresponding “negative” samples. The negative samples are obtained by computing features from other images,
. The noise contrastive estimator models the probability of the binary event thatoriginates from data distribution as:
Herein, is a set of negative samples that are drawn uniformly at random from dataset excluding image , is a temperature parameter, and
is the cosine similarity between the representations.
In practice, we do not use the convolutional features directly but apply to different “heads” to the features before computing the score . Specifically, we apply head on features () of and head on features () of ; see Figure 3 and Section 2.3. NCE then amounts to minimizing the following loss:
This loss encourages the representation of image to be similar to that of its transformed counterpart , whilst also encouraging the representation of to be dissimilar to that of other images .
2.2 Using a Memory Bank of Negative Samples
Prior work has found that it is important to use a large number of negatives in the NCE loss of Equation 4 [51, 72]. In a mini-batch SGD optimizer, it is difficult to obtain a large number of negatives without increasing the batch to an infeasibly large size. To address this problem, we follow  and use a memory bank of “cached” features. Concurrent work used a similar memory-bank approach .
The memory bank, , contains a feature representation for each image in dataset . The representation is an exponential moving average of feature representations
that were computed in prior epochs. This allows us to replace negative samples,, by their memory bank representations, , in Equation 4 without having to increase the training batch size. We emphasize that the representations that are stored in the memory bank are all computed on the original images, , without the transformation .
Final loss function. A potential issue of the loss in Equation 4 is that it does not compare the representations of untransformed images and . We address this issue by using a convex combination of two NCE loss functions in :
Herein, the first term is simply the loss of Equation 4 but uses memory representations and instead of and , respectively. The second term does two things: (1) it encourages the representation to be similar to its memory representation , thereby dampening the parameter updates; and (2) it encourages the representations and to be dissimilar. We note that both the first and the second term use instead of in Equation 4. Setting in Equation 2.2 leads to the loss used in . We study the effect of on the learned representations in Section 4.
2.3 Implementation Details
Although PIRL can be used with any pretext task that involves image transformations, we focus on the Jigsaw pretext task  in this paper. To demonstrate that PIRL is more generally applicable, we also experiment with the Rotation pretext task  and with a combination of both tasks in Section 4.3. Below, we describe the implementation details of PIRL with the Jigsaw pretext task.
Convolutional network. We use a ResNet-50 (R-50) network architecture in our experiments . The network is used to compute image representations for both and . These representations are obtained by applying function or
on features extracted from the the network.
Specifically, we compute the representation of , , by extracting res5 features, average pooling, and a linear projection to obtain a -dimensional representation.
To compute the representation of a transformed image , we closely follow [46, 19]. We: (1) extract nine patches from image , (2) compute an image representation for each patch separately by extracting activations from the res5 layer of the ResNet-50 and average pool the activations, (3) apply a linear projection to obtain a -dimensional patch representations, and (4) concatenate the patch representations in random order and apply a second linear projection on the result to obtain the final -dimensional image representation, . Our motivation for this design of is the desire to remain as close as possible to the covariant pretext task of [46, 18, 19]. This allows apples-to-apples comparisons between the covariant approach and our invariant approach.
Hyperparameters. We implement the memory bank as described in 
and use the same hyperparameters for the memory bank. Specifically, we set the temperature in Equation3 to , and use a weight of to compute the exponential moving averages in the memory bank. Unless stated otherwise, we use in Equation 2.2.
, we evaluate the performance of PIRL in transfer-learning experiments. We perform experiments on a variety of datasets, focusing on object detection and image classification tasks. Our empirical evaluations cover: (1) a learning setting in which the parameters of the convolutional network arefinetuned during transfer, thus evaluating the network “initialization” obtained using self-supervised learning and (2) a learning setting in which the parameters of the network are fixed during transfer learning, thus using the network as a feature extractor. Code reproducing the results of our experiments will be published online.
Baselines. Our most important baseline is the Jigsaw ResNet-50 model of . This baseline implements the covariant counterpart of our PIRL approach with the Jigsaw pretext task.
We also compare PIRL to a range of other self-supervised methods. An important comparison is to NPID . NPID is a special case of PIRL: setting in Equation 2.2 leads to the loss function of NPID. We found it is possible to improve the original implementation of NPID by using more negative samples and training for more epochs (see Section 4). We refer to our improved version of NPID as NPID++. Comparisons between PIRL and NPID++ allow us to study the effect of the pretext-invariance that PIRL aims to achieve, i.e., the effect of using in Equation 2.2.
Pre-training data. To facilitate comparisons with prior work, we use the M images from the ImageNet  train split (without labels) to pre-train our models.
Training details. We train our models using mini-batch SGD using the cosine learning rate decay  scheme with an initial learning rate of and a final learning rate of . We train the models for epochs using a batch size of images and using negative samples in Equation 3. We do not use data-augmentation approaches such as Fast AutoAugment  because they are the result of supervised-learning approaches. We provide a full overview of all hyperparameter settings that were used in the supplemental material.
Transfer learning. Prior work suggests that the hyperparameters used in transfer learning can play an important role in the evaluation pre-trained representations [78, 19, 33]. To facilitate fair comparisons with prior work, we closely follow the transfer-learning setup described in [19, 78].
3.1 Object Detection
Following prior work [19, 72], we perform object-detection experiments on the the Pascal VOC dataset  using the VOC07+12 train split. We use the Faster R-CNN  C4 object-detection model implemented in Detectron2  with a ResNet-50 (R-50) backbone. We pre-train the ResNet-50 using PIRL to initialize the detection model before finetuning it on the VOC training data. We use the same training schedule as  for all models finetuned on VOC and follow [19, 71] to keep the BatchNorm parameters fixed during finetuning. We evaluate object-detection performance in terms of AP, AP, and AP .
The results of our detection experiments are presented in Table 1. The results demonstrate the strong performance of PIRL: it outperforms all alternative self-supervised learnings in terms of all three AP measures. Compared to pre-training on the Jigsaw pretext task, PIRL achieves AP improvements of points. These results underscore the importance of learning invariant (rather than covariant) image representations. PIRL also outperforms NPID++, which demonstrates the benefits of learning pretext invariance.
Interestingly, PIRL even outperforms the supervised ImageNet-pretrained model in terms of the more conservative AP and AP metrics. Similar to concurrent work , we find that a self-supervised learner can outperform supervised pre-training for object detection. We emphasize that PIRL achieves this result using the same backbone model, the same number of finetuning epochs, and the exact same pre-training data (but without the labels). This result is a substantial improvement over prior self-supervised approaches that obtain slightly worse performance than fully supervised baselines despite using orders of magnitude more curated training data  or much larger backbone models . In Table 6, we show that PIRL also outperforms supervised pretraining when finetuning is done on the much smaller VOC07 train+val set. This suggests that PIRL learns image representations that are amenable to sample-efficient supervised learning.
|ResNet-50 using evaluation setup of |
|Different architecture or evaluation setup|
. We train linear classifiers on image representations obtained by self-supervised learners that were pre-trained on ImageNet (without labels). We report the performance for the best-performing layer for each method. We measure mean average precision (mAP) on the VOC07 dataset and top-1 accuracy on all other datasets. Numbers for PIRL, NPID++, Rotation were obtained by us; the other numbers were adopted from their respective papers. Numbers withwere measured using 10-crop evaluation. The best-performing self-supervised learner on each dataset is boldfaced.
3.2 Image Classification with Linear Models
Next, we assess the quality of image representations by training linear classifiers on fixed image representations. We follow the evaluation setup from  and measure the performance of such classifiers on four image-classification datasets: ImageNet , VOC07 , Places205 , and iNaturalist2018 
. These datasets involve diverse tasks such as object classification, scene recognition and fine-grained recognition. Following, we evaluate representations extracted from all intermediate layers of the pre-trained network, and report the image-classification results for the best-performing layer in Table 2.
ImageNet results. The results on ImageNet highlight the benefits of learning invariant features: PIRL improves recognition accuracies by over compared to its covariant counterpart, Jigsaw. PIRL achieves the highest single-crop top-1 accuracy of all self-supervised learners that use a single ResNet-50 model.
The benefits of pretext invariance are further highlighted by comparing PIRL with NPID. Our re-implementation of NPID (called NPID++) substantially outperforms the results reported in . Specifically, NPID++ achieves a single-crop top-1 accuracy of , which is higher or on par with existing work that uses a single ResNet-50. Yet, PIRL substantially outperforms NPID++. We note that PIRL also outperforms concurrent work  in this setting.
Akin to prior approaches, the performance of PIRL improves with network size. For example, CMC  uses a combination of two ResNet-50 models and trains the linear classifier for longer to obtain accuracy. We performed an experiment in which we did the same for PIRL, and obtained a top-1 accuracy of ; see “PIRL-ens.” in Figure 2. To compare PIRL with larger models, we also performed experiments in which we followed [33, 75] by doubling the number of channels in ResNet-50; see “PIRL-c2x” in Figure 2. PIRL-c2x achieves a top-1 accuracy of , which is close to the accuracy obtained by AMDIM  with a model that has more parameters.
Altogether, the results in Figure 2 demonstrate that PIRL outperforms all prior self-supervised learners on ImageNet in terms of the trade-off between model accuracy and size. Indeed, PIRL even outperforms most self-supervised learners that use much larger models [51, 26].
Results on other datasets. The results on the other image-classification datasets in Table 2 are in line with the results on ImageNet: PIRL substantially outperforms its covariant counterpart (Jigsaw). The performance of PIRL is within 2% of fully supervised representations on Places205, and improves the previous best results of  on VOC07 by more than 16 AP points. On the challenging iNaturalist dataset, which has over classes, we obtain a gain of 11% in top-1 accuracy over the prior best result . We observe that the NPID++ baseline performs well on these three datasets but is consistently outperformed by PIRL. Indeed, PIRL sets a new state-of-the-art for self-supervised representations in this learning setting on the VOC07, Places205, and iNaturalist datasets.
3.3 Semi-Supervised Image Classification
We perform semi-supervised image classification experiments on ImageNet following the experimental setup of [26, 72, 75]. Specifically, we randomly select and of the ImageNet training data (with labels). We finetune our models on these training-data subsets following the procedure of . Table 3 reports the top-5 accuracy of the resulting models on the ImageNet validation set.
The results further highlight the quality of the image representations learned by PIRL: finetuning the models on just (13,000) labeled images leads to a top-5 accuracy of . PIRL performs at least as well as SL  and better than VAT 
, which are both methods specifically designed for semi-supervised learning. In line with earlier results, PIRL also outperforms Jigsaw and NPID++.
|Random initialization ||R-50||22.0||59.0|
|VAT + Ent Min. [45, 20]||R-50v2||47.0||83.4|
|SL Exemplar ||R-50v2||47.0||83.7|
|SL Rotation ||R-50v2||53.4||83.8|
|CPC-Largest ||R-170 and R-11||64.0||84.9|
3.4 Pre-Training on Uncurated Image Data
Most representation learning methods are sensitive to the data distribution used during pre-training [30, 62, 41, 19]. To study how much changes in the data distribution impact PIRL, we pre-train models on uncurated images from the unlabeled YFCC dataset . Following [7, 19], we randomly select a subset of 1 million images (YFCC-1M) from the 100 million images in YFCC. We pre-train PIRL ResNet-50 networks on YFCC-1M using the same procedure that was used for ImageNet pre-training. We evaluate using the setup in Section 3.2 by training linear classifiers on fixed image representations.
Table 4 reports the top-1 accuracy of the resulting classifiers. In line with prior results, PIRL outperforms competing self-supervised learners. In fact, PIRL even outperforms Jigsaw and DeeperCluster models that were trained on more data from the same distribution. Comparing pre-training on ImageNet (Table 2) with pre-training YFCC-1M (Table 4) leads to a mixed set of observations. On ImageNet classification, pre-training (without labels) on ImageNet works substantially better than pre-training on YFCC-1M. In line with prior work [30, 19], however, pre-training on YFCC-1M leads to better representations for image classification on the Places205 dataset.
|DeepCluster [6, 7]||YFCC1M||34.1||63.9||35.4||–|
We performed a set of experiments aimed at better understanding the properties of PIRL. To make it feasible to train the larger number of models needed for these experiments, we train the models we study in this section for fewer epochs () and with fewer negatives () than in Section 3. As a result, we obtain lower absolute performances. Apart from that, we did not change the experimental setup or any of the other hyperparameters. Throughout the section, we use the evaluation setup from Section 3.2 that trains linear classifiers on fixed image representations to measure the quality of image representations.
4.1 Analyzing PIRL Representations
Does PIRL learn invariant representations?
PIRL was designed to learn representations that are invariant to image transformation . We analyzed whether the learned representations actually have the desired invariance properties. Specifically, we normalize the representations to have unit norm and compute distances between the (normalized) representation of image, , and the (normalized) representation its transformed version, . We repeat this for all transforms and for a large set of images. We plot histograms of the distances thus obtained in Figure 4
. The figure shows that, for PIRL, an image representation and the representation of a transformed version of that image are generally similar. This suggests that PIRL has learned representations that are invariant to the transformations. By contrast, the distances between Jigsaw representations have a much larger mean and variance, which suggests that Jigsaw representations covary with the image transformations that were applied.
Which layer produces the best representations?
All prior experiments used PIRL representations that were extracted from the res5 layer and Jigsaw representations that were extracted from the res4 layer (which work better for Jigsaw). Figure 5 studies the quality of representations in earlier layers of the convolutional networks. The figure reveals that the quality of Jigsaw representations improves from the conv1 to the res4 layer but that their quality sharply decreases in the res5 layer. We surmise this happens because the res5 representations in the last layer of the network covary with the image transformation and are not encouraged to contain semantic information. By contrast, PIRL representations are invariant to image transformations, which allows them to focus on modeling semantic information. As a result, the best image representations are extracted from the res5 layer of PIRL-trained networks.
4.2 Analyzing the PIRL Loss Function
What is the effect of in the PIRL loss function?
The PIRL loss function in Equation 2.2 contains a hyperparameter that trades off between two NCE losses. All prior experiments were performed with . NPID(++)  is a special case of PIRL in which , effectively removing the pretext-invariance term from the loss. At , the network does not compare untransformed images at training time and updates to the memory bank are not dampened.
We study the effect of on the quality of PIRL representations. As before, we measure representation quality by the top-1 accuracy of linear classifiers operating on fixed ImageNet representations. Figure 6 shows the results of these experiments. The results show that the performance of PIRL is quite sensitive to the setting of , and that the best performance if obtained by setting .
What is the effect of the number of image transforms?
Both in PIRL and Jigsaw, it is possible to vary the complexity of the task by varying the number of permutations of the nine image patches that are included in the set of image transformations, . Prior work on Jigsaw suggests that increasing the number of possible patch permutations leads to better performance [46, 19]. However, the largest value can take is restricted because the number of learnable parameters in the output layer grows linearly with the number of patch permutations in models trained to solve the Jigsaw task. This problem does not apply to PIRL because it never outputs the patch permutations, and thus has a fixed number of model parameters. As a result, PIRL can use all million permutations in .
We study the quality of PIRL and Jigsaw as a function of the number of patch permutations included in . To facilitate comparison with , we measure quality in terms of performance of linear models on image classification using the VOC07 dataset, following the same setup as in Section 3.2. The results of these experiments are presented in Figure 7. The results show that PIRL outperforms Jigsaw for all cardinalities of but that PIRL particularly benefits from being able to use very large numbers of image transformations (i.e., large ) during training.
What is the effect of the number of negative samples?
We study the effect of the number of negative samples, , on the quality of the learned image representations. We measure the accuracy of linear ImageNet classifiers on fixed representations produced by PIRL as a function of the value of used in pre-training. The results of these experiments are presented in Figure 8. They suggest that increasing the number of negatives tends to have a positive influence on the quality of the image representations constructed by PIRL.
4.3 Generalizing PIRL to Other Pretext Tasks
Although we studied PIRL in the context of Jigsaw in this paper, PIRL can be used with any set of image transformations, . We performed an experiment evaluating the performance of PIRL using the Rotation pretext task . We define to contain image rotations by , and measure representation quality in terms of image-classification accuracy of linear models (see the supplemental material for details).
The results of these experiments are presented in Table 5 (top). In line with earlier results, models trained using PIRL (Rotation) outperform those trained using the Rotation pretext task of . The performance gains obtained from learning a rotation-invariant representation are substantial, e.g. top-1 accuracy on ImageNet. We also note that PIRL (Rotation) outperforms NPID++ (see Table 2). In a second set of experiments, we combined the pretext image transforms from both the Jigsaw and Rotation tasks in the set of image transformations, . Specifically, we obtain by first applying a rotation and then performing a Jigsaw transformation. The results of these experiments are shown in Table 5 (bottom). The results demonstrate that combining image transforms from multiple pretext tasks can further improve image representations.
5 Related Work
Our study is related to prior work that tries to learn characteristics of the image distribution without considering a corresponding (image-conditional) label distribution. A variety of work has studied reconstructing images from a small, intermediate representation, e.g., using sparse coding , adversarial training [11, 43, 12]42, 55, 67], or probabilistic versions thereof .
More recently, interest has shifted to specifying pretext tasks  that require modeling a more limited set of properties of the data distribution. For video data, these pretext tasks learn representations by ordering video frames [44, 16, 37, 70, 1, 74, 32], tracking [68, 54], or using cross-modal signals like audio [53, 3, 2, 34, 17, 52].
Our work focuses on image-based pretext tasks. Prior pretext tasks include image colorization [8, 28, 35, 36, 77, 78], orientation prediction , affine transform prediction , predicting contextual image patches , re-ordering image patches [19, 46, 48, 5], counting visual primitives , or their combinations . In contrast, our works learns image representations that are invariant to the image transformations rather than covariant.
PIRL is related to approaches that learn invariant image representations via contrastive learning [72, 68, 27, 14, 60], clustering [6, 7, 48, 69], or maximizing mutual information [29, 27, 4]. PIRL is most similar to methods that learn representations that are invariant under standard data augmentation [27, 4, 72, 29, 13, 73]. PIRL learns representations that are invariant to both the data augmentation and to the pretext image transformations.
|PIRL (Rotation; ours)||25.6M||60.2||77.1||47.6||31.2|
|Combining pretext tasks using PIRL|
|PIRL (Jigsaw; ours)||25.6M||62.2||79.8||48.5||31.2|
|PIRL (Rotation + Jigsaw; ours)||25.6M||63.1||80.3||49.7||33.6|
Finally, PIRL is also related to approaches that use a contrastive loss  in predictive learning [24, 26, 51, 64, 23, 61]. These prior approaches predict missing parts of the data, e.g., future frames in videos [51, 23], or operate on multiple views . In contrast to those approaches, PIRL learns invariances rather than predicting missing data.
6 Discussion and Conclusion
We studied Pretext-Invariant Representation Learning (PIRL) for learning representations that are invariant to image transformations applied in self-supervised pretext tasks. The rationale behind PIRL is that invariance to image transformations maintains semantic information in the representation. We obtain state-of-the-art results on multiple benchmarks for self-supervised learning in image classification and object detection. PIRL even outperforms supervised ImageNet pre-training on object detection.
In this paper, we used PIRL with the Jigsaw and Rotation image transformations. In future work, we aim to extend to richer sets of transformations. We also plan to investigate combinations of PIRL with clustering-based approaches [6, 7]. Like PIRL, those approaches use inter-image statistics but they do so in a different way. A combination of the two approaches may lead to even better image representations.
We are grateful to Rob Fergus, and Andrea Vedaldi for encouragement and their feedback on early versions of the manuscript; Yann LeCun for helpful discussions; Aaron Adcock, Naman Goyal, Priya Goyal, and Myle Ott for their help with the code development for this research; and Rohit Girdhar, and Ross Girshick for feedback on the manuscript. We thank Yuxin Wu, and Kaiming He for help with Detectron2.
Unaiza Ahsan, Rishi Madhok, and Irfan Essa.
Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition.In WACV, 2019.
-  Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In ICCV, 2017.
-  Relja Arandjelovic and Andrew Zisserman. Objects that sound. In ECCV, 2018.
-  Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910, 2019.
-  Fabio M Carlucci, Antonio D’Innocente, Silvia Bucci, Barbara Caputo, and Tatiana Tommasi. Domain generalization by solving jigsaw puzzles. In CVPR, 2019.
-  Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
-  Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.
-  Aditya Deshpande, Jason Rock, and David Forsyth. Learning large-scale automatic image colorization. In ICCV, 2015.
-  Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, pages 1422–1430, 2015.
-  Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In ICCV, 2017.
-  J. Donahue, P. Krahenbühl, and T. Darrell. Adversarial feature learning. In ICLR, 2016.
-  Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. arXiv preprint arXiv:1907.02544, 2019.
Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin
Riedmiller, and Thomas Brox.
Discriminative unsupervised feature learning with exemplar convolutional neural networks.TPAMI, 38(9), 2016.
-  Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Temporal cycle-consistency learning. In CVPR, 2019.
-  M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, Jan. 2015.
Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould.
Self-supervised video representation learning with odd-one-out networks.In CVPR, 2017.
-  Ruohan Gao, Rogerio Feris, and Kristen Grauman. Learning to separate object sounds by watching unlabeled video. In ECCV, 2018.
-  Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
-  Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019.
-  Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2005.
Michael Gutmann and Aapo Hyvärinen.
Noise-contrastive estimation: A new estimation principle for
unnormalized statistical models.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
-  Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
-  Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In ICCV Workshop, pages 0–0, 2019.
-  Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In arXiv 1911.05722, 2019.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Olivier J Hénaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019.
-  R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
-  Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics, 35(4):110, 2016.
-  X. Ji, J.F. Henriques, and A. Vedaldi. Invariant information clustering for unsupervised image classification and segmentation. In ICCV, 2019.
-  Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache. Learning visual features from large weakly supervised data. In ECCV, 2016.
-  Angjoo Kanazawa, David W Jacobs, and Manmohan Chandraker. Warpnet: Weakly supervised matching for single-view reconstruction. In CVPR, pages 3253–3261, 2016.
-  Dahun Kim, Donghyeon Cho, and In So Kweon. Self-supervised video representation learning with space-time cubic puzzles. In AAAI, volume 33, 2019.
-  Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005, 2019.
-  Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018.
-  Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization. In ECCV, 2016.
-  Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In CVPR, 2017.
-  Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In CVPR, 2017.
-  Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment. In NeurIPS, 2019.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV. 2014.
-  Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
-  Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
-  J. Masci, U. Meier, D. Cires, and J. Schmidhuber. Stacked convolutional auto-encoders for hierarchical feature extraction. In ICANN, pages 52–59, 2011.
-  L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In ICML, 2017.
-  Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, 2016.
-  Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. TPAMI, 41(8), 2018.
-  Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
-  Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In ICCV, 2017.
-  Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and Hamed Pirsiavash. Boosting self-supervised learning via knowledge transfer. In CVPR, 2018.
-  David Novotny, Samuel Albanie, Diane Larlus, and Andrea Vedaldi. Self-supervised learning of geometrically stable features through probabilistic introspection. In CVPR, pages 3637–3645, 2018.
-  B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607, 1996.
-  Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
-  Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018.
-  Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In ECCV, 2016.
-  Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In CVPR, 2017.
-  Marc’Aurelio Ranzato, Fu-Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
-  Ignacio Rocco, Relja Arandjelovic, and Josef Sivic. Convolutional neural network architecture for geometric matching. In CVPR, pages 6148–6157, 2017.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115, 2015.
R. Salakhutdinov and G. Hinton.
Deep Boltzmann machines.In AI-STATS, pages 448–455, 2009.
-  Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In ICRA, 2018.
-  Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743, 2019.
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta.
Revisiting unreasonable effectiveness of data in deep learning era.In ICCV, 2017.
-  Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. arXiv preprint arXiv:1503.01817, 2015.
-  Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
-  Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In CVPR, pages 8769–8778, 2018.
-  Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint arXiv:1709.01450, 2017.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
Extracting and composing robust features with denoising autoencoders.In ICML, 2008.
-  Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015.
-  Xiaolong Wang, Kaiming He, and Abhinav Gupta. Transitive invariance for self-supervised visual representation learning. In ICCV, pages 1329–1338, 2017.
-  Donglai Wei, Joseph Lim, Andrew Zisserman, and William T. Freeman. Learning and using the arrow of time. In , 2018.
-  Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
-  Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
-  Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Unsupervised data augmentation. arXiv preprint arXiv:1904.12848, 2019.
-  Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, 2019.
-  Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervised semi-supervised learning. arXiv preprint arXiv:1905.03670, 2019.
-  Liheng Zhang, Guo-Jun Qi, Liqiang Wang, and Jiebo Luo. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. arXiv preprint arXiv:1901.04596, 2019.
-  Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.
-  Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, 2017.
Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva.
Learning deep features for scene recognition using places database.In NIPS, 2014.
-  Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In ICCV, pages 6002–6012, 2019.
Appendix A Training architecture and
Base Architecture for PIRL
PIRL models in Section 3, 4 and 5 in the main paper are standard ResNet-50  models with 25.6 million parameters. Figure 2 and Section 3.2 also use a larger model ‘PIRL-c2x’ which has double the number of channels in each ‘stage’ of the ResNet, e.g., 128 channels in the first convolutional layer, 4096 channels in the final res5 layer etc., and a total of 98 million parameters. The heads and are linear layers that are used for the pre-training stage (Figure 3). When evaluating the model using transfer learning (Section 4 and 5 of the main paper), the heads , are removed.
Training Hyperparameters for Section 3
PIRL models in Section 3 are trained for epochs using the ImageNet training set (1.28 million images). We use a batchsize of 32 per GPU and use a total of 32 GPUs to train each model. We optimize the models using mini-batch SGD with the cosine learning rate decay  scheme with an initial learning rate of and a final learning rate of , momentum of and a weight decay of . We use a total of negatives to compute the NCE loss (Equation 2.2) and the negatives are sampled randomly for each data point in the batch.
Training Hyperparameters for Section 4
The hyperparameters used are exactly the same as Section 3, except that the models are trained with negatives and for epochs only. This results in a lower absolute performance, but the main focus of Section 4 is to analyze PIRL.
Common Data Pre-processing
We use data augmentations as implemented in PyTorch and the Python Imaging Library. These data augmentations are used for all methods including PIRL and NPID++. We do not use supervised policies like Fast AutoAugment. Our ‘geometric’ data pre-processing involves extracting a randomly resized crop from the image, and horizontally flipping it. We follow this by ‘photometric’ pre-processing that alter the RGB color values by using random values of color jitter, contrast, brightness, saturation, sharpness, equalizeetc. transforms to the image.
a.1 Details for PIRL with Jigsaw
We base our Jigsaw implementation on [19, 33]. To construct , we extract a random resized crop that occupies at least 60% of the image. We resize this crop to , and then divide into a grid. We extract a patch of from each of these grids and get patches. We apply photometric data augmentation (random color jitter, hue, contrast, saturation) to each patch independently and finally obtain the patches that constitute .
The image is a image obtained by applying standard data augmentation (flip, random resized crop, color jitter, hue, contrast, saturation) to the image from the dataset.
The 9 patches of are individually input to the network to get their res5 features which are average pooled to get a single
dimensional vector for each patch. We then apply a linear projection to these features to get adimensional vector for each patch. These patch features are concatenated to get a dimensional vector which is then input to another single linear layer to get the final dimensional feature .
We feed forward the image and average pool its res5 feature to get a dimensional vector for the image. We then apply a single linear projection to get a dimensional feature .
a.2 Details for PIRL with Rotation
The image is a image obtained by applying standard data augmentation (flip, random resized crop, color jitter, hue, contrast, saturation) to the image from the dataset.
We feed forward the image , average pool the res5 feature, and apply a single linear layer to get the final dimensional feature for .
We feed forward the image and average pool its res5 feature to get a dimensional vector for the image. We then apply a single linear projection to get a dimensional feature .
Appendix B Object Detection
b.1 VOC07 train+val set for detection
In Section 3.1 of the main paper, we presented object detection results using the VOC07+12 training set which has 16K images. In this section, we use the smaller VOC07 train+val set (5K images) for finetuning the Faster R-CNN C4 detection models. All models are finetuned using the Detectron2  codebase and hyperparameters from . We report the detection AP in Table 6. We see that the PIRL model outperforms the ImageNet supervised pretrained models on the stricter AP and AP metrics without extra pretraining data or changes to the network architecture.
Detection hyperparameters: We use a batchsize of 2 images per GPU, a total of 8 GPUs and finetune models for K iterations with the learning rate dropped by at K iterations. The supervised and Jigsaw baseline models use an initial learning rate of with a linear warmup for iterations with a slope of , while the NPID++ and PIRL models use an initial learning rate of with a linear warmup for iterations and a slope of . We keep the BatchNorm parameters of all models fixed during the detection finetuning stage.
b.2 VOC07+12 train set for detection
Detection hyperparamters: We use a batchsize of 2 images per GPU, a total of 8 GPUs and finetune models for K iterations with the learning rate dropped by at K iterations. The supervised and Jigsaw baseline models use an initial learning rate of with a linear warmup for iterations with a slope of , while the NPID++ and PIRL models use an initial learning rate of with a linear warmup for iterations and a slope of . We keep the BatchNorm parameters of all models fixed during the detection finetuning stage.
Appendix C Linear Models for Transfer
We train linear models on representations from the intermediate layers of a ResNet-50 model following the procedure outlined in . We briefly outline the hyperparameters from their work. The features from each of the layers are average pooled such that they are of about dimensions each. The linear model is trained with mini-batch SGD using a learning rate of decayed by at two equally spaced intervals, momentum of and weight decay of . We train the linear models for epochs on ImageNet (1.28M training images), epochs on Places205 (2.4M training images) and for epochs on iNaturalist-2018 (437K training images). Thus, the number of parameter updates for training the linear models are roughly constant across all these datasets. We report the center crop top-1 accuracy for the ImageNet, Places205 and iNaturalist-2018 datasets in Table 2.
For training linear models on VOC07, we train linear SVMs following the procedure in  and report mean average Precision (mAP).
|Different Evaluation Setup|
|Kolesnikov et al. ||–||–||–||47.7||–||–||–||–||–||–|
Per layer results
The results of all these models are in Table 7.