Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?

01/13/2022
by   Nenad Tomašev, et al.
0

Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performance-critical settings. Building on prior theoretical insights from Mitrovic et al., 2021, we propose ReLICv2 which combines an explicit invariance loss with a contrastive objective over a varied set of appropriately constructed data views. ReLICv2 achieves 77.1 ImageNet using linear evaluation with a ResNet50 architecture and 80.6 larger ResNet models, outperforming previous state-of-the-art self-supervised approaches by a wide margin. Most notably, ReLICv2 is the first representation learning method to consistently outperform the supervised baseline in a like-for-like comparison using a range of standard ResNet architectures. Finally we show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers.

READ FULL TEXT VIEW PDF
02/13/2020

A Simple Framework for Contrastive Learning of Visual Representations

This paper presents SimCLR: a simple framework for contrastive learning ...
04/19/2021

DisCo: Remedy Self-supervised Learning on Lightweight Models with Distilled Contrastive Learning

While self-supervised representation learning (SSL) has received widespr...
12/05/2019

Self-Supervised Learning of Video-Induced Visual Invariances

We propose a general framework for self-supervised learning of transfera...
01/31/2022

Adversarial Masking for Self-Supervised Learning

We propose ADIOS, a masked image model (MIM) framework for self-supervis...
09/27/2021

Compressive Visual Representations

Learning effective visual representations that generalize well without h...
06/13/2020

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning

We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-su...
03/13/2021

Revisiting ResNets: Improved Training and Scaling Strategies

Novel computer vision architectures monopolize the spotlight, but the im...

1 Introduction

Large-scale foundation models (bommasani2021opportunities)—in particular for language (devlin2018bert; brown2020language) and multimodal domains (radford2021learning)—are an important recent development in representation learning. The idea that massive models can be trained without labels in an unsupervised (or self-supervised) manner and be readily adapted, in a few- or zero-shot setting, to perform well on a variety of downstream tasks is important for many problem areas for which labeled data is expensive and impractical to obtain.

Multi-view contrastive objectives have emerged as a successful strategy for representation learning (chen2020simple; he2019momentum). However, downstream utility111Commonly measured by how well a method performs under a standard linear evaluation protocol on ImageNet. See section 3.1 of these representations has until now never exceeded the performance of supervised learning, limiting their usefulness. In this work we show that it is possible to train a representation without using labels which outperforms an established supervised baseline using the same network architecture in a like-for-like comparison on ImageNet.

Figure 1: Top-1 linear evaluation accuracy on ImageNet using ResNet50 encoders with , and width multipliers and a ResNet200 encoder with a width multiplier.

To achieve this we build on the Representation Learning via Invariant Causal Mechanisms (ReLIC) framework (mitrovic2020representation) which learns representations based on the principle of invariant prediction. Unlike other methods, ReLIC

explicitly enforces invariance over the relationship between similar and dissimilar points in the dataset via an additional term in the loss function.

ReLIC learns representations that more closely follow the geometry of the underlying data (mitrovic2020representation). This property ensures the learned representations transfer well to downstream tasks.

In this paper, we propose ReLICv2 which leverages the theoretical understanding of ReLIC with better strategies for selecting similar and dissimilar points and incorporates these into both the contrastive and invariance objectives. As a result ReLICv2 achieves a new state-of-the-art performance in self-supervised learning on a wide range of ResNet architectures. Furthermore, ReLICv2 is the first self-supervised representation learning method that outperforms the supervised ResNet50 baseline on linear ImageNet evaluation across , and variants (figure 1), as established in (chen2020simple); note that other work outperforms this baseline (caron2021emerging) but do so by using a different network architecture to the baseline, and thus are not a like-for-like comparison of network architectures. We demonstrate how to outperform the supervised baseline by changing the self-supervised training scheme without needing to change the network architecture. On top- classification accuracy on ImageNet ReLICv2 achieves with a ResNet50, while with a ResNet200 it achieves .

Furthermore, ReLICv2 also outperforms the supervised baseline on larger ResNet architectures such as ResNet101, ResNet152 and ResNet200. We also demonstrate that ReLIC

v2 exhibits competitive performance in a variety of other tasks including transfer learning, semi-supervised learning, and robustness and out-of-distribution generalization. Finally, although using ResNet architectures

ReLICv2 demonstrates comparable performance to the latest vision transformer-based methods (figure 6).

Summary of contributions.

We briefly review ReLIC and introduce ReLICv2, which incorporates our proposed improvements, in section 2. We present our main results in section 3. To accompany our empirical results, we provide further insights and analysis into how ReLICv2 learns representations as well as its scaling capabilities in section 4, and place our contributions in the wider context of recent developments in representation learning in section 5.

Finally, in section 6 we show that ReLICv2 performs comparably to the latest vision transformer (ViT) architectures (dosovitskiy2020image; caron2021emerging; li2021efficient)

and argue that the concepts and results developed in this work could have important implications for wider adoption of self-supervised pre-training in a variety of domains as well as the design of objectives for foundational machine learning systems.

2 Method

We consider the large family of multi-view representation learning methods (bachman2019learning; chen2020simple; he2019momentum; caron2020unsupervised; grill2020bootstrap; gidaris2020learning; dwibedi2021little; gidaris2021obow). In particular, contrastive methods are a subset of methods that leverage an instance classification problem in order to learn representations. Given a batch of data these methods learn the representation of an anchor point sampled from by comparing it against positives and negatives .222 and may be sampled from or more generally from some other data structure which accumulates points or statistics from successive batches such as a queue (dwibedi2021little; he2019momentum).

The goal of contrastive methods is to classify the positive point as an instance of the anchor point against the negatives.

In the simplest setting, and consist of two differently augmented versions of the same image and comprises the rest of the images in the batch (augmented using the same procedure) (chen2020simple). The standard contrastive objective can then be concisely described in the following likelihood function

(1)

where is a similarity function which compares the output of two encoder networks, and ; and

are multi-layer perceptrons. We use the

target network setting (grill2020bootstrap) where and have the same architecture, but the weights of are an exponential moving average of the weights of .

Note that is also a function of the set of negatives and due to the comparisons in the denominator, it is not symmetric in its arguments.

ReLIC (mitrovic2020representation) introduces an invariance loss

defined as the Kullback-Leibler divergence between the likelihood of the anchor point

and one of its positives both of which are computed as in equation 1

(2)

We apply the stop gradient operator, , which does not affect the computation of the KL-divergence but avoids degenerate solutions during optimization (xie2019unsupervised). Due to the presence of negatives in equation 1, this function effectively measures the similarity of and relative to the points in .

ReLICv2.

Similarly to ReLIC, the basic objective of ReLICv2 is to minimize the combination of contrastive negative log likelihood and invariance loss. Given a minibatch , for sample where the loss function is

(3)

where and are scalar hyper-parameters which weight the relative importance of the contrastive and invariance losses to the overall objective.

ReLICv2 differs from ReLIC in the selection of appropriate sets of positive and negative points and how the resulting views of data are combined in the objective function.

The set of positives is particularly important in our method due to the addition of the explicit invariance term in the loss (c.f. equation 2). is defined by the set of augmentations we apply to each image. In addition to the standard SimCLR augmentations (chen2020simple) we apply two further augmentation strategies: multi-crop augmentations and saliency-based background removal.

Multi-crop augmentations were introduced by (caron2020unsupervised) and consist of comparing several larger () and smaller () sized random crops of the same image. In a notable difference from (caron2020unsupervised), we use four larger and two smaller crops.

We probabilistically apply saliency-based background removal to the larger crops. To achieve this, we use a fully unsupervised version of DeepUSPS (deepusps)

which we trained using a small number of images from the ImageNet training set. We then randomly apply the saliency mask to the image with probability

to separate the image foreground from the background; this enables us to learn representations that can localize the objects in the image (zhao2020distilling). We set for most of our experiments with more details provided in the supplementary material.

We then apply the standard SimCLR augmentations: a random horizontal flip and color distortion consisting of a random sequence of brightness, saturation, contrast and hue changes and an optional grayscale conversion. As a final step, Gaussian blur and solarization are applied.

There are several valid strategies for defining including importance sampling and hard negative mining (robinson2020contrastive). For simplicity, we settled on sampling uniformly from . In our experiments we set . This approach was found by (mitrovic2020less) to eliminate the issue of false negatives (sampling a negative from the same class as the anchor) in expectation and leads to good performance.

Listing LABEL:alg:relic

provides PyTorch-like pseudo-code detailing how we compute equation 

3 for our choices of and and specifically, how the different views of data are combined in the target network setting. Similar to previous work (chen2020simple; grill2020bootstrap) we minimize our objective using the LARS optimizer (you2017large) with a cosine decay learning rate schedule without restarts. Unless otherwise indicated, we train our models for epochs with a warm-up period of epochs and a batch size of . Precise architectural and implementation details are provided in the supplementary material.

1’’
2f_o: online network: encoder + comparison_net
3g_t: target network: encoder + comparison_net
4gamma: target EMA coefficient
5n_e: number of negatives
6p_m: mask apply probability
7’’
8for x in batch: # load a batch of B samples
9    # Apply saliency mask and remove background
10    x_m = remove_background(x)
11    for i in range(num_large_crops):
12        # Select either original or background-removed
13        # Image with probability p_m
14        x = Bernoulli(p_m) ? x_m : x
15        # Do large random crop and augment
16        xl_i = aug(crop_l(x))
17
18        ol_i = f_o(xl_i)
19        tl_i = g_t(xl_i)
20
21    for i in range(num_small_crops):
22        # Do small random crop and augment
23        xs_i = aug(crop_s(x))
24        # Small crops only go through the online network
25        os_i = f_o(xs_i)
26
27    loss = 0
28    # Compute loss between all pairs of large crops
29    for i in range(num_large_crops):
30        for j in range(num_large_crops):
31            loss += loss_relicv2(ol_i, tl_j, n_e)
32
33    # Compute loss between small crops and large crops
34    for i in range(num_small_crops):
35        for j in range(num_large_crops):
36            loss += loss_relicv2(os_i, tl_j, n_e)
37    scale = (num_large_crops + num_small_crops) *
38            num_large_crops
39    loss /= scale
40
41    # Compute grads, update online and target networks
42    loss.backward()
43    update(f_o)
44    g_t = gamma * g_t + (1 - gamma) * f_o
Listing 1: Pseudo-code for ReLICv2.
Theory and intuitions.

Similarly to several other works (saunshi2019theoretical; haochen2021provable; robinson2020contrastive), ReLIC (mitrovic2020representation) was theoretically shown to learn good representations in the sense that they cluster according to the latent class structure of the data and are therefore useful for solving a downstream classification task (mitrovic2020representation). ReLIC and ReLICv2 differ from other works in the use of an explicit invariance loss (eq. 2) in conjunction with a contrastive loss.

The contrastive part of the loss repels representations of negative points with different strengths corresponding to the distance to the anchor point. These have the effect of separating the mean representation of the classes. This phenomenon is described in Theorem 2 of (mitrovic2020representation) and Theorem 5.1 of (saunshi2019theoretical). This also illustrates the problem of false negatives and why selecting hard negatives is difficult: negative points closest to the anchor contribute most to the contrastive objective (mitrovic2020less; chuang2020debiased; robinson2020contrastive).

The invariance loss encourages the representation of positive points to be similar to that of the anchor leading to representations with tighter within-class concentration as described in Lemma 1 of (mitrovic2020representation). However, since the measure of similarity is the KL divergence between contrastive likelihoods (c.f. eq. 2), this enforces that the predictive distribution of a positive relative to the other negative samples in the batch should be similar to that of the anchor point. In effect, the invariance loss allows us to enforce constraints between all points in the batch, rather than just pairwise constraints between the positives, leading to representations which better respect the underlying structure of the data. We investigate how this additional loss affects the learned representations in section 4.

3 Experimental results

Method Top-1 Top-5
  Supervised (chen2020simple) 76.5 93.7
  SimCLR (chen2020simple) 69.3 89.0
  MoCo v2 (Chen2020ImprovedBW) 71.1 -
  InfoMin Aug. (Tian2020WhatMF) 73.0 91.1
  BYOL (grill2020bootstrap) 74.3 91.6
ReLIC (mitrovic2020representation) 74.8 92.2
  SwAV (caron2020unsupervised) 75.3 -
  NNCLR (dwibedi2021little) 75.6 92.4
  C-BYOL (lee2021compressive) 75.6 92.7
ReLICv2 (ours) 77.1 93.3
Table 1: Top-1 and top-5 accuracy (in %) under linear evaluation on the ImageNet test for a ResNet50 encoder set for different representation learning methods. For clarity of exposition we only compares against methods which were at one point state-of-art.
Figure 2: Transfer performance relative to the supervised baseline (a value of 0 indicates equal performance to supervised).

We pretrain representations without using labels on the training set of the ImageNet ILSVRC-2012 dataset (russakovsky2015imagenet) and then evaluate the learned representations in a wide variety of downstream settings, datasets and tasks. First we examine the performance under the standard linear evaluation and semi-supervised protocols on the ImageNet validation set. Next we investigate the transfer capabilities of ReLICv2 representations on other image classification datasets as well as on semantic segmentation. We also test the robustness and out-of-distribution (OOD) generalization capabilities of ReLICv2 on a series of challenging datasets. Finally, to investigate the scalability and generality of our approach, we also pretrain on the much larger and more complex Joint Foto Tree (JFT-300M) dataset (sun2017revisiting) and report the results of the linear evaluation protocol on the ImageNet validation set. A complete set of results and details of all experimental settings is provided in the supplementary material.

The supervised baseline we consider refers to the ResNet50 architecture trained with cross-entropy loss and full access to labels using the same set of basic data augmentations for 1000 epochs as proposed by (chen2020simple) and used throughout the representation learning literature (c.f. (grill2020bootstrap; caron2020unsupervised; dwibedi2021little)).

3.1 Linear evaluation on ImageNet

We first evaluate ReLICv2’s representations by training a linear classifier on top of the frozen representation according to the procedure described in (chen2020simple; grill2020bootstrap; caron2020unsupervised; dwibedi2021little) and the supplementary material. We report top-1 and top-5 accuracies on the ImageNet test in table 1. ReLICv2 outperforms all previous state-of-the-art self-supervised approaches by a significant margin in terms of both top-1 and top-5 accuracy. Remarkably, ReLICv2 even outperforms the supervised baseline in terms of top-1 accuracy despite using no label information to pretrain the representation.

Figure 1 compares the performance of ReLICv2 against the supervised baseline and other competing methods for both the standard ResNet50 architecture as well as configurations with and wider layers and a wider ResNet200. ReLICv2 not only outperforms competing methods but is also the first self-supervised representation learning method which consistently outperforms the supervised baseline for all encoder configurations. Furthermore, ReLICv2 also outperforms the supervised baseline for 101, 152 and 200-layer ResNet architectures (grill2020bootstrap) and performs competitively to the latest vision transformer architectures at similar parameter counts. See figure 6 and the supplementary material for detailed results.

Method Top-1 Top-5
1% 10% 1% 10%
Supervised (chen2020simple) 25.4 56.4 48.4 80.4
SimCLR (chen2020simple) 48.3 65.6 75.5 87.8
BYOL (grill2020bootstrap) 53.2 68.8 78.4 89.00
SwAV (caron2020unsupervised) 53.9 70.2 78.5 89.9
NNCLR (dwibedi2021little) 56.4 69.8 80.7 89.3
C-BYOL (lee2021compressive) 60.6 70.5 83.4 90.0
ReLICv2 (ours) 58.1 72.4 81.3 91.2
Table 2: Top-1 and top-5 accuracy (in %) after semi-supervised training with a fraction of ImageNet labels on a ResNet50 encoder for different representation learning methods.

3.2 Semi-supervised training on ImageNet

Next we evaluate the performance of ReLICv2 in a semi-supervised setting. We pretrain the representation and leverage a small subset of the available labels in the ImageNet training set to refine the learned representation following the protocol as described in (chen2020simple; grill2020bootstrap; caron2020unsupervised; dwibedi2021little; lee2021compressive) and the supplementary material. Top-1 and top-5 accuracy on the ImageNet validation set is reported in table 2. ReLICv2 outperforms both the supervised baseline and all previous state-of-the-art self-supervised methods when using of the data for fine-tuning. When using of the data, only C-BYOL performs better than ReLICv2. For further semi-supervised results using larger ResNet models and different dataset splits see the supplementary material.

3.3 Transfer to other tasks

We evaluate the generality of ReLICv2 representations by testing whether the learned features are useful across image domains.

Classification.

We perform linear evaluation and fine-tuning on the same set of classification tasks used in (chen2020simple; grill2020bootstrap; dwibedi2021little) and follow their evaluation protocol detailed in the supplementary material. We report standard metrics for each dataset and report performance on a held-out test set. Figure 2 compares the transfer performance of representations pre-trained using BYOL(grill2020bootstrap), NNCLR (dwibedi2021little) and ReLICv2 relative to supervised pre-training. Overall, ReLICv2 improves upon both the supervised baseline and competing methods, performing best on 7 out of 11 tasks. ReLICv2 exhibits an average relative improvement over the supervised baseline of over across all tasks—over double that of NNCLR. Detailed results for both linear and fine-tuned evaluation protocols are in the supplementary material.

Other vision tasks.

To further evaluate the generality of the learned representation, we assess the performance of ReLICv2 in other challenging vision tasks via fine-tuning, more specifically: PASCAL (Everingham10)

and Cityscapes semantic segmentation

(Cordts2016Cityscapes). In accordance with (he2019momentum), we use the ReLICv2 ImageNet representation to initialise a fully convolutional backbone, which we fine-tune on the PASCAL train_aug2012 set for 45 epochs and report the mean intersection over union (mIoU) on the val2012 set. The fine-tuning on Cityscapes is done on the train_fine set for 160 epochs and evaluated on the val_fine set.

Method PASCAL Cityscapes
BYOL (grill2020bootstrap) 75.7 74.6
DetCon (detcon) 77.3 77.0
ReLICv2 (ours) 77.9 75.2

On both PASCAL and Cityscapes, ReLICv2 outperforms BYOL by a significant margin as can and remarkably, on PASCAL outperforms DetCon (detcon) which has been specifically trained for detection.

3.4 Robustness and OOD generalization

We evaluate the robustness and out-of-distribution (OOD) generalization of ReLICv2 representations on a wide variety of datasets. To evaluate the robustness of ReLICv2 we use ImageNetV2 (recht2019imagenet) and ImageNet-C (hendrycks2019benchmarking) datasets. For evaluating OOD generalization, we use ImageNet-R (hendrycks2021many), ImageNet-Sketch (wang2019learning) and ObjectNet (barbu2019objectnet). On all datasets, we evaluate the representations from a standard ResNet50 encoder under a linear evaluation protocol akin to 3.1, i.e. we train a linear classifier on top of the frozen representation using the labelled ImageNet training set; the test evaluation is performed zero-shot, i.e no training is done on the above datasets. ReLICv2 learns more robust representations and outperforms both the supervised baseline and the competing self-supervised methods on ImageNetV2 and ImageNet-C. Also, ReLICv2 learns representations that outperform competing self-supervised methods while being on par with supervised performance in terms of OOD generalization. For a detailed explanation of the datasets and a full breakdown of the results see the supplementary material.

3.5 Large-scale transfer with JFT-300M

Next we test how well ReLICv2 scales to much larger datasets by pretraining representations using the Joint Foto Tree (JFT-300M) dataset which consists of 300 million images from more than 18k classes (hinton2015distilling; chollet2017xception; sun2017revisiting). We then evaluate the learned representations on the ImageNet validation set under the same linear evaluation protocol as described in section 3.1. We compare ReLICv2 against BYOL and Divide and Contrast (DnC) (tian2021divide), a method that was specifically designed to handle large and uncurated datasets and represents the current state-of-art in self-supervised JFT-300M pretraining. Table 3 reports the top-1 accuracy when training the various methods using the standard ResNet50 architecture as the backbone for different number of ImageNet equivalent epochs on JFT-300M; implementation details can be found in the supplementary material. ReLICv2 improves over DnC by more than when training on JFT for 1000 epochs and achieves better overall performance than competing methods while needing a smaller number of training epochs. In the supplementary material we also report robustness and OOD generalization results after JFT pretraining.

4 Analysis

Figure 3: Distances between nearest-neighbour representations. Each coloured point in a row represents one of the five nearest neighbours of the representation of that image where the colour indicates the distance between the points.
Figure 4: Distribution of the linear discriminant ratio: the ratio of between-class distances and within-class distances of embeddings computed on the ImageNet validation set.

4.1 Latent space analysis.

In order to understand the effect of the explicit invariance term in the loss function on the representations learned by ReLICv2, we look at the distances between learned representations of closely related classes. Figure 3 illustrates the Euclidean distances between nearest-neighbour representations learned by ReLICv2 and BYOL on ImageNet using the protocol described in section 3. Here we pick two breeds of dog and two breeds of cat. Each of these four classes has 50 points associated with it from the ImageNet validation set, ordered contiguously. Each row represents an image and each coloured point in a row represents one of the five nearest neighbours of the representation of that image where the colour indicates the distance between the image and the nearest neighbour. Representations which align perfectly with the underlying class structure would exhibit a perfect block-diagonal structure; that is their nearest neighbours all belong to the same underlying class. We see that ReLICv2 learns representations whose nearest neighbours are closer and exhibit less confusion between classes and super-classes than BYOL.

To quantify the overall structure of the learned latent space, we examine the within- and between-class distances of all classes. Figure 4 compares the distribution of ratios of between-class and within-class -distances of the representations of points in the ImageNet validation set learned by ReLICv2 against those learned by the supervised baseline.333Both ReLICv2 and the supervised baseline were trained on the ImageNet training set. A larger ratio implies that the representation is better concentrated within the corresponding classes and better separated between classes and therefore more easily linearly separated (c.f. Fisher’s linear discriminants (friedman2001elements)). We see that ReLICv2’s distribution is shifted to the right (i.e. having a higher ratio) compared to the supervised baseline suggesting that the representations can be better separated using a linear classifier. The empirical results in this section further confirm the theoretical insights of (mitrovic2020representation) and explain the superior performance of ReLICv2 reported in section 3.1.

Method Epochs Top-1
  BYOL (grill2020bootstrap) 1000 67.0
  Divide and Contrast (tian2021divide) 1000 67.9
ReLICv2 (ours) 1000 70.3
  BYOL (grill2020bootstrap) 3000 67.6
  Divide and Contrast (tian2021divide) 3000 69.8
ReLICv2 (ours) 3000 71.1
  BYOL (grill2020bootstrap) 5000 67.9
  Divide and Contrast (tian2021divide) 4500 70.7
ReLICv2 (ours) 5000 71.4
Table 3: Top-1 accuracy (in %) on ImageNet when learning representations using the JFT-300M dataset. Each method is pre-trained on JFT-300M for an ImageNet-equivalent number of epochs and evaluted on the ImageNet validation set under a linear evaluation protocol.

4.2 Scaling analysis

Figure 5 shows the ImageNet linear evaluation accuracy obtained by representations learned using ReLICv2 as a function of the number of images seen during pre-training using the ImageNet training set. It can be seen that in order to reach 70% accuracy the ResNet50 model requires approximately twice the number of iterations as the ResNet295 model. The ResNet295 has approximately the number of parameters as the ResNet50 (87M vs 24M, respectively). This finding is in accordance with other works which show that larger models are more sample efficient (i.e. they require fewer samples to reach a given accuracy) (zhai2021scaling).

4.3 Ablations

The two main distinction between prior work (mitrovic2020representation; mitrovic2020less) and ReLICv2 are the use of multi-crop and saliency masking. Here we ablate the use of these techniques using top- ImageNet validation set performance under the linear evaluation protocol on a ResNet50 pretrained for epochs. In summary, starting from ReLIC (), we gain by adding multi-crop and another by adding saliency masking on top of that which yields the final ReLICv2 performance of .

Multi-crop.

ReLIC (mitrovic2020representation) constructs views using crops of size . (caron2020unsupervised) suggested using crops of size and crops of size . Here we ablate the use of different numbers of large and small crops using only standard SimCLR augmentations (without saliency masks).

Crops [2, 0] [2, 2] [2, 6] [4, 2] [6, 2] [8, 2]
Top-1 74.3 76.2 76.0 76.8 76.5 76.5

We find that using multi-crop improves results significantly over the ReLIC baseline ([2,0]). However we observe that using fewer large crops performs better than the procedure proposed by (caron2020unsupervised).

Saliency masking.

We apply saliency masks on top of multi-crop during training enabling us to learn representations that focus on the semantically-relevant parts of the image, i.e. the foreground objects, and that are more robust to background changes. We report the top-1 accuracy under linear evaluation on ImageNet for different probabilities of removing the background of the large augmented crops during training.

0.0 0.1 0.15 0.2 0.25
Top-1 76.8 77.1 76.8 76.8 76.7

Applying the saliency masks 10% of the time results in best performance and significantly improves over not using masking (). Moreover, we also explored using different datasets for training DeepUSPS (deepusps) in an unsupervised way to obtain the saliency masks. We found this to have little overall effect on the results. See the supplementary material for full details.

Figure 5: ImageNet accuracy obtained by ReLICv2 as a function of number of images seen during pre-training for a variety of ResNet architectures. The number of parameters of each model is in parenthesis.

5 Related work

Unsupervised learning of representations by combining multiple views of data is a classical problem in machine learning which builds on earlier foundational work on co-training (blum1998combining) and multi-view learning (kakade2007multi; sridharan2008information; chaudhuri2009multi; mcwilliams2012multi; mcwilliams2013correlated). More recently, contrastive multi-view approaches to representation learning have become an important area of research owing to their excellent performance in visual recognition tasks (oord2018representation; bachman2019learning; chen2020simple). However, the underlying mechanisms for why these techniques work is less well understood (tschannen2019mutual).

One intuitive avenue borrowing from earlier work (blum1998combining)

is to analyse contrastive methods from the perspective of an appropriate evaluation metric: a downstream classification task

(saunshi2019theoretical; haochen2021provable). In this way optimizing an unsupervised loss can be connected to performance on a supervised problem. The ReLIC family of methods also approaches representation learning through this lens.

In this review we focus on how important algorithmic choices: namely explicitly enforcing invariance and more considered treatment of positive and negative examples are key factors in improving downstream classification performance of unsupervised representations. A detailed comparison is provided in the supplementary material. Outside of this scope, (lee2021compressive) have taken a conditional entropy bottleneck approach which is particularly notable as it has also resulted in better-than-supervised performance albeit only on the ResNet50 (2x) architecture (see figure 1).

Negatives.

A key observation of (chen2020simple) was that large batches (up to 4096) improve results. This was partly attributed to the effect of more negatives. This motivated the incorporation of queues that function as large reservoirs of negative examples into contrastive learning (he2019momentum). However subsequent work has shown that naively using a large number of negatives can have a detrimental effect on learning (mitrovic2020less; saunshi2019theoretical; chuang2020debiased; robinson2020contrastive). One reason for this is due to false negatives, that is points in the set of negatives which actually belong to the same latent class as the anchor point. These points are likely to have a high relative similarity to the anchor under and therefore contribute disproportionately to the loss. This will have the effect of pushing apart points belonging to the same class in representation space.

Subsampling-based approaches have been proposed to avoid false negatives via importance sampling to attempt to find true negatives which are close to the latent class boundary of the anchor point (robinson2020contrastive), or uniformly-at-random sampling a small number of points to avoid false negatives (mitrovic2020less).

Positives and invariance.

Learning representations which are invariant to data augmentation is known to be important for self-supervised learning. Invariance is achieved heuristically through comparing two different augmentations of the same anchor point. Incorporating an explicit clustering step is another way of enforcing some notion of invariance

(caron2020unsupervised). However, neither of these strategies can be directly linked theoretically to learning more compact representations. More rigorously (mitrovic2020representation) approach invariance from a causal perspective. They show that invariance must be explicitly enforced—via an invariance loss in addition to the contrastive loss—in order to obtain guaranteed generalization performance. Most recently (dwibedi2021little) and (assran2021semi) use nearest neighbours to identify other elements from the batch which potentially belong to the same class as the anchor point.

Beyond ResNets.

New training methods (pham2021meta; wightman2021resnet), architectural innovations (brock2021high) and most recently transformer-based (ViT) approaches (dosovitskiy2020image) have improved fully supervised performance beyond the ResNet50 baseline commonly referred to in the contrastive learning literature (chen2020simple). Supervised contrastive methods have yet to achieve comparable performance (khosla2020supervised). Some recent works have also explored self-supervised learning using ViT architectures achieving promising results (caron2021emerging; li2021efficient).

Figure 6: Comparison of ImageNet top-1 accuracy between ReLICv2 and recent vision transformer-based architectures (Swin (liu2021swin) represents a fully supervised transformer baseline).

6 Discussion

ReLICv2 demonstrates for the first time that representations learned without access to labels can consistently outperform a strong, supervised baseline on ImageNet. In terms of a like-for-like comparison using ResNet50 encoders, ReLICv2 represents a substantial improvement over current state-of-art. This is a direct consequence of incorporating better strategies for selecting positive and negative points in the ReLIC framework as suggested by the theoretical results of (mitrovic2020representation).

Although several works present similar ideas for negative (chuang2020debiased; robinson2020contrastive) and positive selection (dwibedi2021little) individually, we are the first to a) combine both positive and negative selection and b) incorporate an invariance loss to fully exploit the selection of better positives. Our approach is general and is not dependent on specific positive/negative selection strategies. Indeed, the optimal choice of positives remains an open question and is likely to be highly problem and data dependent.

Finally, as noted in section 5 vision transformers (ViTs) have emerged as promising architectures for visual representation learning. Figure 6 compares recent ViT-based methods against ReLICv2 using a variety of larger ResNet architectures. Notably, ReLICv2 outperforms DINO (caron2021emerging) and MoCo v3 (chen2021empirical) and exhibits similar performance to EsViT (li2021efficient) for comparable parameter counts despite these methods using more powerful architectures and more involved training procedures. Our results suggest that combining the insights we have developed with ReLICv2 alongside recent architectural innovations could lead to further improvements in representation learning and more powerful foundation models.

Acknowledgements

We thank Relja Arandjelovic, Yusuf Aytar, Andrew Zisserman, Koray Kavukcuoglu, Daan Wierstra and Karen Simonyan for discussions on this work and feedback on the manuscript.

References

Appendix A Image Preprocessing

a.1 Augmentations

Following the data augmentations protocols of [chen2020simple, grill2020bootstrap, caron2020unsupervised], ReLICv2 uses a set of augmentations to generate different views of the original image which has three channels, red , green and blue with .

The augmentations used, in particular (corresponding to aug in Listing LABEL:alg:relic) are the same as in [grill2020bootstrap] and are generated as follows; for exact augmentations parameters see table 4). The following sequence of operations is performed in the given order.

  1. Crop the image: Randomly select a patch of the image, between a minimum and maximum crop area of the image, with aspect ratio sampled log-uniformly in

    . Upscale the patch, via bicubic interpolation, to a square image of size

    .

  2. Flip the image horizontally.

  3. Colour jitter: randomly adjust brightness, contrast, saturation and hue of the image, in a random order, uniformly by a value in where is the maximum adjustment (specified below).

  4. Grayscale the image, such that the channels are combined into one channel with value .

  5. Randomly blur. Apply a

    Gaussian kernel with standard deviation sampled uniformly in

    .

  6. Randomly solarize: threshold each channel value such that all values less than are replaced by and all values above or equal to are replaced with .

Apart from the initial step of image cropping, each step is executed with some probability to generate the final augmented image. These probabilities and other parameters are given in table 4, separately for augmenting the original image and the positives . Note that we use 4 large views of size pixels and 2 small views of

pixels; to get the first and third large views and the first small view we use the parameters listed below for odd views, while for the second and fourth large view and the second small view we use the parameters for even views.

Parameter Even views Odd views
Probability of randomly cropping 50% 50%
Probability of horizontal flip 50% 50%
Probability of colour jittering 80% 80%
Probability of grayscaling 20% 20%
Probability of blurring 100% 10%
Probability of solarization 0% 20%
Maximum adjustment of brightness 0.4 0.4
Maximum adjustment of contrast 0.4 0.4
Maximum adjustment of saturation 0.2 0.2
Maximum adjustment of hue 0.1 0.1
Crop size 224 96 (small), 224 (large)
Crop minimum area 8% 5% (small), 14% (large)
Crop maximum area 100% 14% (small), 100% (large)
Table 4: Parameters of data augmentation scheme. Small/large indicates small or large crop.

a.2 Saliency Masking

Using unsupervised saliency masking enables us to create positives for the anchor image with the background largely removed and thus the learning process will rely less on the background to form representations. This encourages the representation to localize the objects in the image [zhao2020distilling].

We develop a fully unsupervised version of DeepUSPS [deepusps] to compute saliency masks for each image in the ImageNet training set. By applying the saliency masks on top of the large views, we obtain masked images with the background removed. To further increase the background variability, instead of using a black background for the images, we apply a homogeneous grayscale to the background with the grayscale level randomly sampled for each image during training. We also use a foreground threshold such that we apply the saliency mask only if it covers at least of the image. The masked images with the grayscaled background are used only during training. Specifically, with a small probability we selected the masked image of the large view in place of the large view. Figure 7 shows how the saliency masks are added on top of the images to obtain the images with grayscale background.

Figure 7: Illustration of how for each image in the ImageNet training set (left) we use our unsupervised version of DeepUSPS to obtain the saliency mask (middle) which we then apply on top of the image to obtain the image with the background removed (right).

a.2.1 Using DeepUSPS to obtain saliency masks

DeepUSPS [deepusps] is an unsupervised saliency prediction method that uses self-supervision to refine pseudo-labels from a number of handcrafted saliency methods. To train DeepUSPS, we firstly sample a random subset of ImageNet images; note that the original implementation of DeepUSPS uses images from the MSRA-B dataset. We instead use a randomly selected subset of the ImageNet training set of the same size to ensure a fair comparison to previous work. As the handcrafted saliency methods we use Robust Background Detection (RBD) [zhu2014saliency], Manifold Ranking (MR) [yang2013saliency], Dense and Sparse Reconstruction (DSR) [li2013saliency]

and Markov Chain (MC)

[jiang2013saliency] to compute the initial saliency masks. Note that these methods do not make use of any supervised label information.

For training DeepUSPS, we closely follow the approach described by [deepusps]. We employ the two-stage mechanism for DeepUSPS. In the first stage, the noisy pseudo-labels from each handcrafted method are iteratively refined. In the second stage, these refined labels from each handcrafted saliency method are used to train the final saliency detection network. The saliency detection network is then used to compute the saliency masks for all images in the ImageNet training set. We use the publicly available code for training DeepUSPS: https://tinyurl.com/wtlhgo3.

Note that the official implementation for DeepUSPS uses as backbone a DRN-network [yu2017dilated] which was pretrained on CityScapes [Cordts2016Cityscapes] with supervised labels. To be consistent with our fully-unsupervised setting, we replace this network with a ResNet50 2x model which was pretrained on ImageNet using the self-supervised objective from SWaV [caron2020unsupervised]. We used the publicly available pretrained SWaV model from: https://github.com/facebookresearch/swav.

To account for this change in the architecture, we adjust some of the model hyperparameters of DeepUSPS. In the first stage of DeepUSPS training, the pseudo-generation networks used for refining the noisy pseudo-labels from each of the handcrafted methods are trained for

epochs in three self-supervised iterations. We start with a learning rate of which is doubled during each iteration. In the second stage, the saliency detection network is trained for epochs using a learning rate of . We use the Adam optimizer with momentum set to and a batch size of . The remaining hyperparameters are set in the same way as they are in the original DeepUSPS code.

Appendix B Pretraining on ImageNet – implementation details and additional results

b.1 Linear evaluation

Following the approach of [chen2020simple, grill2020bootstrap, caron2020unsupervised, dwibedi2021little], we follow the standard linear evaluation protocol on ImageNet. We train a linear classifier on top of the frozen representation which has been pretrained, i.e. the encoder parameters as well as the batch statistics are not being updated. For training the linear layer, we preprocess the data by applying standard spatial augmentations, i.e. randomly cropping the image with subsequent resizing to and then randomly applying a horizontal flip. At test time, we resize images to 256 pixels along the shorter side with bicubic resampling and apply a

center crop to it. Both for training and testing, after performing the above processing, we normalize the color channels by substracting the average channel value and dividing by the standard deviation of the channel value (as computed on ImageNet). To train the linear classifier, we optimize the cross-entropy loss with stochastic gradient descent with Nestorov momentum for

epochs using a batch size of and a momentum of ; we do not use any weight decay or other regularization techniques. In the following tables we report the top-1 and top-5 accuracies of different methods under a varied set of ResNet encoders of different sizes, spanning ResNet50, ResNet101, ResNet152 and ResNet200 and layer widths of , and . ResNet50 with and wider layers has and million parameters, respectively. ResNet101, ResNet152, ResNet200 and ResNet200 have , , and million parameters, respectively.

In the following table 5, we present results under linear evaluation on the ImageNet validation set a varied set of ResNet architectures; we compare against different unsupervised representation learning methods and use as the supervised baselines the results reported in [chen2020simple, grill2020bootstrap]. Note that the supervised baselines reported in [chen2020simple] are extensively used throughout the self-supervised literature in order to compare performance against supervised learning. For architectures for which supervised baselines are not available in [chen2020simple], we use supervised baselines reported in [grill2020bootstrap] which use stronger augmentations for training supervised models than [chen2020simple] and as such do not represent a direct like-for-like comparison with self-supervised methods.

Across this varied set of ResNet architectures, ReLICv2 outperforms supervised baselines in all cases with margins up to in absolute terms.

Method Top-1 Top-5   Supervised [chen2020simple] 77.8   MoCo [he2019momentum] 65.4   SimCLR [chen2020simple] 74.2 92.0   BYOL [grill2020bootstrap] 77.4 93.6   SwAV [caron2020unsupervised] 77.3   C-BYOL [lee2021compressive] 78.8 94.5 ReLICv2 (ours) 79.0 94.5
(a) ResNet50 encoder.
Method Top-1 Top-5   Supervised [chen2020simple] 78.9   MoCo [he2019momentum] 68.6   SimCLR [chen2020simple] 76.5 93.2   SwAV [caron2020unsupervised] 77.9   BYOL [grill2020bootstrap] 78.6 94.2 ReLICv2 (ours) 79.4 94.3
(b) ResNet50 encoder.
Method Top-1 Top-5   Supervised [grill2020bootstrap] 78.0 94.0   BYOL [grill2020bootstrap] 76.4 9.0 ReLICv2 (ours) 78.7 94.4
(c) ResNet101 encoder.
Method Top-1 Top-5   Supervised [grill2020bootstrap] 79.1 94.5   BYOL [grill2020bootstrap] 77.3 93.7 ReLICv2 (ours) 79.3 94.6
(d) ResNet152 encoder.
Method Top-1 Top-5   Supervised [grill2020bootstrap] 79.3 94.6   BYOL [grill2020bootstrap] 77.8 93.9 ReLICv2 (ours) 79.8 95.0
(e) ResNet200 encoder.
Method Top-1 Top-5   Supervised [grill2020bootstrap] 80.1 95.2   BYOL [grill2020bootstrap] 79.6 94.8 ReLICv2 (ours) 80.6 95.2
(f) ResNet200 encoder.
Table 5: Top-1 and top-5 accuracy (in %) under linear evaluation on the ImageNet validation set for a varied set of ResNet architectures.

b.2 Semi-supervised learning

We further test ReLICv2 representations learned on bigger ResNet models in the semi-supervised setting. For this, we follow the semi-supervised protocol as in [chen2020simple, grill2020bootstrap, caron2020unsupervised]. First, we initialize the encoder with the parameters of the pretrained representation and we add on top of this encoder a linear classifier which is randomly initialized. Then we train both the encoder and the linear layer using either or of the ImageNet training data; for this we use the splits introduced in [chen2020simple] which have been used in all the methods we compare to [grill2020bootstrap, caron2020unsupervised, dwibedi2021little, lee2021compressive]. For training, we randomly crop the image and resize it to and then randomly apply a horizontal flip. At test time, we resize images to 256 pixels along the shorter side with bicubic resampling and apply a

center crop to it. Both for training and testing, after performing the above processing, we normalize the color channels by substracting the average channel value and dividing by the standard deviation of the channel value (as computed on ImageNet). Note that this is the same data preprocessing protocol as in the linear evaluation protocol. To train the model, we use a cross entropy loss with stochastic gradient descent with Nesterov momentum of

. For both and settings, we train for epochs and decay the initial learning rate by a factor at and epochs. Following the approach of [caron2020unsupervised], we use the optimizer with different learning rates for the encoder and linear classifier parameters. For the setting, we use a batch size of and base learning rates of and for the linear layer and encoder, respectively; we do not use any weight decay or other regularization technique. For the setting, we use a batch size of and base learning rates of and for the linear layer and encoder, respectively; we use a weight decay of , but do not use any other regularization technique. From table 6, we see that ReLICv2 outperforms competing self-supervised methods on ResNet50 in both the and setting. For larger ResNets, ResNet50 and ResNet200 , ReLICv2 is state-of-the-art with respect to top-1 accuracy for the low-data regime of . On these networks for the higher data regime of BYOL outperforms ReLICv2. Note that BYOL trains their semi-supervised models for or epochs whereas ReLICv2 is trained only for epochs. We hypothesize that longer training (e.g. or epochs as BYOL) is needed for ReLICv2 representations on larger ResNets as there are more model parameters.

Method Top-1 Top-5
1% 10% 1% 10%
SimCLR [chen2020simple] 58.5 71.7 83.0 91.2
BYOL [grill2020bootstrap] 62.2 73.5 84.1 91.7
ReLICv2 (ours) 64.7 73.7 85.4 92.0
ResNet50 encoder.
Method Top-1 Top-5
1% 10% 1% 10%
SimCLR [chen2020simple] 63.0 74.4 85.8 92.6
BYOL [grill2020bootstrap] 69.1 75.7 87.9 92.5
ReLICv2 (ours) 69.5 74.6 87.3 91.6
ResNet50 encoder.

Method Top-1 Top-5
1% 10% 1% 10%
BYOL [grill2020bootstrap] 71.2 77.7 89.5 93.7
ReLICv2 (ours) 72.1 76.4 89.5 93.0
ResNet200 encoder.
Table 6: Top-1 and top-5 accuracy (in %) after semi-supervised training with a fraction of ImageNet labels for different ResNet encoders and unsupervised representation learning methods. Results are reported on the ImageNet validation set.

b.3 Transfer

We follow the transfer performance evaluation protocol as outlined in [grill2020bootstrap, chen2020simple]. We evaluate ReLICv2 both in both transfer settings – linear evaluation and fine-tuning. For the linear evaluation protocol we freeze the encoder and train only a randomly initialized linear classifier which is put on top of the encoder. On the other hand, for fine-tuning in addition to training the randomly initialized linear classifier, we also allow for gradients to propagate to the encoder which has been initialized with the parameters of the pretrained representation. In line with prior work [chen2020simple, grill2020bootstrap, dwibedi2021little], we test ReLICv2 representations on the following datasets: Food101 [bossard2014food], CIFAR10 [krizhevsky2009learning], CIFAR100 [krizhevsky2009learning], Birdsnap [berg2014birdsnap], SUN397 (split 1) [xiao2010sun], DTD (split 1) [cimpoi2014describing], Cars [krause20133d] Aircraft [maji2013fine], Pets [parkhi2012cats], Caltech101 [fei2004learning], and Flowers [nilsback2008automated].

Again in line with previous methods [chen2020simple, grill2020bootstrap, dwibedi2021little], for Food101 [bossard2014food], CIFAR10 [krizhevsky2009learning], CIFAR100 [krizhevsky2009learning], Birdsnap [berg2014birdsnap], SUN397 (split 1) [xiao2010sun], DTD (split 1) [cimpoi2014describing], and Cars [krause20133d] we report the Top-1 accuracy on the test set, and for Aircraft [maji2013fine], Pets [parkhi2012cats], Caltech101 [fei2004learning], and Flowers [nilsback2008automated] we report the mean per-class accuracy as the relevant metric in the comparisons. For DTD and SUN397, we only use the first split, of the 10 provided splits in the dataset as per [chen2020simple, grill2020bootstrap, dwibedi2021little].

We train on the training sets of the individual datasets and sweep over different values of the models hyperparameters. To select the best hyperparameters, we use the validation sets of the individual datasets. Using the chosen hyperparameters, we train the appropriate using the merged training and validation data and test on the held out test data in order to obtain the numbers reported in table 7. We swept over learning rates {, , , , , , , , }, batch sizes {, , , }, weight decay between {, , , , , }, warmup epochs {, }, momentum {, }, Nesterov {True, False}, and the number of training epochs. For linear transfer we considered setting epochs among {, , , , }, and for fine-tuning, we also considered {, , }, for datasets where lower learning rates were preferable. Models were trained with the SGD optimizer with momentum.

As can be seen from table 7, ReLICv2 representations yield better performance than both state-of-the-art self-supervised methods as well as the supervised baseline across a wide range of datasets. Specifically, ReLICv2 is best on 7 out of 11 datasets and on 8 out of 11 datasets in the linear and fine-tuning settings, respectively.

Method Food101 CIFAR10 CIFAR100 Birdsnap SUN397 Cars Aircraft DTD Pets Caltech101 Flowers
Linear evaluation:
Supervised-IN [chen2020simple] 94.5
SimCLR [chen2020simple]
BYOL [grill2020bootstrap] 96.1
NNCLR [dwibedi2021little] 93.7 79.0
ReLICv2 (ours) 80.6 65.4 66.2 75.1 64.8 77.4 92.4
Fine-tuned:
Random Init [chen2020simple]
Supervised-IN [chen2020simple] 86.4
SimCLR [chen2020simple]
BYOL [grill2020bootstrap] 97.8 93.8
ReLICv2 (ours) 88.7 76.7 64.7 92.3 88.7 76.9 92.2 97.9
Table 7: Accuracy (in %) of transfer performance of a ResNet50 pretrained on ImageNet.

b.4 Semantic segmentation

We evaluate the ability of ReLICv2 to facilitate successful transfer of the learned representations to PASCAL [Everingham10] and Cityscapes [Cordts2016Cityscapes] semantic segmentation tasks.

In accordance with [he2019momentum], we use the ReLICv2 ImageNet representation to initialise a fully convolutional backbone, which we fine-tune on the PASCAL train_aug2012 set for 45 epochs and report the mean intersection over union (mIoU) on the val2012 set. The fine-tuning on Cityscapes is done on the train_fine set for 160 epochs and evaluated on the val_fine set.

The results in main text demonstrate that ReLICv2 compares favourably in terms of semantic segmentation to both BYOL and DetCon on PASCAL, reaching IoU. ReLICv2 also outperforms BYOL on Cityscapes, vs IoU.

b.5 Robustness and OOD Generalization

The robustness and out-of-distribution (OOD) generalization abilities of ReLICv2 representations are tested on several detasets. We use ImageNetV2 [recht2019imagenet] and ImageNet-C [hendrycks2019benchmarking] datasets to evaluate robustness. ImageNetV2 [recht2019imagenet] has three sets of images that were collected to have a similar distribution to the original ImageNet validation set, while ImageNet-C [hendrycks2019benchmarking] consists of 15 synthetically generated corruptions (e.g. blur, noise) that are added to the ImageNet validation set.

For OOD generalization we examine the performance on ImageNet-R [hendrycks2021many], ImageNetSketch [wang2019learning] and ObjectNet [barbu2019objectnet]. ImageNet-R [hendrycks2021many] consists of different renditions (e.g. paintings, cartoons) of ImageNet classes, while ImageNet-Sketch [wang2019learning] consists of images, for each ImageNet class, of object sketches in the black-and-white color scheme. These datasets aim to test robustness to different textures and other naturally occurring style changes and are out-of-distribution to the ImageNet training data. ObjectNet [barbu2019objectnet] has images from differing viewpoints and backgrounds compared to ImageNet.

On all datasets we evaluate the representations of a standard ResNet50 encoder under a linear evaluation protocol akin to Section 3.1, i.e. we freeze the pretrained representations and train a linear classifier using the labelled ImageNet training set; the test evaluation is performed zero-shot, i.e no training is done on the above datasets. As can be seen from table 7(a), ReLICv2 learns more robust representations and outperforms both the supervised baseline and the competing self-supervised methods on ImageNetV2 and ImageNet-C. We provide a detailed breakdown across the different ImageNet-C corruptions in table 9. Furthermore, ReLICv2 learns representations that outperform competing self-supervised methods while being on par with supervised performance in terms of OOD generalization; see table 7(b).

Method MF T-0.7 Ti IN-C
  Supervised 65.1 73.9 78.4 40.9
  SimCLR [chen2020simple] 53.2 61.7 68.0 31.1
  BYOL [grill2020bootstrap] 62.2 71.6 77.0 42.8
ReLIC [mitrovic2020representation] 63.1 72.3 77.7 44.5
ReLICv2 (ours) 65.4 74.5 79.5 44.8
(a) Datasets testing robustness.
Method IN-R IN-S ObjectNet
Supervised 24.0 6.1 26.6
SimCLR [chen2020simple] 18.3 3.9 14.6
BYOL [grill2020bootstrap] 23.0 8.0 23.0
ReLIC [mitrovic2020representation] 23.8 9.1 23.8
ReLICv2 (ours) 23.9 9.9 25.9
(b) Datasets testing OOD generalization.
Table 8: Top-1 Accuracy (in %) under linear evaluation on the ImageNetV2 and ImageNet-C datasets (robustness datasets) and ImageNet-R (IN-R), ImageNet-Sketch (IN-S), ObjectNet (out-of-distribution datasets) for different unsupervised representation learning methods. We evaluate on all three variants on ImageNetv2 – matched frequency (MF), Threshold 0.7 (T-0.7) and Top Images (TI). The results for ImageNet-C (IN-C) are averaged across the 15 different corruptions.

max width= Blur Weather Digital Method Gauss Shot Impulse Defocus Glass Motion Zoom Snow Frost Fog Bright Contrast Elastic Pixel JPEG   Supervised [lim2019fast] 37.1 35.1 30.8 36.8 25.9 34.9 38.1 34.5 40.7 56.9 68.1 40.6 45.6 32.6 56.0   SimCLR [chen2020simple] 29.1 26.3 17.3 22.1 14.7 20.0 18.6 27.2 33.3 46.2 59.7 53.9 31.0 24.2 43.9   BYOL [grill2020bootstrap] 41.5 38.7 31.9 37.8 22.5 31.6 29.6 35.1 42.9 60.1 69.0 58.4 41.5 46.3 55.9 ReLIC [mitrovic2020representation] 43.4 40.7 36.6 40.5 24.5 34.3 30.5 36.6 43.8 61.4 69.5 59.5 42.8 46.8 57.3 ReLICv2 (ours) 41.6 39.0 31.1 39.7 22.6 35.2 34.5 40.1 46.1 64.5 71.0 60.0 44.6 46.6 58.4

Table 9: Top-1 accuracies for for Gauss, Shot, Impulse, Blur, Weather, and Digital corruption types on ImageNet-C.

Appendix C Pretraining on Joint Foto Tree (JFT-300M) – implementation details and additional results

c.1 Linear evaluation

For results reported in table 3, we use the following training and evaluation protocol. To pretrain ReLICv2 on the Joint Foto Tree (JFT-300M) dataset, we used a base learning rate of for pretraining the representations for ImageNet-equivalent epochs. For longer pretraining of and ImageNet-equivalent epochs, we use a lower base learning rate of . We set the target exponential moving average to , the contrast scale to , temperature to and the saliency mask apply probability to for all lenghts of pretraining. For and ImageNet-equivalent epochs we use as the invariance scale, while for ImageNet-equivalent epochs, we use invariance scale . We then follow the linear evaluation protocol on ImageNet described in Appendix B.1. We train a linear classifier on top of the pretrained representations from JFT-300M with stochastic gradient descent with Nesterov momentum for 100 epochs using batch size of 256, learning rate of 0.5 and momentum of 0.9.

c.2 Transfer

We evaluate the transfer performance of JFT-300M pretrained representations under the linear evaluation protocol. For this, we freeze the encoder and train only linear classifier on top of the frozen encoder output, i.e. representation. As before in B.3, we follow the transfer performance evaluation protocol as outlined in [grill2020bootstrap, chen2020simple]. In line with prior work, for Food101 [bossard2014food], CIFAR10 [krizhevsky2009learning], CIFAR100 [krizhevsky2009learning], Birdsnap [berg2014birdsnap], SUN397 (split 1) [xiao2010sun], DTD (split 1) [cimpoi2014describing], and Cars [krause20133d] we report the top-1 accuracy on the test set, and for Aircraft [maji2013fine], Pets [parkhi2012cats], Caltech101 [fei2004learning], and Flowers [nilsback2008automated] we report the mean per-class accuracy as the relevant metric in the comparisons. For DTD and SUN397, we only use the first split, of the 10 provided splits in the dataset.

We train on the training sets of the individual datasets and sweep over different values of the models hyperparameters. To select the best hyperparameters, we use the validation sets of the individual datasets. Using the chosen hyperparameters, we train the linear layer from scratch using the merged training and validation data and test on the held out test data in order to obtain the numbers reported in table 10. We swept over learning rates {, , , , , , , , }, batch sizes {, , , }, weight decay between {, , , , , }, warmup epochs {, }, momentum {, }, Nesterov {True, False}, and the number of training epochs {, , }. Models were trained with the SGD optimizer with momentum.

As can be seen from table 10, longer pretraining benefits transfer performance of ReLICv2. Although DnC [tian2021divide] was specifically developed to handle uncurated datasets such as JFT-300M, we see that ReLICv2 has comparable performance to DnC in terms of the number of datasets with state-of-the-art performance among self-supervised representation learning methods; this showcases the generality of ReLICv2.

Method Food101 CIFAR10 CIFAR100 Birdsnap SUN397 Cars Aircraft DTD Pets Caltech101 Flowers
BYOL-5k [grill2020bootstrap] 73.3 89.8 72.4 38.2 61.8 64.4 54.4 75.5 77 90.1 94.3
DnC-4.5k [tian2021divide] 78.7 91.7 74.9 42.1 65.0 75.3 54.1 76.6 86.1 90.2 98.2
ReLICv2-1k (ours) 77.5 90.2 72.6 47.4 64.5 74.4 62.9 77.0 84.9 92.2 94.5
ReLICv2-5k (ours) 78.3 89.9 73.0 49.4 65.6 76.9 65.5 76.8 85.1 91.4 95.7
Table 10: Accuracy (in %) of transfer performance of a ResNet50 pretrained on JFT under the linear transfer evaluation protocol. xk refers to the length of pretraining in ImageNet-equivalent epochs, e.g. 1k corresponds to 1000 ImageNet-equivalent epochs of pretraining.

c.3 Robustness and OOD Generalization

We also tested the robustness and out-of-distribution (OOD) generalization of ReLICv2 representations pretrained on JFT. We use the same set-up described in B.5 where we freeze the pretrained representations on JFT-300M, train a linear classifier using the labelled ImageNet training set and perform zeroshot test evaluation on datasets testing robustness and OOD generalization. As in B.5, we evaluated robustness using the ImageNetV2 [recht2019imagenet] and ImageNet-C [hendrycks2019benchmarking] datasets and OOD generalization using ImageNet-R [hendrycks2021many], ImageNetSketch [wang2019learning] and ObjectNet [barbu2019objectnet] datasets. We report the robustness results in table 10(a) and the OOD generalization results in table 10(b). We notice that ReLICv2 representations pretrained on JFT-300M for different number of ImageNet-equivalent epochs have worse robustness and OOD generalization performance compared to ReLICv2 representations pretrained directly on ImageNet (see tables 7(a) and 7(b) for reference). Given that the above datasets have been specifically constructed to measure the robustness and OOD generalization abilities of models pretrained on ImageNet (as they have been constructed in relation to ImageNet), this result is not entirely surprising. We hypothesize that this is due to there being a larger discrepancy between datasets and JFT-300M than these datasets and ImageNet and as such JFT-300M-pretrained representations perform worse than ImageNet-pretrained representations. Additionally, note that pretraining on JFT-300M for longer does not necessarily result in better downstream performance on the robustness and out-of-distribution datasets.

Epochs MF T-0.7 Ti IN-C
1000 57.6 66.7 73.0 32.9
3000 58.6 67.5 73.4 32.8
5000 59.1 67.3 73.3 33.5
(a) ImageNetv2 dataset.
Epochs IN-R IN-Sketch ObjectNet
1000 20.4 6.7 20.3
3000 20.3 8.7 21.3
5000 20.3 5.4 20.9
(b) ImageNet-R, ImageNet-Sketch and ObjectNet datasets.
Table 11: Top-1 Accuracy (in %) under linear evaluation on the the ImageNet-R (IN-R), ImageNet-Sketch (IN-S) and ObjectNet out-of-distribution datasets and on ImageNetV2 dataset for ReLICv2 pre-trained on JFT-300M for different numbers of ImageNet-equivalent epochs. We evaluate on all three variants on ImageNetV2 – matched frequency (MF), Threshold 0.7 (T-0.7) and Top Images (TI). The results for ImageNet-C (IN-C) are averaged across the 15 different corruptions.

Appendix D Ablations

In order to determine the sensitivity of ReLICv2 to different model hyperparameters, we perform an extensive ablation study. Unless otherwise noted, in this section we report results after epochs of pretraining. As saliency masking is one of the main additions of ReLICv2 on top of ReLIC and was not covered extensively in the main text, we start our ablation analysis with looking into the effect of different modelling choices for it.

d.1 Using different datasets for obtaining the saliency masks

In the main text in Sections 3.1, 3.2, 3.3, 3.4 we used a DeepUSPS [deepusps] saliency detection network trained only on a randomly selected subset of

ImageNet images. Here we explore whether using additional data could help improve the performance of the saliency estimation and of the overall representations learnt by

ReLICv2. For this purpose, we use the MSRA-B dataset [liu2010learning], which was originally used by DeepUSPS to train their saliency detection network. MSRA-B consists of training images for which handcrafted masks computed with the methods Robust Background Detection (RBD) [zhu2014saliency], Hierarchy-associated Rich Features (HS) [zou2015harf], Dense and Sparse Reconstruction (DSR) [li2013saliency] and Markov Chain (MC) [jiang2013saliency] are already available. We use the same hyperparameters as described in Section A.2.1 to train DeepUSPS on MSRA-B.

We explored whether using saliency masks obtained from training DeepUSPS on the MSRA-B affects performance of ReLICv2 pre-training on ImageNet. We noticed that for ReLICv2 representations pretrained on ImageNet for epochs, we get top-1 and top-5 accuracy under linear evaluation on the ImageNet validation set for a ResNet50 (1x) encoder. The slight performance gains may due to the larger variety of images in MSRA-B used for training the saliency detection network, as opposed to the random sample of ImageNet images that we used for training DeepUSPS directly on the ImageNet dataset.

We also explored training DeepUSPS on randomly selected images from the ImageNet dataset and this resulted in the DeepUSPS model overfitting, which degraded the quality of the saliency masks and resulted in a ReLICv2 performance of top-1 and top-5 accuracy on the ImageNet validation set after epochs of pretraining on ImageNet training set.

The results for ReLICv2 in Section 3.5 are obtained by applying the DeepUSPS saliency detection network trained on MSRA-B to all images in JFT-300M and then applying the saliency masks to the large augmented views during training as described in Section A.2.

d.2 Analysis and ablations for saliency masks

Using saliency masking during ReLICv2 training enables us to learn representations that focus on the semantically-relevant parts of the image, i.e. the foreground objects, and as such the learned representations should be more robust to background changes. We investigate the impact of using saliency masks with competing self-supervised benchmarks, the effect of the probability of applying the saliency mask to each large augmented view during training as well as the robustness of ReLICv2 to random masks and mask corruptions. For the ablation experiments described in this section, we train the models for epochs.

Using saliency masks with competing self-supervied methods.

We evaluate the impact of using saliency masks with competing self-supervised methods such as BYOL [grill2020bootstrap]. This method only uses two large augmentented views during training and we randomly apply the saliency masks, in a similar way as described in Section A.2, to each large augmented view with probability . We report in table 12 the top-1 and top-5 accuracy under linear evaluation on ImageNet for different settings of for removing the background of the augmented images. We notice that saliency masking also helps to improve performance of BYOL.

Mask probability 0 0.1 0.15 0.2 0.25 0.3
BYOL Top-1 73.1 73.4 73.2 73.3 72.8 71.8
Top-5 91.2 91.3 91.2 91.3 90.8 90.1
Table 12: Top-1 and top-5 accuracy (in %) under linear evaluation on the ImageNet validation set for BYOL trained using different probabilities of using the saliency mask to remove the background of the augmented images. Models are trained for 300 epochs.
Mask apply probability.

We also investigate the effect of using probabilities ranging from 0 to 1 for applying the saliency mask during training for ReLICv2. In addition, we explore further the effect of using different datasets for training the saliency detection network in DeepUSPS that is subsequently used for computing the saliency masks. Table 13 reports the top-1 and top-5 accuracy for varying the mask apply probability between 0 and 1 and for using the ImageNet vs. the MSRA-B dataset [liu2010learning] for training DeepUSPS. Note that using the additional images from the MSRA-B dataset to train the DeepUSPS saliency detection network results in better saliency masks which translates to better performance when using the saliency masks during ReLICv2 training.

DeepUSPS trained on ImageNet DeepUSPS trained on MSRA-B
Mask probability Top-1 Top-5 Top-1 Top-5
  0 75.2 92.4 75.2 92.4
  0.05 75.3 92.6 75.2 92.6
  0.1 75.4 92.5 75.3 92.4
  0.15 75.2 92.5 75.5 92.5
  0.2 75.2 92.5 75.6 92.6
  0.25 75.0 92.3 75.3 92.5
  0.3 75.1 92.3 74.8 92.4
  0.4 75.0 92.3 75.3 92.5
  0.5 74.7 92.2 75.0 92.4
  0.6 75.0 92.3 75.0 92.3
  0.7 74.4 92.3 74.6 92.0
  0.8 73.9 91.7 75.0 92.1
  0.9 74.0 91.7 74.6 92.0
  1.0 73.7 91.7 74.5 92.0
Table 13: Top-1 and top-5 accuracy (in %) under linear evaluation on the ImageNet validation set for a ResNet50 (1x) encoder set for different probabilities of using the saliency mask to remove the background of the large augmented views during training and for using different datasets to train the DeepUSPS saliency detection network for computing the saliency masks. Models are trained for 300 epochs.
Random masks and mask corruptions.

To understand how important having accurate saliency masks for the downstream performance of representations is we also investigated using random masks, corrupting the saliency masks obtained from DeepUSPS and using a bounding box around the saliency masks during ReLICv2 training.

We explored using completely random masks, setting the saliency mask to be a random rectangle of the image and also a centered rectangle. As ImageNet images generally consists of images with objects centered in the middle of the image, we expect that using a random rectangle that is centered around the middle will cover a reasonable portion of the object. Table 14 reports the performance under linear evaluation on the ImageNet validation set when varying the size of the random masks to cover different percentage areas of the full image. We notice that improving the quality of the masks, by using random rectangle patches instead of completely random points in the image as the mask, results in better performance. However, the performance with random masks is lower than using saliency masks from DeepUSPS. As expected, using centered rectangles instead of randomly positioned rectangles as masks results in better peformance.

Random Rectangle Centered Rectangle
Image percentage area Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
  10% 70.8 89.9 70.9 90.3 71.3 90.1
  20% 72.2 90.7 73.1 91.3 73.4 91.3
  30% 72.9 91.3 73.8 91.8 73.8 91.9
  40% 73.1 91.4 74.2 91.9 74.1 92.0
  50% 73.3 91.5 74.0 92.0 74.3 92.0
  60% 73.6 91.8 74.2 92.1 74.3 92.2
  70% 73.7 91.9 74.4 92.1 74.4 92.2
  80% 74.1 92.1 74.4 92.2 74.2 92.1
  90% 74.1 92.2 74.4 92.1 74.2 92.2
Table 14: Top-1 and top-5 accuracy (in %) under linear evaluation on the ImageNet validation set for a ResNet50 (1x) encoder set for using different types of random masks that cover various percentage areas () of the full image. These random masks are applied on top of the large augmented views during training with probability 0.1. Models are trained for 300 epochs.

Moreover, to test the robustness of ReLICv2 to corruptions of the saliency masks, we add/remove from the masks a rectangle proportional to the area of the saliency mask. The mask rectangle is added/removed from the image center. Table 15 reports the results when varying the area of the rectangle to be added/removed to cover different percentages of the saliency masks obtained from DeepUSPS. We notice that while ReLICv2 is robust to small corruptions of the saliency mask its performance drops in line with the quality of the saliency masks degrading.

Add rectangle to mask Remove rectangle from mask
Mask percentage area Top-1 Top-5 Top-1 Top-5
  10% 75.2 92.5 75.2 92.3
  20% 75.3 92.6 75.1 92.4
  30% 75.1 92.3 74.7 92.2
  40% 74.9 92.2 74.6 92.2
  50% 74.9 92.4 74.5 92.0
  60% 74.9 92.2 74.0 91.7
  70% 74.8 92.2 73.6 91.7
  80% 74.8 92.4 73.4 91.4
  90% 74.7 92.2 73.0 91.3
  100% 74.6 92.3 72.6 90.9
Table 15: Top-1 and top-5 accuracy (in %) under linear evaluation on the ImageNet validation set for a ResNet50 (1x) encoder set for corrupting the saliency masks by adding/remove a rectangle from the image center. The rectangle is a percentage () of the saliency mask area (the higher the percentage the higher the corruption). The corrupted saliency masks are applied on top of the large augmented views during training with probability 0.1.

Finally, we also explore corrupting the masks using a bounding box around the saliency mask obtained from DeepUSPS which results in top-1 and top-5 accuracy under linear evaluation on the ImageNet validation set for a ResNet50 (1x) encoder trained for epochs with mask apply probability of Note that this performance is comparable to using random rectangles to mask the large augmented views during training (see table 14) and by is lower than directly using the saliency masks from DeepUSPS.

d.3 Other model hyperparameters

Now we turn our attention to ablating the effect of other model hyperparameters on the downstream performance of ReLICv2 representations. Note that these hyperparameters have been introduced and extensively ablated in prior work [grill2020bootstrap, mitrovic2020representation, mitrovic2020less].

Number of negatives.

As mentioned in Section 2 ReLICv2 selects negatives by randomly subsampling the minibatch in order to avoid false negatives. We investigate the effect of changing number of negatives in table 16. We can see that the best performance can be achieved with relatively low numbers of negatives, i.e. just 10 negatives. Furthermore, we see that using the whole batch as negatives has one of the lowest performances.

In further experiments, we observed that for longer pretraining (e.g. epochs) there is less variation in performance than for pretraining for epoch which itself is also quite low.

Number of negatives Top-1 Top-5
  1 75.1 92.4
  5 75.2 92.6
  10 75.4 92.5
  20 75.3 92.7
  50 75.5 92.5
  100 75.4 92.5
  500 75.1 92.4
  1000 75.3 92.6
  2000 75.4 92.5
  4096 75.2 92.6
Table 16: Top-1 and top-5 accuracy (in %) under linear evaluation on the ImageNet validation set for a ResNet50 (1x) encoder set for different numbers of randomly selected negatives. All settings are trained for epochs.
Target EMA.

ReLICv2 uses a target network whose weights are an exponential moving average (EMA) of the online encoder network which is trained normally using stochastic gradient descent; this is a setup first introduced in [grill2020bootstrap] and subsequently used in [mitrovic2020representation] among others. The target network weights at iteration are where is the EMA parameter which controls the stability of the target network ( sets ); are the parameters of the online encoder at time , while are the parameters of the target encoder at time . As can be seen from table 17, all decay rates between and yield similar performance for top-1 accuracy on the ImageNet validation set after pretraining for epochs indicating that ReLICv2 is robust to choice of in that range. For values of of and higher, the performance quickly degrades indicating that the updating of the target network is too slow. Note that contrary to [grill2020bootstrap] where top-1 accuracy drops below for , ReLICv2 is significantly more robust to this setting achieving double that accuracy.

Top-1 Top-5
  0 73.5 91.5
  0.9 74.6 92.2
  0.99 75.5 92.6
  0.993 75.4 92.5
  0.996 74.4 92.0
  0.999 70.5 89.8
  1.0 39.6 63.6
Table 17: Top-1 and top-5 accuracy (in %) under linear evaluation on the ImageNet validation set for a ResNet50 (1x) encoder set for different setting of the target exponentially moving average (EMA). All settings are trained for 300 epochs.

Appendix E Comparison between self-supervised methods

As a companion to section 5, table 18 provides a detailed comparison in terms of how prominent representation learning methods utilize positive and negative examples and how they incorporate both explicit contrastive and invariance losses. Here refers to the standard set of SimCLR augmentations [chen2020simple], refers to a scheme which selects nearest neighbours of , are multicrop augmentations (c.f. [caron2020unsupervised]). and refer to using prototypes computed via an explicit clustering step c.f. [caron2020unsupervised]. Finally, refers to a scheme which computes saliency masks of and removes backgrounds as described in section 2. Note that SwAV first computes a clustering of the batch then contrasts the embedding of the point and its nearest cluster centroid against the remaining cluster centroids ; invariance is implicitly enforced in the clustering step.

Method Contrastive Invariance Positives Negatives
SimCLR [chen2020simple] full batch
BYOL [grill2020bootstrap] n/a
NNCLR [dwibedi2021little] , full batch
MoCo [he2019momentum] queue
SwAV [caron2020unsupervised] , ,
Debiased [chuang2020debiased] importance sample
Hard Negatives [robinson2020contrastive] importance sample
ReLICv1 [mitrovic2020representation] subsample
ReLICv2 (ours) , , subsample
Table 18: The role of positives and negatives in recent unsupervised representation learning algorithms.