Self-supervised Pre-training with Hard Examples Improves Visual Representations

Self-supervised pre-training (SSP) employs random image transformations to generate training data for visual representation learning. In this paper, we first present a modeling framework that unifies existing SSP methods as learning to predict pseudo-labels. Then, we propose new data augmentation methods of generating training examples whose pseudo-labels are harder to predict than those generated via random image transformations. Specifically, we use adversarial training and CutMix to create hard examples (HEXA) to be used as augmented views for MoCo-v2 and DeepCluster-v2, leading to two variants HEXA_MoCo and HEXA_DCluster, respectively. In our experiments, we pre-train models on ImageNet and evaluate them on multiple public benchmarks. Our evaluation shows that the two new algorithm variants outperform their original counterparts, and achieve new state-of-the-art on a wide range of tasks where limited task supervision is available for fine-tuning. These results verify that hard examples are instrumental in improving the generalization of the pre-trained models.


An Evaluation of Self-Supervised Pre-Training for Skin-Lesion Analysis

Self-supervised pre-training appears as an advantageous alternative to s...

ClusterFit: Improving Generalization of Visual Representations

Pre-training convolutional neural networks with weakly-supervised and se...

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

We introduce Corrupted Image Modeling (CIM) for self-supervised visual p...

Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning

Pretrained models from self-supervision are prevalently used in fine-tun...

Masked Visual Pre-training for Motor Control

This paper shows that self-supervised visual pre-training from real-worl...

Self-Supervised Detection of Contextual Synonyms in a Multi-Class Setting: Phenotype Annotation Use Case

Contextualised word embeddings is a powerful tool to detect contextual s...

Self-supervision through Random Segments with Autoregressive Coding (RandSAC)

Inspired by the success of self-supervised autoregressive representation...

1 Introduction

Self-supervised visual representation learning aims to learn image features from raw pixels without relying on manual supervisions. Recent results show that self-supervised pre-training (SSP) outperforms state-of-the-art (SoTA) fully-supervised pre-training methods [22, 4]

, and is becoming the building block in many computer vision applications. The pre-trained model produces general-purpose features and serve as the backbone of various downstream tasks such as classification, detection and segmentation, improving the generalization of those task-specific models that are often trained on limited amounts of task labels.

Most state-of-the-art SSP methods focus on designing novel pretext objectives, ranging from the traditional prototype learning [51, 53, 3, 26, 56], to a recently popular concept known as contrastive learning [7, 22, 8, 21], and a combination of both [4, 30]. Apart from the improved efficiency, all these methods heavily rely on data augmentation to create different views of an image using image transformations, such as random crop (with flip and resize), color distortion, and Gaussian blur. Recent studies show that SSP performance can be further improved by using more aggressively transformed views, such as increasing the number of views [4], and more distinctive views via minimizing mutual information [45].

However, image transformations are agnostic to the pretext objectives, and it remains unknown how to augment views specifically based on the pre-training tasks themselves, and how different augmentation methods affect the generalization of the learned models. To tailor data augmentation to pre-training tasks, we explicitly formulate SSP as a problem of predicting pseudo-labels, based on which we propose to generate hard examples (Hexa), a family of augmented views whose pseudo-labels are difficult to predict. Specifically, two schemes are considered. Adversarial examples are created with the intention to cause an SSP model to make prediction mistakes and thus improve the generalization of the model [50]. Cut-mixed examples are created via cutting and pasting patches among different images [54], so that its content is a mixture of multiple images.

Our contributions include:

A pseudo-label perspective is formulated to motivate the concept of hard examples in self-supervised learning.

Two novel algorithms are proposed, through applying our framework to two distinctly different existing approaches. Experiment are conducted on a wide range of tasks in self-supervised benchmarks, showing that Hexa consistently improves their original counterparts, and achieves SoTA performance under the same settings. It demonstrates the genericity and effectiveness of proposed framework in constructing hard examples for improving the visual representations using SSP.

2 Self-supervision: A Pseudo-label View

Self-supervised learning learns representations by leveraging the weak signals intrinsically existing in images as pseudo-labels, and maximizing agreement between pseudo-labels and the learned representations. This framework comprises the following four major components [7]. Data augmentation that randomly transforms any given image , resulting in multiple correlated views of the same example, denoted as . A backbone network parameterized by that extracts a feature representation from an augmented view . A projection head parameterized by that maps the feature representation to a latent representation , on which self-supervised loss is applied. A self-supervised loss function aims to predict the pseudo-label based on . Different self-supervised learning methods differ in their exploited weak signals, based on which different kinds of pseudo-labels are constructed. We cast a broad family of SSP methods as a pseudo-label classification task, where

are the classifier parameters; After training,

provides generic visual representations. Following this point of view, we revisit two types of methods.

Type I: Contrastive Learning.

Contrastive learning is a framework that learns representations by maximizing agreement between differently augmented views of the same image via a contrastive loss in the latent space. For a given query , we identify its positive samples from a set of keys , where positive samples are indexed by and negative samples are indexed by . The pseudo-labels in contrastive learning are defined by feature pairwise comparisons: for the pair and for the pair . For a query with

pairs, its pseudo-label vector is


The contrastive prediction task can be formulated as a dictionary look-up problem. By mapping a view into a latent representation using the function composition

, an effective contrastive loss function, called InfoNCE, can be derived as:


where denotes the set of trainable parameters and is a temperature hyper-parameter. From (1) to (2), only the loss term indexed with remains, while the ones indexed with are excluded, because their corresponding pseudo-label .

In the instance discrimination pretext task (used by MoCo and SimCLR), a query and a key form a positive pair if they are data-augmented versions of the same image, and otherwise form a negative pair. The contrastive loss (2) can be minimized by various mechanisms that differ in how the keys (or negative samples) are maintained [8].

  • SimCLR [7]  The negative keys are from the same batch and updated end-to-end by back-propagation. SimCLR is based on this mechanism and requires a large batch to provide a large set of negatives.

  • MoCo [22, 8]  In the MoCo mechanism, the negative keys are maintained in a queue , and only the queries and positive keys are encoded in each training batch. A momentum encoder is adopted to improve the representation consistency between the current and earlier keys. MoCo decouples the batch size from the number of negatives. MoCo-v2 [8] is an improved version using strong augmentation (i.e. more aggressive image transformations) and MLP projection proposed in SimCLR.

Type II: Prototype Learning.

The prototype learning methods [3, 30, 4] introduce a “prototype” as the centroid for a cluster formed by similar image views. The latent representations are fed into a clustering algorithm to produce the prototype/cluster assignments, which are subsequently used as “pseudo-labels” to supervise representation learning.

DeepCluster is a representative prototype learning work. It employs -means as the clustering algorithm, which takes a set of latent vectors as input, clusters them into distinct groups with prototypes , and simultaneously output the optimal cluster assignment

as a one-hot probability simplex. The model is trained to predict the optimal assignment:


where is th prototype/cluster centroid, and the is the index of the assigned cluster for . DeepCluster alternates between two steps: feature clustering using -means and feature learning by predicting these pseudo-labels. The prototype learning improves limitations of the contrastive instance discrimination methods via allowing views with similar semantics but from different source images to be pushed together.

Note that (2) and (4) represent a traditional view for self-supervised learning formulations, while (1) and (3) are our derived pseudo-label view, where is explicitly involved in the learning objectives. It opens opportunities to study new data augmentation based on , in improving the robustness and generalization of for visual representations.

(a) Image transformations v.s. Hard examples (b) Augmented view space
Figure 1: Illustration of Hexa: (a) Hard examples. For the original dog image, existing SSP methods employ random transformations to generate augmented example , we propose two types of hard examples. Adversarial examples add perturbations on and cut-mixed examples cut and paste patches between . (b) A visualization example of augmented view space. Each circle “” indicates an augmented example . The adversarial example (“”) fools the model to make a prediction mistake, and the cut-mixed example (“”) is created between two standard augmentations.

3 Pre-training with Hard Examples

Augmented views play a vital role in SSP. Most existing methods synthesize views through random image transformations, without carefully considering their feasibility in completing the self-supervised learning task: predicting pseudo-labels. By contrast, we focus on studying hard examples, which are defined as augmented views whose pseudo-labels are difficult to be predicted. Specifically, we consider two schemes: adversarial examples and cut-mixed examples. We visually illustrate in Figure 1 how hard examples are constructed from image transformations, detail the derivation process as follows.

3.1 Adversarial Examples

Adversarial robustness refers to a model’s invariance to small (often imperceptible) perturbations of its inputs (i.e. clean examples). The adversarial examples are produced by adding perturbations on clean examples to fool the predictions of a trained model the most. In self-supervised learning, we propose to add perturbations on the augmented views to fool their predicted pseudo-labels.

Adversarial Contrastive Learning.

For the instance contrastive discrimination methods, we focus on the MoCo algorithm. Specifically, we propose to generate adversarial examples for query only. Since both key and pseudo-label are fixed, it is feasible to compute the gradient on the query , leading to the adversarial training objective:


where is a hyper-parameter governing how invariant the resulting model should be to adversarial attacks, and is the perturbation. In practice, (5) is updated using two steps: By applying Projected Gradient Descent (PGD) [19, 2], we obtain adversarial examples on-the-fly:


where the step size for PGD. The generated can be viewed as a new data augmentation on query , and we then feed into the model to update parameters . Note differs from traditional random augmentations in that it takes into consideration of the relationship between the positive example and all negative examples within the memory bank, and tends to be a “harder” query than for the dictionary look-up problem. The adversarial examples for SimCLR is easier to construct, and can be viewed as a special case when all negative examples are from current batch, rather than from memory bank.

Adversarial Prototype Learning.

The adversarial training for prototype-based methods are similar to supervised settings, after the cluster assignments are learned. We treat these pseudo-labels as targets to fool the model:


Similarly,  (7) is also updated in two steps: starting with adversarial example generation, followed by model update. The adversarial example is “harder” than to be correctly aligned into clusters.


It is shown in AdvProp [50]

that clean examples and adversarial examples tend to have different batch statistics, due to their salient empirical distribution divergence. Thus, we adopt the AdvProp training scheme, where two separate sets of batch normalization (BN) 

[25] parameters are considered, summarizing the statistic for clean examples and that for adversarial examples, respectively. In Figure 1, we use the dog image as input and visualize the perturbations as noisy grey maps, which are added on . Though look indistinguishable with visually, their corresponding pseudo-labels have been revised significantly, depending on how much they move across the decision boundary (i.e. how many PGD steps are applied). We study the impact of hyper-parameters in PGD in Appendix, and we choose PGD step as for computational efficiency, perturbation threshold and step size .

1:Initializing network parameters for query and key ; Random image transformations and ; A queue for memory bank, with momentum decay coefficient .
2:for a number of training iterations do
4:      Sample a batch of image from the full dataset;
5:      Clean query and key ;
7:      Forward and ;
8:      Synthesize adversarial query using (6);
10:      Synthesize (, ) using (8);
12:      Forward , and using (10);
13:      Compute of (10) and update ;
15:      ;
17:      Queuing in and dequeuing oldest elements;
18:end for
Algorithm 1 Hexa

3.2 Cut-Mixed Examples

Cutmix [54] is a recent image augmentation technique for supervised learning. Patches are cut and pasted among images to create a new example, where the ground truth labels are also mixed proportionally to the area of the patches. Specifically, for randomly selected two images, we consider an augmented view from and , where and are the corresponding pseudo-labels (i.e. instance identity in contrast learning or cluster index in prototype learning). The cutmixed example and its pseudo-labels are generated using the combining operation as:


where (width and height ) denotes a binary mask indicating where to drop out and fill in from two images, is a binary mask filled with ones, is element-wise multiplication, is the combination ratio between two views. Following [54],

is initially sampled from the beta distribution

, and is finally set as the area percentage that view occupies in . The beta distribution controls how much the two views are mixed. We empirically study hyper-parameters of in Appendix, and use in our experiments, which leads to the mean

and standard deviation


Since has mixed contents from two different source images, it tends to be a hard example in predicting either of its labels. The added patches further enhance the localization ability by requiring the model to identify the object from a partial view. To train the model, an objective can be written with the standard loss function as:

1:Initializing network parameters ; Random image transformations ; Initializing a set of prototypes and compute initial assignment .

 a number of training epoch 

3:     for a number of training iteration do
5:          Sample image batch from the full dataset;
6:          Augmented views ;
8:          Forward ;
9:          Synthesize using (7);
11:          Synthesize (, ) using (8);
13:          Forward , and using (10);
14:          Compute of (10) and update ;
15:     end for
17:      Collect in above inner-loop;
18:      Solve -means to update and for each ;
19:end for
Algorithm 2 Hexa


In each training iteration, we consider images in the original batch as , randomly permute images in a batch to create , and generate cut-mixed samples by combining selected examples from two batches with the same index, according to (8). For MoCo, we perform cutmix on queries, and leave keys unchanged. When multiple crops are considered for each image in DeepCluster, the same permutation index is shared among the crops. In Figure 1, transformations on dog image are , transformations on cat image are . The cut-mixed examples are “dog-cat” images shown on the right side of Figure 1(a). One may imagine often lies at the decision boundary, depending on how much content is mixed from each.

3.3 Full Hexa Objective

The overall self-adversarial training objective considers both clean and hard examples constructed by adversarial and cutmix augmentations:


where and are the weighting hyper-parameters to control the effect of adversarial examples and cutmixed examples, respectively. In our experiments, we set and/or . Note that reduces the objective to the standard self-supervised training algorithms. Concretely, we consider two novel algorithms:

  • Hexa By plugging terms (2) (8) and (5) into (10), it yields the full self-adversarial contrastive learning objective denoted as . The Hexa training procedure is detailed in Algorithm 1. We build Hexa on top of MoCo-v2. The two algorithms are distinguished from each other in Lines 5-9, where hard examples are computed on query and subsequently employed in model update for Hexa.

  • Hexa The full self-adversarial prototype learning objective is obtained via plugging (4)(8) and (7) into (10). We build Hexa based on DeepCluster-v2 [30], which improves DeepCluster [3] to reach similar performance with recent state-of-the-art methods. The Hexa training procedure is detailed in Algorithm 2. It differs from DeepCluster-v2 in Lines 6-10, where hard examples are computed to train the network in conjunction with clean examples.

4 Related Works

4.1 Self-supervised Pre-training

Pretext task taxonomy.

Self-supervised learning is a popular form of unsupervised learning, where labels annotated by humans are replaced by “pseudo-labels” directly extracted from the raw input data by leveraging its intrinsic structures. We broadly categorize existing self-supervised learning methods into three classes:

Handcrafted pretext tasks. This includes many traditional self-supervised methods such as relative position of patches [12, 37], masked pixel/patch prediction [39, 46], auto-regressive modeling [5] , rotation prediction [18]

, image colorization 

[58, 29], cross-channel prediction [59] and generative modeling [40, 13]. These approaches typically exploit domain knowledge to carefully design a pretext task, with the learned features often focusing on one certain aspect of images, leading to a limited transfer ability. Contrastive learning. The instance-level classification task is considered [14, 60], where each image in a dataset is treated as a unique class, and various augmented views of an image are the examples to be classified. Some recent works in this line are CPC [38], deep InfoMax [24, 1], MoCo [22], SimCLR [7], BYOL [21] etc. Prototype learning. Clustering is employed for deep representation learning, including DeepCluster [3], SwAV [4] and PCL [30], among many others [51, 53, 26, 56]. The proposed Hexa can be generally applied to all three classes in principle, as long as the notation of pseudo-labels exists. In this paper, we focus on the latter two classes, as they have shown SoTA representation learning performance, surpassing the ImageNet-supervised counterpart in multiple downstream vision tasks.

The role of augmentations.

Image data augmentations/transformations such as crop and blurring play a crucial role in modern self-supervised learning pipeline. It has been empirically shown that visual representations can be improved by employing stronger image transformations [8] and increasing the number of augmented views of an image [4]. InfoMin [45] studied the principles of good views for contrastive learning, and suggested to select views with less mutual information. By definition, adversarial and cut-mixed examples tend to be harder examples than transformation-augmented ones for self-supervised problems, and are complementary to the above techniques.

4.2 Hard Examples


A vast majority of works commonly view adversarial examples as a threat to models [19, 33], and suggest training with adversarial examples leads to accuracy drop on clean data [41, 34]. Adversarial training have been studied for self-supervised pre-training [6]. Our work is significantly different from Chen et al. [6] in two aspects: Motivations – We aim to use adversarial examples to boost standard recognition accuracy on large-scale datasets such as ImageNet, while Chen et al. [6]

mainly study model robustness on small datasets such as CIFAR-10.

Algorithms – We focus on the modern contrastive/prototype learning methods (last two categories of SSP methods in Section 4.1), while Chen et al. [6] work on traditional handcrafted SSP methods (the first category).

Improved standard accuracy.

Hard examples have been shown to be effective in improving recognition accuracy in supervised learning settings. For adversarial examples, one early attempt is virtual adversarial training (VAT) [36]

, a regularization method that improves semi-supervised learning tasks. The success was recently extended to natural language processing 

[47, 9, 32] and vision-and-language tasks [17]. In computer vision, AdvProp [50] is a recent work showing that adversarial examples improve recognition accuracy on ImageNet in supervised settings. Hadi et al. further show that adversarially robust ImageNet models transfer better [43]. For cut-mixed examples, it was first studied by Yun et al. [54]. Similar augmentation schemes using a mixture of images include mix-up [57], cut-out [11] etc. All above hard examples are constructed in the supervised settings, our Hexa is the first work to systematically study hard examples in large-scale self-supervised settings, due to the proposed pseudo-label formulation. We confirm that hard examples improve the model’s transfer ability.

5 Experimental Results

All of our study for unsupervised pretraining (learning encoder network without labels) is done using the ImageNet ILSVRC-2012 dataset [10]. We implement Hexa based on the pre-training scheldule of MoCo-v2, and implement Hexa based on the pre-training scheldule of DeepCluster-v2. Both use the cosine learning rate and MLP projection head. Due to the limit of computational resource, all experiments are conducted with ResNet-50 and pre-trained in 200/800 epochs if not specifically mentioned. Once the model is pre-trained, we follow the same fine-tuning protocols/schedules with the baseline methods [22, 4]

. Following common practice in evaluating pre-trained visual representations, we test the model’s transfer learning ability on a wide range of datasets/tasks in the self-supervised learning benchmark 

[20], based on the principle that a good representation should transfer with limited supervision and limited fine-tuning.

5.1 On the impact of different hard examples

To understand different design choices in our framework, we compare different schemes to add hard examples into SSP. Std baseline: Only standard random transformations are used, i.e., ; Std + Adv: Adversarial examples are added into Std, i.e., and ; Std + Cmt: Cutmixed examples are computed on and then added, i.e., and ; Std + Adv + Cmt: Both hard examples are added, i.e., ; Std + Adv + Cmt: As an ablation choice, we consider computing cutmixed examples on adversarial views , denoted as Cmx; Std + Adv + Cmt + Cmt: All types of hard examples are added.

We conduct the comparison experiments with a small number of pre-training steps on ImageNet. For Hexa, we pre-train for 20 epochs. For Hexa, we pre-train for 5 epochs, but with 6 crops per image: 2 crops at resolution 160 and 4 crops at resolution 96. The last checkpoint is employed to extract features, on which a linear classifier is trained for 1 epoch on ImageNet. The results are reported in Figure 2. Interestingly, cut-mixed examples computed on are more effective than those on . This is expected, as the ground-truth label of should be different from , the mixed label of the latter can not reflect ground-truth label of the former. Further, both adversarial and cut-mixed examples can improve the baseline method, regardless of whether they are added separately or simultaneously, showing the effectiveness of the proposed methods.

(a) Hexa (b) Hexa
Figure 2: Impact of different hard example combination schemes in Hexa.

In what follows, we denote Hexa and Hexa as two variants that are both constructed with 2 random crops at resolution 224 and adversarial examples. More specifically, Hexa follows MoCo-v2: one crop for query and the other for key; Hexa is always compared with the DeepCluster-v2 variant with 2 crops. The current SoTA method is SwAV [4], which employs 8 random crops: 2 crops at resolution 224 and 6 crops at resolution 96. To compare with SoTA, we also increase the number of crops to 8 and consider two variants: Hexa(8-crop) is with adversarial examples, and Hexa(8-crop) is constructed with both adversarial and cut-mixed examples. All 2-crop methods use a mini-batch size of B=256, and 8-crop methods use a mini-batch size of B=4096.

5.2 Linear classification

To evaluate the learned representations, we first follow the widely used linear evaluation protocol, where a linear classifier is trained on top of the frozen base network, and test accuracy is used as a proxy for representations. We follow previous setup [20] and evaluate the performance of such linear classifiers on four datasets, including ImageNet [10], PASCAL VOC2007 (VOC07) [15], CIFAR10 (C10) and CIFAR100 (C100) [28]. A softmax classifier is trained for ImageNet/CIFAR, while a linear SVM [16] is trained for VOC07. We report 1-crop (), Top-1 validation accuracy for ImageNet/CIFAR and mAP for VOC07.

Method Epoch   ImageNet VOC07 C10 C100
Supervised - 76.5 87.5 93.6 78.3
Instance D. [49] 200 54.0 - - -
Jigsaw [37] 90 45.7 64.5 - -
BigBiGAN [13] - 56.6 - - -
CPC-v2 [23] 200 63.8 - - -
CMC [44] 200 66.2 - - -
SimCLR [7] 200 61.9 - - -
SimCLR [7] 1000 69.3 80.5 90.6 71.6
MoCo [22] 200 60.6 79.2 - -
PIRL [35] 800 63.6 81.1 - -
PCL-v2 [30] 200 67.6 85.4 -
BYOL [21] 800 74.3 - 91.3 78.4
SwAV (B=256) [4] 200 72.7 87.5 91.8 74.2
SwAV (B=4096) [4] 200 73.9 87.9 92.0 76.0
SwAV (B=4096) [4] 800 75.3 88.1 93.1 77.0
InfoMin [45] 200 70.1 - - -
InfoMin [45] 800 73.0 - - -
MoCo-v2 200 67.5 84.5 89.4 70.1
Hexa 200 68.9 85.0 90.4 71.5
MoCo-v2 800 71.1 86.8 90.6 71.8
Hexa 800 71.7 87.0 91.5 73.2
DeepCluster-v2 200 67.6 85.4 89.6 70.9
Hexa 200 68.1 85.9 90.7 71.5
Hexa(8-crop ) 200 74.0 88.1 92.9 76.6
Hexa(8-crop ) 200 73.4 88.8 91.9 75.2
DeepCluster-v2 800 75.2 87.6 93.2 77.3
Hexa(8-crop ) 800 75.1 88.2 93.5 78.0
Table 1: Linear classification performance on learned representations using ResNet-50. All numbers for baselines are from their corresponding papers or [30] , except that we use the released pretrained model for SwAV.
(a) Pre-training (b) Linear classification
Figure 3: Learning curves on ImageNet. (a) For 800 pre-training epochs of contrastive methods, the Top-1 accuracy is measured for checkpoints at every 200 epochs. (b) Training a linear classifier on the last checkpoints produced by prototype methods for 100 epochs.

Table 1 shows the results of linear classification. It is interesting to observe that DeepCluster-v2 is slightly better than MoCo-v2, indicating that the traditional prototype methods can be on par with the popular contrastive methods, with the same pre-training epochs and data augmentation strategies. We hope this result can inspire future research to more carefully select different pretext objectives. By contrast, Hexa variants consistently outperform theirs counterparts for both contrastive and prototype methods, demonstrating that the proposed hard examples can effectively improve learned visual representations in SSP.

We also pre-train Hexa with 800 epochs, a longer schedule used in MoCo-v2 [8]. The learning curves are compared in Figure 3(a). Hexa is consistently better than MoCo-v2 and the gap is larger at the beginning. We hypothesize that the augmentation space is more efficiently explored with hard examples than with traditional image transformations, but this advantage is less reflected in improved recognition accuracy, when the augmentation space is gradually fully occupied at the end of training. When comparing with SoTA methods equipped with multi-crop [4], we see that Hexa(8-crop) achieves slightly better than SwAV on ImageNet, and even outperforms InfoMin with 800 pre-training steps. By plotting the training curves of their linear classifiers in Figure 3(b), we observe that Hexa(8-crop) clearly outperforms SwAV with limited fine-tuning (e.g. 20 epochs training). The advantage of Hexa is more significantly than SwAV with limit supervision, this can be seen from a larger performance gap on VOC07 in Table 1.

Method Epoch 2 4 8 16 32
Supervised - 67.8 73.9 79.6 82.3 83.8
Jigsaw [37] 200 31.1 40.0 46.7 51.8 -
SimCLR [7] 200 43.1 52.5 61.0 67.1 -
MoCo [22] 200 42.0 49.5 60.0 65.9 -
PCL-v2 [30] 200 59.6 66.2 74.5 78.3 -
SwAV (B=4096) [4] 200 53.8 65.0 73.9 78.6 82.3
SwAV (B=4096) [4] 800 54.6 64.6 73.4 79.0 82.5
MoCo-v2 [8] 200 56.4 67.2 72.0 77.2 79.5
Hexa 200 57.0 68.5 73.1 78.0 80.1
MoCo-v2 [8] 800 60.6 72.1 77.1 80.9 82.8
Hexa 800 61.5 72.8 77.5 81.5 83.0
DeepCluster-v2 [3, 4] 200 57.7 66.5 74.1 77.6 80.7
Hexa 200 55.7 65.3 74.1 78.0 81.2
Hexa(8-crop ) 200 55.5 66.2 75.2 79.5 83.1
Hexa(8-crop ) 200 56.9 67.3 76.4 81.1 84.0
DeepCluster-v2 [3, 4] 800 53.5 65.6 73.3 78.9 82.5
Hexa(8-crop ) 800 54.8 65.8 74.2 79.4 83.1
Table 2: Low-shot classification on VOC07 using linear SVMs trained on fixed representations. We vary the number of labeled examples per class and report the mAP across 5 runs. All baseline numbers are from [30] except that we use the released pretrained model for SwAV.

Low-shot classification.

We evaluate the learned representation on image classification tasks with few training samples per-category. We follow the setup in Goyal et al. [20] and train linear SVMs using fixed representations on VOC07 for object classification. We vary the number of training samples per-class and report the average result across 5 independent runs. The results are shown in Table 2. Hard examples help improve the performance for both contrastive and prototype learning, especially when . This is probably because the performance is very sensitive to the choice of selected labelled samples when , rendering the evaluation less stable. Pre-training longer (MoCo-v2 with 800 epochs) helps reduce this issue, and the proposed hard examples can further boost the performance. When samples are considered, the proposed scheme surpasses the ImageNet-supervised pre-training approach. To the best of our knowledge, Hexa is the first work to surpasses the supervised baseline with such a small number of labelled samples on VOC07, showing high sample-efficiency of the learned representations. Hexa pre-trained at 200 epochs also outperforms SwAV (pre-trained at both 200 epochs and 800 epochs) by a large margin in all cases.

1% labels 10% labels 100% labels
Method Epoch Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
Supervised - 25.4 48.4 56.4 80.4 76.5 93.0
Pseudolabels [55] - - 51.6 - 82.4 - -
VAT  [36, 55] - - 47.0 - 83.4 - -
SL Rotation [55] - - 53.4 - 83.8 - -
UDA [52] - - - 68.8 88.5 - -
FixMatch - - - 71.5 89.1 - -
Instance D. [49] 200 - 39.2 - 77.4 - -
Jigsaw [37] 90 - 45.3 - 79.3 - -
SimCLR [7] 200 - 56.5 - 82.7 - -
SimCLR [7] 1000 48.3 75.5 65.6 87.8 76.5 93.5
MoCo [22] 200 - 56.9 - 83.0 - -
PIRL [35] 800 - 57.2 - 83.8 - -
PCL [30] 200 - 75.3 - 85.6 - -
SWAV (B=256) [4] 200 51.3 76.6 67.8 88.6 75.5 92.9
SWAV (B=4096) [4] 200 52.6 77.7 68.5 89.2 76.3 93.2
SWAV (B=4096) [4] 800 53.9 78.5 70.2 89.9 78.3 94.1
BYOL [21] 800 53.2 78.4 68.8 89.0 77.7 93.9
MoCo-v2 [8] 200 38.9 67.4 61.5 84.6 74.6 92.5
Hexa 200 39.4 67.6 62.3 85.1 74.8 92.4
MoCo-v2 [8] 800 42.3 70.1 63.8 86.2 75.5 92.8
Hexa 800 42.4 70.2 64.1 86.3 75.7 93.0
DeepCluster-v2 [3, 4] 200 46.7 72.9 63.5 86.3 71.9 91.0
Hexa 200 48.9 74.7 64.9 87.3 73.9 92.2
Hexa (8-crop) 200 54.1 78.6 69.3 89.3 76.9 93.6
Hexa (8-crop) 200 54.9 79.3 69.4 89.7 77.2 93.9
DeepCluster-v2 [3, 4] 800 55.6 79.3 70.6 90.2 78.0 94.0
Hexa (8-crop) 800 57.3 80.7 71.8 90.8 78.6 94.4
Table 3: Semi-supervised classification on ImageNet. We use the released pretrained model for MoCo/SwAV. All other numbers are adopted from corresponding papers.

5.3 Semi-supervised learning on ImageNet

We perform semi-supervised learning experiments to evaluate whether the learned representation can provide a good basis for fine-tuning. Following the setup from Chen et al. [7], we select a subset (1% or 10%) of ImageNet training data (the same labelled images with Chenet al. [7]), and fine-tune the entire self-supervised trained model on these subsets. For the proposed Hexa, and we fine-tune the models using the same schedule. SwAV with 8 augmentation crops and 200 pre-training epochs is used a fair baseline.

Table 3 reports the Top-1 and Top-5 accuracy on ImageNet validation set. Hexa improves its counterparts MoCo-v2 and DeepCluster-v2 in all cases. By different variants of Hexa, we see that cut-mixed examples are important in boosting performance, especially with 1% labels. Hexa(8-crop ) sets a new SoTA under 200 training epochs, outperforming all existing self-supervised learning methods. It even outperforms BYOL pre-trained at 800 epochs in both cases. For SwAV pre-trained at 200 epochs, it is significantly inferior to Hexa in the same setting. For SwAV pre-trained at 800 epochs, it achieves Top-1 53.9% and Top-5 78.5% with 1% labelled images, which is lower than our Hexa pre-trained at 200 epochs by a notable margin. This again shows the effectiveness of hard examples in improving visual representations in low-resource settings.

We also fine-tune over 100% of ImageNet labels for 20 epochs, and Hexa reaches 78.6% Top-1 accuracy, outperforming the supervised approach (76.5%) using the same ResNet-50 architecture by a large margin (2.1% absolute recognition accuracy). Hexa also achieves higher performance compared with all existing self-supervised learning methods in both 200 and 800 pre-training epochs settings. This shows that hard examples can effectively improve SSP, which can be viewed as a promising approach to further improve standard supervised learning such as Big Transfer [27] in the future.

  Methods Epoch AP AP50 AP75   Supervised -  53.5 81.3 58.8   MoCo-v2 200  57.0 82.4 63.6    Hexa 200  57.1 82.4 63.8   MoCo-v2 800  57.4 82.5 64.0    Hexa 800  57.7 82.8 64.9
Table 4: Object detection results on VOC. The numbers for MoCo-v2 are from [8].

5.4 Object detection

It is standard practice in data-scarce object detection tasks to initialize earlier model layers with the weights from ImageNet-trained networks. We study the benefits of using hard-examples-trained networks to initialize object detection. On the VOC object detection task, a Faster R-CNN detector [42]

is fine-tuned end-to-end on the VOC 07+12 trainval set1 and evaluated on the VOC 07 test set using the COCO suite of metrics 

[31]. The results are shown in Table 4. We find that Hexa consistently outperforms MoCo-v2 that is pre-trained with standard image transformations.

6 Conclusion

We have presented a comprehensive study of utilizing hard examples to improve visual representations for image self-supervised learning. By treating SSP as a pseudo-label classification task, we introduce a general framework to generate harder augmented views to boost the discriminative power of self-supervised learned models. Two novel algorithmic variants are proposed: Hexa for contrastive learning and Hexa for prototype learning. Our Hexa variants outperform their counterparts, often by a notable margin, and achieve SoTA under the same settings. Future research directions include incorporating more advanced hard examples under this framework, and exploring their performance with larger networks.


  • [1] P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, Cited by: §4.1.
  • [2] S. Bubeck (2014) Convex optimization: algorithms and complexity. arXiv preprint arXiv:1405.4980. Cited by: §3.1.
  • [3] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §2, 2nd item, §4.1, Table 2, Table 3.
  • [4] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882. Cited by: Appendix B, Appendix B, Table 5, §1, §1, §2, §4.1, §4.1, §5.1, §5.2, Table 1, Table 2, Table 3, §5.
  • [5] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, P. Dhariwal, D. Luan, and I. Sutskever (2020) Generative pretraining from pixels. In

    Proceedings of the 37th International Conference on Machine Learning

    Cited by: §4.1.
  • [6] T. Chen, S. Liu, S. Chang, Y. Cheng, L. Amini, and Z. Wang (2020) Adversarial robustness: from self-supervised pre-training to fine-tuning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cited by: §4.2.
  • [7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. ICML. Cited by: Table 5, §1, 1st item, §2, §4.1, §5.3, Table 1, Table 2, Table 3.
  • [8] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: Appendix B, Appendix B, §1, 2nd item, §2, §4.1, §5.2, Table 2, Table 3, Table 4.
  • [9] Y. Cheng, L. Jiang, and W. Macherey (2019)

    Robust neural machine translation with doubly adversarial inputs

    arXiv preprint arXiv:1906.02443. Cited by: §4.2.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §5.2, §5.
  • [11] T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    arXiv preprint arXiv:1708.04552. Cited by: §4.2.
  • [12] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, Cited by: §4.1.
  • [13] J. Donahue and K. Simonyan (2019) Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, Cited by: §4.1, Table 1.
  • [14] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox (2015)

    Discriminative unsupervised feature learning with exemplar convolutional neural networks

    IEEE transactions on pattern analysis and machine intelligence. Cited by: §4.1.
  • [15] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (VOC) challenge. International journal of computer vision. Cited by: §5.2.
  • [16] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin (2008) LIBLINEAR: a library for large linear classification. Journal of machine learning research 9 (Aug), pp. 1871–1874. Cited by: Appendix B, §5.2.
  • [17] Z. Gan, Y. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu (2020) Large-scale adversarial training for vision-and-language representation learning. arXiv preprint arXiv:2006.06195. Cited by: §4.2.
  • [18] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §4.1.
  • [19] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §3.1, §4.2.
  • [20] P. Goyal, D. Mahajan, A. Gupta, and I. Misra (2019) Scaling and benchmarking self-supervised visual representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6391–6400. Cited by: Appendix B, §5.2, §5.2, §5.
  • [21] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §1, §4.1, Table 1, Table 3.
  • [22] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, Cited by: Appendix B, §1, §1, 2nd item, §4.1, Table 1, Table 2, Table 3, §5.
  • [23] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: Table 1.
  • [24] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018)

    Learning deep representations by mutual information estimation and maximization

    arXiv preprint arXiv:1808.06670. Cited by: §4.1.
  • [25] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.1.
  • [26] X. Ji, J. F. Henriques, and A. Vedaldi (2019) Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §4.1.
  • [27] A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby (2019) Big transfer (BIT): general visual representation learning. arXiv preprint arXiv:1912.11370 6, pp. 2. Cited by: §5.3.
  • [28] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §5.2.
  • [29] G. Larsson, M. Maire, and G. Shakhnarovich (2016) Learning representations for automatic colorization. In European conference on computer vision, Cited by: §4.1.
  • [30] J. Li, P. Zhou, C. Xiong, R. Socher, and S. C. Hoi (2020) Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966. Cited by: Appendix B, §1, §2, 2nd item, §4.1, Table 1, Table 2, Table 3.
  • [31] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: §5.4.
  • [32] X. Liu, H. Cheng, P. He, W. Chen, Y. Wang, H. Poon, and J. Gao (2020) Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994. Cited by: §4.2.
  • [33] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017)

    Towards deep learning models resistant to adversarial attacks

    arXiv preprint arXiv:1706.06083. Cited by: §4.2.
  • [34] Y. Min, L. Chen, and A. Karbasi (2020) The curious case of adversarially robust models: more data can help, double descend, or hurt generalization. arXiv preprint arXiv:2002.11080. Cited by: §4.2.
  • [35] I. Misra and L. v. d. Maaten (2020) Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717. Cited by: Table 1, Table 3.
  • [36] T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence. Cited by: §4.2, Table 3.
  • [37] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §4.1, Table 1, Table 2, Table 3.
  • [38] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §4.1.
  • [39] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §4.1.
  • [40] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin (2016)

    Variational autoencoder for deep learning of images, labels and captions

    Advances in neural information processing systems. Cited by: §4.1.
  • [41] A. Raghunathan, S. M. Xie, F. Yang, J. C. Duchi, and P. Liang (2019) Adversarial training can hurt generalization. arXiv preprint arXiv:1906.06032. Cited by: §4.2.
  • [42] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, Cited by: §5.4.
  • [43] H. Salman, A. Ilyas, L. Engstrom, A. Kapoor, and A. Madry (2020) Do adversarially robust imagenet models transfer better?. arXiv preprint arXiv:2007.08489. Cited by: §4.2.
  • [44] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: Table 1.
  • [45] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola (2020) What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243. Cited by: §1, §4.1, Table 1.
  • [46] T. H. Trinh, M. Luong, and Q. V. Le (2019) Selfie: self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940. Cited by: §4.1.
  • [47] D. Wang, C. Gong, and Q. Liu (2019) Improving neural language modeling via adversarial training. arXiv preprint arXiv:1906.03805. Cited by: §4.2.
  • [48] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: Cited by: Appendix B.
  • [49] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: Table 1, Table 3.
  • [50] C. Xie, M. Tan, B. Gong, J. Wang, A. L. Yuille, and Q. V. Le (2020) Adversarial examples improve image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.1, §4.2.
  • [51] J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In International conference on machine learning, Cited by: §1, §4.1.
  • [52] Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848. Cited by: Table 3.
  • [53] J. Yang, D. Parikh, and D. Batra (2016) Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §4.1.
  • [54] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §3.2, §4.2.
  • [55] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer (2019) SL: self-supervised semi-supervised learning. In Proceedings of the IEEE international conference on computer vision, pp. 1476–1485. Cited by: Table 3.
  • [56] X. Zhan, J. Xie, Z. Liu, Y. Ong, and C. C. Loy (2020) Online deep clustering for unsupervised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6688–6697. Cited by: §1, §4.1.
  • [57] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §4.2.
  • [58] R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In European conference on computer vision, Cited by: §4.1.
  • [59] R. Zhang, P. Isola, and A. A. Efros (2017) Split-brain autoencoders: unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.1.
  • [60] C. Zhuang, A. L. Zhai, and D. Yamins (2019) Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6002–6012. Cited by: §4.1.

Appendix A Hyper-parameter Choice

Adversarial examples.

We study the hyper-parameter choices attack perturbation threshold and PGD step size in adversarial images. For Hexa, we grid search over and . Each variant is pre-trained for 40 epochs. A linear classifier is added on the pre-trained checkpoint and trained for 100 epochs. The results are shown Figure 4. Adding too large perturbations can hurt model performance significantly. Otherwise, the model perform similarly with differnt ways of adding small perturbations, allowing either a large threshold with a small step size, or a large step size with a small threshold. We used for convenience.

Figure 4: Top-1 accuracy on ImageNet is measured for different adversarial attack settings.

Cut-mixed examples.

We study the hyper-parameter choices in in cut-mixed images. We consider 6 random crops for each image: 2 crops at resolution 160 and 4 crops at resolution 96. The model is pre-trained in 5 epoch, and a linear classification on the checkpoint is trained for 1 epoch. The results are shown in Figure 5. Cut-mixed examples in various settings improves performance. We used in our experiments.

(a) PDF of Beta distribution
(b) Top-1 accuracy on ImageNet
Figure 5:

The impact of hyper-parameters in mixing two images, measured by Top-1 accuracy on ImageNet. (a) The probability density function (PDF) of a Beta distribution. (b) Accuracy of

Hexa checkpoints with 5 pre-training epochs.

Appendix B Experiments details for transfer learning

Linear classification on ImageNet

The main network is fixed, and global average pooling features (2048-D) of ResNet-50 are extracted. We train for 100 epochs. For Hexa, we follow the block-decay training schedule of [22, 8] with an initial learning rate of 30.0 and step decay with a factor of 0.1 at . For Hexa, we follow the cosine-decay training schedule of [4]

with an initial learning rate of 0.3. The logistic regression classifier is trained using SGD with a momentum of 0.9.

Linear classification on VOC07

For training linear SVMs on VOC07, we follow the procedure in [20, 30] and use the LIBLINEAR package [16]. We pre-process all images by resizing to 256 pixels along the shorter side and taking a center crop. The linear SVMs are trained on the global average pooling features of ResNet-50.

Linear classification on Cifar10 and Cifar100

We trained a linear classifier on features extracted from the frozen pre-trained network. We used Adamax to optimize the softmax cross-entropy objective for 20 epochs, a batch size of 256, a learning rate [0.1,0.01,0.001] and decay at [7, 14] with a factor of 0.1. All images were resized to

pixels (after which we took a center crop), and we did not apply data augmentation.

Semi-supervised learning on ImageNet

We follow [4] to finetune ResNet-50 with pretrained weights on a subset of ImageNet with labels. We optimize the model with SGD, using a batch size of 256, a momentum of 0.9, and a weight decay of 0.0005. We apply different learning rate to the ConvNet and the linear classifier. The learning rate for the ConvNet is 0.01, and the learning rate for the classifier is 0.1 (for 10% labels) or 1 (for 1% labels). We train for 20 epochs, and drop the learning rate by 0.2 at 12 and 16 epochs.

Object detection on VOC

We follow [8] to use the R50-FPN backbone for the Faster R-CNN detector available in the Detectron2 codebase [48]. We freeze all the conv layers and also fix the BatchNorm parameters. The model is optimized with SGD, using a batch size of 8, a momentum of 0.9, and a weight decay of 0.0001. The initial learning rate is set as 0.05. We finetune the models for 15 epochs, and drop the learning rate by 0.1 at 12 epochs.

Appendix C Experiments on Fine-tuning

We fine-tuned the entire network using the weights of the pre-trained network as initialization. We trained for 20 epochs at a batch size of 256 using Adamax, decayed at [7,14] with a factor of 0.1. We grid search learning rate over [0.0005, 0.001, 0.01]. The results are shown in Table 5. Our Hexa consistently improves their original counterparts for both datasets.

Method Epoch C10 C100
Supervised [7] - 97.5 86.4
Supervised - 96.7 83.5
Random Init [7] - 95.9 80.2
SimCLR [7] 1000 97.7 85.9
SwAV (B=256) [4] 200 96.6 82.7
SwAV (B=4096) [4] 200 96.4 83.2
MoCo-v2 200 95.6 80.8
Hexa 200 96.1 81.3
MoCo-v2 800 96.1 83.0
Hexa 800 96.5 83.5
DeepCluster-v2 200 96.2 82.1
Hexa 200 97.2 84.4
Hexa(8-crop ) 200 97.0 85.5
Hexa(8-crop ) 200 96.9 84.9
Table 5: Image classification performance on fine-tuning the entire ResNet-50 network. All numbers for baselines are from [7] , except that we use the released pretrained model for SwAV. indicates the results based on our runs using the same training schedules.