Supervised learning with Deep Neural Networks has been the de facto approach for feature learning in Computer Vision over the last decade. Recently, there is a surge of interest in learning features in an unsupervised manner. This has the advantage of learning from massive amounts of unlabelled/uncurated data for feature extraction and network pre-training and is envisaged to surpass the standard approach of transfer learning from ImageNet or other large labelled datasets.
The approach we describe in this paper builds upon the widely-used framework of contrastive learning [23, 22, 8, 17, 10, 21, 1] which utilizes a contrastive loss to maximize the similarity between the representations of two different instances of the same training sample while simultaneously minimizing the similarity with the representations computed from different samples. A key point for contrastive learning is the availability of a large number of negative samples for computing the contrastive loss that are stored in a memory bank. Since the memory bank is updated rarely, this is believed to hamper training stability, hence recent methods, like [23, 8, 5] advocate online learning without a memory bank.
For this reason, the method of  advocates an online approach by defining the positive pair from two differently augmented versions of the same training sample and considers as negatives all other pairs from the same batch, eliminating the memory bank. As opposed to , we show how to train a powerful network in an unsupervised manner relying on a memory bank-based training approach. Momentum Contrast  maintains and updates a separate encoder for the negative samples rather than storing a memory bank in a fashion similar to the “mean teacher” . More recently, SimCLR  emphasized the importance of composite augmentations, large batch sizes, bigger models and the use of a nonlinear projection head. They suggested that a large minibatch can replace a memory bank. In contrast, our approach employs a memory bank for contrastive learning.
Our main contribution is to show how to massively improve the vanilla memory bank approach of  by introducing minimal changes. We explore 2 key ideas: (1) What is the effect of larger batch sizes on contrastive learning with a memory bank? Concurrent work  has advocated the use of a large batch size for online training, i.e. without a memory bank as it increases the number of negative pairs. We show that a large batch size is also effective for contrastive learning with a memory bank (hence decoupling its positive effect from the number of negative pairs) which identifies a connection with gradient smoothing and improved memory bank updates. Furthermore, we show that if a larger mini-batch is constructed so that a set of augmentations for each instance are used, additional consistency between the instance augmentations can be enforced to further enhance training. (2) Is contrastive learning effective when instances are too visually similar? Intuitively, instance discrimination is not meaningful for such cases. We show that if these samples are “merged” into the memory bank, a much more powerful network can be trained.
on CIFAR-10 and of up to
on STL-10. Furthermore, with these improvements, our method surpasses and  by on CIFAR-10 and by on STL-10, setting for these datasets a new state-of-the-art. Overall, we make the following 3 contributions:
We propose a large mini-batch for memory-bank based contrastive learning by pulling, for each sample, a set of augmentations within the same batch. We show that this approach leads to stronger networks and improves the memory bank representation. (Section 2.2).
By having a set of augmentations in our disposal, we also propose a simple consistency loss which enforces the logits obtained by different augmentations of the same sample to be close enough. Notably, this is achieved without trying to enforce discrimination with respect to the negative samples as proposed by previous approaches (Section 2.3).
We observe that instance discrimination is not meaningful for samples that are too visually similar. Hence, we propose a hard negative mining approach for improving the memory bank that gradually merges extremely visually similar data samples that were previously forced to be apart by the instance level classification loss (Section 2.4).
Given a set of unlabelled images , our goal is to learn a mapping from the data to a -dimensional feature embedding . Typically is a neural network and its parameters. Throughout the paper we will simply refer to the feature embedding of the -th sample as and assume that . Following [7, 22] our pretext task will consist in distinguishing the -th instance from the rest of the samples present in the dataset (i.e. each data sample will be treated as a separate class). The training objective is thus formulated as minimizing the negative log-likelihood over all instances of the training set:
2.2 Large mini-batch with multiple augmentations
In contrastive learning, a large mini-batch can be motivated for the case of online learning (no memory bank is used) for increasing the number of negative samples. However, for the case of contrastive learning with a memory bank, the number of negative samples is fixed and independent of the batch size.
We make the observation that for the memory-bank case a large mini-batch is useful because it results in more frequent updates for a given feature inside the memory bank. For example if the batch-size is doubled then will be updated twice more frequently. As already mentioned in , a memory-bank approach comes at the cost of a large oscillation during training due to inconsistencies caused by updating the feature representations for different samples at very different time instances. Hence, more frequent updates of the memory bank – offered by a larger mini-batch – can help stabilize training. We consider increasing the batch size by an expansion factor of . There are two ways to achieve this. The standard way is to just increase the number of samples at each iteration. All samples, in this case, are different to each other. Table 2 shows the results obtained by training a network with contrastive learning for
on CIFAR-10 using the kNN evaluation protocol. Clearly, a large batch-size results in much higher accuracy showcasing its benefit in contrastive learning.
The second way to increase the batch size we explore in this work is by using multiple – in particular – augmentations per sample within the same batch. Specifically, for every input sample from the batch we propose to construct a series of perturbed copies using a randomly composed set of augmentations . As such, the loss from one batch (with size ) becomes:
where is the -th augmented copy of image transformed using a randomly selected set of chained augmentation operators (i.e. flipping, color jittering etc.) and the corresponding embedding produced by passing the sample through the network. This is illustrated in Fig. 1 where different shades of the same color represent different augmentations for the same instance. This second way is primarily motivated by being able to enforce the consistency loss described in the next section. The results, shown in Table 2, confirm that by applying the proposed way even higher accuracies can be achieved.
We note that by increasing the batch size in the proposed way (i.e. using multiple augmentations) by , the feature is actually updated after the same number of iterations, regardless the value of , which corresponds to the same number of iterations than that of not increasing the batch size. To overcome this issue, we propose the feature to be updated by aggregating the features produced by the augmented versions of : (see Fig. 1(a)).
The latter observation allows us to further study where the accuracy improvement in Table 2 comes from. To this end, we further study the case of using augmentations to calculate the loss of Eq. (2) but updating the memory bank only once (equivalent to using ). The results for this case for are shown in Table 2. Interestingly, we observe a significant accuracy improvement over the baseline (no augmentation). Since the memory bank is updated in the same way as for the case , we conjecture that this accuracy improvement is coming from the smoothed gradients
due to the use of the large batch size. When measured, the (average) cosine distance between the memory bank representations at adjacent epochs becomes smaller asincreases. Overall, we conclude that a large batch size helps improving both network training and updating the memory bank.
2.3 Instance consistency
|None||KL (Eq. 3)|
With the introduction of multiple instantiations of the same sample within the batch in the previous subsection, generated by applying a different set of randomly selected transformations
, herein we propose to explicitly enforce a consistent representation between the augmented representations of the same image. A similar idea has been explored for the case of semi-supervised learning[19, 14, 2], however, to our knowledge, in the context of contrastive learning, this has not explored before. Notably, this consistency is enforced without trying to enforce discrimination with respect to the negative samples as proposed by recent contrastive approaches [23, 22, 8, 17, 10, 21, 1]. More specifically, given a set of logits produced by each of the augmented copies of , we define our consistency loss as follows:
Note that the proposed loss term performs a dense corresponding matching (i.e. every possible pair formed using the augmented samples is considered). This is illustrated in Fig. 3. For completeness, we also evaluated an loss for enforcing consistency. As the results from Table 4 show, the proposed consistency loss offers noticeable improvements over the vanilla training process and the form of regularization directly on the feature embeddings.
2.4 Hard negative mining
Unsupervised learning with instance discrimination assumes that a sample within the dataset forms a unique class. An obvious limitation of this approach is that near-identical or very similar samples are artificially forced to be apart in the embedding space. To alleviate this, we propose an offline kNN-based strategy that merges similar instances into a single class. As opposed to the deep clustering approach from 
, we do not seek to construct large clusters in an online manner via K-means nor replacing the instance-level discrimination task; instead, during an offline grouping stage, for each memory bank feature representation, we compute its nearest neighbours, and then group the ones located in its immediatevicinity (see Fig. 1(b)). This process is reminiscent of hard negative mining with the difference being that after the hard negative samples are identified they are treated as positives. Once the selected instances are merged together they will have a common representation and share the same location inside the memory bank. Similarly, during training, for the grouped instances instead of using augmentations of the same image, we uniformly sample and augment images located within the same group. By using a small , the large majority of the samples after grouping stage remain ungrouped (only 5-10% of samples are grouped). As such the effect of the proposed approach is to remove very similar samples from being forced to produce different features. Our proposed conservative hard mining strategy is run in an offline manner near the end of the training, each time grouping the most similar samples by means of measuring their cosine distance. Firstly, we notice that the gains flatten out after the algorithm is run for 2 times (i.e. denoted as stages in the tables). Secondly, while the method offers improvements even when the model is retrained from scratch using the computed assignments, we find the gains are significantly larger if we continue training from the current checkpoint. Table 4 summarizes results showcasing the large impact of our hard negative mining approach.
We report results for two popular settings: on seen testing categories (testing and training is performed on images that contain mutual categories) and unseen testing categories
(training and testing categories are disjoint). All methods were implemented using PyTorch.
Seen Testing Categories. Following [22, 23] the experiments are performed on the CIFAR-10  and STL-10  datasets under the same settings. In particular, we use a ResNet18  as a feature extractor setting the output embedding size to 128. As per , the network is trained for 300 epochs using a starting learning rate of , which is then dropped by at epochs 80, 140 and 200. The network is optimized using SGD with momentum () and a weight decay of . During training each input sample is randomly augmented using a combination of the following transformations: Random resize and crop, random grayscale, random mirroring and color jittering. The temperature is set to , the memory bank momentum to and the consistency regularization factor to . Following , we adhere to the linear and kNN protocols. As Tables 5 and 6 show, our method surpasses other methods, including our direct baseline, the method of , by a significant margin.
|DeepCluster (1000) ||67.6|
|Triplet (Hard) ||78.4|
|Invariant Instance ||83.6|
Unseen Testing Categories. Following Song et al. , we report results by training a ResNet-18 model on unseen categories on the Standford Online Product  dataset. The images corresponding to the first half of categories are used for training, in an unsupervised manner, without using their labels, while the testing is done on images belonging to unseen categories. We closely align our setting and training details with [23, 15]: we report results in terms of the clustering quality and NN retrieval performance. We denote with
the probability of any correct matching to occur in the top-k retrieved. NMI, the second reported metric, measures the quality of the clustering. As Table 7 shows, our method improves in terms of R@1 on top of the state-of-the-art by almost 4% and on top of our baseline from  by 9%.
|DeepCluster (100) ||5K||56.6||61.2|
Top-1 (%) acc. on STL-10 using a linear and kNN classifier.
We described three simple yet powerful ways to improve unsupervised contrastive learning with a memory bank. Firstly, we proposed a large mini-batch with multiple instance augmentations for providing smoother gradients for improving network training and increasing the quality of the features stored in the memory bank. Secondly, we introduced a simple, yet effective, intra-instance consistency loss that encourages the distribution of each augmented sample to match that of the remaining augmentations. Finally, we presented our very hard mining strategy that attempts to overcome one of the problems of unsupervised instance discrimination: that of trying to push apart near-identical images. We exhaustively evaluated the proposed improvements reporting large accuracy improvements.
-  (2019) Learning representations by maximizing mutual information across views. In NeurIPS, Cited by: §1, §2.3.
-  (2019) ReMixMatch: semi-supervised learning with distribution alignment and augmentation anchoring. arXiv. Cited by: §2.3.
-  (2013) Unsupervised feature learning for rgb-d based object recognition. In Experimental robotics, Cited by: Table 6.
-  (2018) Deep clustering for unsupervised learning of visual features. In ECCV, Cited by: §2.4, Table 5, Table 6.
-  (2020) A simple framework for contrastive learning of visual representations. arXiv. Cited by: §1, §1, §1.
-  (2011) An analysis of single-layer networks in unsupervised feature learning. In AIStat, Cited by: §3.
Discriminative unsupervised feature learning with exemplar convolutional neural networks. TPAMI. Cited by: §2.1, Table 5, Table 6, Table 7.
-  (2019) Momentum contrast for unsupervised visual representation learning. arXiv. Cited by: §1, §1, §2.3.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.
-  (2019) Data-efficient image recognition with contrastive predictive coding. arXiv. Cited by: §1, §2.3.
-  (2018) Mining on manifolds: metric learning without labels. In CVPR, Cited by: Table 7.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §3.
-  (2019) Self-supervised learning of pretext-invariant representations. arXiv. Cited by: §1.
-  (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. TPAMI. Cited by: §2.3.
-  (2017) No fuss distance metric learning using proxies. In ICCV, Cited by: §3.
-  (2016) Deep metric learning via lifted structured feature embedding. In CVPR, Cited by: §3.
-  (2018) Representation learning with contrastive predictive coding. arXiv. Cited by: §1, §2.3.
-  (2017) Automatic differentiation in pytorch. Cited by: §3.
-  (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In NeurIPS, Cited by: §2.3.
Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, Cited by: §1.
-  (2019) Contrastive multiview coding. arXiv. Cited by: §1, §2.3.
-  (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, Cited by: §1, §1, §1, §2.1, §2.2, §2.3, Table 5, Table 6, Table 7, §3, §3.
-  (2019) Unsupervised embedding learning via invariant and spreading instance feature. In CVPR, Cited by: §1, §1, §1, §2.1, §2.3, Table 5, Table 6, Table 7, §3, §3.
-  (2015) Stacked what-where auto-encoders. arXiv. Cited by: Table 6.