Rethinking Image Mixture for Unsupervised Visual Representation Learning

03/11/2020 ∙ by Zhiqiang Shen, et al. ∙ 8

In supervised learning, smoothing label/prediction distribution in neural network training has been proven useful in preventing the model from being over-confident, and is crucial for learning more robust visual representations. This observation motivates us to explore the way to make predictions flattened in unsupervised learning. Considering that human annotated labels are not adopted in unsupervised learning, we introduce a straightforward approach to perturb input image space in order to soften the output prediction space indirectly. Despite its conceptual simplicity, we show empirically that with the simple solution – image mixture, we can learn more robust visual representations from the transformed input, and the benefits of representations learned from this space can be inherited by the linear classification and downstream tasks.



There are no comments yet.


page 7

page 15

Code Repositories


Rethinking Image Mixture for Unsupervised Visual Representation Learning

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently unsupervised visual representation learning has attracted increasing attention [noroozi2016unsupervised, zhang2016colorful, oord2018representation, hjelm2018learning, gidaris2018unsupervised, wu2018unsupervised, he2019momentum, misra2019self, tian2019contrastive, chen2020simple]

due to its enormous potential of being free from human annotated supervision, and its extraordinary capability of leveraging the boundless unlabeled data. Previous studies in this field address this problem mainly in two directions: one is realized via a heuristic


task design that applies a transformation to the input image, such as colorization 

[zhang2016colorful], rotations [gidaris2018unsupervised], jigsaw [noroozi2016unsupervised], etc., and the corresponding labels are derived from the properties of the transformation on the unlabeled data. Another direction is contrastive learning based approaches [he2019momentum, tian2019contrastive] in the latent feature space, such as maximizing mutual information between different views [bachman2019learning, tian2019contrastive], momentum contrast learning [he2019momentum, MoCov2] with instance discrimination task [wu2018unsupervised, ye2019unsupervised], learning pretext-invariant representations [misra2019self], or training with the composition of data augmentations, larger batch sizes and nonlinear transformation [chen2020simple]. These methods have recently shown great promise for this task, achieving state-of-the-art accuracy.

Figure 1: Illustration of our motivation on contrastive-based unsupervised learning approaches. Contrastive learning measures the similarity of sample pairs in the latent representation space. With flattened prediction

, the model is encouraged to treat each incorrect instance as equally probable, which will smooth decision boundaries and prevent the learner from becoming over-confident.

Our proposed approach applies to the contrastive methods and stems from some simple observations of label smoothing in supervised learning [szegedy2016rethinking]. Interestingly, we observed from visualizations of previous literature [he2019bag, muller2019does] that label smoothing tends to force the output prediction of networks being less confident (i.e., lower maximum probability of predictions) but the overall performance increases significantly. The explanation for this seemingly contradictory phenomenon is that with label smoothing

, the learner is encouraged to treat each incorrect instance/class as equal probability. Thus more structure is enforced in latent representations, enabling less variation across predicted instances and/or across samples. This will further prevent the network from overfitting to the training data. Otherwise, the network can often output incorrect and confident predictions when evaluated on slightly different test samples. Considering that contrastive learning essentially is classifying positive congruent and negative incongruent pairs with cross-entropy loss, such an observation reveals that a typical contrastive-based method will also encounter the over-confident prediction problem as raised in supervised learning.

By now, a major challenge we have to face is that, contrastive learning approaches do not involve any explicit labels to process, so that the conventional label smoothing operation cannot be applied directly. Therefore, as illustrated in Figure 1

, in this paper we propose to leverage semantic interpolations on image space as the new training signal, obtaining neural networks with smoother decision boundaries at latent level of representation. As a result, neural networks trained with this new space learn flatter class-agnostic and unsupervised representations, that is, with fewer directions of variance.

We choose two recently proposed contrastive-based methods momentum contrast learning [he2019momentum] and contrastive multiview coding [tian2019contrastive]

as our baselines. We conduct extensive experiments on ImageNet classification and downstream recognition tasks (PASCAL VOC 

[everingham2010pascal] and COCO [lin2014microsoft] object detection) to demonstrate the effectiveness of our approach. Code is available at:

Our contributions are summarized as follows:

  • We provide empirical analysis to reveal the fact that smoothing prediction could improve performance favorably for contrastive-based unsupervised learning. We present two simple image mixture methods based on the previous literature [zhang2018mixup, yun2019cutmix] to encourage neural networks to predict less confidently.

  • We show that training space matters. We provide evidence on why this flattening happens under ideal conditions of latent space, validate it empirically on practical situations of contrastive learning, and connect it to the previous works on analyzing the discipline inside the unsupervised learning behavior. We explain the difficulties that arise with original image space when visualizing these trajectories of predictions. Thus we derive the conclusion that a good training space is crucial for unsupervised optimization.

  • We observe that more training epochs can dramatically improve the representation ability for unsupervised learning (3%

    5% improvement on linear evaluation), especially when the scale of training data is not large enough, such as ImageNet-100111A randomly selected subset of ImageNet with 100 classes from [tian2019contrastive]. that we used for ablation study in our experiments.

  • We conduct extensive experiments to demonstrate that our approach can surpass the baselines MoCo and CMC methods. Our learned representations can further benefit to the downstream visual tasks such as object detection and segmentation, and further improve their performance.

2 Related Work

In this paper, we focus on flattening latent predictions for unsupervised learning circumstance, thus our study is not only related to prior works that learn unsupervised representations from unlabeled data and transfer to subsequent/downstream tasks, but also related to works that smooth label/prediction in supervised learning paradigm for better generalization. Moreover, data augmentation is a closely related aspect that we will elaborate and review in this section.

Unsupervised Visual Representation Learning.

Unsupervised learning aims to explore the internal distributions of data and learns a representation without the human annotated labels. To achieve this purpose, early work mainly focused on reconstructing images from a latent representation, such as autoencoders 

[vincent2008extracting, vincent2010stacked, masci2011stacked], sparse coding [olshausen1996emergence], adversarial learning [goodfellow2014generative, donahue2016adversarial, donahue2019large], etc. After that, more and more studies tried to design handcrafted pretext tasks such as image colorization [zhang2016colorful, zhang2017split], solving jigsaw puzzles [noroozi2016unsupervised], counting visual primitives [noroozi2017representation], rotation prediction [gidaris2018unsupervised], etc. Recently, contrastive based visual representation learning [hadsell2006dimensionality] has attracted many researchers’ attention and achieved promising results. For example, Oord et al. [oord2018representation]

proposed to use autoregressive models to predict the future samples in latent space with probabilistic contrastive loss. Wu et al. 

[wu2018unsupervised] proposed a non-parametric memory bank to store the instance representation, in order to tackle the computational issue imposed by the large number of instances. Hjelm et al. [hjelm2018learning] proposed to maximize mutual information from the encoder between inputs and outputs of a deep neural network. Bachman et al. [bachman2019learning] further extended this idea to multiple views of a shared context and a similar method is applied in CMC [tian2019contrastive]. Moreover, He et al. [he2019momentum] proposed to adopt momentum contrast to update the models and Misra&Maaten [misra2019self] developed the pretext-invariant representation learning strategy that learns invariant representations from the pre-designed pretext tasks.

Smoothing Label/Prediction in Supervised Learning. Explicit label smoothing has been adopted successfully to improve the performance of deep neural models across a wide range of tasks, including image classification [szegedy2016rethinking, he2019bag], object detection [zhang2019bag], machine translation [vaswani2017attention], and speech recognition [chorowski2016towards]. Moreover, motivated by mixup, Verma et al. [verma2019manifold]

proposed to implicitly interpolate hidden states as a regularizer that encourages neural networks to predict less confidently (softer prediction) on interpolations of hidden representations. They found that neural networks trained with this kind of operation can learn flatter class-representations which possesses better generalization, as well as better robustness to novel deformations of testing data and even adversarial examples. Some recent work 

[muller2019does] further demonstrated that label smoothing implicitly calibrates the prediction of learned networks, so that the confidences of their outputs are more aligned with the true labels of the trained dataset. However, all of these studies lie in the supervised learning, to the best of our knowledge, there is no existing work focusing on smoothing predictions for unsupervised learning and thus this is the first trial to explore along this direction.

Data Augmentation. The conventional data augmentation strategies such as flipping, rotation, color distortion, contrast adjustment, scaling, cropping, filtering, translation, adding gaussian noise/blur, etc., are among the most popular and important techniques for training deep neural networks and improving the generalization capabilities of models, also have shown promising potential in both supervised learning [cubuk2019autoaugment, lim2019fast] and unsupervised learning [chen2020simple, he2019momentum]. Recently, a new body of semantic manipulation based augmentation methods emerged in supervised learning, such as Cutout [devries2017improved]/Random Erase [zhong2017random], Mixup [zhang2018mixup], CutMix [yun2019cutmix], etc. Specifically, Cutout and Random Erase proposed to randomly mask out square regions of input or fill different values during training to improve the diversity and robustness of training samples and in turn, the overall performance, however, some studies [lim2019fast] pointed out that they are not effective for large-scale data like ImageNet [russakovsky2015imagenet] since they will drop information inside images during training and overfitting is not the key problem for large-scale dataset. Mixup [zhang2018mixup] proposed to train a neural network on a convex combination space of image pairs and their corresponding labels, which can help to increase the generalization and robustness, stabilize the training of neural networks. CutMix is motivated by Mixup [zhang2018mixup] operation, while they consider the mixture on the object level. In this method, the regions in an image are randomly cut and pasted among training images to generate new images and the ground truth labels are also mixed proportionally to the area of the regions to the whole image.

3 Image Mixture for Unsupervised Learning

In this section, we start by introducing that contrastive learning is actually learning an embedding that contrasts samples from two different distributions, as shown in Figure 1, so flattening distributions could be crucial to smooth decision boundaries and to prevent the learner from becoming over-confident with contrastive learning. Then we elaborate our method that softens distributions indirectly from input image space since directly softening prediction is not feasible in practice. Lastly, we compare our method with previous data augmentation mechanisms.

3.1 Contrastive Learning

Contrastive learning aims to train an encoder with parameters

(e.g., convolutional neural networks in most literature) that can contrast image representations

for the input image . Previous studies [he2019momentum, tian2019contrastive] usually take two random “views” () of the same image under random data augmentation or generate from different sensors (i.e., sensory views, such as depth, grey images, etc.) to form a positive pair. The negative pairs are obtained by computing features from different images (

). Here we take the noise contrastive estimator (NCE) 

[gutmann2010noise] as an example, we train this standard log-softmax function to apply a positive sample out of negative samples and it predicts the probability of data distribution as:


where is a temperature hyper-parameter, and is a matching function or other critic metrics for measuring the similarity of two representations of images/views. NCE then can be formulated as the following loss:


This loss encourages the representations of images and forming positive pairs to be similar, meanwhile, encouraging the representations of negative pairs to be dissimilar.

3.2 Motivation of Flattening Predictions for Contrastive Loss

Deep neural networks often produce incorrect, yet extremely confident predictions on testing samples that slightly differ from those seen in training. This problem is one of the core challenges in supervised learning. In this paper, we investigate this issue from the perspective of representations learned by contrastive loss in unsupervised scenarios. We observed that deep neural networks spread the vanilla training data (input space) widely throughout the latent representation space, deliver high confidence predictions to almost the entire training samples. Consequently, we impose the interpolation with mixed context on input space to affect the networks to produce compressed distribution and further flatten the output predictions.

We would like to learn a critic that predicts high probability for positives and low for negatives. In unsupervised learning, the generalization ability is crucial since the downstream data is usually different from the pre-trained images. If the predictions of positive and negative pairs are sharp and clear-cut, the network will easily find a low-loss solution to fit the distributions, resulting in poor generalization of the representations. Common practices leverage random data augmentation (e.g., rotation, colorization, flipping, color jittering, etc.) to force network predicting slightly different outputs for the same image. As these augmentation methods essentially will not change the content of an image sharply thus the representation will not change too much after the network is well learned. In this work, we rethink the image mixture operation of interpolating different contents of images for unsupervised learning: the previous image interpolation methods [zhang2018mixup, yun2019cutmix] are originally designed for supervised learning circumstance, we adapt and develop them to accommodate the unsupervised scenario. We empirically show that the proposed interpolation methods not only can improve the generalization of representation, but are also compatible with conventional data augmentation with different pretext tasks as we will introduce later.

Figure 2: Our global image mixture strategy. As the corresponding mixed label is not required in unsupervised learning, we simply extend mixup [zhang2018mixup] to enable multiple images mixture.

3.3 Global Mixture

Mixup [zhang2018mixup] is a common way of adopting to obtain the weighted mixture of two global images, it is usually restricted to two images in common usage for practically acquiring corresponding labels, according to the property of supervised learning. Here we extend it to unlimited number of mixtures by applying a simple iterative mixture strategy to maximize the potential of this technique, as shown in Figure 2 (left), we can formulate the process as:


where denote the images that we want to mix and is the output mixture. are mixture coefficients and are restricted to following [zhang2018mixup].

From the formulation we can observe that, regardless of how many mixtures of images, the last one will be mixed only once so it is actually the main image in terms of the content/context during training. Therefore, the proposed strategy can achieve the purpose of flattening input space, meanwhile, it can guarantee that the final mixed output will not be somewhat too ambiguous on its content.

Figure 3: Our region-level mixture with context pixel decay. We found that in most cases, randomly mixing regions [yun2019cutmix] cannot obtain semantically harmonious output between context and objects (as illustrated in the second row of the figure, red cross denotes the disharmonious results and green tick denotes the harmonious ones). Decaying context pixels can alleviate the disharmony and achieve desired performance.

3.4 Context Decay for Region-based Mixture

In most cases, randomly mixing regions [yun2019cutmix] cannot obtain semantically harmonious output between context and objects (as illustrated in 3, the second row of the figure). This is acceptable in supervised learning since we have mixed labels to guide the network learning correct and desired information, but it is not true in unsupervised learning as we do not have labels anymore. We address this issue by developing a structured form of pixel decay on context pixels and maintain the object areas. Our method can be seen as increasing the image contrast, but we conduct it on the region level. During training, a contiguous context area of an image is decayed by a coefficient instead of individual pixels. We observe that decaying context pixels can alleviate the disharmony and achieve the desired performance, which is particularly effective in regularizing deep neural models.

As shown in Figure 3, our decay operation is applied on the first iteration of region-level mixture, which can be formulated as:


where denotes the pixel decay coefficient. denotes a binary mask as defined in [yun2019cutmix]. is a binary mask with all values equaling one. denotes element-wise multiplication. The following iterations are similar to global mixture:


3.5 Compatibility with Different Pretext Tasks

The central idea of this paper is to demonstrate that a good training space is crucial for contrastive-based unsupervised learning. We choose MoCo with instance discrimination i.e. exemplar-based pretext task and CMC with multi-view discrimination pretext task as our base models. Take CoMo as an example, our flatten operation is applied after augmenting input image and before feeding into their encoders, i.e., neural networks, so that our method is a simple addition to the original methods. To be more precise, we consider unsupervised learning with three aspects: 1) training space generation or data processing; 2) pretext tasks; and 3) loss function design. This work focuses on the first aspect, i.e., generating better training space for unsupervised learning, so our method is naturally compatible with any type of pretext tasks with contrastive learning.

3.6 Relations to Previous Data Augmentation Mechanisms

Previous methods [bachman2019learning, misra2019self, he2019momentum, wu2018unsupervised] in unsupervised learning conservatively choose those augmentations that will not dramatically alter the content in an image. In general, most of the methods maximize mutual information between representations extracted from multiple views of the shared context/content. Even some recent study [chen2020simple] pointed out that data augmentation plays a critical role, but they still choose the safest augmentation ways to build their framework. In this paper, we aim to show that, with appropriate usage, the more radical interpolation method can be more effective and robust for learning better representations. This kind of method will generate a larger training space, and unsupervised learning can benefit from this space. The benefits of representations from this new space can be further inherited to the linear classification and downstream tasks.

4 Experiments

In this section, we first introduce our experimental configurations for the unsupervised pre-training. Then we provide details on linear classification protocol with our ablation studies. Finally, we transfer our learned features to downstream tasks such as object detection and instance segmentation.

4.1 Datasets and Implementation Detail for Unsupervised Pre-training

Datasets. We conduct ablation studies on ImageNet-100 [tian2019contrastive], a randomly selected subset of ImageNet [russakovsky2015imagenet] with 100 classes. The complete ImageNet-100 class list is shown in our Appendix. Our final experiments are conducted on the full ImageNet (called ImageNet-1K in the following context), which has 1.28 million images in 1000 classes.

Implementation Detail. We use a mini-batch size of 128 with 4 GPUs on ImageNet-100 and 256 with 8 GPUs on ImageNet-1K following [tian2019contrastive, he2019momentum]. We use 0.99 as MoCo momentum values on ImageNet-100 as recommended by [tian2019contrastive] (For ImageNet-1K we still use 0.999 as [he2019momentum]). The number of negatives is set to 16384 on ImageNet-100 for both MoCo and CMC and we use YCbCr space and Softmax-CE

for CMC on ImageNet-100. Except otherwise stated, the other hyperparameter configurations are strictly following the baselines MoCo and CMC on ImageNet-1K according to their original papers. For example, we use SGD as an optimizer and weight decay is set to 1e-4. The SGD momentum is set to 0.9. On the ImageNet-1K, all the methods for a given architecture are run for the consistent number of epochs and learning rate schedule as 

[he2019momentum, tian2019contrastive]

, considering our prime objective is to verify the effectiveness of our proposed method rather than suppressing state-of-the-art results, even we found that the longer training can deliver better performance. We implement our method on PyTorch 

[paszke2019pytorch] framework.

4.2 Evaluation with Linear Classification

Our linear classification consists of two parts: ablation studies on ImageNet-100 to explore the best hyperparameter setting for our proposed method; and the final results on ImageNet-1K with the same configurations as the baselines.

Setup. We first conduct unsupervised pre-training on the two datasets ImageNet-100 and ImageNet-1K respectively. Then we fix all the parameters and train a supervised linear classifier as [he2019momentum]. We also report single crop, top-1 accuracy on the validation set. For ImageNet-100, we use the initial learning rate of 10 and weight decay 0. We train with 60 epochs and the learning rate multiplied by 0.1 at 30, 40 and 50 epochs. For ImageNet-1K, we use the initial learning rate of 30 and weight decay also is set to 0. We train with 100 epochs and the learning rate multiplied by 0.1 at 60, 80 and 90 epochs.

Ablation Study. We investigate and compare the following four aspects in our methods: 1) the number of mixtures on both global and region levels; 2) the ratio of pixel decay ; 3) the budget of unsupervised pre-training; and 4) sensibility of our method to other pretext task. The pre-training dataset is ImageNet-100 for all of these ablation studies.

1) Mixture Numbers of Global and Region Images. We first investigate the influence of mixture numbers on global and region-level images with MoCo method. The results are summarized in Table 1 and Table 2. Overall, more mixtures than 2 images can obtain higher performance, which accords with our intuition. However, we observe that more is not always better, if the number of mixtures exceeds a threshold, the performance begins to decrease. 4 and 5 are optimal for global and region levels respectively in our experiments.

2) Pixel Decay . The results are shown in Table 3, our region-level flattening is fairly robust for a variety of decay values. In the table, indicates that pixel decay is not applied for training, which is the baseline for our method. It can be observed that when we obtain the best performance. These results support our motivation of decaying context pixels to alleviate the disharmony for the region-level flattening.

3) More Budget for Unsupervised Pre-training. We investigate the influence of unsupervised training budget for the linear classification evaluation, which reflects the capability of representations from the pre-trained models. Our results are shown in Figure 4, interestingly, we have three main observations: 1) more training epochs during pre-training can significantly improve the accuracy of linear classification (), we conjecture this is because the ImageNet-100 is relatively small with limited number of images, thus the pre-trained model needs more training budget to converge well to a desired status. 2) the benefit/improvement from our method keeps increasing when giving more training budget in unsupervised phase. This proves that our generated new training space is more informative and diverse, with larger capacity and potential, which can further avoid overfitting for the small scale training data; 3) the performance gap between MoCo and CMC decreases sharply when the training budget becomes larger. This observation indicates that our flattened training space can help the different pretext methods to reach saturation on performance with the given architecture.

#Global 2 3 4 5 6 7 Acc. (%) 76.16 76.24 76.30 75.64 75.86 75.96 Table 1: Ablation results for different numbers of mixtures on global level. The base method is MoCo. The training budget is 400 epochs. #Regions 2 3 4 5 6 7 Acc. (%) 76.02 76.16 75.56 76.36 75.66 75.44 Table 2: Ablation results for different numbers of mixtures on region level. The base method is MoCo. The training budget is 400 epochs.
Pixel Decay Rate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.99 1.0
Acc. (%) 76.3 76.8 76.3 76.0 76.2 76.5 75.9 75.8 76.2 76.3 76.5
Table 3: Ablation results for different ratios of pixel decay. The base method is MoCo. The training budget is 400 epochs.
Figure 4: Linear classification accuracy with different pre-training budget on ImageNet-100. The dotted lines denote the baseline results and the solid lines are with our method. The red color denotes MoCo method and blue color is CMC.
Prob Mixture Accuracy (%) baseline 78.7 0.1 G 79.4 G+R 79.7 0.2 G 79.1 G+R 79.4 0.3 G 78.6 G+R 79.0 0.4 G 77.7 G+R 78.2 Table 4: Sensibility of different mixture probabilities on CMC with the training budget of 800 epochs. “G” is the global and “R” is the region-level mixture. Note that the global mixture number is set 2 in this ablation study for speeding up training, while we obtain the best performance 80.0% with 4, as shown in Table 5.

4) Sensitivity. We study the influence of global flattening for the multi-view pretext like CMC, as this kind of method essentially splits the images into different color spaces with global image transformation. We observe that our global flattening is consistently effective for it, but is more sensitive comparing to other pretext tasks like class discrimination task. We tried different values of probability for controlling to impose our global flattening during training. The results are shown in Table 4, the training budget is 800 epochs for these experiments in order to make sure that the models are in full convergence. The baseline result is 78.7%, which means we did not adopt flattening in training. The probabilities in the table denote the global level, we use a fixed probability 0.1 on region level flattening for all experiments since this experiment is to examine the sensitivity of global mixture. Generally, using both global and region level flattening can obtain higher performance. We found that the multi-view method enjoys a small proportion of global flattening. However, for MoCo with class discrimination task, 0.5 or higher values can perform better results.

The Effectiveness of Each Flattening Level. The results are summarized in Table 5. As the baselines with 240 epochs are not fully converged, we first train our models with 800 epochs to generate strong baselines with high-level performance, then we start to verify the effectiveness of each component that we proposed. Adding either global or region level flattening independently can get 1.01.3% improvement, full equipment with both of them can further to improve 0.30.4%. We argue that the improvement is not trivial since the baseline is already very high. Moreover, based on the tendency of the curves in Figure 4, we can expect that more training budget can achieve more improvement, while as we stated above, the focus of this paper is not to suppress the state-of-the-art performance but to validate the effectiveness of our method.

Summary and Best-Practices. As different pretext tasks usually have different behaviors in training, here we introduce several observations and suggestions of our method during experiments in practical use. Generally, either of our two mixture method can improve the performance solely, but our empirical study suggests that region level flattening is slightly more effective than global level. We also found that MoCo (instance discrimination task) can use relatively large probability of global mixture, while CMC (multiview task) needs small one which indicates CMC is more sensitive to global mixture. Also in some cases the ImageNet-100 (subset of ImageNet-1K) has slightly different behaviors as full ImageNet, but most phenomenon/conclusions are consistent between them. Moreover, our ImageNet-1K settings are mainly inherited from subset ImageNet-100, we do not have enough resources to explore hyper-parameters on the full ImageNet-1K directly. Thus, the settings on ImageNet-1K may not be optimal and still have space to improve through simply adjusting the hyper-parameters.

MoCo +More Budget +Global +Region-level Acc. CMC +More Budget +Global +Region-level Acc.
73.4 75.7
78.2 78.7
79.3 79.7
79.5 79.6
79.8 80.0
Table 5: Ablation results on ImageNet-100 [tian2019contrastive] with MoCo [he2019momentum] and CMC [tian2019contrastive]. +More Budget indicates we train the models with 800 epochs. We use #Negative=16384 as the number of negative sample pairs, and Softmax-CE as our loss function for all the models. ResNet50 is adopted for MoCo, two ResNet50-half [tian2019contrastive] and YCbCr space are used for CMC in this ablation study. : we found CMC with #Negative=4096 performs slightly better than 16384 when #training epoch is small (e.g., 240 epochs), here we choose the better result as the baseline.

Comparison to State-of-the-art Methods. As we found training budget is crucial for unsupervised representation learning, for a fair comparison, we report accuracy vs. #parameters vs. pre-training epochs as [he2019momentum]. The results are shown in Table 6, our proposed method can consistently improve the baselines by 1.0% with the same training budget.

Architecture Method #Params (M) Budget (#epochs) Accuracy (%)
R50w Exemplar [dosovitskiy2014discriminative] 211  46.0
R50w RelativePosition [doersch2015unsupervised] 94  51.4
R50w Jigsaw [noroozi2016unsupervised] 94  44.6
Rv50w Rotation [gidaris2018unsupervised] 86  55.4
R101 Colorization [zhang2016colorful] 28  39.6   [doersch2017multi]
VGG [simonyan2014very] DeepCluster [caron2018deep] 15  48.4   [caron2019unsupervised]
R50 BigBiGAN [donahue2019large] 24  56.6
Rv50w BigBiGAN [donahue2019large] 86  61.3
Methods using contrastive learning:
R50 InstDisc [wu2018unsupervised] 24 200  54.0
R50 LocalAgg [zhuang2019local] 24 200  58.8
R50 MoCo [he2019momentum] 24 200  60.6
R50w MoCo [he2019momentum] 94 200  65.4
R50 PIRL [misra2019self] 24 800  63.6
R101 CPC v1 [oord2018representation] 28  48.7
R170 CPC v2 [henaff2019data] 305  71.5
AMDIM AMDIM [bachman2019learning] 194 50  63.5
AMDIM AMDIM [bachman2019learning] 626 150  68.1
R50 MoCo 24 200  59.5
Ours 24 200  60.8
R50w MoCo 94 200  64.6
Ours 94 200 65.9
Table 6: Comparison of the linear classification evaluation on ImageNet-1K. All results are reported as unsupervised pre-training on ImageNet-1K training set, followed by supervised linear classification trained on fixed features, evaluated on the validation set. : Using customized network structures. : Results are from [kolesnikov2019revisiting]. : Using FastAutoAugment policies [lim2019fast] during pre-training which is searched by ImageNet labels. : Our MoCo code base is reimplemented from CMC, which we found is 0.81.0% lower than the reported results in MoCo.
pre-train AP Recall
random init.
MoCo 75.1 90.2
Ours (=16384) 75.8 90.7
Ours (=65536) 76.4 90.5
(a) Faster, R50-FPN, 12 epochs, No Freeze
pre-train AP Recall
random init.
MoCo 72.7 92.3
Ours (=16384) 73.3 92.8
Ours (=65536) 73.8 93.3
(b) Faster, R50-FPN, 12 epochs, Freeze BN
Table 9: Object detection fine-tuned on PASCAL VOC trainval07+12. Evaluation is on test2007: AP and Recall are the default VOC metrics. All models are fine-tuned for 12 epochs. “Freeze BN” indicates that we fix all BN parameters in a network during training as used in the supervised pre-training counterpart. The MoCo model is with 16384. The result of random initialization is from [he2019momentum] with R50-dilated-C5 architecture and 18 training epochs.
pre-train AP AP AP AP AP AP
MoCo 30.7 48.2 32.7 15.7 33.5 40.7
Ours (=16384) 31.8 49.8 34.2 16.7 34.7 42.2
Ours (=65536) 31.9 50.2 34.1 16.5 35.0 41.7
(a) RetinaNet, R50-FPN, 12 epochs, Freeze BN
pre-train AP AP AP AP AP AP
MoCo 31.2 51.3 33.3 29.1 48.2 30.6
Ours (=16384) 32.8 53.3 35.1 30.3 50.1 32.1
Ours (=65536) 33.7 54.4 36.5 31.1 51.3 32.7
(b) Mask R-CNN, R50-FPN, 12 epochs, Freeze BN
Table 12: Object detection and instance segmentation fine-tuned on COCO: bounding-box AP (AP) and mask AP (AP) evaluated on val2017. The MoCo model is with 16384.

4.3 Downstream Tasks

In this section, we evaluate the transferability of our learned representation on the object detection task. We use PASCAL VOC [everingham2010pascal] and COCO [lin2014microsoft] as our benchmarks and we closely follow the prior works [he2019momentum, wu2018unsupervised, misra2019self]

with the transfer learning setup. We use Faster R-CNN 

[ren2015faster], RetinaNet [lin2017focal] and Mask R-CNN [he2017mask] implemented in mmdetection222 with a ResNet-50333 backbone as we found Detectron2 [wu2019detectron2] (used in [he2019momentum, misra2019self]) does not support torchvision models. Our baseline MoCo model is downloaded from444 which is trained with #Negative ()=16384, for fair comparisons, we both evaluate our model with 16384 and 65536. We pre-train the ResNet-50 with our method first then initialize the detection models with the learned parameters. As recommended by [he2019momentum, misra2019self], we both perform experiments of fine-tuning all BN to calibrate the distributions and freezing all BN parameters for training. But the difference from [he2019momentum] is that we did not use BN in the neck structure (newly initialized layers) such as FPN [lin2017feature].

PASCAL VOC Object Detection. We train our models on the split of trainval07+12 and evaluate on the VOC test2007 following [wu2018unsupervised, he2019momentum, misra2019self]. We use the image size of [1000, 600] pixels for training and the same size for testing as the default setting in mmdetection. We evaluate with the default VOC metric of AP and recall for all the models. Results are shown in Table 9, the MoCo model is trained with 16384, our results are consistently better than the baseline under both “No Freeze” and “Freeze BN” scenarios.

COCO Object Detection and Segmentation. We train our models on RetinaNet [lin2017focal] and Mask R-CNN [he2017mask] with FPN [lin2017feature] architecture. We use the image size of [1333, 800] pixels for training all our models. We fine-tune on the COCO train2017 and evaluate on the val2017 split. The total training budget is 12 epochs. The initial learning rate is set to 0.02 and the whole schedule is following the mmdetection default setting. Our results are shown in Table 12, we fine-tune all layers’ parameters but freeze BN parameters as we found releasing them make the training unstable, also our main purpose is to verify the better transferability of our learned representations. For RetinaNet, we report the results with the metrics of bounding box AP, AP, AP and AP, AP, AP, for Mask R-CNN, we report with both bounding box and mask evaluations. It can be observed that our results are consistently better than the baseline.

4.4 Visualizations of Flattening Predictions

As our model is learned from the mixtures of diverse images, it is less confident to the original images and also will produce flattened distributions when feeding into the original images. To demonstrate this assumption, we visualize the prediction distributions of our models and MoCo. As shown in Figure 5, we plot the sorted distribution of 2048 dimension outputs from the last layer of ResNet-50. Blue/red dotted lines denote maximum probabilities of predictions from our model and MoCo, respectively. Our model has lower values than MoCo which supports our assumption as aforementioned.

Figure 5: 2048-dim of predictions from unsupervised pre-trained ResNet-50 on ImageNet-1K val set. Our model obtains more flattened distribution than MoCo when feeding into the same image with the contrastively pre-trained models.

5 Conclusion

We have investigated the use of image mixture for flattening predictions in unsupervised learning. Through a variety of experiments on linear classification, object detection and instance segmentation, we have shown that neural networks trained with our newly generated space have better representation capability in terms of generalization and transferability, as well as better robustness for different pretext tasks. We also show that more training budget is crucial for unsupervised task. Being easy to implement and incurring minimal additional computational cost, we hope the proposed method can be a useful tool for the unsupervised learning problem.



Appendix 0.A ImageNet-100 Category List from CMC [tian2019contrastive]

n02869837 n01749939 n02488291 n02107142 n13037406 n02091831 n04517823 n04589890 n03062245 n01773797 n01735189 n07831146 n07753275 n03085013 n04485082 n02105505 n01983481 n02788148 n03530642 n04435653 n02086910 n02859443 n13040303 n03594734 n02085620 n02099849 n01558993 n04493381 n02109047 n04111531 n02877765 n04429376 n02009229 n01978455 n02106550 n01820546 n01692333 n07714571 n02974003 n02114855 n03785016 n03764736 n03775546 n02087046 n07836838 n04099969 n04592741 n03891251 n02701002 n03379051 n02259212 n07715103 n03947888 n04026417 n02326432 n03637318 n01980166 n02113799 n02086240 n03903868 n02483362 n04127249 n02089973 n03017168 n02093428 n02804414 n02396427 n04418357 n02172182 n01729322 n02113978 n03787032 n02089867 n02119022 n03777754 n04238763 n02231487 n03032252 n02138441 n02104029 n03837869 n03494278 n04136333 n03794056 n03492542 n02018207 n04067472 n03930630 n03584829 n02123045 n04229816 n02100583 n03642806 n04336792 n03259280 n02116738 n02108089 n03424325 n01855672 n02090622