Prototypical Contrastive Learning of Unsupervised Representations

05/11/2020 ∙ by Junnan Li, et al. ∙ 10

This paper presents Prototypical Contrastive Learning (PCL), an unsupervised representation learning method that addresses the fundamental limitations of the popular instance-wise contrastive learning. PCL implicitly encodes semantic structures of the data into the learned embedding space, and prevents the network from solely relying on low-level cues for solving unsupervised learning tasks. Specifically, we introduce prototypes as latent variables to help find the maximum-likelihood estimation of the network parameters in an Expectation-Maximization framework. We iteratively perform E-step as finding the distribution of prototypes via clustering and M-step as optimizing the network via contrastive learning. We propose ProtoNCE loss, a generalized version of the InfoNCE loss for contrastive learning by encouraging representations to be closer to their assigned prototypes. PCL achieves state-of-the-art results on multiple unsupervised representation learning benchmarks, with >10

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised visual representation learning aims to learn image representations from pixels themselves without relying on semantic annotations, and recent advances are largely driven by instance discrimination tasks Wu et al. (2018); Ye et al. (2019); He et al. (2020); Misra and van der Maaten (2019); Hjelm et al. (2019); Oord et al. (2018); Tian et al. (2019). These methods usually consist of two key components: image transformation and contrastive loss. Image transformation aims to generate multiple embeddings that represent the same image, by data augmentation Ye et al. (2019); Hjelm et al. (2019); Chen et al. (2020a), patch perturbation Misra and van der Maaten (2019), or using momentum features He et al. (2020)

. The contrastive loss, in the form of a noise contrastive estimator 

Gutmann and Hyvärinen (2010), aims to bring closer samples from the same instance and separate samples from different instances. Essentially, instance-wise contrastive learning leads to an embedding space where all instances are well-separated, and each instance is locally smooth (i.e. input with perturbations have similar representations).

Despite their improved performance, instance discrimination based methods share a common fundamental weakness: semantic structure of data is not encoded by the learned representations. This problem arises because instance-wise contrastive learning treats two samples as a negative pair as long as they are from different instances, regardless of their semantic similarity. This is magnified by the fact that thousands of negative samples are generated to form the contrastive loss, leading to many negative pairs that share similar semantics but are undesirably pushed apart in the embedding space.

Figure 1: Illustration of Prototypical Contrastive Learning. Each instance is assigned to multiple prototypes with different granularity. PCL learns an embedding space which encodes the semantic structure of data.

In this paper, we propose prototypical contrastive learning (PCL), a new framework for unsupervised representation learning that implicitly encodes the semantic structure of data into the embedding space. Figure 1 shows an illustration of PCL. A prototype is defined as “a representative embedding for a group of semantically similar instances”. We assign several prototypes of different granularity to each instance, and construct a contrastive loss which enforces the embedding of a sample to be more similar to its corresponding prototypes compared to other prototypes. In practice, we can find prototypes by performing standard clustering on the embeddings.

We formulate prototypical contrastive learning as an Expectation-Maximization (EM) algorithm, where the goal is to find the parameters of a Deep Neural Network (DNN) that best describes the data distribution, by iteratively approximating and maximizing the log-likelihood function. Specifically, we introduce prototypes as additional latent variables, and estimate their probability in the E-step by performing

-means clustering. In the M-step, we update the network parameters by minimizing our proposed contrastive loss, namely ProtoNCE

. We show that minimizing ProtoNCE is equivalent to maximizing the estimated log-likelihood, under the assumption that the data distribution around each prototype is isotropic Gaussian. Under the EM framework, the widely used instance discrimination task can be explained as a special case of prototypical contrastive learning, where the prototype for each instance is its augmented feature, and the Gaussian distribution around each prototype has the same fixed variance. The contributions of this paper can be summarized as follows:

  • [leftmargin=*]

  • We propose prototypical contrastive learning, a novel framework for unsupervised representation learning. The learned representation not only preserves the local smoothness of each image instance, but also captures the hierarchical semantic structure of the global dataset.

  • We give a theoretical framework that formulates PCL as an Expectation-Maximization (EM) based algorithm. The iterative steps of clustering and representation learning can be interpreted as approximating and maximizing the log-likelihood function. The previous methods based on instance discrimination form a special case in the proposed EM framework.

  • We propose ProtoNCE, a new contrastive loss which improves the widely used InfoNCE Wu et al. (2018); He et al. (2020); Oord et al. (2018); Chen et al. (2020a) by dynamically estimating the concentration for the feature distribution around each prototype. We provide explanations for PCL from an information theory perspective, by showing that the learned prototypes contain more information about the image classes.

  • PCL sets a new state-of-the-art for unsupervised representation learning on multiple benchmarks.

2 Related Work

Our work is closely related to two main branches of studies in unsupervised/self-supervised learning: instance-wise contrastive learning and deep unsupervised clustering.

Instance-wise Contrastive Learning. At the core of state-of-the-art unsupervised representation learning algorithms Wu et al. (2018); Ye et al. (2019); He et al. (2020); Misra and van der Maaten (2019); Zhuang et al. (2019); Hjelm et al. (2019); Oord et al. (2018); Tian et al. (2019); Chen et al. (2020a), instance-wise contrastive learning aims to learn an embedding space where samples (e.g. crops) from the same instance (e.g. an image) are pulled closer and samples from different instances are pushed apart. To construct the contrastive loss for a mini-batch of samples, positive instance features and negative instance features are generated for each sample. Different contrastive learning methods vary in their strategy to generate instance features. The memory bank approach Wu et al. (2018) stores the features of all samples calculated in the previous step, and selects from the memory bank to form positive and negative pairs. The end-to-end approach Ye et al. (2019); Tian et al. (2019); Chen et al. (2020a) generates instance features using all samples within the current mini-batch, and apply the same encoder to both the original samples and their augmented version. Recently, the momentum encoder (MoCo) approach He et al. (2020) is proposed, which encodes samples on-the-fly by a momentum-updated encoder, and maintains a queue of instance features.

Despite their improved performance, the existing methods based on instance-wise contrastive learning have the following two major weaknesses, which can be overcome by the proposed PCL framework.

  • [leftmargin=*]

  • The task of instance discrimination could be solved by exploiting low-level image differences, thus the learned embeddings do not necessarily capture high-level semantics. This is supported by the fact that the accuracy of instance classification often rapidly rises to a high level (>90% within 10 epochs) and further training gives limited informative signals. A recent study also shows that better performance of instance discrimination could worsen the performance on downstream tasks 

    Tschannen et al. (2020).

  • A sufficiently large number of negative instances need to be sampled, which inevitably yields negative pairs that share similar semantic meaning and should be closer in the embedding space. However, they are undesirably pushed apart by the contrastive loss. Such problem is defined as class collision in Saunshi et al. (2019) and is shown to hurt representation learning. Essentially, instance discrimination learns an embedding space that only preserves the local smoothness around each instance but largely ignores the global semantic structure of the dataset.

Deep Unsupervised Clustering. Clustering based methods have been proposed for deep unsupervised learning. Approaches in Xie et al. (2016); Yang et al. (2016); Liao et al. (2016); Yang et al. (2017); Chang et al. (2017); Ji et al. (2019) jointly learn image embeddings and cluster assignments, but they have not shown the ability to learn transferable representations from a large scale of images. Closer to our work, DeepCluster Caron et al. (2018) learns from millions of images by performing iterative clustering and unsupervised representation learning. However, our method is conceptually different from DeepCluster. In DeepCluster, the cluster assignments are considered as pseudo-labels and a classification objective is optimized, which results in two weaknesses: (1) the high-dimensional features from the penultimate layer of a ConvNet are not optimal for clustering and need to be PCA-reduced; (2) an additional linear classification layer is frequently re-initialized which interferes with representation learning. In our method, representation learning happens directly in a low-dimensional embedding space, by optimizing a contrastive loss on the prototypes (cluster centroids). This frees our method from the computationally expensive linear layer, and enables a much larger number of clusters where each instance is assigned to multiple clusters of different granularity.

Self-supervised Pretext Tasks.

Another line of self-supervised learning methods focus on training DNNs to solve pretext tasks that lead to good image representations being learned. These tasks usually involve hiding certain information about the input and training the network to recover those missing information. Examples include image inpainting 

Pathak et al. (2016)

, colorization 

Zhang et al. (2016, 2017), prediction of patch orderings Doersch et al. (2015); Noroozi and Favaro (2016) and image transformations Dosovitskiy et al. (2014); Gidaris et al. (2018); Caron et al. (2019); Zhang et al. (2019). However, most of these pretext tasks exploit specific structures of visual data, making them harder to generalize to other domains. The proposed PCL is a more general learning framework with better theoretical justification. Furthermore, PCL can incorporate the pretext tasks (e.g.  Jigsaw Noroozi and Favaro (2016) or Rotation Gidaris et al. (2018)) as a form of image transformation, which could potentially lead to improved performance.

3 Prototypical Contrastive Learning

3.1 Preliminaries

Given a training set of images, unsupervised visual representation learning aims to learn an embedding function (realized via a DNN) that maps to with , such that best describes

. Instance-wise contrastive learning achieves this objective by optimizing a contrastive loss function, such as InfoNCE 

Oord et al. (2018); He et al. (2020), defined as:

(1)

where is a positive embedding for instance , and includes one positive embedding and negative embeddings for other instances, and is a temperature hyper-parameter. In MoCo He et al. (2020), these embeddings are obtained by feeding to a momentum encoder parametrized by , , where is a moving average of .

In prototypical contrastive learning, we use prototypes instead of , and replace the fixed temperature with a per-prototype concentration estimation . An overview of our training framework is shown in Figure 2, where clustering and representation learning are performed iteratively at each epoch. Next, we will delineate the theoretical framework of PCL based on EM. A pseudo-code of our algorithm is given in appendix B.

Figure 2: Training framework of Prototypical Contrastive Learning.

3.2 PCL as Expectation-Maximization

Our objective is to find the network parameters that maximizes the log-likelihood function of the observed samples:

(2)

We assume that the observed data are related to latent variable which denotes the prototypes of the data. In this way, we can re-write the log-likelihood function as:

(3)

It is hard to optimize this function directly, so we use a surrogate function to lower-bound it:

(4)

where denotes some distribution over ’s (), and the last step of derivation uses Jensen’s inequality. To make the inequality hold with equality, we require to be a constant. Therefore, we have:

(5)

By ignoring the constant in Equation 4, we should maximize:

(6)

E-step. In this step, we aim to estimate . To this end, we perform -means on the features given by the momentum encoder to obtain clusters. We define prototype as the centroid for the -th cluster. Then, we compute , where if belongs to the cluster represented by ; otherwise . Similar to MoCo He et al. (2020), we found features from the momentum encoder yield more consistent clusters.

M-step. Based on the E-step, we are ready to maximize the lower-bound in Equation 6.

(7)

Under the assumption of a uniform prior over cluster centroids, we have:

(8)

where we set the prior probability

for each as since we are not provided any samples.

We assume that the distribution around each prototype is an isotropic Gaussian, which leads to:

(9)

where and . If we apply -normalization to both and , then . Combining this with Equations 346789, we can write maximum log-likelihood estimation as

(10)

where denotes the concentration level of the feature distribution around a prototype and will be introduced later. Note that Equation 10 has a similar form as the InfoNCE loss in Equation 1. Therefore, InfoNCE can be interpreted as a special case of the maximum log-likelihood estimation, where the prototype for a feature is the augmented feature from the same instance (i.e), and the concentration of the feature distribution around each instance is fixed (i.e).

In practice, we take the same approach as NCE and sample negative prototypes to calculate the normalization term. We also cluster the samples times with different number of clusters , which enjoys a more robust probability estimation of prototypes that encode the hierarchical structure. Furthermore, we add the InfoNCE loss to retain the property of local smoothness and help bootstrap clustering. Our overall objective, namely ProtoNCE, is defined as

(11)

3.3 Concentration Estimation

The distribution of embeddings around each prototype has different level of concentration. We use to denote the concentration estimation, where a smaller indicates larger concentration. Here we calculate using the momentum features that are within the same cluster as a prototype . The desired should be small (high concentration) if (1) the average distance between and is small, and (2) the cluster contains more feature points (i.e is large). Therefore, we define as:

(12)

where is a smooth parameter to ensure that small clusters do not have an overly-large . We normalize for each set of prototypes such that they have a mean of .

In the ProtoNCE loss (Equation 11), acts as a scaling factor on the similarity between an embedding and its prototype . With the proposed , the similarity in a loose cluster (larger ) are down-scaled, pulling embeddings closer to the prototype. On the contrary, embeddings in a tight cluster (smaller ) have an up-scaled similarity, thus less encouraged to approach the prototype. Therefore, learning with ProtoNCE yields more balanced clusters with similar concentration, as shown in Figure 3

(a). It prevents a trivial solution where most embeddings collapse to a single cluster, a problem that could only be heuristically addressed by data-resampling in DeepCluster 

Caron et al. (2018).

(a) (b)
Figure 3: (a) Histogram of cluster size for PCL (clusters =50000) with fixed or estimated concentration. Using a different

for each prototype yields more balanced clusters with similar sizes, which leads to better representation learning. (b) Mutual info between instance features (or their assigned prototypes) and class labels of all images in ImageNet. Compared to InfoNCE, our ProtoNCE learns better prototypes with more semantics.

3.4 Mutual Information Analysis

It has been shown that minimizing InfoNCE is maximizing a lower bound on the mutual information (MI) between representations and  Oord et al. (2018). Similarly, minimizing the proposed ProtoNCE can be considered as simultaneously maximizing the mutual information between and all the prototypes . This leads to better representation learning, for two reasons.

First, the encoder would learn the shared information among prototypes, and ignore the individual noise that exists in each prototype. The shared information is more likely to capture higher-level semantic knowledge. Second, we show that compared to instance features, prototypes have a larger mutual information with the class labels. We estimate the mutual information between the instance features (or their assigned prototypes) and the ground-truth class labels for all images in ImageNet Deng et al. (2009) training set, following the method in Ross (2014). We compare the obtained MI of our method (ProtoNCE) and that of MoCo He et al. (2020) (InfoNCE). As shown in Figure 3(b), compared to instance features, the prototypes have a larger MI with the class labels due to the effect of clustering. Furthermore, compared to InfoNCE, training on ProtoNCE can increase the MI of prototypes as training proceeds, indicating that better representations are learned to form more semantically-meaningful clusters.

3.5 Prototypes as Linear Classifier

Another interpretation of PCL can provide more insights into the nature of the learned prototypes. The optimization in Equation 10 is similar to optimizing the cluster-assignment probability using the cross-entropy loss, where the prototypes

represent weights for a linear classifier. With

-means clustering, the linear classifier has a fixed set of weights as the mean vectors for the representations in each cluster,

. A similar idea has been used for few-shot learning Snell et al. (2017), where a non-parametric prototypical classifier performs better than a parametric linear classifier.

3.6 Implementation Details

It has been shown that the performance of unsupervised learned representations can be improved by using a larger network, adopting stronger data augmentation, training for more epochs, or using a larger batchsize Chen et al. (2020b, a); Hénaff et al. (2019); Misra and van der Maaten (2019). However, these improvements usually come at the cost of more computation resources. Therefore, to enable a fair and direct comparison with previous methods, we follow the same setting for unsupervised training as MoCo He et al. (2020). We perform training on the ImageNet-1M dataset with 1000 classes. A ResNet-50 He et al. (2016) is adopted as the encoder, whose last fully-connected layer outputs a 128-D and L2-normalized feature Wu et al. (2018). We perform additional experiments using a non-linear projection head (a 2-layer MLP), which has been shown to improve representation learning Chen et al. (2020a, b). For efficient clustering, we adopt the GPU -means implementation in faiss Johnson et al. (2017)

. More details about our method (pseudo-code of our algorithm, convergence proof, experimental settings, cluster analysis, and representation visualization) are given in appendices.

4 Experiments

Following common practice in self-supervised learning Goyal et al. (2019)

, we evaluate PCL on transfer learning tasks, based on the principle that a good representation should transfer with limited supervision and limited fine-tuning. The most important baseline is MoCo 

He et al. (2020), the recent SOTA instance-wise contrastive learning method. To enable fair and direct comparisons, we follow the settings in He et al. (2020).

4.1 Image Classification with Limited Training Data

Low-shot Classification. We evaluate the learned representation on image classification tasks with few training samples per-category. We follow the setup in Goyal et al. (2019) and train linear SVMs using fixed representations on two datasets: Places205 Zhou et al. (2014)

for scene recognition and PASCAL VOC2007 

Everingham et al. (2010) for object classification. We vary the number of samples per-class and report the average result across 5 independent runs. Table 1 shows the results, in which our method substantially outperforms both MoCo He et al. (2020) and SimCLR Chen et al. (2020a).

Method architecture VOC07 Places205
=1 =2 =4 =8 =16 =1 =2 =4 =8 =16
Random ResNet-50 8.0 8.2 8.2 8.2 8.5 0.7 0.7 0.7 0.7 0.7
Supervised 55.6 65.0 73.9 79.4 81.7 15.5 21.0 26.7 31.9 35.9
Jigsaw Noroozi and Favaro (2016); Goyal et al. (2019) ResNet-50 26.5 31.1 40.0 46.7 51.8 4.6 6.4 9.4 12.9 17.4
MoCo He et al. (2020) 31.2 40.5 50.6 58.9 65.6 9.1 13.2 17.7 23.3 28.4
PCL (ours) 40.9 52.7 61.4 68.1 73.7 11.4 15.7 20.3 25.0 29.5
SimCLR Chen et al. (2020a) ResNet-50-MLP 35.2 42.9 53.7 60.5 67.0 9.9 14.1 19.3 23.8 28.5
PCL (ours) 47.1 54.7 64.1 70.9 76.5 12.1 17.2 21.6 27.0 31.0
Table 1: Low-shot image classification on both VOC07 and Places205 datasets using linear SVMs trained on fixed representations. All methods were pretrained on ImageNet-1M dataset (except for Jigsaw Noroozi and Favaro (2016); Goyal et al. (2019) trained on ImageNet-14M). We vary the number of labeled examples and report the mAP (for VOC) and accuracy (for Places) across 5 runs. Results for Jigsaw were taken from Goyal et al. (2019). We use the released pretrained model for MoCo, and re-implement SimCLR. MoCo, SimCLR, and PCL are trained for the same number of epochs (200 epochs).

Semi-supervised Image Classification.

We perform semi-supervised learning experiments to evaluate whether the learned representation can provide a good basis for fine-tuning. Following the setup from 

Wu et al. (2018); Misra and van der Maaten (2019), we randomly select a subset (1% or 10%) of ImageNet training data (with labels), and fine-tune the self-supervised trained model on these subsets. Table 2 reports the top-5 accuracy on ImageNet validation set. Our method sets a new state-of-the-art under 200 training epochs, outperforming both self-supervised learning methods and semi-supervised learning methods.

Method architecture #pretrain Top-5 Accuracy
epochs 1% 10%
Random Wu et al. (2018) ResNet-50 - 22.0 59.0
Supervised baseline Zhai et al. (2019) ResNet-50 - 48.4 80.4
Semi-supervised learning methods:
Pseudolabels Zhai et al. (2019) ResNet-50v2 - 51.6 82.4
VAT + Entropy Min. Miyato et al. (2019); Zhai et al. (2019) ResNet-50v2 - 47.0 83.4
SL Exemplar Zhai et al. (2019) ResNet-50v2 - 47.0 83.7
SL Rotation Zhai et al. (2019) ResNet-50v2 - 53.4 83.8
Self-supervised learning methods:
Instance Discrimination Wu et al. (2018) ResNet-50 200 39.2 77.4
Jigsaw Noroozi and Favaro (2016); Misra and van der Maaten (2019) ResNet-50 90 45.3 79.3
SimCLR Chen et al. (2020a) ResNet-50-MLP 200 56.5 82.7
MoCo He et al. (2020) ResNet-50 200 56.9 83.0
PIRL Misra and van der Maaten (2019) ResNet-50 800 57.2 83.8
PCL (ours) ResNet-50 200 75.6 86.2
Table 2: Semi-supervised learning on ImageNet. We report top-5 accuracy on the ImageNet validation set of self-supervised models that are finetuned on 1% or 10% of labeled data. We use the released pretrained model for MoCo, and re-implement SimCLR; all other numbers are adopted from corresponding papers.

4.2 Image Classification Benchmarks

Linear Classifiers. Next, we train linear classifiers on fixed image representations using the entire labeled training data. We follow previous setup Goyal et al. (2019); Misra and van der Maaten (2019) and evaluate the performance of such linear classifiers on three datasets, including ImageNet, VOC07, and Places205. Table 3 reports the results. PCL achieves the highest single-crop top-1 accuracy of all self-supervised methods that use a ResNet-50 model with no more than 200 pretrain epochs.

KNN Classifiers. Following Wu et al. (2018); Zhuang et al. (2019)

, we perform k-nearest neighbor (kNN) classification on ImageNet using the learned representations. For a query image with feature

, we take its top nearest neighbors from the momentum features, and perform weighted-combination of their labels where the weights are calculated by . Table 4 reports the accuracy. Our method significantly outperforms previous methods while requiring fewer number of neighbors (20 neighbors as compared to 200 in Wu et al. (2018); Zhuang et al. (2019)).

Method architecture #pretrain Dataset
(#params) epochs ImageNet VOC07 Places205
Colorization Zhang et al. (2016); Goyal et al. (2019) R50 (24M) 28 39.6 55.6 37.5
Jigsaw Noroozi and Favaro (2016); Goyal et al. (2019) R50 (24M) 90 45.7 64.5 41.2
Rotation Gidaris et al. (2018); Misra and van der Maaten (2019) R50 (24M) 48.9 63.9 41.4
DeepCluster Caron et al. (2018, 2019) VGG(15M) 100 48.4 71.9 37.9
BigBiGAN Donahue and Simonyan (2019) R50 (24M) 56.6
InstDisc Wu et al. (2018) R50 (24M) 200 54.0 45.5
MoCo He et al. (2020) R50 (24M) 200 60.6  79.2  48.9
SimCLR Chen et al. (2020a) R50-MLP (28M) 200 61.9
PCL (ours) R50 (24M) 200 62.2 82.2 49.2
PCL (ours) R50-MLP (28M) 200 65.9 84.0 49.8
LocalAgg Zhuang et al. (2019) R50 (24M) 200  60.2  50.1
SelfLabel Asano et al. (2020) R50 (24M) 400  61.5
CPC Oord et al. (2018) R101 (28M) 48.7
CPCv2 Hénaff et al. (2019) R170 (303M) 200 65.9
CMC Tian et al. (2019) R50 (47M) 280  64.1
PIRL Misra and van der Maaten (2019) R50 (24M) 800 63.6 81.1 49.8
AMDIM Hjelm et al. (2019) Custom (626M) 150  68.1  55.0
SimCLR Chen et al. (2020a) R50-MLP (28M) 1000  69.3  80.5
Table 3: Image classification with linear models. We report top-1 accuracy. Numbers with are from released pretrained model; all other numbers are adopted from corresponding papers.
: LocalAgg and SelfLabel uses 10-crop evaluation. CMC and ADMIM use FastAutoAugment Lim et al. (2019) that is supervised by ImageNet labels. SimCLR requires a large batch size of 4096 allocated on 128 TPUs.
Method Instance Disc. Wu et al. (2018) MoCo He et al. (2020) Local Agg. Zhuang et al. (2019) PCL (ours)
Accuracy 46.5 47.1 49.4 54.5
Table 4: Image classification with kNN classifiers using ResNet-50 features on ImageNet. We report top-1 accuracy. Results for Wu et al. (2018); Zhuang et al. (2019) are taken from corresponding papers. Result for MoCo is from released model.

4.3 Object Detection

We further assess the generalization capacity of the learned representation on object detection. Following Goyal et al. (2019), we train a Faster R-CNN Ren et al. (2015) model on VOC07 or VOC07+12, and evaluate on the test set of VOC07. It has been shown that fine-tuning on randomly initialized models can achieve competitive results He et al. (2018). Therefore, we keep the pretrained backbone frozen to better evaluate the learned representation. Note that the same training schedule is used for both the supervised and self-supervised methods. Table 5 reports the average mAP across three independent runs. Our method substantially closes the gap between self-supervised methods and supervised methods.

Method Pretrain Dataset Architecture Training data
 VOC07 VOC07+12
Supervised Goyal et al. (2019) ImageNet-1M Resnet-50-C4 67.1 68.3
Supervised ImageNet-1M Resnet-50-FPN 72.8 79.3
Jigsaw Noroozi and Favaro (2016); Goyal et al. (2019) ImageNet-14M Resnet-50-C4 62.7 64.8
MoCo He et al. (2020) ImageNet-1M Resnet-50-FPN 66.4 73.5
PCL (ours) ImageNet-1M Resnet-50-FPN 71.7 78.5
Table 5: Object detection for frozen conv body on VOC using Faster R-CNN. We measure the average mAP on VOC07 test set across three runs.

5 Conclusion

This paper proposed Prototypical Contrastive Learning, a generic unsupervised representation learning framework that finds network parameters to maximize the log-likelihood of the observed data. We introduce prototypes as latent variables, and perform iterative clustering and representation learning in an EM-based framework. PCL learns an embedding space which encodes the semantic structure of data, by training on the proposed ProtoNCE loss. Our extensive experiments on multiple benchmarks demonstrate the state-of-the-art performance of PCL for unsupervised representation learning.

6 Broader Impacts

Our research advances unsupervised representation learning especially for computer vision, which alleviates the need for expensive human annotation when training deep neural network models. By utilizing the enormous amount of unlabeled images available on the web, smarter AI systems can be built with stronger visual representation abilities. However, unsupervised representation learning puts heavy requirement on computational resource during the pretraining stage, which could be costly both financially and environmentally. Therefore, more efforts are needed towards reducing the computational cost for unsupervised learning. As part of the efforts, we will release our pretrained models to facilitate future research in downstream applications without the expensive retraining.

References

  • Y. M. Asano, C. Rupprecht, and A. Vedaldi (2020) Self-labelling via simultaneous clustering and representation learning. In ICLR, Cited by: Table 3.
  • M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In ECCV, pp. 139–156. Cited by: §2, §3.3, Table 3.
  • M. Caron, P. Bojanowski, J. Mairal, and A. Joulin (2019) Unsupervised pre-training of image features on non-curated data. In ICCV, pp. 2959–2968. Cited by: §2, Table 3.
  • J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan (2017) Deep adaptive image clustering. In ICCV, pp. 5880–5888. Cited by: §2.
  • K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §D.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020a) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: 3rd item, §1, §2, §3.6, §4.1, Table 1, Table 2, Table 3.
  • X. Chen, H. Fan, R. Girshick, and K. He (2020b) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §3.6.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li (2009) ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §3.4.
  • C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In ICCV, pp. 1422–1430. Cited by: §2.
  • J. Donahue and K. Simonyan (2019) Large scale adversarial representation learning. In NeurIPS, pp. 10541–10551. Cited by: Table 3.
  • A. Dosovitskiy, J. T. Springenberg, M. A. Riedmiller, and T. Brox (2014)

    Discriminative unsupervised feature learning with convolutional neural networks

    .
    In NIPS, pp. 766–774. Cited by: §2.
  • M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman (2010) The pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §4.1.
  • R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin (2008) LIBLINEAR: A library for large linear classification. JMLR 9, pp. 1871–1874. Cited by: §D.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In ICLR, Cited by: §2, Table 3.
  • P. Goyal, D. Mahajan, A. Gupta, and I. Misra (2019) Scaling and benchmarking self-supervised visual representation learning. In ICCV, pp. 6391–6400. Cited by: §4.1, §4.2, §4.3, Table 1, Table 3, Table 5, §4, §D.
  • M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, pp. 297–304. Cited by: §1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, Cited by: 3rd item, §1, §A, §2, §3.1, §3.2, §3.4, §3.6, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5, §4, §D.
  • K. He, R. B. Girshick, and P. Dollár (2018) Rethinking imagenet pre-training. arXiv preprint arXiv:1811.08883. Cited by: §4.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.6.
  • O. J. Hénaff, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §3.6, Table 3.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. In ICLR, Cited by: §1, §2, Table 3.
  • X. Ji, J. F. Henriques, and A. Vedaldi (2019) Invariant information clustering for unsupervised image classification and segmentation. In ICCV, pp. 9865–9874. Cited by: §2.
  • J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §A, §3.6.
  • R. Liao, A. G. Schwing, R. S. Zemel, and R. Urtasun (2016) Learning deep parsimonious representations. In NIPS, pp. 5076–5084. Cited by: §2.
  • S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim (2019) Fast autoaugment. In NeurIPS, pp. 6662–6672. Cited by: Table 3.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne.

    Journal of machine learning research

    9, pp. 2579–2605.
    Cited by: §F.
  • I. Misra and L. van der Maaten (2019) Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991. Cited by: §1, §2, §3.6, §4.1, §4.2, Table 2, Table 3.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2019) Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 41 (8), pp. 1979–1993. Cited by: Table 2.
  • X. V. Nguyen, J. Epps, and J. Bailey (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, pp. 2837–2854. Cited by: §E.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, pp. 69–84. Cited by: §2, Table 1, Table 2, Table 3, Table 5.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: 3rd item, §1, §2, §3.1, §3.4, Table 3.
  • D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, pp. 2536–2544. Cited by: §2.
  • S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §4.3.
  • B. C. Ross (2014) Mutual information between discrete and continuous data sets. PloS one 9 (2). Cited by: §3.4.
  • N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and H. Khandeparkar (2019) A theoretical analysis of contrastive unsupervised representation learning. In ICML, pp. 5628–5637. Cited by: 2nd item.
  • J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. In NIPS, pp. 4077–4087. Cited by: §3.5.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §1, §2, Table 3.
  • M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic (2020) On mutual information maximization for representation learning. In ICLR, Cited by: 1st item.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pp. 3733–3742. Cited by: 3rd item, §1, §A, §2, §3.6, §4.1, §4.2, Table 2, Table 3, Table 4.
  • J. Xie, R. B. Girshick, and A. Farhadi (2016) Unsupervised deep embedding for clustering analysis. In ICML, pp. 478–487. Cited by: §2.
  • B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong (2017)

    Towards k-means-friendly spaces: simultaneous deep learning and clustering

    .
    In ICML, pp. 3861–3870. Cited by: §2.
  • J. Yang, D. Parikh, and D. Batra (2016) Joint unsupervised learning of deep representations and image clusters. In CVPR, pp. 5147–5156. Cited by: §2.
  • M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised embedding learning via invariant and spreading instance feature. In CVPR, pp. 6210–6219. Cited by: §1, §2.
  • X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer (2019) S4L: self-supervised semi-supervised learning. In ICCV, pp. 1476–1485. Cited by: Table 2.
  • L. Zhang, G. Qi, L. Wang, and J. Luo (2019) AET vs. AED: unsupervised representation learning by auto-encoding transformations rather than data. In CVPR, Cited by: §2.
  • R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In ECCV, pp. 649–666. Cited by: §2, Table 3.
  • R. Zhang, P. Isola, and A. Efros (2017)

    Split-brain autoencoders: unsupervised learning by cross-channel prediction

    .
    In CVPR, pp. 1058–1067. Cited by: §2.
  • B. Zhou, À. Lapedriza, J. Xiao, A. Torralba, and A. Oliva (2014)

    Learning deep features for scene recognition using places database

    .
    In NIPS, pp. 487–495. Cited by: §4.1.
  • C. Zhuang, A. L. Zhai, and D. Yamins (2019) Local aggregation for unsupervised learning of visual embeddings. In ICCV, pp. 6002–6012. Cited by: §2, §4.2, Table 3, Table 4.

A Training Details for Unsupervised Learning

For the unsupervised learning experiment, We follow previous works He et al. (2020); Wu et al. (2018) and perform data augmentation with random crop, random color jittering, random horizontal flip, and random grayscale conversion. We use SGD as our optimizer, with a weight decay of 0.0001, a momentum of 0.9, and a batch size of 256. We train for 200 epochs, where we warm-up the network in the first 20 epochs by only using the InfoNCE loss. The initial learning rate is 0.03, and is multiplied by 0.1 at 120 and 160 epochs. In terms of the hyper-parameters, we set , , , and number of clusters . The clusters are performed per-epoch on center-cropped images. We find over-clustering to be beneficial. We use the GPU -means implementation in faiss Johnson et al. (2017) which takes less than 20 seconds. Overall, PCL introduces computational overhead compared to MoCo.

B Pseudo-code for Prototypical Contrastive Learning

1 Input: encoder , training dataset , number of clusters
  // initialize momentum encoder as the encoder
2 while  do
        /* E-step */
          // get momentum features for all training data
3        for  to  do
                 // cluster into clusters, return prototypes
                 // estimate the distribution concentration around each prototype with Equation 12
4              
5        end for
       /* M-step */
6        for  do // load a minibatch
                 // forward pass through encoder and momentum encodeer
                 // calculate loss with Equation 11
                 // update encoder parameters
                 // update momentum encoder
7              
8        end for
9       
10 end while
Algorithm 1 Prototypical Contrastive Learning.

C Convergence Proof

Here we provide the proof that the proposed PCL would converge. Suppose let

(13)

We have shown in Section 3.2 that the above inequality holds with equality when .

At the -th E-step, we have estimated . Therefore we have:

(14)

At the -th M-step, we fix and train parameter to maximize Equation 14. Therefore we always have:

(15)

The above result suggests that monotonously increase along with more iterations. Hence the algorithm will converge.

D Training Details for Transfer Learning Experiments

For training linear SVMs on Places and VOC, we follow the procedure in Goyal et al. (2019) and use the LIBLINEAR Fan et al. (2008) package. We preprocess all images by resizing to 256 pixels along the shorter side and taking a center crop. The linear SVMs are trained on the global average pooling features of ResNet-50.

For image classification with linear models, we use the pretrained representations from the global average pooling features (2048-D) for ImageNet and VOC, and the conv5 features (averaged pooled to

9000-D) for Places. We train a linear SVM for VOC, and a logistic regression classifier (a fully-connected layer followed by softmax) for ImageNet and Places. The logistic regression classifier is trained using SGD with a momentum of 0.9. For ImageNet, we train for 100 epochs with an initial learning rate of 10 and a weight decay of 0. Similar hyper-parameters are used by 

He et al. (2020). For Places, we train for 40 epochs with an initial learning rate of 0.3 and a weight decay of 0.

For semi-supervised learning, we finetune ResNet-50 with pretrained weights on a subset of ImageNet with labels. We optimize the model with SGD, using a batch size of 256, a momentum of 0.9, and a weight decay of 0.0005. We apply different learning rate to the ConvNet and the linear classifier. The learning rate for the ConvNet is 0.01, and the learning rate for the classifier is 0.1 (for 10% labels) or 1 (for 1% labels). We train for 20 epochs, and drop the learning rate by 0.2 at 12 and 16 epochs.

For object detection, We use the R50-FPN backbone for the Faster R-CNN detector available in the MMdetection Chen et al. (2019) codebase. We freeze all the conv layers and also fix the BatchNorm parameters. The model is optimized with SGD, using a batch size of 8, a momentum of 0.9, and a weight decay of 0.0001. The initial learning rate is set as 0.05. We finetune the models for 15 epochs, and drop the learning rate by 0.1 at 12 epochs.

E Adjusted Mutual Information

In order to evaluate the quality of the clustering, we compute the adjusted mutual information score (AMI) Nguyen et al. (2010) between the clusterings and the ground-truth labels for ImageNet training data. AMI is adjusted for chance which accounts for the bias in MI to give high values to clusterings with a larger number of clusters. AMI has a value of 1 when two partitions are identical, and an expected value of 0 for random (independent) partitions. In Figure 4, we show the AMI scores for three clusterings obtained by PCL, with number of clusters .

Figure 4: Adjusted mutual information score between the clusterings generated by PCL and the ground-truth labels for ImageNet training data.

F Visualization of Learned Representation

In Figure 5, we visualize the unsupervised learned representation of ImageNet training images using t-SNE Maaten and Hinton (2008). Compared to the representation learned by MoCo, the representation learned by the proposed PCL forms more separated clusters, which also suggests representation of lower entropy.

Figure 5: T-SNE visualization of the unsupervised learned representation for ImageNet training images from the first 60 classes. Left: MoCo; Right: PCL (ours). Colors represent classes.

G Visualization of Clusters

In Figure 6, we show ImageNet training images that are randomly chosen from clusters generated by the proposed PCL. PCL not only clusters images from the same class together, but also finds fine-grained patterns that distinguish sub-classes, demonstrating its capability to learn useful semantic representations.

Figure 6: Visualization of randomly chosen clusters generated by PCL. Green boarder marks top-5 images that are closest to fine-grained prototypes (). Orange boarder marks images randomly chosen from coarse-grained clusters () that also cover the same green images. PCL can discover hierarchical semantic structures within the data (e.g. images with horse and man form a fine-grained cluster within the coarse-grained horse cluster.)