The explosion of unlabeled data, especially visual content in recent years has led to the growing demand for effective organization of these data into semantically distinct groups in an unsupervised manner. Such data clustering facilitates downstream machine learning and reasoning tasks. Since labels are unavailable, clustering algorithms are mainly based on the similarity between samples to predict the cluster assignment. However, common similarity metrics such as cosine similarity or (negative) Euclidean distance are ineffective when applied to high-dimensional data like images. Modern image clustering methods[7, 19, 20, 44, 46, 47]
, therefore, leverage deep neural networks (e.g., CNNs, RNNs) to transform high-dimensional data into low-dimensional representation vectors in the latent space and perform clustering in that space. Ideally, a good clustering model assigns data to clusters to keep inter-group similarity low while maintaining high intra-group similarity. Most existing deep clustering methods do not satisfy both of these properties. For example, autoencoder-based clustering methods[21, 46, 48]
often learn representations that capture too much information including distracting information like background or texture. This prevents them from computing proper similarity scores between samples at the cluster-level. Autoencoder-based methods have only been tested on simple image datasets like MNIST. Another class of methods[7, 19, 20]
directly use cluster-assignment probabilities rather than representation vectors to compute the similarity between samples. These methods can only differentiate objects belonging to different clusters but not in the same cluster, hence, may incorrectly group distinct objects into the same cluster. This leads to low intra-group similarity.
To address the limitations of existing methods, we propose a novel framework for image clustering called Contrastive Representation Learning and Clustering (CRLC). CRLC consists of two heads sharing the same backbone network: a “representation learning” head (RL-head) that outputs a continuous feature vector, and a “clustering” head (C-head) that outputs a cluster-assignment probability vector. The RL-head computes the similarity between objects at the instance level while the C-head separates objects into different clusters. The backbone network serves as a medium for information transfer between the two heads, allowing the C-head to leverage disciminative fine-grained patterns captured by the RL-head to extract correct coarse-grained cluster-level patterns. Via the two heads, CRLC can effectively modulate the inter-cluster and intra-cluster similarities between samples. CRLC is trained in an end-to-end manner by minimizing a weighted sum of two sample-oriented contrastive losses w.r.t. the two heads. To ensure that the contrastive loss corresponding to the C-head leads to the tightest InfoNCE lower bound , we propose a novel critic called “log-of-dot-product” to be used in place of the conventional “dot-product” critic.
In our experiments, we show that CRLC significantly outperforms a wide range of state-of-the-art single-stage clustering methods on five standard image clustering datasets including CIFAR10/20, STL10, ImageNet10/Dogs. The “two-stage” variant of CRLC also achieves better results than SCAN - a powerful two-stage clustering method on three challenging ImageNet subsets with 50, 100, and 200 classes. When some labeled data are provided, CRLC, with only a small change in its objective, can surpass many state-of-the-art semi-supervised learning algorithms by a large margin.
In summary, our main contributions are:
A novel framework for joint representation learning and clustering trained via two sample-oriented contrastive losses on feature and probability vectors;
An optimal critic for the contrastive loss on probability vectors; and,
Extensive experiments and ablation studies to validate our proposed method against baselines.
2.1 Representation learning by maximizing mutual information across different views111Here, we use “views” as a generic term to indicate different transformations of the same data sample.
Maximizing mutual information across different views (or ViewInfoMax for short) allows us to learn view-invariant representations that capture the semantic information of data important for downstream tasks (e.g., classification). This learning strategy is also the key factor behind recent successes in representation learning [18, 33, 39, 42].
Since direct computation of mutual information is difficult [29, 37], people usually maximize the variational lower bounds of mutual information instead. The most common lower bound is InfoNCE  whose formula is given by:
denote random variables from 2 different views.are samples from , is a sample from associated with . is called a “positive” pair and () are called “negative” pairs. is a real value function called “critic” that characterizes the similarity between and . is often known as the “contrastive loss” in other works [8, 39].
Since , is upper-bounded by . It means that: i) the InfoNCE bound is very loose if , and ii) by increasing , we can achieve a better bound. Despite being biased,
has much lower variance than other unbiased lower bounds of, which allows stable training of models.
Implementing the critic
In practice, is implemented as the scaled cosine similarity between the representations of and as follows:
where and are unit-normed representation vectors of and , respectively; .
is the “temperature” hyperparameter. Interestingly,in Eq. 4 matches the theoretically optimal critic that leads to the tightest InfoNCE bound for unit-normed representation vectors (detailed explanation in Appdx. A.4)
3.1 Clustering by maximizing mutual information across different views
In the clustering problem, we want to learn a parametric classifierthat maps each unlabeled sample to a cluster-assignment probability vector ( is the number of clusters) whose component characterizes how likely belongs to the cluster (). Intuitively, we can consider as a representation of and use this vector to capture the cluster-level information in by leveraging the “ViewInfoMax” idea discussed in Section 2.1. It leads to the following loss for clustering:
where is a coefficient; , are probability vectors associated with and , respectively. is the probability contrastive loss similar to (Eq. 5) but with feature vectors replaced by probability vectors. is the entropy of the marginal cluster-assignment probability . Here, we maximize to avoid a degenerate solution in which all samples fall into the same cluster (e.g., is one-hot for all samples). However, it is free to use other regularizers on rather than .
Choosing a suitable critic
It is possible to use the conventional “dot-product” critic for as for (Eq. 4). However, this will lead to suboptimal results (Section 5.3) since is applied to categorical probability vectors rather than continuous feature vectors. Therefore, we need to choose a suitable critic for so that the InfoNCE bound associated with is tightest. Ideally, should match the theoretically optimal critic which is proportional to (detailed explanation in Appdx. A.3). Denoted by and the cluster label of and respectively, we then have:
Thus, the most suitable critic is which we refer to as the “log-of-dot-product” critic. This critic achieves its maximum value when and are the same one-hot vectors and its minimum value when and are different one-hot vectors. Apart from this critic, we also list other nonoptimal critics in Appdx. A.1. Empirical comparison of the “log-of-dot-product” critic with other critics is provided in Section 5.3.
In addition, to avoid the gradient saturation problem of minimizing when probabilities are close to one-hot (explanation in Appdx. A.5), we smooth out the probabilities as follows:
where is the uniform probability vector over classes; is the smoothing coefficient set to 0.01 if not otherwise specified.
Implementing the contrastive probability loss
To implement , we can use either the SimCLR framework  or the MemoryBank framework . If the SimCLR framework is chosen, both and () are computed directly from and respectively via the parametric classifier . On the other hand, if the MemoryBank framework is chosen, we maintain a nonparametric memory bank - a matrix of size containing the cluster-assignment probabilities of all training samples, and update its rows once a new probability is computed as follows:
where is the momentum, which is set to 0.5 in our work if not otherwise specified; is the probability vector of the training sample at step corresponding to the -th row of ; is the new probability vector. Then, except computed via as normal, all in Eq. 7 are sampled uniformly from . At step 0, all the rows of are initialized with the same probability of . We also tried implementing using the MoCo framework  but found that it leads to unstable training. The main reason is that during the early stage of training, the EMA model in MoCo often produces inconsistent cluster-assignment probabilities for different views.
3.2 Incorporating representation learning
Due to the limited representation capability of categorical probability vectors, models trained by minimizing the loss in Eq.7 are not able to discriminate objects in the same cluster. Thus, they may capture suboptimal cluster-level patterns, which leads to unsatisfactory results.
To overcome this problem, we propose to combine clustering with contrastive representation learning into a unified framework called CRLC333CRLC stands for Contrastive Representation Learning and Clustering.. As illustrated in Fig. 1
, CRLC consists of a “clustering” head (C-head) and a “representation learning” head (RL-head) sharing the same backbone network. The backbone network is usually a convolutional neural network which maps an input imageinto a hidden vector . Then, is fed to the C-head and the RL-head to produce a cluster-assignment probability vector and a continuous feature vector , respectively. We simultaneously apply the clustering loss (Eq. 8) and the feature contrastive loss (Eq. 6) on and respectively and train the whole model with the weighted sum of and as follows:
where are coefficients.
3.3 A simple extension to semi-supervised learning
is originally proposed for unsupervised clustering, it can be easily extended to semi-supervised learning (SSL). There are numerous ways to adjust CRLC so that it can incorporate labeled data during training. However, within the scope of this work, we only consider a simple approach which is adding a crossentropy loss on labeled data to. The new loss is given by:
We call this variant of CRLC “CRLC-semi”. Despite its simplicity, we will empirically show that CRLC-semi outperforms many state-of-the-art SSL methods when only few labeled samples are available. We conjecture that the clustering objective arranges the data into disjoint clusters, making classification easier.
4 Related Work
There are a large number of clustering and representation learning methods in literature. However, within the scope of this paper, we only discuss works in two related topics, namely, contrastive learning and deep clustering.
4.1 Contrastive Learning
Despite many recent successes in learning representations, the idea of contrastive learning appeared long time ago. In 2006, Hadsell et. al. 
proposed a max-margin contrastive loss and linked it to a mechanical spring system. In fact, from a probabilistic view, contrastive learning arises naturally when working with energy-based models. For example, in many problems, we want to maximizewhere is the output associated with a context and is the set of all possible outputs or vocab. This is roughly equivalent to maximizing and minimizing for all but in a normalized setting. However, in practice, the size of is usually very large, making the computation of expensive. This problem was addressed in [32, 42]
by using Noise Contrastive Estimation (NCE) to approximate
. The basic idea of NCE is to transform the density estimation problem into a binary classification problem: “Whether samples are drawn from the data distribution or from a known noise distribution?”. Based on NCE, Mikolov et. al. and Oord et. al.  derived a simpler contrastive loss which later was referred to as the InfoNCE loss  and was adopted by many subsequent works [8, 12, 16, 31, 39, 49] for learning representations.
alternates between clustering data via K-means and contrasting samples based on their views and their assigned cluster centroids (or prototypes). SwAV does not contrast two sample views directly but uses one view to predict the code of assigning the other view to a set of learnable prototypes. InterCLR  and ODC 
avoid offline clustering on the entire training dataset after each epoch by storing a pseudo-label for every sample in the memory bank (along with the feature vector) and maintaining a set of cluster centroids. These pseudo-labels and cluster centroids are updated on-the-fly at each step via mini-batch K-means.
4.2 Deep Clustering
Traditional clustering algorithms such as K-means or Gaussian Mixture Model (GMM) are mainly designed for low-dimensional vector-like data, hence, do not perform well on high-dimensional structural data like images. Deep clustering methods address this limitation by leveraging the representation power of deep neural networks (e.g., CNNs, RNNs) to effectively transform data into low-dimensional feature vectors which are then used as inputs for a clustering objective. For example, DCN applies K-means to the latent representations produced by an auto-encoder. The reconstruction loss and the K-means clustering loss are minimized simultaneously. DEC , by contrast, uses only an encoder rather than a full autoencoder like DCN to compute latent representations. This encoder and the cluster centroids are learned together via a clustering loss proposed by the authors. JULE  uses a RNN to implement agglomerative clustering on top of the representations outputted by a CNN and trains the two networks in an end-to-end manner. VaDE  regards clustering as an inference problem and learns the cluster-assignment probabilities of data using a variational framework . Meanwhile, DAC  treats clustering as a binary classification problem: “Whether a pair of samples belong to the same cluster or not?”. To obtain a pseudo label for a pair, the cosine similarity between the cluster-assignment probabilities of the two samples in that pair is compared with an adaptive threshold. IIC  learns cluster assignments via maximizing the mutual information between clusters under two different data augmentations. PICA , instead, minimizes the contrastive loss derived from the the mutual information in IIC. While the cluster contrastive loss in PICA is cluster-oriented and can have at most negative pairs ( is the number of clusters). Our probability contrastive loss, by contrast, is sample-oriented and can have as many negative pairs as the number of training data. Thus, in theory, our proposed model can capture more information than PICA. In real implementation, in order to gain more information from data, PICA has to make use of the “over-clustering” trick . It alternates between minimizing for clusters and minimizing for clusters ( denotes the “over-clustering” coefficient). DRC  and CC  enhances PICA by combining clustering with contrastive representation learning, which follows the same paradigm as our proposed CRLC. However, like PICA, DRC and CC uses cluster-oriented representations rather than sample-oriented representations.
In addition to end-to-end deep clustering methods, some multi-stage clustering methods have been proposed recently [34, 40]. The most notable one is SCAN . This method uses representations learned via contrastive learning during the first stage to find nearest neighbors for every sample in the training set. In the second stage, neighboring samples are forced to have similar cluster-assignment probabilities. Our probability contrastive loss can easily be extended to handle neighboring samples (see Section 5.1.2).
End-to-end clustering results on 5 standard image datasets. Due to space limit, we only show the means of the results. For the standard deviations, please refer to Appdx. A.8.
|ImageNet||50 classes||100 classes||200 classes|
We evaluate our proposed method on 5 standard datasets for image clustering which are CIFAR10/20 , STL10 , ImageNet10 [11, 7], and ImageNet-Dogs [11, 7], and on 3 big ImageNet subsets namely ImageNet50/100/200 with 50/100/200 classes, respectively [11, 40]. A description of these datasets is given in Appdx. A.6. Our data augmentation setting follows [16, 42]. We first randomly crop images to a desirable size (32
32 for CIFAR, 9696 for STL10, and 224224 for ImageNet subsets). Then, we perform random horizontal flip, random color jittering, and random grayscale conversion. For datasets which are ImageNet subsets, we further apply Gaussian blurring at the last step . Similar to previous works [7, 20, 19], both the training and test sets are used for CIFAR10, CIFAR20 and STL10 while only the training set is used for other datasets. We also provide results where only the training set is used for CIFAR10, CIFAR20 and STL10 in Appdx. A.8. For STL10, 100,000 auxiliary unlabeled samples are additionally used to train the “representation learning” head. However, when training the “clustering” head, these auxiliary samples are not used since their classes may not appear in the training set.
Model architecture and training setups
as the backbone network when working on the 5 standard datasets and on the 3 big ImageNet subsets, respectively. The “representation learning” head (RL-head) and the “clustering” head (C-head) are two-layer neural networks with ReLU activations. The length of the output vector of the RL-head is 128. The temperature(Eq. 5) is fixed at 0.1. To reduce variance in learning, we train our model with 10 C-subheads444The final in Eq. 8 is the average of of these C-subheads. similar to . This only adds little extra computation to our model. However, unlike [19, 20, 53], we do not use an auxiliary “over-clustering” head to exploit additional information from data since we think our RL-head can do that effectively.
Training setups for end-to-end and two-stage clustering are provided in Appdx. A.7.
We use three popular clustering metrics namely Accuracy (ACC), Normalized Mutual Information (NMI), Adjusted Rand Index (ARI) for evaluation. For unlabeled data, ACC is computed via the Kuhn-Munkres algorithm. All of these metrics scale from 0 to 1 and higher values indicate better performance. In this work, we convert the [0, 1] range into percentage.
5.1.1 End-to-end training
Table 1 compares the performance of our proposed CRLC with a wide range of state-of-the-art deep clustering methods. CRLC clearly outperforms all baselines by a large margin on most datasets. For example, in term of clustering accuracy (ACC), our method improves over the best baseline (DRC ) by 5-7% on CIFAR10/20, STL10, and ImageNet-Dogs. Gains are even larger if we compare with methods that do not explicitly learn representations such as PICA  and IIC . CRLC only performs worse than DRC on ImageNet10, which we attribute to our selection of hyperparameters. In addition, even when only the “clustering” head is used, our method still surpasses most of the baselines (e.g., DCCM, IIC). These results suggest that: i) we can learn semantic clusters from data just by minimizing the probability contrastive loss, and ii) combining with contrastive representation learning improves the quality of the cluster assignment.
To have a better insight into the performance of CRLC, we visualize some success and failure cases in Fig. 2 (and also in Appdx. A.11). We see that samples predicted correctly with high confidence are usually representative for the cluster they belong to. It suggests that CRLC has learned coarse-grained patterns that separate objects at the cluster level. Besides, CRLC has also captured fine-grained instance-level information, thus, is able to find nearest neighbors with great similarities in shape, color and texture to the original image. Another interesting thing from Fig. 2 is that the predicted label of a sample is often strongly correlated with that of the majority of its neighbors. It means that: i) CRLC has learned a smooth mapping from images to cluster assignments, and ii) CRLC tends to make “collective” errors (the first and third rows in Fig. (c)c). Other kinds of errors may come from the closeness between classes (e.g., horse vs. dog), or from some adversarial signals in the input (e.g., the second row in Fig. (b)b). Solutions for fixing these errors are out of scope of this paper and will be left for future work.
5.1.2 Two-stage training
Although CRLC is originally proposed as an end-to-end clustering algorithm, it can be easily extended to a two-stage clustering algorithm similar to SCAN . To do that, we first pretrain the RL-head and the backbone network with (Eq. 6). Next, for every sample in the training data, we find a set of nearest neighbors based on the cosine similarity between feature vectors produced by the pretrained network. In the second stage, we train the C-head by minimizing (Eq. 8) with the positive pair consisting of a sample and its neighbor drawn from a set of nearest neighbors. We call this variant of CRLC “two-stage” CRLC. In fact, we did try training both the C-head and the RL-head in the second stage by minimizing but could not achieve good results compared to training only the C-head. We hypothesize that finetuning the RL-head causes the model to capture too much fine-grained information and ignore important cluster-level information, which hurts the clustering performance.
In Table 2, we show the clustering results of “two-stage” CRLC on ImageNet50/100/200. Results on CIFAR10/20 and STL10 are provided in Appdx. A.9. For fair comparison with SCAN, we use the same settings as in  (details in Appdx. A.7). It is clear that “two-stage” CRLC outperforms SCAN on all datasets. A possible reason is that besides pushing neighboring samples close together, our proposed probability contrastive loss also pulls away samples that are not neighbors (in the negative pairs) while the SCAN’s loss does not. Thus, by experiencing more pairs of samples, our model is likely to form better clusters.
5.2 Semi-supervised Learning
Given the good performance of CRLC on clustering, it is natural to ask whether this model also performs well on semi-supervised learning (SSL) or not. To adapt for this new task, we simply train CRLC with the new objective (Eq. 12). The model architecture and training setups remain almost the same (changes in Appdx. A.13).
From Table 3, we see that CRLC-semi, though is not designed especially for SSL, significantly outperforms many state-of-the-art SSL methods (brief discussion in Appdx. A.12). For example, CRLC-semi achieves about 30% and 10% lower error than MixMatch  and UDA  respectively on CIFAR10 with 4 labeled samples per class. Interestingly, the power of CRLC-semi becomes obvious when the number of labeled data is pushed to the limit. While most baselines cannot work with 1 or 2 labeled samples per class, CRLC-semi still performs consistently well with very low standard deviations. We hypothesize the reason is that CRLC-semi, via minimizing , models the “smoothness” of data better than the SSL baselines. For more results on SSL, please check Appdx. A.14.
5.3 Ablation Study
Comparison of different critics in the probability contrastive loss
In Fig. 3 left, we show the performance of CRLC on CIFAR10 and CIFAR20 w.r.t. different critic functions. Apparently, the theoretically sound “log-of-dot-product” critic (Eq. 9) gives the best results. The “negative-L2-distance” critic is slightly worse than the “log-of-dot-product” critic while the “dot-product” and the “negative-JS-divergence” critics are the worst.
Contribution of the feature contrastive loss
We investigate by how much our model’s performance will be affected if we change the coefficient of ( in Eq. 11) to different values. Results on CIFAR20 are shown in Fig. 3 middle, right. Interestingly, minimizing both and simultaneously results in lower values of than minimizing only (). It implies that provides the model with more information to form better clusters. In order to achieve good clustering results, should be large enough relative to the coefficient of which is 1. However, too large results in a high value of , which may hurts the model’s performance. For most datasets including CIFAR20, the optimal value of is 10.
Nonparametric implementation of CRLC
Besides using SimCLR , we can also implement the two contrastive losses in CRLC using MemoryBank  (Section 3.1). This reduces the memory storage by about 30% and the training time by half (on CIFAR10 with ResNet34 as the backbone and the minibatch size of 512). However, MemoryBank-based CRLC usually takes longer time to converge and is poorer than the SimCRL-based counterpart as shown in Fig. 4 left. The contributions of the number of negative samples and the momentum coefficient to the performance of MemoryBank-based CRLC are analyzed in Appdx. A.10.2.
We visualize the manifold of the continous features learned by CRLC in Fig. 4 middle. We observe that CRLC usually groups features into well-separate clusters. This is because the information captured by the C-head has affected the RL-head. However, if the RL-head is learned independently (e.g., in SimCLR), the clusters also emerge but are usually close together (Fig. 4 right). Through both cases, we see the importance of contrastive representation learning for clustering.
We proposed a novel clustering method named CRLC that exploits both the fine-grained instance-level information and the coarse-grained cluster-level information from data via a unified sample-oriented contrastive learning framework. CRLC showed promising results not only in clustering but also in semi-supervised learning. In the future, we plan to enhance CRLC so that it can handle neighboring samples in a principled way rather than just views. We also want to extend CRLC to other domains (e.g., videos, graphs) and problems (e.g., object detection).
-  (2019) Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371. Cited by: §4.1.
-  (2019) ReMixMatch: semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785. Cited by: §A.12, §A.14, Table 8, Table 3.
-  (2019) Mixmatch: a holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 5050–5060. Cited by: §A.12, §A.14, Table 8, §5.2, Table 3.
Deep clustering for unsupervised learning of visual features. In
Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §4.1.
-  (2020) Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882. Cited by: §4.1.
Deep discriminative clustering analysis. arXiv preprint arXiv:1905.01681. Cited by: Table 1.
-  (2017) Deep adaptive image clustering. In Proceedings of the IEEE international conference on computer vision, pp. 5879–5887. Cited by: §1, §4.2, §5, Table 1.
-  (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §A.5, §A.7, §2.1, §3.1, §4.1, Figure 4, §5, §5.3.
-  (2011) An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223. Cited by: §5.
-  (2019) RandAugment: practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719. Cited by: §A.14.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §5.
-  (2014) Discriminative unsupervised feature learning with convolutional neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1, pp. 766–774. Cited by: §4.1.
-  (2017) Nonparametric variational auto-encoders for hierarchical representation learning.. In ICCV, pp. 5104–5112. Cited by: §A.7.
-  (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. Cited by: §4.1.
-  (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §4.1.
-  (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §A.7, §A.7, §3.1, §4.1, §5.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.
-  (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §2.1.
-  (2020) Deep semantic clustering by partition confidence maximisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8849–8858. Cited by: §1, §4.2, §5, §5, §5.1.1, Table 1.
-  (2019) Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9865–9874. Cited by: §1, §4.2, §5, §5, §5.1.1, Table 1.
-  (2017) Variational deep embedding: an unsupervised and generative approach to clustering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 1965–1972. Cited by: §1, §4.2.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §4.2.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §5.
-  (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: Table 8.
-  (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3, pp. 2. Cited by: §A.14, Table 8.
-  (2020) Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966. Cited by: §4.1.
-  (2020) Contrastive clustering. arXiv preprint arXiv:2009.09687. Cited by: §4.2.
SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §A.13, §A.7.
-  (2020) Formal limitations on the measurement of mutual information. In International Conference on Artificial Intelligence and Statistics, pp. 875–884. Cited by: §2.1.
-  (2013) Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26, pp. 3111–3119. Cited by: §4.1.
-  (2020) Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717. Cited by: §4.1.
A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 419–426. Cited by: §4.1.
-  (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.1, §4.1.
-  (2020) Improving unsupervised image clustering with robust learning. arXiv preprint arXiv:2012.11150. Cited by: §4.2.
-  (2019) On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171–5180. Cited by: §A.3, §1, §2.1, §2.1, §4.1.
-  (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685. Cited by: §A.12, §A.13, §A.14, §A.14, Table 8, Table 3.
-  (2019) Understanding the limitations of variational mutual information estimators. arXiv preprint arXiv:1910.06222. Cited by: §2.1.
Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems, pp. 1195–1204. Cited by: Table 8.
-  (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §A.7, §2.1, §2.1, §4.1.
-  (2020) Scan: learning to classify images without labels. In European Conference on Computer Vision, pp. 268–285. Cited by: §A.7, Table 7, §4.2, §5, §5, §5.1.2, §5.1.2, Table 2.
-  (2019) Deep comprehensive correlation mining for image clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8150–8159. Cited by: Table 1.
-  (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §A.5, §2.1, §3.1, §4.1, Figure 4, §5, §5.3.
-  (2020) Delving into inter-image invariance for unsupervised visual representations. arXiv preprint arXiv:2008.11702. Cited by: §4.1.
-  (2016) Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487. Cited by: §1, §4.2, Table 1.
-  (2019) Unsupervised data augmentation for consistency training. Cited by: §A.12, §A.14, Table 8, §5.2, Table 3.
-  (2017) Towards k-means-friendly spaces: simultaneous deep learning and clustering. In international conference on machine learning, pp. 3861–3870. Cited by: §1, §4.2.
-  (2016) Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5147–5156. Cited by: §1, §4.2, Table 1.
-  (2019) Deep clustering by gaussian mixture variational autoencoders with graph embedding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6440–6449. Cited by: §1.
-  (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6210–6219. Cited by: §4.1.
-  (2020) Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Advances in Neural Information Processing Systems 33. Cited by: Table 1.
-  (2020) Online deep clustering for unsupervised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6688–6697. Cited by: §4.1.
-  (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §A.12.
-  (2020) Deep robust clustering by contrastive learning. arXiv preprint arXiv:2008.03030. Cited by: §4.2, §5, §5.1.1, Table 1.
-  (2019) Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6002–6012. Cited by: §4.1.
Appendix A Appendix
a.1 Possible critics for the probability contrastive loss
We list here several possible critics that could be used in . If we simply consider a critic as a similarity measure of two probabilities and , could be the negative Jensen Shannon (JS) divergence666The JS divergence is chosen due to its symmetry. The negative sign reflects the fact that is a similarity measure instead of a divergence. between and :
or the negative L2 distance between and :
In both cases, achieves its maximum value when and its minimum value when and are different one-hot vectors.
We can also define as the dot product of and as follows:
However, the maximum value of this critic is no longer obtained when but when and are the same one-hot vector (check Appdx. A.2 for details). It means that maximizing this critic encourages not only the consistency between and but also the confidence of and .
a.2 Global maxima and minima of the dot product critic for probabilities
The dot product critic achieves its global maximum value at when and are the same one-hot vector, and its global minimum value at when and are different one-hot vector.
Since , we have . This minimum value is achieved when for all . And because , and must be different one-hot vectors.
In addition, we also have . This maximum value is achieved when or for all , which means and must be the same one-hot vectors. ∎
Since the gradient of w.r.t. is proportional to , if we fix and only optimize , maximizing via gradient ascent will encourage to be one-hot at the component at which is the largest. Similarly, minimizing via gradient descent will encourage to be one-hot at the component at which is the smallest.
In case , all the components of have similar gradients. Although it does not change the relative order between the components of after update, it still push towards the saddle point . However, chance that models get stuck at this saddle point is tiny unless we explicitly force it to happen (e.g., maximizing ).
For better understanding of the optimization dynamics, we visualize the surface of with in Fig. (a)a. has the same global optimal values and surface as
a.3 Derivation of the InfoNCE lower bound
The variational lower bound of can be computed as follows:
where is the variational approximation of .
Following , we assume that belongs to the energy-based variational family that uses a critic and is scaled by the data density :
where is the partition function which does not depend on , .
Since the optimal value of is , we have:
which means the optimal value of is proportional to .
Here, we encounter the intractable . To form a tractable lower bound of , we continue replacing with its variational upper bound:
where is the variational approximation of . We should choose close to so that the variance of the bound in Eq. 21 is small. Recalling that , we define as follows:
a.4 Derivation of the scaled dot product critic in representation learning
Recalling that in contrastive representation learning, the critic is defined as the scaled dot product between two unit-normed feature vectors , :
Interestingly, this formula of is accordant with the formula of and is proportional to . To see why, let’s assume that the distribution of given
is modeled by an isotropic Gaussian distribution withas the mean vector and as the covariance matrix. Then, we have:
where due to the fact that and are unit-normed vectors.
a.5 Analysis of the gradient of
Recalling that the probability contrastive loss for a sample with the “log-of-dot-product” critic is computed as follows:
Because is always parametric while () can be either parametric (if is implemented via the SimCLR framework ) or non-parametric (if is implemented via the MemoryBank framework ), we focus on the gradient of back-propagating through . In practice,
is usually implemented by applying softmax to the logit vector:
where denotes the -th component of . Similarly, is the -th component of .
The gradient of w.r.t. is given by:
The first term in Eq. 27 is equivalent to:
And the second term in Eq. 27 is equivalent to:
Thus, we have:
where is the -th logit of . Since during training, is encouraged to be one-hot (see Appdx. A.2), the denominator may not be defined if we do not prevent from being a different one-hot vector. However, even when the denominator is defined, the update still does not happen as expected when is one-hot. To see why, let’s consider a simple scenario in which and . Apparently, the denominator is 0.001 0. By maximizing , we want to push toward . Thus, we expect that and . However, the gradients w.r.t. are 0s for all :