Clustering is one of the fundamental tasks in computer vision and machine learning. Especially with the development of the Internet, we can easily collect thousands of images and videos every day, most of which are unlabeled. It is very expensive and time-consuming to manually label these data. In order to make use of these unlabeled data and investigate their correlations, unsupervised clustering draws much attention recently, which aims to categorize similar data into one cluster based on some similarity measures.
Image clustering is a challenging task due to the image variance of shape, appearance in the wild. Traditional clustering methods[47, 18, 6]33], and subspace clustering [29, 16]
may fail in this case for two obvious issues: firstly, hand-crafted features have limited capacity and cannot dynamically adjust to capture the prior distribution, especially when dealing with large-scale real-world images; secondly, the separation of feature extraction and clustering will struggle with the sub-optimal. Recently, with the booming of deep unsupervised representation learning, many researchers shift their attention to deep unsupervised clustering[38, 22, 8], which can well solve the aforementioned limitations. Typically, to learn a better representation, [2, 43, 44] adopt the auto-encoder and  maximizes the mutual information between features. DAC 
constructs positive and negative pairs based on cosine similarity to guide network training.
However, for these methods, several points are still missing. Firstly, feature representations that only consider reconstruction or mutual information lack discriminative power. Secondly, traditional cluster method like K-means effectively use category assumption on data. Contrast to that, DAC only focuses on pair-wise correlation and neglects the category information, which limits its performance. Thirdly, there are also other correlations that are helpful for deep image feature learning, for example,  shows that measuring feature equivariance can benefit image representation understanding.
To tackle above issues, as shown in Figure 1
(a), we propose a novel method, namely deep comprehensive correlation mining (DCCM), which comprehensively explores correlations among different samples (red line), local robustness to geometry transformation (yellow line), between different layer features of the same sample (blue line), and their inter-correlations (green lines) to learn discriminative representations and train the network in a progressive manner. First of all, for the correlation among different samples, we adopt the deep convolutional neural network (CNN) to generate prediction feature for the input image. With proper constraints, the learned prediction feature will tend to be one-hot. Then we can compute the cosine similarity and construct the similarity graph. Based on the similarity graph and prediction feature, we assign a large threshold to get highly-confident pseudo-graph and pseudo-label to guide the feature learning. Secondly, for the local robustness to small perturbations, we add small perturbation or transformation on the original input image to generate a transformed image. Under the local robustness assumption, the prediction of the transformed image should be consistent with that of the original image. So we can use the prediction of the original image to guide the feature learning of the transformed image. Thirdly, feature representation of deep layer should preserve distinct information of the input. So we maximize the mutual information between the deep layer feature and shallow layer feature of the same sample. To make the representation more discriminative, we further extend it to a triplet form by incorporating the graph information above. Finally, we combine the loss function of these three different aspects and jointly investigate these correlations in an end-to-end way. Results in Figure1(c) show the superiority of our method (purple curve) over the state-of-the-art method DAC  (red curve).
Our main contributions are summarized as follows:
We propose a novel end-to-end deep clustering framework to comprehensively mine various kinds of correlations, and select highly-confident information to train the network in a progressive way;
We first derive the rationality of pseudo-label and introduce the highly-confident pseudo-label loss to directly investigate the category information and guide the unsupervised training of deep network;
We make use of the local robustness assumption to small perturbations aiming at learning better representation. Instead of simply constrain the feature distance, above pseudo-graph and pseudo-label is utilized to guide discriminative feature learning of samples after small perturbation;
We extend the instance-level mutual information to triplet-level, and come up with triplet mutual information loss to learn more discriminative features.
2 Related Work
2.1 Deep Clustering
mainly aim to combine the deep feature learning[2, 41, 46] with traditional clustering methods [47, 18, 6]. Auto-encoder (AE)  is a very popular feature learning method for deep clustering, and many methods are proposed to minimize the loss of traditional clustering methods to regularize the learning of latent representation of auto-encoder. For example, [43, 19] proposes the deep embedding clustering to utilize the KL-divergence loss.  also uses the KL-divergence loss, but adds a noisy encoder to learn more robust representation.  adopts the K-means loss, and [22, 38, 42] incorporate the self-representation based subspace clustering loss.
Besides the auto-encoder, some methods directly design specific loss function based on the last layer output.  introduces a recurrent-agglomerative framework to merge clusters that are close to each other.  explores the correlation among different samples based on the label features, and uses such similarity as supervision.  extends the spectral clustering into deep formulation.
2.2 Deep Unsupervised Feature Learning
mainly focus on deep unsupervised learning of representations. Based on Generative Adversarial Networks (GAN), proposes to add an encoder to extract visual features.  directly uses the fixed targets which are uniformly sampled from a unit sphere to constrain the deep features assignment.  utilizes the pseudo-label computed by the K-means on output features as supervision to train the deep neural networks.  proposes the deep infomax to maximize the mutual information between the input and output of a deep neural network encoder.
2.3 Self-supervised Learning
Self-supervised learning[23, 25] generally needs to design a pretext task, where a target objective can be computed without supervision. They assume that the learned representations of the pretext task contain high-level semantic information that is useful for solving downstream tasks of interest, such as image classification. For example,  tries to predict the relative location of image patches, and [34, 35] predict the permutation of a “jigsaw puzzle” created from the full image.  regards each image as an individual class and generates multiple images of it by data augmentation to train the network.  rotates an image randomly by one of four different angles and lets the deep model predict the rotation.
3 Deep Comprehensive Correlation Mining
Without labels, correlation stands in the most important place in deep clustering. In this section, we first construct pseudo-graph to explore binary correlation between samples to start the network training. Then we propose the pseudo-label loss to make full use of category information behind the data. Next, we mine the local robustness of predictions before and after adding transform on input image. We also lift the instance level mutual information to triplet formation to introduce discriminativity. Finally, we combine them together to get our proposed method.
3.1 Preliminary: Pseudo-graph Supervision
We first compute the similarity among samples and select highly-confident pair-wise information to guide the network training by constructing pseudo-graph. Let be the unlabeled dataset, where is the -th image and is the total number of images. Denote as the total number of classes. We aim to learn a deep CNN based mapping function which is parameterized by . Then we can use to represent the prediction feature of image after the soft-max layer of CNN. It has the following properties:
Based on the label feature , the cosine similarity between the -th and the -th samples can be computed by , where
is the dot production of two vectors. Similar to DAC, we can construct the pseudo-graph by setting a large threshold :
If the similarity between two samples is larger than the threshold, then we judge that these two samples belong to the same class (), and the similarity of these samples should be maximized. Otherwise (), the similarity of these samples should be minimized. The pseudo-graph supervision can be defined by:111For the loss function , there are many choices, such as the contrastive Siamese net loss [4, 30] regularizing the distance between two samples, and the binary cross-entropy loss  regularizing the similarity.
Please note that there are two differences between our pseudo-graph and that in DAC : 1) Unlike the strong
-norm constrain in DAC, we relax this assumption which only needs to take the output after Softmax layer. This relaxation increases the capacity of labeling feature and finally induce a better result in our experiment. 2) Instead of dynamically decreasing threshold in DAC, we only need a fixed threshold of. This prevents the training from the disadvantage caused by noisy false positive pairs.
3.2 Pseudo-label Supervision
The correlation explored in pseudo-graph is not transitive and limited to pair-wise samples. Towards this issue, in this subsection, we propose the novel pseudo-label loss and prove its rationality. We first prove the existence of -partition of the pseudo-graph, which could be naturally regarded as pseudo-label. And then we state that this partition would make the optimal solution in Eq. (3) lead to one-hot prediction, which formulates the pseudo-label. Finally, the pseudo-label loss will be introduced to optimize convolutional neural networks.
Existence of -partition. The binary relation between samples and defined in Eq. (3) is not transitive: is not deterministic given and , and this may lead to unstability in training. Therefore, we introduce Lemma (1) to extend it to a stronger relation.
For any weighted complete graph with weight for edge , if for , then there exists a threshold that has exactly partitions, where
If we take the assumption that is distinctive to each other in similarity graph , it can be seen as a weighted complete graph under the assumption of Lemma (1). Then there exists a threshold dividing into exactly partitions .
Formulation of the Pseudo-label. Let denote the sample belongs to partition , and we can define a transitive relation as:
which indicates that pairs with high cosine similarity are guaranteed to be in the same partition. This is to say, as the quality of similarity matrix increases during training, this partition gets closer to the ground truth partition, therefore can be regarded as a target to guide and speed up training. Hence, we set the partition of each as its pseudo label.
The following claim reveals the relationship between the assigned pseudo-label and the prediction after softmax:
Claim 1.222The proof is presented in supplementary materials.
Let denote the optimal solution to Eq. (3). If has partitions, then the prediction would be one-hot:
Hence we can formulate our pseudo-label as:
where denotes the
-th component of the prediction vector. Its corresponding probability of the predicted pseudo-label can be computed by. In practice, does not strictly follow the one-hot property, since it is difficult to attain the optimal solution for the problem in Eq. (3) due to the non-convex property. So we also set a large threshold for probability to select highly-confident pseudo-label for supervision:
indicates the predicted pseudo-label is highly-confident, and only under this situation, will the pseudo-label of the -th samples join the network training.
Pseudo-label Loss. The pseudo-label supervision loss is formulated as:
The loss function is often defined by the cross-entropy loss. By combining the supervision of highly-confident pseudo-graph and pseudo-label, we explore the correlation among different samples by minimizing:
where is a balance parameter. Those selected highly-confident information can supervise the training of deep network in a progressive manner.
3.3 The Local Robustness
An ideal image representation should be invariant to the geometry transformation, which can be regarded as the local robustness assumption. Mathematically, given an image sample and a geometry transformation , we denote as the transformed sample, then a good feature extractor should satisfy that these two samples have the same label and . Thus we can incorporate the distance between and as a feature invariant loss as:
where is the -norm to measure the distance between predictions of original and transformed samples. and generated by the transformation can be regarded as the ’easy’ positive pair, which can well stabilize the training and boost the performance.
Moreover, please recall that for the original samples, we compute the pseudo-graph and pseudo-label as supervision. Instead of simply minimizing the distance of predictions, we hope the graph and label information computed based on transformed samples should be consistent with those of original samples. On the one hand, given an image with highly-confident pseudo-label , we also force has same pseudo-label. On the other hand, we also investigate the correlation among the transformed samples with the highly-confident pseudo-graph computed on the original samples , which is beneficial to increase the network robustness. The loss function to achieve above targets can be formulated as:
The deep unsupervised learning can benefit a lot from the above strategy. As we set high confidence for the construction of pseudo-graph and pseudo-label, it can be regarded as the easy sample, which will contribute little to the parameter learning . By adding small perturbation, the prediction of transformed sample will not be easy as that of original sample, which will contribute a lot in return.
3.4 Triplet Mutual Information
In this section, we explore the correlation between deep and shallow layer representations of each instance and propose a novel loss, named triplet mutual information loss, to make full use of the feature correspondence information. Firstly, we introduce the mutual information loss which is proposed in [36, 20] and analyze its limitation. Next, the concept of triplet correlations is described. Finally, we propose the triplet mutual information loss that enables convolutional neural networks to learn discriminative features.
The mutual information (MI) between deep and shallow layer features of the same sample should be maximized, which guarantees the consistency of representation. Similar to 
, we also convert the MI of two random variables (and
) to the Jensen-Shannon divergence (JSD) between samples coming from the joint distributionand their product of marginals . Correspondingly, features of different layers should follow the joint distribution only when they are features of the same sample, otherwise, they follow the marginal product distribution. So JSD version MI is defined as:
where corresponds to the deep layer features, corresponds to the shallow layer features, is a discriminator trained to distinguish whether and are sampled from the joint distribution or not, and is the softplus function. For discriminator implementation,  shows that incorporating knowledge about locality in the input can improve the representations’ quality.
Please note that currently, we do not incorporate any class information. For two different samples and , the mutual information between ’s shallow-layer representation and ’s deep-layer representation will be minimized even if they belong to the same class, which is not reasonable. So we consider fixing this issue by introducing the mutual information loss of positive pairs. As shown in the bottom right of Figure 2, with the generated pseudo-graph described in Section 3.1, we select positive pairs and negative pairs with the same anchor to construct triplet correlations. Analogous to supervised learning, this approach lifts the instance-level mutual information supervision to triplet-level supervision.
Then we show how this approach is theoretically formulated by extending Eq. (13). We set the samples of random variable and to be sets, instead of instances. Denote the deep layer feature of sample belongs to class as and its shallow layer feature as , then and are feature sets of class . Variables and are defined by and , respectively. Then we can get the following extension of Eq. (13):
where we investigate the mutual information based on class-related feature sets. In this case, besides considering the features of same sample, we also maximize the mutual information between different layers’ features for samples belongs to the same class. The overview of triplet mutual information loss is shown in the bottom right of Figure 2. Specifically, we compute the loss function in Eq. (3.4) by pair-wise sampling. For each sample, we construct the positive pairs and negative pairs based on the pseudo-graph W to compute the triplet mutual information loss, which is very helpful to learn more discriminative representations.
3.5 The Unified Model and Optimization
By combining the investigations of these three aspects in above subsections and jointly train the network, we come up with our deep comprehensive correlation mining for unsupervised learning and clustering. The final objective function of DCCM can be formulated as:
where and are constants to balance the contributions of different terms, is the overall pseudo-graph loss, and is the overall pseudo-label loss. The framework of DCCM is presented in Figure 2. Based on the ideally one-hot prediction feature, we compute the highly-confident pseudo-graph and pseudo-label to guide the feature learning of both original and transformed samples, investigating both correlations among different samples and local robustness for small perturbation. In the meantime, to investigate feature correspondence for discriminative feature learning, the pseudo-graph is also utilized to select highly-confident positive and negative pairs for triplet mutual information optimization.
Our proposed method can be trained in a minibatch based end-to-end way, which can be optimized efficiently. After the training, the predicted feature is ideally one-hot. The predicted cluster label for sample is exactly same to the pseudo-label , which is easily computed by Eq. (7). We summarize the overall training process in Algorithm 1.
We distribute our experiments into a few sections. We first examine the effectiveness of DCCM by comparing it against other state-of-the-art algorithms. After that, we conduct more ablation studies by controlling several influence factors. Finally, we do a series of analysis experiments to verify the effectiveness of the unified model training framework. Next, we introduce the experimental setting.
Datasets. We select six challenging image datasets for deep unsupervised learning and clustering, including the CIFAR-10 , CIFAR-100 , STL-10 , Imagenet-10, and ImageNet-dog-15, and Tiny-ImageNet  datasets. We summarize the statistics of these datasets in Table 1.
For the clustering task, we adopt the same setting as that in , where the training and validation images of each dataset are jointly utilized, and the superclasses are considered for the CIFAR-100 dataset in experiments. ImageNet-10 and ImageNet-dog-15 used in our experiments are same as , where they randomly choose subjects and kinds of dog images from the ImageNet dataset, and resize these images to . As for the Tiny-ImageNet dataset, a reduced version of the ImageNet dataset , it totally contains classes of images, which is a very challenging dataset for clustering.
For the transfer learning classification task, we adopt the similar setting as that in, where we mainly consider the CIFAR-10, CIFAR-100 of classes. Training and testing samples are separated.
Evaluation Metrics. To evaluate the performance of clustering, we adopt three commonly used metrics including normalized mutual information (NMI), accuracy (ACC), adjusted rand index (ARI). These three metrics favour different properties in clustering task. For details, please refer to the appendix. For all three metrics, the higher value indicates the better performance.
To evaluate the quality of feature representation, we adopt the non-linear classification task which is the same as that in . Specifically, after the training of DCCM, we fix the parameter of deep neural network and train a multilayer perception network with a single hidden layer ( units) on top of the last convolutional layer and fully-connected layer features separately in a supervised way.
Visualizations of embeddings for different stages of DCCM and DAC on the CIFAR-10 dataset. Different colors denote various clusters. From (a) to (c), with the increasing of epochs, DCCM tends to progressively learn more discriminative features. Based on (c) and (d), features of DCCM are more discriminative than that of DAC.
Implementation Details. The network architecture used in our framework is a shallow version of the AlexNet (details for different datasets are described in the supplementary materials). Similar to 
, we adopt the RMSprop optimizer with. For hyper-parameters, we set and for all datasets, which are relatively stable within a certain range. The thresholds to construct highly-confident pseudo-graph and select highly-confident pseudo-label are set to and
, respectively. The small perturbations used in the experiments include rotation, shift, rescale, etc. For discriminator of mutual information estimation, we adopt the network with threeconvolutional layers, which is same to 
. We use pytorch to implement our approach.
4.1 Main Results
We first compare the DCCM with other state-of-the-art clustering methods on the clustering task. The results are shown in the Table 2. Most results of other methods are directly copied from DAC . DCCM significantly surpasses other methods by a large margin on these benchmarks under all three evaluation metrics. Concretely, the improvement of DCCM is very significant even compared with the state-of-the-art method DAC . Take the clustering ACC for example, our result is higher than the performance of DAC  on the CIFAR-10 dataset. On the CIFAR-100 dataset, the gain of DCCM is over DAC .
Figure 3 visualizes feature embeddings of the DCCM and DAC on CIFAR-10 using t-SNE . We can see that compared with DAC, DCCM exhibits more discriminative feature representation. Above results can sufficiently verify the effectiveness and superiority of our proposed DCCM.
To further evaluate the quality of feature representations, we adopt the classification task and compare DCCM with other deep unsupervised feature learning methods. We compare DCCM against several unsupervised feature learning methods, including variational AE (VAE) , adversarial AE (AAE) , BiGAN , noise as targets (NAT) , and deep infomax (DIM) . The top 1 non-linear classification accuracy comparison is presented in Figure 4. We can also observe that DCCM achieves much better results than other methods on CIFAR-10 and CIFAR-100 datasets. Especially on the CIFAR-10 dataset, our results on both convolutional and fully-connected layer features are more than higher than these of the second best method DIM. Since we incorporate the graph-based class information and transform the instance-level mutual information into the triplet-level, our method can learn much more discriminative features, which accounts for the obvious improvement.
4.2 Correlation Analysis
We analyze the effectiveness of various correlations from three aspects: Local Robustness, Pseudo-label and Triplet Mutual Information in this section. The results are shown in Table 3.
Local Robustness Influence. The only difference between methods M2 and M1 lies in whether to use the local robustness mechanism or not. We can see that M2 significantly surpasses the M1, which demonstrates the robustness and effectiveness of local robustness. Because we set high threshold to select positive pairs, without transformation, these easy pairs have limited contribution to parameter learning. With the local robustness loss, we construct many hard sample pairs to benefit the network training. So it significantly boosts the performance.
Effectiveness of Pseudo-label. With the help of pseudo-label, M3 (with both pseudo-graph and pseudo-label) achieves much better results than M2 (with only pseudo-graph) under all metrics. Specifically, there is a improvement on clustering ACC. The reason is that pseudo-label can make full use of the category information behind the feature distribution, which can benefit the clustering.
Triplet Mutual Information Analysis. Comparing the results of M4 and M3, we can see that the triplet mutual information can further improve the clustering ACC by . As we analyzed in Section 3.4, with the help of pseudo-graph, triplet mutual information can not only make use of the features correspondence of the same sample, but also introduce discriminative property by constructing positive and negative pairs. So it can further improve the result.
4.3 Overall Study of DCCM
In this section, we conducted analysis experiments on CIFAR-10  to investigate the behavior of deep comprehensive correlations mining. The model is trained with the unified model optimization which is introduced in 3.5.
BCubed Precision and Recall of Pseudo-graph. BCubed  is a metric to evaluate the quality of partitions in clustering. We validate that our method can learn better representation in a progressive manner by using the BCubed precision and recall curves, which are computed based on the pseudo-graphs of different epochs in Figure 5. It is obvious that with the increasing of epochs, the precision of the pseudo-graph becomes much better, which will improve the clustering performance in return.
Statistics of Prediction Features. According to Claim 1, the ideal prediction features have the one-hot property, so that we can use the highly-confident pseudo-label to guide the training. To verify it, we compare the distribution of the largest prediction probability between the initial stage and the final stage. The results on the CIFAR-10 dataset is presented in Figure 6(a). For the CIFAR-10 dataset, the largest probability is in the range of . We count the probability in nine disjoint intervals, such as , , , and . We can see that in the initial stage, less than of all samples have the probability that is larger than , while after training, nearly of all samples have the probability that is larger than . The above results imply that the largest probability tends to be , and others tend to be , which is consistent with our Claim 1.
Influence of Thresholds.
In Figure 6(b), we test the influence of threshold to select highly-confident pseudo-label for training. We can see that with the increase of threshold, the performance also increases. The reason is that with low threshold, some incorrect pseudo-label will be adopted for network training, which will affect the performance. So it is important to set relatively high threshold to select highly-confident pseudo-label for supervision.
For deep unsupervised learning and clustering, we propose the DCCM to learn discriminative feature representation by mining comprehensive correlations. Besides the correlation among different samples, we also make full use of the mutual information between corresponding features, local robustness to small perturbations, and their intercorrelations. We conduct extensive experiments on several challenging datasets and two different tasks to thoroughly evaluate the performance. DCCM achieves significant improvement over the state-of-the-art methods.
-  E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information retrieval, 12(4):461–486, 2009.
-  Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In NIPS, pages 153–160, 2007.
-  P. Bojanowski and A. Joulin. Unsupervised learning by predicting noise. In ICML, pages 517–526, 2017.
-  J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature verification using a” siamese” time delay neural network. In NIPS, pages 737–744, 1994.
-  D. Cai, X. He, and J. Han. Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering, 17(12):1624–1637, 2005.
-  D. Cai, X. He, X. Wang, H. Bao, and J. Han. Locality preserving nonnegative matrix factorization. In IJCAI, 2009.
-  M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
-  J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan. Deep adaptive image clustering. In IEEE ICCV, pages 5879–5887, 2017.
-  A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, pages 215–223, 2011.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE CVPR, 2009.
K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang.
Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization.In IEEE ICCV, pages 5747–5756, 2017.
-  C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, pages 1422–1430, 2015.
-  J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. In ICLR, 2017.
-  A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, pages 766–774, 2014.
-  Y. Duan, W. Zheng, X. Lin, J. Lu, and J. Zhou. Deep adversarial metric learning. In IEEE CVPR, pages 2780–2789, 2018.
-  E. Elhamifar and R. Vidal. Sparse subspace clustering. In IEEE CVPR, pages 2790–2797, 2009.
-  S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
-  K. C. Gowda and G. Krishna. Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern recognition, 10(2):105–112, 1978.
-  X. Guo, L. Gao, X. Liu, and J. Yin. Improved deep embedded clustering with local structure preservation. In IJCAI, pages 1753–1759, 2017.
-  R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid. Deep subspace clustering networks. In NIPS, pages 24–33, 2017.
-  L. Jing and Y. Tian. Self-supervised visual feature learning with deep neural networks: A survey. arXiv preprint arXiv:1902.06162, 2019.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  A. Kolesnikov, X. Zhai, and L. Beyer. Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005, 2019.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10 and cifar-100 datasets. URl: https://www. cs. toronto. edu/kriz/cifar. html, 6, 2009.
-  K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In IEEE CVPR, pages 991–999, 2015.
-  G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by low-rank representation. IEEE TPAMI, 35(1):171–184, 2013.
Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang.
Smooth neighbors on teacher graphs for semi-supervised learning.In IEEE CVPR, pages 8896–8905, 2018.
-  L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
-  A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
A. Y. Ng, M. I. Jordan, and Y. Weiss.
On spectral clustering: Analysis and an algorithm.In NIPS, pages 849–856, 2002.
-  M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, pages 69–84. Springer, 2016.
-  M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash. Boosting self-supervised learning via knowledge transfer. In IEEE CVPR, pages 9359–9367, 2018.
-  S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In NIPS, pages 271–279, 2016.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017.
-  X. Peng, S. Xiao, J. Feng, W.-Y. Yau, and Y. Zhang. Deep subspace clustering with sparsity prior. In IJCAI, pages 1925–1931, 2016.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  U. Shaham, K. Stanton, H. Li, B. Nadler, R. Basri, and Y. Kluger. Spectralnet: Spectral clustering using deep neural networks. 2018.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.Journal of machine learning research, 11(Dec):3371–3408, 2010.
-  P. Xi, F. Jiashi, L. Jiwen, Y. Wei-Yun, and Y. Zhang. Cascade subspace clustering. In AAAI, 2017.
-  J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep embedding for clustering analysis. In ICML, pages 478–487, 2016.
B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong.
Towards K-means-friendly spaces: Simultaneous deep learning and clustering.In ICML, pages 3861–3870, 2017.
-  J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning of deep representations and image clusters. In IEEE CVPR, pages 5147–5156, 2016.
-  M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional networks. In IEEE CVPR, pages 2528–2535, 2010.
-  L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, pages 1601–1608, 2005.
6 Supplementary Material
6.1 Proof of Lemma and Claim
Proof of Lemma : Since for , there exists a strongly increasing sequence of weights , and we can remove edges from in the order from smallest weight to largest by increasing threshold . This action would either increase the current partition number to or remain it unchanged. At the beginning of the process we have partition and at the end of the process we have partitions. Since , there exists a partition in the process.
Proof of Claim : Select samples from partition , denote the cosine similarity matrix of their corresponding optimal features as , and equals to its partitions pseudo graph
, which is an identity matrix. Denoteas , where denotes the -th element of the vector .
The set can only have no more than positive elements, otherwise, according to Pigeonhole principle, there exists a that and , which is contradicted to .
On the other hand, for the output of a softmax layer, every vector has at least one positive entry. Therefore, every vector has and only has one positive element that equals to .
6.2 Defenitions of Metrics
We introduce the following three standarded metrics we used to evaluate our model:
Normalized Mutual Information (NMI): Let and denote the predicted partition and the ground truth partition respectively, the NMI metric is calculated as:
Adjusted Rand Index (ARI): Given a set of elements, and two groupings or partitions (e.g. clustering results) of these elements with and groups, namely and , the overlap between and
can be summarized in a contingency table, where each element denotes the number of objects in common between and :
The contingent table is of the following shape:
and ARI is defined by:
Accuracy (ACC): Suppose the clustering algorithm is tested on samples. For a sample , we denote its cluster label as and its ground truth as . The clustering accuracy is defined by:
and function denotes the best permutation mapping function gained by Hungarian algorithm .
6.3 Compared Methods
For clustering, we adopt both traditional methods and deep learning based methods, including K-means, spectral clustering (SC) , agglomerative clustering (AC) , the nonnegative matrix factorization (NMF) based clustering , auto-encoder (AE) , denoising auto-encoder (DAE) , GAN , deconvolutional networks (DECNN) , variational auto-encoding (VAE) , deep embedding clustering (DEC) , jointly unsupervised learning (JULE) , and deep adaptive image clustering (DAC) .
6.4 Architechtures Details
In Table 4, we present the architectures for different datasets.
For CIFAR-10/CIFAR-100 , we set conv layers and pooling layers, followed with
fully-connected layers. Batch Normalization and ReLU are used on all hidden layers. The output features after the second conv layer (S for shallow) and the first fc layer (D for deep) are used to compute the mutual information () loss, concatenated as the input of discriminator. For other datasets, such as Tiny-ImageNet  and STL-10 , we set conv layers instead of . Due to their larger input size, we use the feature maps after the third conv layer as S. For all experiments, the output was a dimensional vector.
6.5 Sampling Strategy
|V1||nearest pos + random* neg||0.744|
|V2||nearest pos + farthest neg||0.713|
|V3||random* pos + random* neg||0.731|
|V4||top-n pos + random* neg||0.698|
The experiment result corresponding to the analysis in line - is listed in Table 5. We tried four strategies to fetch positive and negative pairs from pseudo-graph , and the terms used in the table refer to:
nearest means that for each sample, we select its nearest sample from the minibatch to construct a positive pair, while farthest means taking the farthest one to construct a negative pair.
random* means that we randomly take a positive sample that satisfies as a positive pair or a negative sample that satisfies as a negative pair.
top- pos means that we select the top confident pairs from the graph to construct positive pairs.
For each strategy, we take positive pairs and negative pairs into account, where is our batch size. This is to make sure that the computational complexity of each approach is nearly the same for fair comparison, while we also have explored more costly approaches and find that the improvement is negligible.
To clearly illustrate how is effected, here we set a fixed model trained with only . Then with the pseudo-graph generated by it, we train a new model using only from scratch. It can be concluded that the positive pairs are sensitive to noise since strategy V1 achieves better results than V3, and harder negative pairs are beneficial for training as strategy V1 also achieves better results than V2. Besides, we also notice the importance of uniform sampling within the minibatch, as the top-n pairs in V4 has higher confidence than that in V1, but the training collapses since only part of samples in the batch are included in the top-n strategy.