1 Introduction
Clustering is one of the fundamental tasks in computer vision and machine learning. Especially with the development of the Internet, we can easily collect thousands of images and videos every day, most of which are unlabeled. It is very expensive and timeconsuming to manually label these data. In order to make use of these unlabeled data and investigate their correlations, unsupervised clustering draws much attention recently, which aims to categorize similar data into one cluster based on some similarity measures.
Image clustering is a challenging task due to the image variance of shape, appearance in the wild. Traditional clustering methods
[47, 18, 6], such as Kmeans, spectral clustering
[33], and subspace clustering [29, 16]may fail in this case for two obvious issues: firstly, handcrafted features have limited capacity and cannot dynamically adjust to capture the prior distribution, especially when dealing with largescale realworld images; secondly, the separation of feature extraction and clustering will struggle with the suboptimal. Recently, with the booming of deep unsupervised representation learning, many researchers shift their attention to deep unsupervised clustering
[38, 22, 8], which can well solve the aforementioned limitations. Typically, to learn a better representation, [2, 43, 44] adopt the autoencoder and [20] maximizes the mutual information between features. DAC [8]constructs positive and negative pairs based on cosine similarity to guide network training.
However, for these methods, several points are still missing. Firstly, feature representations that only consider reconstruction or mutual information lack discriminative power. Secondly, traditional cluster method like Kmeans effectively use category assumption on data. Contrast to that, DAC only focuses on pairwise correlation and neglects the category information, which limits its performance. Thirdly, there are also other correlations that are helpful for deep image feature learning, for example, [28] shows that measuring feature equivariance can benefit image representation understanding.
To tackle above issues, as shown in Figure 1
(a), we propose a novel method, namely deep comprehensive correlation mining (DCCM), which comprehensively explores correlations among different samples (red line), local robustness to geometry transformation (yellow line), between different layer features of the same sample (blue line), and their intercorrelations (green lines) to learn discriminative representations and train the network in a progressive manner. First of all, for the correlation among different samples, we adopt the deep convolutional neural network (CNN) to generate prediction feature for the input image. With proper constraints, the learned prediction feature will tend to be onehot. Then we can compute the cosine similarity and construct the similarity graph. Based on the similarity graph and prediction feature, we assign a large threshold to get highlyconfident pseudograph and pseudolabel to guide the feature learning. Secondly, for the local robustness to small perturbations, we add small perturbation or transformation on the original input image to generate a transformed image. Under the local robustness assumption, the prediction of the transformed image should be consistent with that of the original image. So we can use the prediction of the original image to guide the feature learning of the transformed image. Thirdly, feature representation of deep layer should preserve distinct information of the input. So we maximize the mutual information between the deep layer feature and shallow layer feature of the same sample. To make the representation more discriminative, we further extend it to a triplet form by incorporating the graph information above. Finally, we combine the loss function of these three different aspects and jointly investigate these correlations in an endtoend way. Results in Figure
1(c) show the superiority of our method (purple curve) over the stateoftheart method DAC [8] (red curve).Our main contributions are summarized as follows:

We propose a novel endtoend deep clustering framework to comprehensively mine various kinds of correlations, and select highlyconfident information to train the network in a progressive way;

We first derive the rationality of pseudolabel and introduce the highlyconfident pseudolabel loss to directly investigate the category information and guide the unsupervised training of deep network;

We make use of the local robustness assumption to small perturbations aiming at learning better representation. Instead of simply constrain the feature distance, above pseudograph and pseudolabel is utilized to guide discriminative feature learning of samples after small perturbation;

We extend the instancelevel mutual information to tripletlevel, and come up with triplet mutual information loss to learn more discriminative features.
2 Related Work
2.1 Deep Clustering
Existing deep clustering methods [45, 43, 8]
mainly aim to combine the deep feature learning
[2, 41, 46] with traditional clustering methods [47, 18, 6]. Autoencoder (AE) [2] is a very popular feature learning method for deep clustering, and many methods are proposed to minimize the loss of traditional clustering methods to regularize the learning of latent representation of autoencoder. For example, [43, 19] proposes the deep embedding clustering to utilize the KLdivergence loss. [11] also uses the KLdivergence loss, but adds a noisy encoder to learn more robust representation. [44] adopts the Kmeans loss, and [22, 38, 42] incorporate the selfrepresentation based subspace clustering loss.Besides the autoencoder, some methods directly design specific loss function based on the last layer output. [45] introduces a recurrentagglomerative framework to merge clusters that are close to each other. [8] explores the correlation among different samples based on the label features, and uses such similarity as supervision. [40] extends the spectral clustering into deep formulation.
2.2 Deep Unsupervised Feature Learning
Instead of clustering, several approaches [2, 24, 32]
mainly focus on deep unsupervised learning of representations. Based on Generative Adversarial Networks (GAN),
[13] proposes to add an encoder to extract visual features. [3] directly uses the fixed targets which are uniformly sampled from a unit sphere to constrain the deep features assignment. [7] utilizes the pseudolabel computed by the Kmeans on output features as supervision to train the deep neural networks. [20] proposes the deep infomax to maximize the mutual information between the input and output of a deep neural network encoder.2.3 Selfsupervised Learning
Selfsupervised learning
[23, 25] generally needs to design a pretext task, where a target objective can be computed without supervision. They assume that the learned representations of the pretext task contain highlevel semantic information that is useful for solving downstream tasks of interest, such as image classification. For example, [12] tries to predict the relative location of image patches, and [34, 35] predict the permutation of a “jigsaw puzzle” created from the full image. [14] regards each image as an individual class and generates multiple images of it by data augmentation to train the network. [17] rotates an image randomly by one of four different angles and lets the deep model predict the rotation.3 Deep Comprehensive Correlation Mining
Without labels, correlation stands in the most important place in deep clustering. In this section, we first construct pseudograph to explore binary correlation between samples to start the network training. Then we propose the pseudolabel loss to make full use of category information behind the data. Next, we mine the local robustness of predictions before and after adding transform on input image. We also lift the instance level mutual information to triplet formation to introduce discriminativity. Finally, we combine them together to get our proposed method.
3.1 Preliminary: Pseudograph Supervision
We first compute the similarity among samples and select highlyconfident pairwise information to guide the network training by constructing pseudograph. Let be the unlabeled dataset, where is the th image and is the total number of images. Denote as the total number of classes. We aim to learn a deep CNN based mapping function which is parameterized by . Then we can use to represent the prediction feature of image after the softmax layer of CNN. It has the following properties:
(1) 
Based on the label feature , the cosine similarity between the th and the th samples can be computed by , where
is the dot production of two vectors. Similar to DAC
[8], we can construct the pseudograph by setting a large threshold :(2) 
If the similarity between two samples is larger than the threshold, then we judge that these two samples belong to the same class (), and the similarity of these samples should be maximized. Otherwise (), the similarity of these samples should be minimized. The pseudograph supervision can be defined by:^{1}^{1}1For the loss function , there are many choices, such as the contrastive Siamese net loss [4, 30] regularizing the distance between two samples, and the binary crossentropy loss [8] regularizing the similarity.
(3) 
Please note that there are two differences between our pseudograph and that in DAC [8]: 1) Unlike the strong
norm constrain in DAC, we relax this assumption which only needs to take the output after Softmax layer. This relaxation increases the capacity of labeling feature and finally induce a better result in our experiment. 2) Instead of dynamically decreasing threshold in DAC, we only need a fixed threshold of
. This prevents the training from the disadvantage caused by noisy false positive pairs.3.2 Pseudolabel Supervision
The correlation explored in pseudograph is not transitive and limited to pairwise samples. Towards this issue, in this subsection, we propose the novel pseudolabel loss and prove its rationality. We first prove the existence of partition of the pseudograph, which could be naturally regarded as pseudolabel. And then we state that this partition would make the optimal solution in Eq. (3) lead to onehot prediction, which formulates the pseudolabel. Finally, the pseudolabel loss will be introduced to optimize convolutional neural networks.
Existence of partition. The binary relation between samples and defined in Eq. (3) is not transitive: is not deterministic given and , and this may lead to unstability in training. Therefore, we introduce Lemma (1) to extend it to a stronger relation.
Lemma 1.
For any weighted complete graph with weight for edge , if for , then there exists a threshold that has exactly partitions, where
(4) 
If we take the assumption that is distinctive to each other in similarity graph , it can be seen as a weighted complete graph under the assumption of Lemma (1). Then there exists a threshold dividing into exactly partitions .
Formulation of the Pseudolabel. Let denote the sample belongs to partition , and we can define a transitive relation as:
(5) 
which indicates that pairs with high cosine similarity are guaranteed to be in the same partition. This is to say, as the quality of similarity matrix increases during training, this partition gets closer to the ground truth partition, therefore can be regarded as a target to guide and speed up training. Hence, we set the partition of each as its pseudo label.
The following claim reveals the relationship between the assigned pseudolabel and the prediction after softmax:
Claim 1.
^{2}^{2}2The proof is presented in supplementary materials.Let denote the optimal solution to Eq. (3). If has partitions, then the prediction would be onehot:
(6) 
Hence we can formulate our pseudolabel as:
(7) 
where denotes the
th component of the prediction vector. Its corresponding probability of the predicted pseudolabel can be computed by
. In practice, does not strictly follow the onehot property, since it is difficult to attain the optimal solution for the problem in Eq. (3) due to the nonconvex property. So we also set a large threshold for probability to select highlyconfident pseudolabel for supervision:(8) 
indicates the predicted pseudolabel is highlyconfident, and only under this situation, will the pseudolabel of the th samples join the network training.
Pseudolabel Loss. The pseudolabel supervision loss is formulated as:
(9) 
The loss function is often defined by the crossentropy loss. By combining the supervision of highlyconfident pseudograph and pseudolabel, we explore the correlation among different samples by minimizing:
(10) 
where is a balance parameter. Those selected highlyconfident information can supervise the training of deep network in a progressive manner.
3.3 The Local Robustness
An ideal image representation should be invariant to the geometry transformation, which can be regarded as the local robustness assumption. Mathematically, given an image sample and a geometry transformation , we denote as the transformed sample, then a good feature extractor should satisfy that these two samples have the same label and . Thus we can incorporate the distance between and as a feature invariant loss as:
(11) 
where is the norm to measure the distance between predictions of original and transformed samples. and generated by the transformation can be regarded as the ’easy’ positive pair, which can well stabilize the training and boost the performance.
Moreover, please recall that for the original samples, we compute the pseudograph and pseudolabel as supervision. Instead of simply minimizing the distance of predictions, we hope the graph and label information computed based on transformed samples should be consistent with those of original samples. On the one hand, given an image with highlyconfident pseudolabel , we also force has same pseudolabel. On the other hand, we also investigate the correlation among the transformed samples with the highlyconfident pseudograph computed on the original samples , which is beneficial to increase the network robustness. The loss function to achieve above targets can be formulated as:
(12) 
where is the transformed data set, and are same to those of original set in Eqs. (2) and (8).
The deep unsupervised learning can benefit a lot from the above strategy. As we set high confidence for the construction of pseudograph and pseudolabel, it can be regarded as the easy sample, which will contribute little to the parameter learning [15]. By adding small perturbation, the prediction of transformed sample will not be easy as that of original sample, which will contribute a lot in return.
3.4 Triplet Mutual Information
In this section, we explore the correlation between deep and shallow layer representations of each instance and propose a novel loss, named triplet mutual information loss, to make full use of the feature correspondence information. Firstly, we introduce the mutual information loss which is proposed in [36, 20] and analyze its limitation. Next, the concept of triplet correlations is described. Finally, we propose the triplet mutual information loss that enables convolutional neural networks to learn discriminative features.
The mutual information (MI) between deep and shallow layer features of the same sample should be maximized, which guarantees the consistency of representation. Similar to [36]
, we also convert the MI of two random variables (
and) to the JensenShannon divergence (JSD) between samples coming from the joint distribution
and their product of marginals . Correspondingly, features of different layers should follow the joint distribution only when they are features of the same sample, otherwise, they follow the marginal product distribution. So JSD version MI is defined as:(13) 
where corresponds to the deep layer features, corresponds to the shallow layer features, is a discriminator trained to distinguish whether and are sampled from the joint distribution or not, and is the softplus function. For discriminator implementation, [20] shows that incorporating knowledge about locality in the input can improve the representations’ quality.
Please note that currently, we do not incorporate any class information. For two different samples and , the mutual information between ’s shallowlayer representation and ’s deeplayer representation will be minimized even if they belong to the same class, which is not reasonable. So we consider fixing this issue by introducing the mutual information loss of positive pairs. As shown in the bottom right of Figure 2, with the generated pseudograph described in Section 3.1, we select positive pairs and negative pairs with the same anchor to construct triplet correlations. Analogous to supervised learning, this approach lifts the instancelevel mutual information supervision to tripletlevel supervision.
Then we show how this approach is theoretically formulated by extending Eq. (13). We set the samples of random variable and to be sets, instead of instances. Denote the deep layer feature of sample belongs to class as and its shallow layer feature as , then and are feature sets of class . Variables and are defined by and , respectively. Then we can get the following extension of Eq. (13):
(14) 
where we investigate the mutual information based on classrelated feature sets. In this case, besides considering the features of same sample, we also maximize the mutual information between different layers’ features for samples belongs to the same class. The overview of triplet mutual information loss is shown in the bottom right of Figure 2. Specifically, we compute the loss function in Eq. (3.4) by pairwise sampling. For each sample, we construct the positive pairs and negative pairs based on the pseudograph W to compute the triplet mutual information loss, which is very helpful to learn more discriminative representations.
3.5 The Unified Model and Optimization
By combining the investigations of these three aspects in above subsections and jointly train the network, we come up with our deep comprehensive correlation mining for unsupervised learning and clustering. The final objective function of DCCM can be formulated as:
(15) 
where and are constants to balance the contributions of different terms, is the overall pseudograph loss, and is the overall pseudolabel loss. The framework of DCCM is presented in Figure 2. Based on the ideally onehot prediction feature, we compute the highlyconfident pseudograph and pseudolabel to guide the feature learning of both original and transformed samples, investigating both correlations among different samples and local robustness for small perturbation. In the meantime, to investigate feature correspondence for discriminative feature learning, the pseudograph is also utilized to select highlyconfident positive and negative pairs for triplet mutual information optimization.
Our proposed method can be trained in a minibatch based endtoend way, which can be optimized efficiently. After the training, the predicted feature is ideally onehot. The predicted cluster label for sample is exactly same to the pseudolabel , which is easily computed by Eq. (7). We summarize the overall training process in Algorithm 1.
Datasets  CIFAR10  CIFAR100  STL10  ImageNet10  Imagenetdog15  TinyImageNet  

Methods\Metrics  NMI  ACC  ARI  NMI  ACC  ARI  NMI  ACC  ARI  NMI  ACC  ARI  NMI  ACC  ARI  NMI  ACC  ARI 
Kmeans  0.087  0.229  0.049  0.084  0.130  0.028  0.125  0.192  0.061  0.119  0.241  0.057  0.055  0.105  0.020  0.065  0.025  0.005 
SC [47]  0.103  0.247  0.085  0.090  0.136  0.022  0.098  0.159  0.048  0.151  0.274  0.076  0.038  0.111  0.013  0.063  0.022  0.004 
AC [18]  0.105  0.228  0.065  0.098  0.138  0.034  0.239  0.332  0.140  0.138  0.242  0.067  0.037  0.139  0.021  0.069  0.027  0.005 
NMF [6]  0.081  0.190  0.034  0.079  0.118  0.026  0.096  0.180  0.046  0.132  0.230  0.065  0.044  0.118  0.016  0.072  0.029  0.005 
AE [2]  0.239  0.314  0.169  0.100  0.165  0.048  0.250  0.303  0.161  0.210  0.317  0.152  0.104  0.185  0.073  0.131  0.041  0.007 
DAE [41]  0.251  0.297  0.163  0.111  0.151  0.046  0.224  0.302  0.152  0.206  0.304  0.138  0.104  0.190  0.078  0.127  0.039  0.007 
GAN [39]  0.265  0.315  0.176  0.120  0.151  0.045  0.210  0.298  0.139  0.225  0.346  0.157  0.121  0.174  0.078  0.135  0.041  0.007 
DeCNN [46]  0.240  0.282  0.174  0.092  0.133  0.038  0.227  0.299  0.162  0.186  0.313  0.142  0.098  0.175  0.073  0.111  0.035  0.006 
VAE [24]  0.245  0.291  0.167  0.108  0.152  0.040  0.200  0.282  0.146  0.193  0.334  0.168  0.107  0.179  0.079  0.113  0.036  0.006 
JULE [45]  0.192  0.272  0.138  0.103  0.137  0.033  0.182  0.277  0.164  0.175  0.300  0.138  0.054  0.138  0.028  0.102  0.033  0.006 
DEC [43]  0.257  0.301  0.161  0.136  0.185  0.050  0.276  0.359  0.186  0.282  0.381  0.203  0.122  0.195  0.079  0.115  0.037  0.007 
DAC [8]  0.396  0.522  0.306  0.185  0.238  0.088  0.366  0.470  0.257  0.394  0.527  0.302  0.219  0.275  0.111  0.190  0.066  0.017 
DCCM (ours)  0.496  0.623  0.408  0.297  0.340  0.181  0.376  0.482  0.262  0.608  0.710  0.555  0.321  0.383  0.182  0.224  0.108  0.038 
4 Experiments
We distribute our experiments into a few sections. We first examine the effectiveness of DCCM by comparing it against other stateoftheart algorithms. After that, we conduct more ablation studies by controlling several influence factors. Finally, we do a series of analysis experiments to verify the effectiveness of the unified model training framework. Next, we introduce the experimental setting.
Datasets. We select six challenging image datasets for deep unsupervised learning and clustering, including the CIFAR10 [26], CIFAR100 [26], STL10 [9], Imagenet10, and ImageNetdog15, and TinyImageNet [10] datasets. We summarize the statistics of these datasets in Table 1.
For the clustering task, we adopt the same setting as that in [8], where the training and validation images of each dataset are jointly utilized, and the superclasses are considered for the CIFAR100 dataset in experiments. ImageNet10 and ImageNetdog15 used in our experiments are same as [8], where they randomly choose subjects and kinds of dog images from the ImageNet dataset, and resize these images to . As for the TinyImageNet dataset, a reduced version of the ImageNet dataset [10], it totally contains classes of images, which is a very challenging dataset for clustering.
For the transfer learning classification task, we adopt the similar setting as that in
[20], where we mainly consider the CIFAR10, CIFAR100 of classes. Training and testing samples are separated.Evaluation Metrics. To evaluate the performance of clustering, we adopt three commonly used metrics including normalized mutual information (NMI), accuracy (ACC), adjusted rand index (ARI). These three metrics favour different properties in clustering task. For details, please refer to the appendix. For all three metrics, the higher value indicates the better performance.
To evaluate the quality of feature representation, we adopt the nonlinear classification task which is the same as that in [20]. Specifically, after the training of DCCM, we fix the parameter of deep neural network and train a multilayer perception network with a single hidden layer ( units) on top of the last convolutional layer and fullyconnected layer features separately in a supervised way.
Visualizations of embeddings for different stages of DCCM and DAC on the CIFAR10 dataset. Different colors denote various clusters. From (a) to (c), with the increasing of epochs, DCCM tends to progressively learn more discriminative features. Based on (c) and (d), features of DCCM are more discriminative than that of DAC.
Implementation Details. The network architecture used in our framework is a shallow version of the AlexNet (details for different datasets are described in the supplementary materials). Similar to [8]
, we adopt the RMSprop optimizer with
. For hyperparameters, we set and for all datasets, which are relatively stable within a certain range. The thresholds to construct highlyconfident pseudograph and select highlyconfident pseudolabel are set to and, respectively. The small perturbations used in the experiments include rotation, shift, rescale, etc. For discriminator of mutual information estimation, we adopt the network with three
convolutional layers, which is same to [20]. We use pytorch
[37] to implement our approach.4.1 Main Results
We first compare the DCCM with other stateoftheart clustering methods on the clustering task. The results are shown in the Table 2. Most results of other methods are directly copied from DAC [8]. DCCM significantly surpasses other methods by a large margin on these benchmarks under all three evaluation metrics. Concretely, the improvement of DCCM is very significant even compared with the stateoftheart method DAC [8]. Take the clustering ACC for example, our result is higher than the performance of DAC [8] on the CIFAR10 dataset. On the CIFAR100 dataset, the gain of DCCM is over DAC [8].
Figure 3 visualizes feature embeddings of the DCCM and DAC on CIFAR10 using tSNE [31]. We can see that compared with DAC, DCCM exhibits more discriminative feature representation. Above results can sufficiently verify the effectiveness and superiority of our proposed DCCM.
To further evaluate the quality of feature representations, we adopt the classification task and compare DCCM with other deep unsupervised feature learning methods. We compare DCCM against several unsupervised feature learning methods, including variational AE (VAE) [24], adversarial AE (AAE) [32], BiGAN [13], noise as targets (NAT) [3], and deep infomax (DIM) [20]. The top 1 nonlinear classification accuracy comparison is presented in Figure 4. We can also observe that DCCM achieves much better results than other methods on CIFAR10 and CIFAR100 datasets. Especially on the CIFAR10 dataset, our results on both convolutional and fullyconnected layer features are more than higher than these of the second best method DIM. Since we incorporate the graphbased class information and transform the instancelevel mutual information into the tripletlevel, our method can learn much more discriminative features, which accounts for the obvious improvement.
Methods  Correlations  Metrics  
LR  PL  MI  NMI  ACC  ARI  
M1  0.304  0.405  0.232  
M2  0.412  0.512  0.323  
M3  0.448  0.583  0.358  
M4  0.496  0.623  0.408 
4.2 Correlation Analysis
We analyze the effectiveness of various correlations from three aspects: Local Robustness, Pseudolabel and Triplet Mutual Information in this section. The results are shown in Table 3.
Local Robustness Influence. The only difference between methods M2 and M1 lies in whether to use the local robustness mechanism or not. We can see that M2 significantly surpasses the M1, which demonstrates the robustness and effectiveness of local robustness. Because we set high threshold to select positive pairs, without transformation, these easy pairs have limited contribution to parameter learning. With the local robustness loss, we construct many hard sample pairs to benefit the network training. So it significantly boosts the performance.
Effectiveness of Pseudolabel. With the help of pseudolabel, M3 (with both pseudograph and pseudolabel) achieves much better results than M2 (with only pseudograph) under all metrics. Specifically, there is a improvement on clustering ACC. The reason is that pseudolabel can make full use of the category information behind the feature distribution, which can benefit the clustering.
Triplet Mutual Information Analysis. Comparing the results of M4 and M3, we can see that the triplet mutual information can further improve the clustering ACC by . As we analyzed in Section 3.4, with the help of pseudograph, triplet mutual information can not only make use of the features correspondence of the same sample, but also introduce discriminative property by constructing positive and negative pairs. So it can further improve the result.
4.3 Overall Study of DCCM
In this section, we conducted analysis experiments on CIFAR10 [26] to investigate the behavior of deep comprehensive correlations mining. The model is trained with the unified model optimization which is introduced in 3.5.
BCubed Precision and Recall of Pseudograph. BCubed [1] is a metric to evaluate the quality of partitions in clustering. We validate that our method can learn better representation in a progressive manner by using the BCubed [1]precision and recall curves, which are computed based on the pseudographs of different epochs in Figure 5. It is obvious that with the increasing of epochs, the precision of the pseudograph becomes much better, which will improve the clustering performance in return.
Statistics of Prediction Features. According to Claim 1, the ideal prediction features have the onehot property, so that we can use the highlyconfident pseudolabel to guide the training. To verify it, we compare the distribution of the largest prediction probability between the initial stage and the final stage. The results on the CIFAR10 dataset is presented in Figure 6(a). For the CIFAR10 dataset, the largest probability is in the range of . We count the probability in nine disjoint intervals, such as , , , and . We can see that in the initial stage, less than of all samples have the probability that is larger than , while after training, nearly of all samples have the probability that is larger than . The above results imply that the largest probability tends to be , and others tend to be , which is consistent with our Claim 1.
Influence of Thresholds.
In Figure 6(b), we test the influence of threshold to select highlyconfident pseudolabel for training. We can see that with the increase of threshold, the performance also increases. The reason is that with low threshold, some incorrect pseudolabel will be adopted for network training, which will affect the performance. So it is important to set relatively high threshold to select highlyconfident pseudolabel for supervision.
5 Conclusions
For deep unsupervised learning and clustering, we propose the DCCM to learn discriminative feature representation by mining comprehensive correlations. Besides the correlation among different samples, we also make full use of the mutual information between corresponding features, local robustness to small perturbations, and their intercorrelations. We conduct extensive experiments on several challenging datasets and two different tasks to thoroughly evaluate the performance. DCCM achieves significant improvement over the stateoftheart methods.
References
 [1] E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information retrieval, 12(4):461–486, 2009.
 [2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layerwise training of deep networks. In NIPS, pages 153–160, 2007.
 [3] P. Bojanowski and A. Joulin. Unsupervised learning by predicting noise. In ICML, pages 517–526, 2017.
 [4] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature verification using a” siamese” time delay neural network. In NIPS, pages 737–744, 1994.
 [5] D. Cai, X. He, and J. Han. Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering, 17(12):1624–1637, 2005.
 [6] D. Cai, X. He, X. Wang, H. Bao, and J. Han. Locality preserving nonnegative matrix factorization. In IJCAI, 2009.
 [7] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
 [8] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan. Deep adaptive image clustering. In IEEE ICCV, pages 5879–5887, 2017.
 [9] A. Coates, A. Ng, and H. Lee. An analysis of singlelayer networks in unsupervised feature learning. In AISTATS, pages 215–223, 2011.
 [10] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In IEEE CVPR, 2009.

[11]
K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang.
Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization.
In IEEE ICCV, pages 5747–5756, 2017.  [12] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, pages 1422–1430, 2015.
 [13] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. In ICLR, 2017.
 [14] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, pages 766–774, 2014.
 [15] Y. Duan, W. Zheng, X. Lin, J. Lu, and J. Zhou. Deep adversarial metric learning. In IEEE CVPR, pages 2780–2789, 2018.
 [16] E. Elhamifar and R. Vidal. Sparse subspace clustering. In IEEE CVPR, pages 2790–2797, 2009.
 [17] S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
 [18] K. C. Gowda and G. Krishna. Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern recognition, 10(2):105–112, 1978.
 [19] X. Guo, L. Gao, X. Liu, and J. Yin. Improved deep embedded clustering with local structure preservation. In IJCAI, pages 1753–1759, 2017.
 [20] R. D. Hjelm, A. Fedorov, S. LavoieMarchildon, K. Grewal, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.
 [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [22] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid. Deep subspace clustering networks. In NIPS, pages 24–33, 2017.
 [23] L. Jing and Y. Tian. Selfsupervised visual feature learning with deep neural networks: A survey. arXiv preprint arXiv:1902.06162, 2019.
 [24] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [25] A. Kolesnikov, X. Zhai, and L. Beyer. Revisiting selfsupervised visual representation learning. arXiv preprint arXiv:1901.09005, 2019.
 [26] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [27] A. Krizhevsky, V. Nair, and G. Hinton. Cifar10 and cifar100 datasets. URl: https://www. cs. toronto. edu/kriz/cifar. html, 6, 2009.
 [28] K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In IEEE CVPR, pages 991–999, 2015.
 [29] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by lowrank representation. IEEE TPAMI, 35(1):171–184, 2013.

[30]
Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang.
Smooth neighbors on teacher graphs for semisupervised learning.
In IEEE CVPR, pages 8896–8905, 2018.  [31] L. v. d. Maaten and G. Hinton. Visualizing data using tsne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
 [32] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.

[33]
A. Y. Ng, M. I. Jordan, and Y. Weiss.
On spectral clustering: Analysis and an algorithm.
In NIPS, pages 849–856, 2002.  [34] M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, pages 69–84. Springer, 2016.
 [35] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash. Boosting selfsupervised learning via knowledge transfer. In IEEE CVPR, pages 9359–9367, 2018.
 [36] S. Nowozin, B. Cseke, and R. Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In NIPS, pages 271–279, 2016.
 [37] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017.
 [38] X. Peng, S. Xiao, J. Feng, W.Y. Yau, and Y. Zhang. Deep subspace clustering with sparsity prior. In IJCAI, pages 1925–1931, 2016.
 [39] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [40] U. Shaham, K. Stanton, H. Li, B. Nadler, R. Basri, and Y. Kluger. Spectralnet: Spectral clustering using deep neural networks. 2018.

[41]
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of machine learning research, 11(Dec):3371–3408, 2010.  [42] P. Xi, F. Jiashi, L. Jiwen, Y. WeiYun, and Y. Zhang. Cascade subspace clustering. In AAAI, 2017.
 [43] J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep embedding for clustering analysis. In ICML, pages 478–487, 2016.

[44]
B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong.
Towards Kmeansfriendly spaces: Simultaneous deep learning and clustering.
In ICML, pages 3861–3870, 2017.  [45] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning of deep representations and image clusters. In IEEE CVPR, pages 5147–5156, 2016.
 [46] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional networks. In IEEE CVPR, pages 2528–2535, 2010.
 [47] L. ZelnikManor and P. Perona. Selftuning spectral clustering. In NIPS, pages 1601–1608, 2005.
6 Supplementary Material
6.1 Proof of Lemma and Claim
Proof of Lemma : Since for , there exists a strongly increasing sequence of weights , and we can remove edges from in the order from smallest weight to largest by increasing threshold . This action would either increase the current partition number to or remain it unchanged. At the beginning of the process we have partition and at the end of the process we have partitions. Since , there exists a partition in the process.
∎
Proof of Claim : Select samples from partition , denote the cosine similarity matrix of their corresponding optimal features as , and equals to its partitions pseudo graph
, which is an identity matrix. Denote
as , where denotes the th element of the vector .The set can only have no more than positive elements, otherwise, according to Pigeonhole principle, there exists a that and , which is contradicted to .
On the other hand, for the output of a softmax layer, every vector has at least one positive entry. Therefore, every vector has and only has one positive element that equals to .
∎












6.2 Defenitions of Metrics
We introduce the following three standarded metrics we used to evaluate our model:

Normalized Mutual Information (NMI): Let and denote the predicted partition and the ground truth partition respectively, the NMI metric is calculated as:
(16) 
Adjusted Rand Index (ARI): Given a set of elements, and two groupings or partitions (e.g. clustering results) of these elements with and groups, namely and , the overlap between and
can be summarized in a contingency table
, where each element denotes the number of objects in common between and :(17) The contingent table is of the following shape:
and ARI is defined by:
(18) 
Accuracy (ACC): Suppose the clustering algorithm is tested on samples. For a sample , we denote its cluster label as and its ground truth as . The clustering accuracy is defined by:
(19) where
(20) and function denotes the best permutation mapping function gained by Hungarian algorithm [5].
6.3 Compared Methods
For clustering, we adopt both traditional methods and deep learning based methods, including Kmeans, spectral clustering (SC) [47], agglomerative clustering (AC) [18], the nonnegative matrix factorization (NMF) based clustering [6], autoencoder (AE) [2], denoising autoencoder (DAE) [41], GAN [39], deconvolutional networks (DECNN) [46], variational autoencoding (VAE) [24], deep embedding clustering (DEC) [43], jointly unsupervised learning (JULE) [45], and deep adaptive image clustering (DAC) [8].
6.4 Architechtures Details
In Table 4, we present the architectures for different datasets.
For CIFAR10/CIFAR100 [27], we set conv layers and pooling layers, followed with
fullyconnected layers. Batch Normalization
[21] and ReLU are used on all hidden layers. The output features after the second conv layer (S for shallow) and the first fc layer (D for deep) are used to compute the mutual information () loss, concatenated as the input of discriminator. For other datasets, such as TinyImageNet [10] and STL10 [9], we set conv layers instead of . Due to their larger input size, we use the feature maps after the third conv layer as S. For all experiments, the output was a dimensional vector.6.5 Sampling Strategy
Methods  Classification ACC(Y64)  

V1  nearest pos + random* neg  0.744 
V2  nearest pos + farthest neg  0.713 
V3  random* pos + random* neg  0.731 
V4  topn pos + random* neg  0.698 
The experiment result corresponding to the analysis in line  is listed in Table 5. We tried four strategies to fetch positive and negative pairs from pseudograph , and the terms used in the table refer to:

nearest means that for each sample, we select its nearest sample from the minibatch to construct a positive pair, while farthest means taking the farthest one to construct a negative pair.

random* means that we randomly take a positive sample that satisfies as a positive pair or a negative sample that satisfies as a negative pair.

top pos means that we select the top confident pairs from the graph to construct positive pairs.
For each strategy, we take positive pairs and negative pairs into account, where is our batch size. This is to make sure that the computational complexity of each approach is nearly the same for fair comparison, while we also have explored more costly approaches and find that the improvement is negligible.
To clearly illustrate how is effected, here we set a fixed model trained with only . Then with the pseudograph generated by it, we train a new model using only from scratch. It can be concluded that the positive pairs are sensitive to noise since strategy V1 achieves better results than V3, and harder negative pairs are beneficial for training as strategy V1 also achieves better results than V2. Besides, we also notice the importance of uniform sampling within the minibatch, as the topn pairs in V4 has higher confidence than that in V1, but the training collapses since only part of samples in the batch are included in the topn strategy.
Comments
There are no comments yet.