Deep Robust Clustering by Contrastive Learning

08/07/2020 ∙ by Huasong Zhong, et al. ∙ 0

Recently, many unsupervised deep learning methods have been proposed to learn clustering with unlabelled data. By introducing data augmentation, most of the latest methods look into deep clustering from the perspective that the original image and its tansformation should share similar semantic clustering assignment. However, the representation features before softmax activation function could be quite different even the assignment probability is very similar since softmax is only sensitive to the maximum value. This may result in high intra-class diversities in the representation feature space, which will lead to unstable local optimal and thus harm the clustering performance. By investigating the internal relationship between mutual information and contrastive learning, we summarized a general framework that can turn any maximizing mutual information into minimizing contrastive loss. We apply it to both the semantic clustering assignment and representation feature and propose a novel method named Deep Robust Clustering by Contrastive Learning (DRC). Different to existing methods, DRC aims to increase inter-class diver-sities and decrease intra-class diversities simultaneously and achieve more robust clustering results. Extensive experiments on six widely-adopted deep clustering benchmarks demonstrate the superiority of DRC in both stability and accuracy. e.g., attaining 71.6 state-of-the-art results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Clustering aims to separate the samples into different groups such that samples in the same cluster should be as similar as possible while samples among different clusters should be as dissimilar as possible, which is one of the most basic problems in machine learning

[40, 32]

. Clustering in computer vision is especially difficult for the lack of low dimensional discriminative representation. With the development of deep learning

[31], more and more researchers pay attention to simultaneously learn features and clustering with unlabelled images. Although deep clustering methods perform significantly better than traditional methods, they are far from satisfactory on many large and complicated image datasets. How to improve the accuracy and stability of clustering is still a very important but challenging problem.

Most of the existing methods alternately update the cluster assignment and inter-sample similarities which are used to guide the model training[44, 6, 42]. Nevertheless, they are susceptible to the inevitable errors distributed in the neighborhoods and suffer from error-propagation during training. To solve this problem, some methods proposed to take advantage of mutual information and data augmentation[1] to simultaneously learn representation and cluster assignment[2, 24, 20]

. Specifically, they tried to maximize the mutual information between the assignment distributions of original images and their augmentations, which help to greatly improve the performance. For a general deep clustering framework, the representation is first extracted by a convolutional neural network(CNN), then the

dimensional logits (we call it assignment feature) can be obtained after a fully connected layer. After that, the

dimensional assignment probability can be calculated by the softmax function. The main idea of the latest methods is that the original image and its augmentation should share similar assignment probability. However, the assignment features could be quite different even the assignment probabilities are almost the same since the assignment probability is only sensitive to the maximum value of the assignment feature (See Figure 1(a)). Therefore, it will lead to unstable clustering results with high intra-class diversities. As shown in Figure 1(b), many boundary points can be misclassified if only assignment probability is used, which will greatly harm the clustering performance.

In order to obtain more stable clusters and improve the clustering accuracy, we propose a novel method named Deep Robust Clustering by Contrastive Learning (DRC). Different from the existing methods, DRC tires not only learn invariant clustering results but also invariant features. From the perspective of assignment probability, DRC aims to maximize the mutual information between the cluster assignment distribution of the original images and their augmentations by a global view, which can help to increase inter-class variance and lead to high confident partitions. From the perspective of assignment feature, DRC aims to maximize the mutual information between the assignment features of the original image and its augmentation by a local view, which can help to decrease intra-class variance and achieve more robust clusters (See Figure 1

(c)). In addition, we demonstrate that maximizing the mutual information is equivalent to minimizing two contrastive losses, which has been proved powerful and more friendly for training in unsupervised learning

[17, 38, 15, 7]. The main contributions can be summarized as:

  • We point out the drawback of the existing deep clustering methods and proposed a new method named Deep Robust Clustering by Contrastive Learning (DRC). As far as we know, DRC is the first work to look into deep clustering from perspectives of both assignment probability and assignment feature, which help to increase inter-class diversities and decrease intra-class diversities simultaneously.

  • We investigated the internal relationship between mutual information and contrastive learning and summarized a general framework that can turn any maximizing mutual information into minimizing contrastive loss. DRC is the first work that successfully applies contrastive learning to deep clustering and achieves significant improvement.

  • Extensive experiments on six widely-adopted deep clustering benchmarks show that DRC can achieve more robust clusters and outperform a wide range of the state-of-the-art methods.

Related Work

Deep Clustering.

There exists two categories of deep clustering approaches: explicitly learn the cluster assignment [45, 44, 5, 13, 41, 6, 46]

and combine deep representation learning with cluster analysis 

[22, 36, 14, 25].

The former usually aims to mine the estimated information or estimated ground-truth to train the network with a way of the supervised method. DEC 

[44]

introduced the K-means 

[27] to conduct cluster assignments on pre-trained image features and then iteratively optimized the network with more confident estimated ground-truth. IIC [23] and DCCM [41] exploited the inter-samples relations based on the pairwise relationship between the latest sample features and optimize the model accordingly. But, these pseudo relations or pseudo labels may cause severe error-propagation at the beginning stage of training, which limits their performance. On the contrary, the latter focused on exploiting a good cluster structure to train the network. PICA [21] maximized the global partition confidence of the clustering solution.

Mutual Information.

Information theory has been utilized as a tool to train the deep networks in particular. IMSAT [19] used data augmentation to impose the invariance on discrete representations by maximizing mutual information between data and its representation. DeepINFOMAX [18]

simultaneously estimated and maximized the mutual information between input data and learned high-level representations. However, they computed mutual information over continuous random variables, which required complex estimators. On the contrary, IIC 

[23] did so for discrete variables with simple and exact computations.

Contrastive Learning.

Contrastive learning has been widely used in unsupervised deep learning.  [8] proposed this technique which used a max-margin approach to separate positive from negative examples based on triplet losses.  [11]

proposed a parametric form method that considered each instance as a class represented by a feature vector.  

[43] introduced a memory bank to store the instance class representation embedding. Then,  [50, 38] adopted and extended this memory back based approach in their recent paper. On the contrary,  [10, 47] replaced the memory bank with the use of in-batch samples for negative sampling.

Method

Figure 2: Overview of the proposed Deep Robust Clustering (DRC) method for unsupervised deep clustering.

In this section, we will first present the problem definition of deep clustering. And we will propose our novel end-to-end deep clustering framework. Then, we will demonstrate the relationship both mutual information and contrastive learning. Next, we introduce two contrastive losses based on this relationship: assignment feature loss and assignment probability loss. Finally, we describe the model training method for DRC.

Problem Formulation

Given a set of unlabelled images drawn from different semantic classes. Deep clustering aims to separate the images into different clusters by convolutional neural network (CNN) models such that the images with the same semantic labels can be reduced into the same cluster. Here we aim to learn a deep CNN network based on mapping function with parameter , then each image can be mapped to a -dimension assignment feature . After that, the assignment probability vector can be obtained by softmax function which can be defined by

Then the cluster assignment can be predicted by maximum likelihood:

Framework

To address the above problem, we introduce a novel end-to-end deep clustering framework to take advantage of both assignment probability and assignment feature. As shown in Figure 2, we first adopt the deep convolutional neural network(CNN) to generate assignment feature and assignment probability of dimension. After that, a contrastive loss based on assignment probability is used to hold the assignment consistency of original images and their augmentations, which can help to increase inter-class variance and formulate well-separated clusters. And a contrastive loss based on assignment feature is used to capture the representation consistency between original images and their augmentations, which can help to decrease intra-class variance and achieve more robust clusters.

Mutual Information Contrastive Learning

Contrastive learning has been proven to be powerful in unsupervised and self-supervised learning, which helps to achieve state-of-the-art results in many tasks. And Contrastive loss is also strongly related to mutual information. Let

be samples in a given space. And the transformation of is defined by . Since we know nothing about the ground truth of , all what we know is that can be view as a positive sample of for any . In other words, should be much bigger than . A very natural idea is maximally preserving the mutual information between and defined as

(1)

If we assume

(2)

where is a function that can be different in different situations, then we have the following theorem.

Theorem 1

Assume there exists a constant such that holds for all , then

holds.

Proof: Denote , then we have

Define

(3)

so minimizing contrastive loss is equal to maximizing a lower bound of mutual information .

Loss Function

Our loss function consists of three parts: 1. a contrastive loss based on assignment feature that preserves mutual information in feature level. 2. a contrastive loss based on assignment probability that maximizes the mutual information between predicted labels of original images and predicted labels of transformed images. 3. a cluster regularization loss is to avoid trivial solutions.

Let be the augmentations of , where is a random transformation for .

Assignment Feature Loss.

From the perspective of assignment feature, we can assume , where

and .

A basic assumption is that the assignment features between a image and its augmentation should be similar. To maximize the mutual information , it is reasonable to define

(4)

according to Theorem 1, where is a temperature parameter. Then we can define the assignment feature loss as:

(5)

Assignment Probability Loss.

As mentioned in problem formulation, let

be the assignment probability matrix for and respectively, we can write the matrix as

where and can tell us which pictures in and will be assigned to cluster respectively. Here we let . Since are the augmentations of , the cluster assignments should be consistent, which is equal to maximizing the mutual information

(6)

where is the joint assignment distribution of and , and and are the marginal distributions. Base on Theorem 1, we can define

(7)

where is also a temperature parameter. Then we can define the assignment probability loss as:

(8)

Cluster regularization Loss.

In deep clustering, it is easy to fall into a local optimal solution that assign most samples into a minority of clusters. Inspired by group lasso[34], we introduce cluster regularization loss to address this problem, which can be formulated as:

(9)

where indicate the -th element of .

Then the overall objective function of DRC can be formulated as:

(10)

where is a weight parameter.

Model training

The objective function (Eq. 10

) is differentiable end-to-end, enabling the conventional stochastic gradient descent algorithm for model training. The assignment probability loss and assignment feature loss are calculated by a random mini-batch of images and their augmentations. The training procedure is summarized in Algorithm

1.

Input: Training images

, training epochs

, cluster number .
Output: A deep clustering model with parameter .
for each epoch do
       Step 1: Sampling a random mini-batch of images; Step 2: Generating augmentations for the sampled images; Step 3: Computing assignment feature loss according to Eq. 5; Step 4: Computing assignment probability loss according to Eq. 8; Step 5: Computing cluster regularization loss according to Eq. 9;  Step 6: Update with Adam by minimizing the overall loss according to Eq. 10
end for
Algorithm 1 Training algorithm for DRC

Experiments

Datasets & Metrics

We conduct extensive experiments on six widely-adopted benchmark datasets. For fair comparison, we adopt the same experimental setting as  [6, 21].

  • CIFAR-10/100: [28] A natural image dataset with 50,000/10,000 samples from 10(/100) classes in which the training and testing images of each dataset are jointly utilized to clustering.

  • STL-10: [9]

    An ImageNet sourced dataset containing 500/800 training/test images from each of 10 classes and additional 100,000 samples from several unknown categories.

  • ImageNet-10 and ImageNet-Dogs: [6] Two subsets of ImageNet [29]: the former with 10 randomly selected subjects and the latter with 15 dog breeds.

  • Tiny-ImageNet: [30] A subset of ImageNet with 200 classes which is a very challenging dataset for clustering. There are 100,000/10,000 training/test images evenly distributed in each category.

  • Evaluation Metrics: We used three standard clustering performance metrics: Accuracy (ACC), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI).

Figure 3: The variance of ACC for DRC and PICA on six different datasets.
Datasets CIFAR-10 CIFAR-100 STL-10 ImageNet-10 Imagenet-dog-15 Tiny-ImageNet
Methods\Metrics NMI ACC ARI NMI ACC ARI NMI ACC ARI NMI ACC ARI NMI ACC ARI NMI ACC ARI
K-means 0.087 0.229 0.049 0.084 0.130 0.028 0.125 0.192 0.061 0.119 0.241 0.057 0.055 0.105 0.020 0.065 0.025 0.005
SC 0.103 0.247 0.085 0.090 0.136 0.022 0.098 0.159 0.048 0.151 0.274 0.076 0.038 0.111 0.013 0.063 0.022 0.004
AC 0.105 0.228 0.065 0.098 0.138 0.034 0.239 0.332 0.140 0.138 0.242 0.067 0.037 0.139 0.021 0.069 0.027 0.005
NMF 0.081 0.190 0.034 0.079 0.118 0.026 0.096 0.180 0.046 0.132 0.230 0.065 0.044 0.118 0.016 0.072 0.029 0.005
AE 0.239 0.314 0.169 0.100 0.165 0.048 0.250 0.303 0.161 0.210 0.317 0.152 0.104 0.185 0.073 0.131 0.041 0.007
DAE 0.251 0.297 0.163 0.111 0.151 0.046 0.224 0.302 0.152 0.206 0.304 0.138 0.104 0.190 0.078 0.127 0.039 0.007
GAN 0.265 0.315 0.176 0.120 0.151 0.045 0.210 0.298 0.139 0.225 0.346 0.157 0.121 0.174 0.078 0.135 0.041 0.007
DeCNN 0.240 0.282 0.174 0.092 0.133 0.038 0.227 0.299 0.162 0.186 0.313 0.142 0.098 0.175 0.073 0.111 0.035 0.006
VAE 0.245 0.291 0.167 0.108 0.152 0.040 0.200 0.282 0.146 0.193 0.334 0.168 0.107 0.179 0.079 0.113 0.036 0.006
JULE 0.192 0.272 0.138 0.103 0.137 0.033 0.182 0.277 0.164 0.175 0.300 0.138 0.054 0.138 0.028 0.102 0.033 0.006
DEC 0.257 0.301 0.161 0.136 0.185 0.050 0.276 0.359 0.186 0.282 0.381 0.203 0.122 0.195 0.079 0.115 0.037 0.007
DAC 0.396 0.522 0.306 0.185 0.238 0.088 0.366 0.470 0.257 0.394 0.527 0.302 0.219 0.275 0.111 0.190 0.066 0.017
DCCM 0.496 0.623 0.408 0.285 0.327 0.173 0.376 0.482 0.262 0.608 0.710 0.555 0.321 0.383 0.182 0.224 0.108 0.038
IIC - 0.617 - - 0.257 - - 0.610 - - - - - - - - - -
PICA:(Mean) 0.561 0.645 0.467 0.296 0.322 0.159 0.592 0.693 0.504 0.782 0.850 0.733 0.336 0.324 0.179 0.277 0.094 0.016
PICA:(Best) 0.591 0.696 0.512 0.310 0.337 0.171 0.611 0.713 0.531 0.802 0.870 0.761 0.352 0.352 0.201 0.277 0.098 0.040
DRC:(Mean) 0.612 0.716 0.534 0.343 0.355 0.196 0.639 0.744 0.564 0.828 0.883 0.796 0.377 0.373 0.222 0.315 0.132 0.053
DRC:(Best) 0.621 0.727 0.547 0.356 0.367 0.208 0.644 0.747 0.569 0.830 0.884 0.798 0.384 0.389 0.233 0.321 0.139 0.056
Table 1: Clustering performance of different methods on six challenging datasets. The first and second results are highlighted in blue / red.

Implementation Details

We adopt PyTorch 

[35] to implement our approach. The network architecture used in our framework is a variant version of ResNet [16] which is the same as  [21]. For fair comparisons with other approaches, we followed most of the same setting as [21, 23]. We used Adam [25] optimizer with and train 500 epochs. And we set the batch size to 256 and repeated each in-batch sample 2 times to contrastive learning. For hyper-parameters, we set for all datasets. And we set the temperature for assignment feature loss and for assignment probability loss. Similar to PICA [21], we also utilized the same auxiliary over-clustering method in a separate clustering head to exploit the additional data from irrelevant classes if available. To report the stable performance of the approach, we trained our model in all datasets with 5 trials and displayed the average and best results separately.

Comparisons to State-of-the-Art Methods.

For clustering, we adopt both traditional methods and deep learning based methods, including K-means, spectral clustering (SC) 

[49], agglomerative clustering (AC) [12], the nonnegative matrix factorization (NMF) based clustering [4], auto-encoder (AE) [3], denoising auto-encoder (DAE) [39], GAN [37], deconvolutional networks (DECNN) [48], variational auto-encoding (VAE) [26], deep embedding clustering (DEC) [44], jointly unsupervised learning (JULE) [46],deep adaptive image clustering (DAC) [6], invariant information clustering [23], deep comprehensive correlation Mining (DCCM) [41] and partition confidence maximisation (PICA) [21]. The results are shown in Table 1. Most results of other methods are directly copied from PICA [21]. We find that our approach DRC significantly surpasses other methods by a large margin on six widely-used deep clustering benchmarks under three different evaluation metrics. Specifically, the improvement of DRC is very significant even compared with the state-of-the-art method PICA. Take the clustering mean ACC for example, our results are 7.1%, 3.3%, 5.1% higher than that of PICA on the CIFAR-10, CIFAR-100 and STL-10 respectively. We further compare the variance of ACC between DRC and PICA[21] and the results are shown in Figure 3. We can see that DRC has a much smaller variance than PICA on five of the six datasets, which implies DRC can give much more robust clustering results under different initialization.

Ablation Study

In this section, we will demonstrate that the three parts of losses in DRC are all very important to achieve state-of-the-art performance by ablation analysis.

Effect of two contrastive losses.

We first investigate how assignment probability loss and assignment feature loss affect the clustering performance on CIFAR-10, CIFAR-100 and ImageNet-10, and the results are shown in Table 2. It is clear that the two losses both help on all three datasets. At the same time, it is also very reasonable that assignment probability loss plays a greater role, since it directly affects the clustering result. And assignment feature loss is also indispensable, especially in CIFAR-10 and CIFAR-100.

(a) AP-PICA
(b) AP-DRC
(c) AF-PICA
(d) AF-DRC
Figure 4: Visualization of assignment probability and assignment feature on CIFAR-10 dataset. (a) assignment probability of PICA, (b) assignment probability of DRC, (c) assignment feature of PICA, (d) assignment feature of DRC.
Figure 5: Cases studies on STL-10. (Left) Successful cases; (Middle) False negative and (Right) false positive failure cases.
Method CIFAR-10 CIFAR-100 ImageNet-10 DRC w/o AP 0.319 0.174 0.445 DRC w/o AF 0.661 0.296 0.875 DRC 0.716 0.355 0.883
Table 2: Effect of two contrastive losses. Metric: ACC.
(a) Intra-Class Variance
(b) Inter-Class Variance
Figure 6: Variance analysis on CIFAR-10.

Effect of cluster regularization loss.

Deep clustering can easily fall into a local optimal solution, when most samples are assigned to the same cluster. We then examine how the cluster regularization loss addresses this problem. As shown in Table 4, we can see that it significantly helps to improve the clustering performance. It is interesting to see the assignment feature loss and cluster regularization loss have little impact on ImageNet-10, since it is a relatively easy dataset that images from different classes are well separated.

Method CIFAR-10 CIFAR-100 ImageNet-10 DRC w/o CR 0.613 0.265 0.868 DRC 0.716 0.355 0.883
Table 3: Effect of cluster regularization loss. Metric: ACC.

Effect of batch size.

According to the  [7] , contrastive learning benefits from larger batch sizes. To evaluate the effect of batch size, we adopted different ranges batch size {32, 64, 128, 256, 512, 1024} to train DRC on CIFAR-10 dataset. The results can be seen in Table 4. We can find the larger batch size will achieve better performance.

Batch-size 64 128 256 512 1024 NMI 0.603 0.604 0.612 0.623 0.634 ACC 0.684 0.712 0.716 0.718 0.722 ARI 0.515 0.528 0.534 0.538 0.542
Table 4: Effect of batch size on CIFAR-10 dataset.

Variance analysis.

A good cluster embedding should have a smaller intra-class variance and a larger inter-class variance. In order to prove the superiority of DRC from this aspect, we randomly select 6,000 samples from CIFAR-10 and calculate the intra-class variance and inter-class variance by using the assignment probability. As shown in Figure 6, we can see that DRC achieves relatively smaller intra-class variance but larger inter-class variance than PICA[21]. This also demonstrates that DRC can obtain more robust clusters than the existing state-of-the-art methods.

Qualitative Study

Visualization of cluster assignment.

To further illustrate that DRC can get more robust clustering results, we compare it with PICA on CIFAR-10 by visualising the assignment feature and assignment probability. We plot the predictions of 6,000 randomly selected samples with the ground-truth classes color encoded by using t-SNE[33]. Figure 4(a) and Figure 4(b) show the results of assignment probability, we can see that samples of the same class are closer and samples of different classes better separated for DRC. For the assignment feature, we also see a similar phenomenon in Figure 4(c) and Figure 4(d).

Success vs. failure cases.

At last, we investigate both success and failure cases to get extra insights into our method. Specifically, we study the following three cases of four classes from STL-10: (1) Success cases, (2) False negative failure cases, (3) False positive cases. As shown in Figure 5, DRC can successfully group together images of the same class with different backgrounds and angles. Two different failure cases tell us that DRC mainly learns the shape of objects. Samples of different classes with a similar pattern may be grouped together and samples of the same class with different patterns may be separated into different classes. It is hard to look into the details at the absence of the ground-truth labels, which is still an unsolved problem for unsupervised learning and clustering.

Conclusion

We summarized a general framework that can turn any maximizing mutual information into minimizing contrastive loss. And we apply it to both the semantic clustering assignment and representation feature, which can help to increase inter-class diversities and decrease intra-class diversities to perform more robust clustering assignment. Extensive experiments on six challenging datasets demonstrated DRC method can achieve state-of-the-art results.

References

  • [1] A. Antoniou, A. Storkey, and H. Edwards (2017)

    Data augmentation generative adversarial networks

    .
    arXiv preprint arXiv:1711.04340. Cited by: Introduction.
  • [2] Y. M. Asano, C. Rupprecht, and A. Vedaldi (2019) Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371. Cited by: Introduction.
  • [3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle (2007) Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pp. 153–160. Cited by: Comparisons to State-of-the-Art Methods..
  • [4] D. Cai, X. He, X. Wang, H. Bao, and J. Han (2009) Locality preserving nonnegative matrix factorization.. In IJCAI, Vol. 9, pp. 1010–1015. Cited by: Comparisons to State-of-the-Art Methods..
  • [5] J. Chang, Y. Guo, L. Wang, G. Meng, S. Xiang, and C. Pan (2019) Deep discriminative clustering analysis. arXiv preprint arXiv:1905.01681. Cited by: Deep Clustering..
  • [6] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan (2017) Deep adaptive image clustering. In Proceedings of the IEEE international conference on computer vision, pp. 5879–5887. Cited by: Introduction, Deep Clustering., 3rd item, Datasets & Metrics, Comparisons to State-of-the-Art Methods..
  • [7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: Introduction, Effect of batch size..
  • [8] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    ,
    Vol. 1, pp. 539–546. Cited by: Contrastive Learning..
  • [9] A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    ,
    pp. 215–223. Cited by: 2nd item.
  • [10] C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060. Cited by: Contrastive Learning..
  • [11] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014) Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems, pp. 766–774. Cited by: Contrastive Learning..
  • [12] K. C. Gowda and G. Krishna (1978) Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern recognition 10 (2), pp. 105–112. Cited by: Comparisons to State-of-the-Art Methods..
  • [13] X. Guo, L. Gao, X. Liu, and J. Yin (2017) Improved deep embedded clustering with local structure preservation.. In IJCAI, pp. 1753–1759. Cited by: Deep Clustering..
  • [14] P. Haeusser, J. Plapp, V. Golkov, E. Aljalbout, and D. Cremers (2018) Associative deep clustering: training a classification network with no labels. In German Conference on Pattern Recognition, pp. 18–32. Cited by: Deep Clustering..
  • [15] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: Introduction.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Implementation Details.
  • [17] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: Introduction.
  • [18] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: Mutual Information..
  • [19] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama (2017) Learning discrete representations via information maximizing self-augmented training. arXiv preprint arXiv:1702.08720. Cited by: Mutual Information..
  • [20] J. Huang, S. Gong, and X. Zhu (2020) Deep semantic clustering by partition confidence maximisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8849–8858. Cited by: Introduction.
  • [21] J. Huang, S. Gong, and X. Zhu (2020) Deep semantic clustering by partition confidence maximisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8849–8858. Cited by: Deep Clustering., Datasets & Metrics, Implementation Details, Comparisons to State-of-the-Art Methods., Variance analysis..
  • [22] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid (2017) Deep subspace clustering networks. In Advances in Neural Information Processing Systems, pp. 24–33. Cited by: Deep Clustering..
  • [23] X. Ji, J. F. Henriques, and A. Vedaldi (2019) Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9865–9874. Cited by: Deep Clustering., Mutual Information., Implementation Details, Comparisons to State-of-the-Art Methods..
  • [24] X. Ji, J. F. Henriques, and A. Vedaldi (2019) Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9865–9874. Cited by: Introduction.
  • [25] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Deep Clustering., Implementation Details.
  • [26] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: Comparisons to State-of-the-Art Methods..
  • [27] K. Krishna and M. N. Murty (1999) Genetic k-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 29 (3), pp. 433–439. Cited by: Deep Clustering..
  • [28] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: 1st item.
  • [29] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: 3rd item.
  • [30] Y. Le and X. Yang (2015) Tiny imagenet visual recognition challenge. CS 231N 7. Cited by: 4th item.
  • [31] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: Introduction.
  • [32] D. D. Lee and H. S. Seung (2001) Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pp. 556–562. Cited by: Introduction.
  • [33] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Visualization of cluster assignment..
  • [34] L. Meier, S. Van De Geer, and P. Bühlmann (2008)

    The group lasso for logistic regression

    .
    Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (1), pp. 53–71. Cited by: Cluster regularization Loss..
  • [35] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: Implementation Details.
  • [36] X. Peng, J. Feng, J. Lu, W. Yau, and Z. Yi (2017) Cascade subspace clustering. In Thirty-First AAAI conference on artificial intelligence, Cited by: Deep Clustering..
  • [37] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: Comparisons to State-of-the-Art Methods..
  • [38] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: Introduction, Contrastive Learning..
  • [39] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P. Manzagol, and L. Bottou (2010)

    Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion.

    .
    Journal of machine learning research 11 (12). Cited by: Comparisons to State-of-the-Art Methods..
  • [40] U. Von Luxburg (2007) A tutorial on spectral clustering. Statistics and computing 17 (4), pp. 395–416. Cited by: Introduction.
  • [41] J. Wu, K. Long, F. Wang, C. Qian, C. Li, Z. Lin, and H. Zha (2019) Deep comprehensive correlation mining for image clustering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8150–8159. Cited by: Deep Clustering., Deep Clustering., Comparisons to State-of-the-Art Methods..
  • [42] J. Wu, K. Long, F. Wang, C. Qian, C. Li, Z. Lin, and H. Zha (2019) Deep comprehensive correlation mining for image clustering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8150–8159. Cited by: Introduction.
  • [43] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: Contrastive Learning..
  • [44] J. Xie, R. Girshick, and A. Farhadi (2016) Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487. Cited by: Introduction, Deep Clustering., Deep Clustering., Comparisons to State-of-the-Art Methods..
  • [45] B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong (2017) Towards k-means-friendly spaces: simultaneous deep learning and clustering. In international conference on machine learning, pp. 3861–3870. Cited by: Deep Clustering..
  • [46] J. Yang, D. Parikh, and D. Batra (2016) Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156. Cited by: Deep Clustering., Comparisons to State-of-the-Art Methods..
  • [47] M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 6210–6219. Cited by: Contrastive Learning..
  • [48] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus (2010) Deconvolutional networks. In 2010 IEEE Computer Society Conference on computer vision and pattern recognition, pp. 2528–2535. Cited by: Comparisons to State-of-the-Art Methods..
  • [49] L. Zelnik-Manor and P. Perona (2005) Self-tuning spectral clustering. In Advances in neural information processing systems, pp. 1601–1608. Cited by: Comparisons to State-of-the-Art Methods..
  • [50] C. Zhuang, A. L. Zhai, and D. Yamins (2019) Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6002–6012. Cited by: Contrastive Learning..