Deep Discriminative Clustering Analysis

05/05/2019 ∙ by Jianlong Chang, et al. ∙ Intel 14

Traditional clustering methods often perform clustering with low-level indiscriminative representations and ignore relationships between patterns, resulting in slight achievements in the era of deep learning. To handle this problem, we develop Deep Discriminative Clustering (DDC) that models the clustering task by investigating relationships between patterns with a deep neural network. Technically, a global constraint is introduced to adaptively estimate the relationships, and a local constraint is developed to endow the network with the capability of learning high-level discriminative representations. By iteratively training the network and estimating the relationships in a mini-batch manner, DDC theoretically converges and the trained network enables to generate a group of discriminative representations that can be treated as clustering centers for straightway clustering. Extensive experiments strongly demonstrate that DDC outperforms current methods on eight image, text and audio datasets concurrently.



There are no comments yet.


page 1

page 3

page 5

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is one of the most extensively used statistical tool in computer vision and pattern recognition

[1, 3, 10, 12, 33, 43]. In a wide range of fields, many applications where the target information is unobservable can be intrinsically instantiated as clustering tasks, including text mining [42] and multimedia content-based retrieval [2, 20, 83, 86].

In the literature, much research has been dedicated to clustering analysis [7, 13, 15, 30, 42]. From the aspect related to algorithmic structure and operation, clustering methods can be roughly divided into two categories, i.e., partitional and agglomerative methods [31]. By definition, partitional methods tend to split all patterns into several partitions, until stopping criterions are met. In compliance with this definition, typically basic methods are present, such as the Kmeans [70]

and the spectral clustering

[82]. In contrast, the agglomerative methods [13, 21] always start with each pattern as a single cluster and aggregate clusters together gradually until stopping criterions are satisfied. Despite the conceptual breakthroughs, these clustering methods suffer from a severe obstacle, i.e., how to acquire effective representations from unlabeled data purely?

In order to acquire effective representations, unsupervised techniques have been developed to represent data. Handcrafted representations, derived from the professional knowledge, are originally utilized, such as SIFT [46] and HoG [17]

. However, they are always limited to simple scenarios, and may be severely degenerated when facing with more complex ones, including large scale images, texts, and audios. By employing deep unsupervised learning techniques

[6, 69], more preferable representations can be acquired and considerable gains are achieved consequently. Unfortunately, these learned representations are fixed in clustering and can not be further modified to yield better performance. Analogous to previous studies in deep supervised leaning, learnable representations are usually more effective than fixed ones. To benefit from learnable representations, one straightforward way is to integrate deep representation learning and clustering in a joint framework [11, 60, 73, 76], but there are still several challenges. First, how to define an effective and general objective to train deep networks in an unsupervised way? Second, how to guarantee that the representations learned by deep networks are in favour of clustering? Third, how to discover the number of clusters automatically, rather than to predefine one.

To address such challenges, we propose Deep Discriminative Clustering (DDC) that manages the clustering task by iteratively exploring relationships between patterns and learning representations in a mini-batch manner. In each iteration, a global constraint is used to guide the estimation of the relationships. Then under a local constraint, the relationships are fed back to train the network for learning high-level discriminative representations. Consequently, DDC is theoretically convergent and the trained network is capable of generating a group of discriminative representations that can be treated as clustering centers for straightway clustering. Benefiting from such artful modeling, DDC is independent of the number of patterns and the number of clusters, which endows the model with the capability of dealing with tasks that with a large number of patterns or unobservable number of clusters.

To sum up, the key contributions of this work are:

  • Under the global and local constraints, DDC can endow networks with the capability of mapping all patterns to discriminative clustering centers for clustering.

  • By feat of the mini-batch based optimization, DDC alleviates the fussy impacts from the numbers of patterns and clusters, further enhancing the practicality.

  • The availability and the convergence of the developed DDC model are mathematically analyzed, which provides requisite theoretical foundations and guarantees.

  • Extensive experiments strongly verify that our DDC model is concurrently superior to current methods on various datasets, including images, texts, and audios.

2 Related Work

In this section, we make a brief review of the related work on the clustering and deep learning methods.

2.1 Clustering

By definition, clustering tries to generate a semantical organization of data based on similarities, namely patterns within the same cluster are similar to each other, while those in different clusters are dissimilar. Technically, clustering has been widely studied in an unsupervised mechanism from two main aspects: how to define a cluster and how to extract effective representations for clustering?

To define clusters, lots of endeavors have been extensively made. In particular, the Gaussian mixture distribution is one of the simple yet popular techniques to describe clusters. Relying on this technique, a series of clustering methods are developed, such as the Kmeans [70] and its variants [61, 70, 78]. Compared with those with predefined clusters, many methods prefer to adaptively estimate forms of clusters from datasets purely. Typically, there are two frequently-used ways, i.e., density-based and connectivity-based estimations. For the former, density-based methods define clusters via a proper density function, which can be utilized to describe the density around each patterns in the feature space. Depending on diverse density functions, various density-based methods[50, 59, 72] have been proposed. For the latter, patterns are highly connected if they belong to the same cluster, including the spectral clustering [4, 71]. Furthermore, these ideas also form the basis of a number of methods [14, 77], such as the discriminative clustering [25, 47, 53, 54, 67, 68, 75, 78, 88], and the subspace based clustering [40, 56, 65, 79, 80, 87].

To extract effective representations for clustering, plentiful attention has been paid. Primitively, many handcrafted representations have been employed to recode patterns in an unsupervised way, including SIFT [46] and HoG [17] on images. To eliminate influences from trivial variables (e.g., variations of scenes and objects), high-level representations can be learned in a data-driven manner with deep unsupervised learning in recent [6, 19, 62, 69, 81, 85].

In spite of the remarkable successes [23, 34, 36, 51, 83], these two aspects are always separated. As a result, the extracted representations are fixed during clustering and can not be further improved to obtain better performance.

2.2 Deep Neural Network

Deep Neural Network (DNN) is a significant technique in machine learning, which has been developed to manage tasks like humans do

[37]. With respect to the observability of label information, deep learning can be summarized into two categories, i.e.

, deep supervised learning and deep unsupervised learning.

In deep supervised learning, much interesting research has been proposed over the last decade. Thanks to the efficient hardware and a large amount of labeled data, remarkable successes have been achieved in various pattern analysis and machine learning problems, especially in computer vision, such as image classification and objection detection [9, 24, 27, 28, 45, 64]. Although the achievements are significant, deep supervised learning solely pertains to the tasks that with a large amount of labeled data.

To eliminate the requirement for the labeled data, deep unsupervised learning has drawn attention to train DNNs in an unsupervised manner and extract representations of data simultaneously [8, 52, 58, 84]

. In general, the vital problem in deep unsupervised learning is that how to definitely give pseudo labels to train DNNs. Primitively, most unsupervised learning methods attempt to reconstruct inputs with DNNs, namely labels are themselves. Typical approaches include the autoencoder

[6] and its variants [49, 69]. Similar to traditional machine learning methods, more robust and interpretable representations can be learned by introducing additional regularization terms. In recent, deep generative networks [26], one important branch in deep unsupervised learning has been developed. They train networks by representing data distributions with DNNs. Akin to traditional methods, representations can be learned by modifying generative networks, such as the generative adversarial clustering [32, 44, 55, 57, 62] and the variational autoencoder based clustering [19, 49].

While many endeavors have been devoted to deep unsupervised learning, several intrinsic challenges still remain. First, how to provide proper driving force to train DNNs in an unsupervised manner? Second, clustering results can not be obtained directly via the learned feature representations by these methods, which implies that the generated representations are unsuitable for clustering naturally.

Figure 1: The flowchart of DDC. In each iteration, DDC consists of three essential steps. First, a DNN is built to generate indicator features on a mini batch of patterns. Second, the relationships between patterns are estimated based on the generated indicator features. Third, the DNN is trained by optimizing the relationships, i.e., increasing the similarities of similar patterns (shown in green arrows) and reducing the similarities of dissimilar patterns (shown in red arrows). Iterating step 1 to step 3 until the DDC model can not be further improved. Finally, the trained DNN enables to generate discriminative indicator features that can be treated as clustering centers for clustering directly.

3 Deep Discriminative Clustering Model

Clustering, intrinsically, is a function that is in a position to assign patterns into a group of clusters whose number is either observable or unobservable. Formally, considering the problem of clustering patterns into clusters, i.e., , which are represented by discriminative clustering centers . By this definition, clustering is naturally formulated as


where is a clustering function with the learnable parameter . In clustering, generally, a fundamental problem is how to model an objective function on a set of clustering centers to estimate the parameter . In the following, we focus on dealing with this problem. The flowchart of DDC is visually illustrated in Figure 1.

3.1 Relational Objective Function

The main obstacle of modeling an objective function in clustering is that the correspondence between patterns and clustering centers is non-unique. To eliminate this obstacle, relationships between patterns are employed as the label information, since whether pairwise patterns belong to the same cluster is unique. From this perspective, we reform the dataset as a relational dataset , where represents that and belong to the same cluster, and otherwise. Accordingly, the objective function in clustering can be reformulated as


where is a learnable function to estimate similarities between patterns, and is the binary cross-entropy to measure the error between and . In Eq. (2), two issues need to be handled. First, the ground-truth relationship between and is missing in clustering. Second, clustering labels of and can not be explicitly acquired even though the ground-truth relationship is observable. In the following, we manage these issues by introducing a global constraint and a local constraint.

3.1.1 Global Constraint

To tackle these issues, we first explore an inherent priori knowledge termed as global constraint in clustering, which can be employed to instantiate and refine the basic model in Eq. (2). Specifically, the global constraint naturally models the relationships between patterns with three essential rules, i.e., reflexivity, symmetry and transitivity.

Reflexivity The reflexivity signifies that every pattern and itself strictly belong to the same cluster, i.e., is met for arbitrary . Under this rule, we have


Intrinsically, the reflexivity can be considered as a “reconstruction” constraint for every pattern. Contrary to the reconstruction in AE [6] that attempts to learn features to represent inputs, the reflexivity focuses on modeling the similar relationship in a high-level latent space.

Symmetry The symmetry indicates that is (dis) similar to if and only if is (dis) similar to , i.e., is met for arbitrary and . In order to meet this rule, the function should be a symmetric function for the inputs and . For this purpose, the function is decomposed into the dot product between the same function , i.e.,


where “” represents the dot product. Generally, a deep network is employed to parametrize because of its recognized capability in feature learning.

Transitivity The transitivity means that is similar to if is similar to and is similar to , i.e., if for arbitrary , and . Naturally, the transitivity signifies that the ground-truth similarity matrices in clustering possess a specific form, although the real similarities are unobservable.

Inspired by the above three rules, we attempt to estimate the ground-truth relationships from coarse relationships. In detail, given a deep network , the coarse similarity matrix is first obtained according to Eq. (4). Then under the global constraint, the ground-truth similarity matrix is estimated by finding the most similar matrix to the coarse similarity matrix, i.e.,


where is the coarse similarity matrix calculated with the network , “” implies that mets the above global constraint, and is a global constraint error to measure the error between and . By introducing the global constraint into Eq. (2), the basic DDC model can be rewritten as follows:


where the first term is employed to learn the parameter for estimating faithful similarities, and the second term acts as a penalty to explore the ground-truth relationships from coarse relationships.

3.1.2 Local Constraint

To learn beneficial representations for clustering, we introduce indicator features , which are capable of assigning each pattern to a cluster automatically. To this end, we impose a non-negative local constraint on every element in indicator features, i.e.,


where signifies the -th element in the indicator feature , and is satisfied. Compared with the global constraint that focuses on the relationships between patterns, the local constraint is only concerned with local information, i.e., each element in indicator features. By integrating Eq. (7) into Eq. (6), the objective function of DDC is formulated as follows:


In the following, we give Theorem 1 and Theorem 2 to theoretically analyze the availability of our DDC model.

Theorem 1

If the optimal minimum of Eq. (8) is attained, indicator features are either equivalent or orthogonal, i.e.,

or .

Further, the locations of the largest responses in indicator features are different if , and otherwise.

Theorem 1 indicates that the deep network in DDC intrinsically learns a group of discriminative clustering centers. Since different locations of the largest response are uniquely corresponding to different clustering centers, all patterns can be grouped by locating the largest response in indicator features, i.e.,


where and are the cluster label and the learned indicator feature of , respectively. We also provide a sufficient condition for achieving the global minimum as follows.

Theorem 2

For an arbitrary similarity matrix with clusters, the global minimum of Eq. (8) can be always attained via a powerful enough function , if only

the dimension of indicator features is not smaller than .

Theorem 2 implies that DDC enables to handle a set of unlabeled data with clusters if only the dimension of indicator features is not smaller than . This obviously expands the practicability of DDC, since it is unnecessary to predefine the number of clusters before clustering.

3.2 In Context of Related Work

Our DDC model is established by introducing the local and global constraints into a supervised objective. Generally, the local constraint limits the non-negativity of each element in indicator features, which focuses on the local information merely. In contrast, the global constraint is capable of considering relationships among multiple patterns. In additional, the developed DDC model can be treated as a generalization of the DAC model [11]. In the following, we discuss the crucial differences in details.

Technically, DAC enables to learn one-hot representations for clustering. Contrary to our DDC, three practical drawbacks make this work less general and adaptive. First, DAC solely pertains to focus on relationships between pairwise patterns, which is insufficient for clustering. For example, may be dissimilar to in DAC, if is similar to and is similar to . Because of the transitivity, DDC is in a position to avoid such a fussy scenario. Second, DAC requires to predefine the number of clusters, which may hardly be observable before clustering. By comparison, DDC can automatically estimate the number of clusters in a purely data-driven way, as demonstrated in Theorem 2. Third, the performance of DAC is very sensitive to the hyper-parameters for estimating relationships. Relying on the transitivity in the global constraint, in contrast, DDC is competent in estimating relationships without additional hyper-parameters, further enhancing the dependability.

Figure 2: The motivation of the DDC algorithm. The tendencies signify that DNNs are relatively robustness to noisy labels.

4 Deep Discriminative Clustering Algorithm

For optimizing the DDC model, we develop a straightforward yet effective DDC algorithm. Naturally, the algorithm focuses on two problems in Eq. (8), namely the constraint on indicator features and the optimization.

4.1 Implementation of Indicator Features

A constraint layer is built to learn the indicator features to met the non-negativity and the reflexivity in the local and global constraints, respectively. Formally, the layer consists of two mapping functions, i.e.,


where , and indicate the input, temporary variable and output of the layer, respectively. Through the transformation of the above functions, outputs of arbitrary DNNs can be mapped as the indicator features consequently.

4.2 Alternating Optimization on Mini Batches

We solve the DDC model by iteratively optimizing the two terms in Eq. (8) on mini batches of patterns, which is inspired by a discovery in training DNNs. As illustrated in Figure 2, DNNs have considerable robustness to noisy labels when the rate of noisy label is small [74]. Intrinsically, the core problem in training DNNs is to find proper gradients on mini batches. If only a few noisy labels are imported, the real gradients may not be severely degenerated. Benefiting from this discovery, the DDC algorithm iteratively updates the to alleviate the impacts of noises in the estimated similarity matrices. In summary, the algorithm is illustrated in Algorithm 1 and elaborated as follows.

0:  Dataset .
0:  Clustering label of .
1:  Randomly initialize ;
2:  repeat
3:     for all  do
4:        Sample batch from ; patterns in
5:        Generate on the batch ; Eq. (11)
6:        Update by minimizing Eq. (12);
7:     end for
8:  until 
9:  for all  do
10:     ;
11:     ;
12:  end for
Algorithm 1 Deep Discriminative Clustering

When in the function is fixed, the DDC model in Eq. (8) degenerates as follows:


which can be recast as the normalized cut problem on . Therefore, we acquire relying on the spectral clustering [82] method. Because of the mini-batch based optimization, the efficiency of our DDC model can be guaranteed for large datasets.

Once the similarity matrix is calculated, the DDC model in Eq. (8) is equivalent to


which is a typical supervised task, and can be solved with the back-propagation algorithm.

Synthetically, the convergence of the DDC algorithm is guaranteed in the following Theorem 3, i.e.,

Theorem 3

If the sampled mini batches in DDC are identically distributed, the DDC algorithm is convergent.

Since the assumption in Theorem 3 is always satisfied in general scenarios, the DDC model is convergent.

5 Experiments

In this section, we systematically carry out extensive experiments to verify the capability of our DDC model. Due to the space restriction we report additional experimental details in the supplement, e.g., network architectures.

5.1 Datasets

For a comprehensive comparison, eight popular datasets on image, text and audio, are utilized in experiments. For the image dataset, MNIST [38], CIFAR-10 [35], STL-10 [16]

, ImageNet-10

[11], and ImageNet-Dog [11] are used. As for the text datasets, 20NEWS and REUTERS [39] are used for comparison. An audio dataset AudioSet-20 that randomly chosen from AudioSet [22] is employed. On these datasets, the training and testing patterns of each dataset are jointly utilized in our experiments, as described in [11, 73, 76]. Specifically, the term frequency inverse document frequency features111 and the Mel-frequency cepstral coefficients222 are employed to encode texts and audios, respectively. In summary, the number of clusters, and dimensions of patterns on each datasets are listed in Table 1.

5.2 Evaluation Metrics

There are three frequently-used metrics for evaluating the clustering results: Normalized Mutual Information (NMI) [63], Adjusted Rand Index (ARI) [29], and clustering Accuracy (ACC) [41]. Intrinsically, these metrics range in , and higher scores always support that more accurate clustering results are achieved.

5.3 Compared Methods

Several existing clustering methods are employed to compare with our approach. Specifically, the traditional methods, including Kmeans [70], SC [82] and AC [21] are adopted for comparison. For the representation-based clustering approaches, as described in [73], we employ some unsupervised learning methods, including AE [6], SAE [6], DAE [69], DeCNN [81], and SWWAE [85], to learn feature representations and use Kmeans [70] to cluster data as a post processing. For a comprehensive comparison, recent single-stage methods, including CatGAN [62], GMVAE [19], DAC [11], DAC* [11], JULE-SF [76], JULE-RC [76], and DEC [73] are employed.

Dataset Numbers Clusters Dimensions
MNIST [38]
CIFAR-10 [35]
STL-10 [16]
ImageNet-10 [11]
ImageNet-Dog [11]
20NEWS [39]
REUTERS [39] 2000
AudioSet-20 [22]
Table 1: The experimental datasets.

5.4 Experimental Settings

Dataset MNIST [38] CIFAR-10 [35] STL-10 [16] ImageNet-10 [11]
Kmeans [70] 0.4997 0.3652 0.5723 0.0871 0.0487 0.2289 0.1245 0.0608 0.1920 0.1186 0.0571 0.2409
SC [82] 0.6626 0.5214 0.6958 0.1028 0.0853 0.2467 0.0978 0.0479 0.1588 0.1511 0.0757 0.2740
AC [21] 0.6094 0.4807 0.6953 0.1046 0.0646 0.2275 0.2386 0.1402 0.3322 0.1383 0.0674 0.2420
AE [6] 0.7257 0.6139 0.8123 0.2393 0.1689 0.3135 0.2496 0.1610 0.3030 0.2099 0.1516 0.3170
SAE [6] 0.7565 0.6393 0.8271 0.2468 0.1555 0.2973 0.2520 0.1605 0.3203 0.2122 0.1740 0.3254
DAE [69] 0.7563 0.6467 0.8316 0.2506 0.1627 0.2971 0.2242 0.1519 0.3022 0.2064 0.1376 0.3044
DeCNN [81] 0.7577 0.6691 0.8179 0.2395 0.1736 0.2820 0.2267 0.1621 0.2988 0.1856 0.1421 0.3130
SWWAE [85] 0.7360 0.6518 0.8251 0.2330 0.1638 0.2840 0.1962 0.1358 0.2704 0.1761 0.1603 0.3238
CatGAN [62] 0.7637 0.7360 0.8279 0.2646 0.1757 0.3152 0.2100 0.1390 0.2984 0.2250 0.1571 0.3459
GMVAE [19] 0.7364 0.7129 0.8317 0.2451 0.1674 0.2908 0.2004 0.1464 0.2815 0.1934 0.1683 0.3344
DAC [11] 0.9351 0.9486 0.9775 0.3959 0.3059 0.5218 0.3656 0.2565 0.4699 0.3944 0.3019 0.5272
DAC* [11] 0.9246 0.9406 0.9660 0.3793 0.2802 0.4982 0.3474 0.2351 0.4337 0.3693 0.2837 0.5026
JULE-SF [76] 0.9063 0.9139 0.9592 0.1919 0.1357 0.2643 0.1754 0.1622 0.2741 0.1597 0.1205 0.2927
JULE-RC [76] 0.9130 0.9270 0.9640 0.1923 0.1377 0.2715 0.1815 0.1643 0.2769 0.1752 0.1382 0.3004
DEC [73] 0.7716 0.7414 0.8430 0.2568 0.1607 0.3010 0.2760 0.1861 0.3590 0.2819 0.2031 0.3809
DDC 0.9514 0.9667 0.9800 0.4242 0.3285 0.5238 0.3712 0.2674 0.4891 0.4327 0.3451 0.5766
Dataset ImageNet-Dog [11] 20NEWS [39] REUTERS [39] AudioSet-20 [22]
Kmeans [70] 0.0548 0.0204 0.1054 0.2154 0.0826 0.2328 0.3275 0.2593 0.5329 0.1901 0.0780 0.2074
SC [82] 0.0383 0.0133 0.1111 0.2147 0.0934 0.2486 N/A N/A N/A 0.1774 0.0741 0.2195
AC [21] 0.0368 0.0207 0.1385 0.2024 0.0963 0.2397 N/A N/A N/A 0.1854 0.0643 0.2286
AE [6] 0.1039 0.0728 0.1851 0.4439 0.3237 0.4907 0.3564 0.3164 0.7197 0.1933 0.1000 0.2300
SAE [6] 0.1129 0.0729 0.1830 0.4575 0.3265 0.4869 0.3556 0.3186 0.7256 0.1925 0.1061 0.2382
DAE [69] 0.1043 0.0779 0.1903 0.4558 0.3274 0.5027 0.3675 0.3386 0.7246 0.1964 0.1048 0.2366
DeCNN [81] 0.0983 0.0732 0.1747 0.4450 0.3526 0.5199 0.3750 0.3497 0.7221 0.2028 0.1313 0.2465
SWWAE [85] 0.0936 0.0760 0.1585 0.4646 0.3247 0.5103 0.3805 0.3538 0.7284 0.1991 0.1323 0.2261
CatGAN [62] 0.1213 0.0776 0.1738 0.4064 0.3024 0.4754 0.3553 0.3246 0.6245 0.2102 0.0946 0.2174
GMVAE [19] 0.1074 0.0786 0.1788 0.4266 0.3356 0.4792 0.3823 0.3475 0.6365 0.2224 0.1066 0.2441
DAC [11] 0.2185 0.1105 0.2748 0.5893 0.5005 0.5841 0.7116 0.5892 0.7875 0.2730 0.1372 0.2832
DAC* [11] 0.1815 0.0953 0.2455 0.5655 0.4774 0.5495 0.6835 0.5537 0.7535 0.2556 0.1223 0.2645
JULE-SF [76] 0.1213 0.0776 0.1738 0.4976 0.4506 0.5689 0.3986 0.3594 0.6111 0.1935 0.1128 0.2163
JULE-RC [76] 0.0492 0.0261 0.1115 0.5126 0.4679 0.5834 0.4012 0.3825 0.6336 0.2106 0.1108 0.2336
DEC [73] 0.1216 0.0788 0.1949 0.5286 0.4683 0.5832 0.6832 0.5635 0.7563 0.2280 0.1102 0.2461
DDC 0.2395 0.1283 0.3063 0.6083 0.5247 0.6470 0.7328 0.6044 0.8277 0.2967 0.1744 0.3060
Table 2: The clustering results of the methods on the experimental datasets. The best results are highlighted in bold.

In our experiments, DNNs with the constraint layer are devised to learn indicator features (the details of the devised DNNs can be found in the supplementary material). Specifically, we update parameters in models on mini batches with patterns, relying on the traditional spectral clustering [82]. The normalized Gaussian initialization strategy [28]

is utilized to initialize the devised deep networks. The RMSProp optimizer

[66] where the initial learning rate is set to is utilized to train the networks in DDC. For a reasonable evaluation, we perform random restarts for all experiments and the averages are employed to compare with the others methods.

5.5 Clustering Results

The clustering results of the compared methods are listed in Table 2. In the table, DDC achieves the best performance on all the datasets (including images, texts and audios), with significant gains. It demonstrates that DDC is effective for managing general clustering tasks in practice. Furthermore, several tendencies can be observed from Table 2.

First, the representation learning is assuredly crucial in clustering. From the table, the traditional methods (e.g., Kmeans [70]) always achieve inferior performance than the representation-based clustering methods (e.g., AE [6]). Since the only difference between these methods is the utilized features, the performance gains demonstrate that the feature learning plays an important role in clustering.

Second, combing the representation learning and the traditional methods is beneficial to clustering. From the table, the approaches (e.g., DEC [73], JULE [76]) that can simultaneously perform the feature learning and clustering possess superior performance than the representation-based approaches. This scenario powerfully verifies our motivation in DDC that the representation learning and the clustering are both stimulative to each other.

Finally, DDC suffices to handle large-scale datasets (e.g., ImageNet-10, REUTERS and AudioSet-20), not limited to simple datasets (e.g., MNIST). Specifically, “large” consists of two aspects, i.e.

, the dimension of patterns and the number of patterns. For the dimension, DDC enables to cluster high dimensional patterns with DNNs, which can partially manage the curse of dimensionality. As for the number, DDC optimizes DNNs in a mini-batch based optimization, which is independent of the number of patterns and enhances the practicability consequently.

5.6 Ablation Study

In this subsection, we carry out extensive ablation studies to synthetically analyze the developed DDC. Intuitively, all the results are orderly illustrated in Figure 3.

5.6.1 Unobservable Number of Clusters

An experiment on MNIST is performed to demonstrate the validity of DDC, when the number of clusters is unobservable. In the basic network on MNIST, we select from to investigate the variations of global constraint in DDC. From Figure 3 (a), there are obvious differences for different . Specifically, smaller global constraint errors are achieved, as increases from to . For the cases when , larger global constraint errors are obtained as gradually increases. When , the smallest global constraint error on MNIST is achieved. That is, the global constraint error can be employed to search an appropriate number of clusters. This verifies that DDC enables to handle the clustering tasks when the number of clusters is unobservable. In Figure 3, the learned indicator features are orderly visualized with t-SNE [48] when . It indicates that the real number of clusters also corresponds to the fastest convergence rate.

Figure 3: Ablation studies in DDC. (a) Unobservable number of clusters. Specifically, indicates the dimensionality of indicator features in DDC, and the clustering results () at different stages are orderly illustrated on the right side. (b) Performance on larger number of clusters. (c) Impact of number of patterns. (d) Validity of learned indicator features. (e) Sensitivity to initializations, where solid circles illustrate the different initial states. (f) Serviceability of learned DNNs, where means the number of labeled patterns. For each , the solid and dashed lines correspond to the pretrained and un-pretrained DNNs, respectively.

5.6.2 Performance on Larger Number of Clusters

We establish five different datasets to evaluate the performance of DDC, when the number of clusters is very large. By varying the number of clusters between and with an interval , subsets of ILSVRC2012-1K [18] are randomly sampled. Specifically, we observe the following two tendencies from Figure 3 (b). First, the performance of all the methods gradually degrade as the number of clusters increases. Second, the evident superiority always holds on every dataset, which signifies that DDC is in a position to handle the clustering tasks with various number of clusters.

5.6.3 Impact of Number of Patterns

Akin to deep supervised learning tasks, we consider to study the impact of number of patterns on our DDC model. To this end, we vary the number of patterns in the CIFAR-10 dataset between and with an interval . From the clustering results illustrated in Figure 3 (c), one can observe that the performance of DDC is gradually enhanced as the number of patterns increases. This is reasonable since deep networks always possess a large number of learnable parameters. By utilizing more data to train the networks, these parameters can be estimated more accurately. Consequently, more stronger deep networks are learned. This observation also indicates a possibility that DDC can improve the performance by increasing the amount of data. Because of the accessibility of unlabeled data in practice, there is still much room for improvement.

5.6.4 Validity of Learned Indicator Features

To reveal the validity of the learned indicator features in DDC, we report the clustering results which are obtained with the learned indicator features and four popular traditional clustering methods, i.e., Kmeans [70], SC [82], AC [21], and Birch [63]. From Figure 3 (d), almost the same clustering processes are generated for all traditional clustering methods. The slight mismatching may originate from the randomness in training, including random mini-batch selections or initializations. This scenario demonstrates that DDC is capable of learning high-level discriminative representations for clustering and alleviating the influences of these clustering methods consequently.

5.6.5 Sensitivity to Initializations

As the previous studies have proved [6], the initial state is crucial for deep networks. Therefore we study the contribution of initializations for DDC. Typically, four frequently-used initial strategies are utilized in this experiment, i.e., the normalized Gaussian initialization strategy [28], AE [6], SAE [6] and DAE [69]. Intuitively, Figure 3 (e) illustrates the clustering processes on AudioSet-20 with diverse initializations, indicating that the better initializations are beneficial to our DDC model, which is predictable, since preferable initializations can generate proper indicator features and converge to better partitions consequently.

5.6.6 Serviceability of Learned DNNs

In this experiment, the serviceability of learned DNNs in DDC is validated. Naturally, clustering with DDC can be treated as a process to train deep networks in an unsupervised manner purely. To prove the validity of the trained networks, we fine-tune the networks with a small number of labeled patterns (i.e., 32, 128, 512, 2048). From Figure 3 (f), there are dramatic margins between the pretrained networks (solid lines) and un-pretrained networks (dashed lines), especially when the labeled patterns are very limited, e.g., 32 (purple lines). These results strongly demonstrate that DDC is capable of training deep networks from random states to informative states in an unsupervised way.

6 Conclusion

We develop Deep Discriminative Clustering (DDC) to yield a unified mechanism of learning representations and clustering in a purely data-driven manner. For this purpose, the global and local constraints are introduced, which can be employed to guide the estimation of the relationships between patterns and the learning of discriminative representations. By optimizing the DDC model in a mini-batch way, DDC theoretically converges and the network in DDC can directly generate clustering centers for clustering. Extensive experimental results strongly demonstrate that DDC outperforms current methods on popular image, text and audio datasets concurrently. In the future, a potential direction is to combine DDC and the graph networks [5] to explore high-level relationships for improving the performance.


  • [1] R. Achanta and S. Susstrunk. Superpixels and polygons using simple non-iterative clustering. In CVPR, 2017.
  • [2] R. Achanta and S. Süsstrunk. Superpixels and polygons using simple non-iterative clustering. In CVPR, 2017.
  • [3] A. Agudo, M. Pijoan, and F. Moreno-Noguer. Image collection pop-up: 3d reconstruction and clustering of rigid and non-rigid categories. In CVPR, 2018.
  • [4] C. Bajaj, T. Gao, Z. He, Q. Huang, and Z. Liang. SMAC: simultaneous mapping and clustering using spectral decompositions. In ICML, 2018.
  • [5] P. Battaglia, J. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. F. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, Ç. Gülçehre, F. Song, A. Ballard, J. Gilmer, G. Dahl, A. Vaswani, K. Allen, C. Nash, V. Langston, C. Dyer, N. Heess, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, 2018.
  • [6] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In NIPS, 2006.
  • [7] A. Bhaskara and M. Wijewardena. Distributed clustering via LSH based data partitioning. In ICML, 2018.
  • [8] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
  • [9] J. Chang, J. Gu, L. Wang, G. Meng, S. Xiang, and C. Pan.

    Structure-aware convolutional neural networks.

    In NeurIPS, pages 11–20, 2018.
  • [10] J. Chang, G. Meng, L. Wang, S. Xiang, and C. Pan. Deep self-evolution clustering. IEEE Trans. Pattern Anal. Mach. Intell., 2018.
  • [11] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan. Deep adaptive image clustering. In ICCV, 2017.
  • [12] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan. Deep unsupervised learning with consistent inference of latent representations. Pattern Recognition, 2017.
  • [13] V. Chatziafratis, R. Niazadeh, and M. Charikar. Hierarchical clustering with structural constraints. In ICML, 2018.
  • [14] C. Chen and N. Quadrianto. Clustering high dimensional categorical data via topographical features. In ICML, 2016.
  • [15] X. Chen, J. Huang, F. Nie, R. Chen, and Q. Wu. A self-balanced min-cut algorithm for image clustering. In ICCV, 2017.
  • [16] A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.
  • [17] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
  • [18] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [19] N. Dilokthanakul, P. Mediano, M. Garnelo, M. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. 2016.
  • [20] K. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In ICCV, 2017.
  • [21] P. Fränti, O. Virmajoki, and V. Hautamäki. Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans. Pattern Anal. Mach. Intell., 2006.
  • [22] J. Gemmeke, D. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017.
  • [23] B. Gholami and V. Pavlovic. Probabilistic temporal subspace clustering. In CVPR, 2017.
  • [24] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • [25] R. Gomes, A. Krause, and P. Perona. Discriminative clustering by regularized information maximization. In NIPS, 2010.
  • [26] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [27] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2015.
  • [28] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
  • [29] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 1985.
  • [30] D. Ikami, T. Yamasaki, and K. Aizawa. Local and global optimization techniques in graph-based clustering. In CVPR, 2018.
  • [31] A. Jain, M. Murty, and P. Flynn. Data clustering: A review. ACM Comput. Surv., 1999.
  • [32] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. In IJCAI, 2017.
  • [33] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In CVPR, 2010.
  • [34] K. Kamnitsas, D. Castro, L. Folgoc, I. Walker, R. Tanno, D. Rueckert, B. Glocker, A. Criminisi, and A. Nori. Semi-supervised learning via compact latent space clustering. In ICML, 2018.
  • [35] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s Thesis, Department of Computer Science, University of Torono, 2009.
  • [36] J. Lange, A. Karrenbauer, and B. Andres. Partial optimality and fast lower bounds for weighted correlation clustering. In ICML, 2018.
  • [37] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.
  • [38] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  • [39] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004.
  • [40] C. Li and R. Vidal. Structured sparse subspace clustering: A unified optimization framework. In CVPR, 2015.
  • [41] T. Li and C. Ding. The relationships among various nonnegative matrix factorization methods for clustering. In ICDM, 2006.
  • [42] W. Lin, S. Liu, J. Lai, and Y. Matsushita. Dimensionality’s blessing: Clustering images by underlying distribution. In CVPR, 2018.
  • [43] Z. Liu, G. Lin, S. Yang, J. Feng, W. Lin, and W. Ling Goh. Learning markov clustering networks for scene text detection. In CVPR, 2018.
  • [44] F. Locatello, D. Vincent, I. Tolstikhin, G. Rätsch, S. Gelly, and B. Schölkopf. Clustering meets implicit generative models. CoRR, 2018.
  • [45] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [46] D. Lowe. Object recognition from local scale-invariant features. In ICCV, 1999.
  • [47] D. Luo, H. Huang, and C. Ding.

    Discriminative high order SVD: adaptive tensor subspace selection for image classification, clustering, and retrieval.

    In ICCV, 2011.
  • [48] L. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 2008.
  • [49] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. CoRR, 2015.
  • [50] D. Marin, M. Tang, I. Ayed, and Y. Boykov. Kernel clustering: density biases and solutions. IEEE Trans. Pattern Anal. Mach. Intell., 2017.
  • [51] L. Martin, A. Loukas, and P. Vandergheynst. Fast approximate spectral clustering for dynamic networks. In ICML, 2018.
  • [52] L. Mi, W. Zhang, X. Gu, and Y. Wang. Variational wasserstein clustering. In ECCV, 2018.
  • [53] A. Miech, J. Alayrac, P. Bojanowski, I. Laptev, and J. Sivic. Learning from video and text via large-scale discriminative clustering. In ICCV, 2017.
  • [54] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In NIPS, 2006.
  • [55] S. Mukherjee, H. Asnani, E. Lin, and S. Kannan. Clustergan : Latent space clustering in generative adversarial networks. CoRR, 2018.
  • [56] C. Peng, Z. Kang, and Q. Cheng.

    Subspace clustering via variance regularized ridge regression.

    In CVPR, 2017.
  • [57] V. Premachandran and A. Yuille. Unsupervised learning using generative adversarial training and clustering. 2016.
  • [58] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, 2015.
  • [59] A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science, 2014.
  • [60] S. Shah and V. Koltun. Deep continuous clustering. CoRR, abs/1803.01449, 2018.
  • [61] K. Sinha. K-means clustering using random matrix sparsification. In ICML, 2018.
  • [62] J. Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. 2015.
  • [63] A. Strehl and J. Ghosh. Cluster ensembles — A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 2002.
  • [64] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [65] M. Sznaier and O. Camps. Sos-rsc: A sum-of-squares polynomial approach to robustifying subspace clustering algorithms. In CVPR, 2018.
  • [66] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
  • [67] F. Torre and T. Kanade. Discriminative cluster analysis. In ICML, 2006.
  • [68] Z. Tu. Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering. In ICCV, 2005.
  • [69] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol.

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.

    Journal of Machine Learning Research, 2010.
  • [70] J. Wang, J. Wang, J. Song, X. Xu, H. Shen, and S. Li. Optimized cartesian k-means. IEEE Trans. Knowl. Data Eng., 2015.
  • [71] Q. Wang, J. Gao, and H. Li. Grassmannian manifold optimization assisted sparse spectral clustering. In CVPR, 2017.
  • [72] W. Wang, Y. Wu, C. Tang, and M. Hor. Adaptive density-based spatial clustering of applications with noise (DBSCAN) according to data. In ICMLC, 2015.
  • [73] J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep embedding for clustering analysis. In ICML, 2016.
  • [74] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian. Disturblabel: Regularizing CNN on the loss layer. In CVPR, 2016.
  • [75] J. Xu, J. Han, and F. Nie. Discriminatively embedded k-means for multi-view clustering. In CVPR, 2016.
  • [76] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning of deep representations and image clusters. In CVPR, 2016.
  • [77] Z. Yang, J. Corander, and E. Oja.

    Low-rank doubly stochastic matrix decomposition for cluster analysis.

    Journal of Machine Learning Research, 2016.
  • [78] J. Ye, Z. Zhao, and M. Wu. Discriminative k-means for clustering. In NIPS, 2007.
  • [79] M. Yin, Y. Guo, J. Gao, Z. He, and S. Xie. Kernel sparse subspace clustering on symmetric positive definite manifolds. In CVPR, 2016.
  • [80] C. You, D. Robinson, and R. Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In CVPR, 2016.
  • [81] M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus. Deconvolutional networks. In CVPR, 2010.
  • [82] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, 2004.
  • [83] C. Zhang, Q. Hu, H. Fu, P. Zhu, and X. Cao. Latent multi-view subspace clustering. In CVPR, 2017.
  • [84] D. Zhang, Y. Sun, B. Eriksson, and L. Balzano. Deep unsupervised clustering using mixture of autoencoders. CoRR, 2017.
  • [85] J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun. Stacked what-where auto-encoders. 2015.
  • [86] K. Zhao, W. Chu, and A. Martinez. Learning facial action units from web images with scalable weakly supervised clustering. In CVPR, 2018.
  • [87] P. Zhou, Y. Hou, and J. Feng. Deep adversarial subspace clustering. In CVPR, 2018.
  • [88] V. Zografos, L. Ellis, and R. Mester. Discriminative subspace clustering. In CVPR, 2013.