1 Introduction
Clustering is one of the most extensively used statistical tool in computer vision and pattern recognition
[1, 3, 10, 12, 33, 43]. In a wide range of fields, many applications where the target information is unobservable can be intrinsically instantiated as clustering tasks, including text mining [42] and multimedia contentbased retrieval [2, 20, 83, 86].In the literature, much research has been dedicated to clustering analysis [7, 13, 15, 30, 42]. From the aspect related to algorithmic structure and operation, clustering methods can be roughly divided into two categories, i.e., partitional and agglomerative methods [31]. By definition, partitional methods tend to split all patterns into several partitions, until stopping criterions are met. In compliance with this definition, typically basic methods are present, such as the Kmeans [70]
and the spectral clustering
[82]. In contrast, the agglomerative methods [13, 21] always start with each pattern as a single cluster and aggregate clusters together gradually until stopping criterions are satisfied. Despite the conceptual breakthroughs, these clustering methods suffer from a severe obstacle, i.e., how to acquire effective representations from unlabeled data purely?In order to acquire effective representations, unsupervised techniques have been developed to represent data. Handcrafted representations, derived from the professional knowledge, are originally utilized, such as SIFT [46] and HoG [17]
. However, they are always limited to simple scenarios, and may be severely degenerated when facing with more complex ones, including large scale images, texts, and audios. By employing deep unsupervised learning techniques
[6, 69], more preferable representations can be acquired and considerable gains are achieved consequently. Unfortunately, these learned representations are fixed in clustering and can not be further modified to yield better performance. Analogous to previous studies in deep supervised leaning, learnable representations are usually more effective than fixed ones. To benefit from learnable representations, one straightforward way is to integrate deep representation learning and clustering in a joint framework [11, 60, 73, 76], but there are still several challenges. First, how to define an effective and general objective to train deep networks in an unsupervised way? Second, how to guarantee that the representations learned by deep networks are in favour of clustering? Third, how to discover the number of clusters automatically, rather than to predefine one.To address such challenges, we propose Deep Discriminative Clustering (DDC) that manages the clustering task by iteratively exploring relationships between patterns and learning representations in a minibatch manner. In each iteration, a global constraint is used to guide the estimation of the relationships. Then under a local constraint, the relationships are fed back to train the network for learning highlevel discriminative representations. Consequently, DDC is theoretically convergent and the trained network is capable of generating a group of discriminative representations that can be treated as clustering centers for straightway clustering. Benefiting from such artful modeling, DDC is independent of the number of patterns and the number of clusters, which endows the model with the capability of dealing with tasks that with a large number of patterns or unobservable number of clusters.
To sum up, the key contributions of this work are:

Under the global and local constraints, DDC can endow networks with the capability of mapping all patterns to discriminative clustering centers for clustering.

By feat of the minibatch based optimization, DDC alleviates the fussy impacts from the numbers of patterns and clusters, further enhancing the practicality.

The availability and the convergence of the developed DDC model are mathematically analyzed, which provides requisite theoretical foundations and guarantees.

Extensive experiments strongly verify that our DDC model is concurrently superior to current methods on various datasets, including images, texts, and audios.
2 Related Work
In this section, we make a brief review of the related work on the clustering and deep learning methods.
2.1 Clustering
By definition, clustering tries to generate a semantical organization of data based on similarities, namely patterns within the same cluster are similar to each other, while those in different clusters are dissimilar. Technically, clustering has been widely studied in an unsupervised mechanism from two main aspects: how to define a cluster and how to extract effective representations for clustering?
To define clusters, lots of endeavors have been extensively made. In particular, the Gaussian mixture distribution is one of the simple yet popular techniques to describe clusters. Relying on this technique, a series of clustering methods are developed, such as the Kmeans [70] and its variants [61, 70, 78]. Compared with those with predefined clusters, many methods prefer to adaptively estimate forms of clusters from datasets purely. Typically, there are two frequentlyused ways, i.e., densitybased and connectivitybased estimations. For the former, densitybased methods define clusters via a proper density function, which can be utilized to describe the density around each patterns in the feature space. Depending on diverse density functions, various densitybased methods[50, 59, 72] have been proposed. For the latter, patterns are highly connected if they belong to the same cluster, including the spectral clustering [4, 71]. Furthermore, these ideas also form the basis of a number of methods [14, 77], such as the discriminative clustering [25, 47, 53, 54, 67, 68, 75, 78, 88], and the subspace based clustering [40, 56, 65, 79, 80, 87].
To extract effective representations for clustering, plentiful attention has been paid. Primitively, many handcrafted representations have been employed to recode patterns in an unsupervised way, including SIFT [46] and HoG [17] on images. To eliminate influences from trivial variables (e.g., variations of scenes and objects), highlevel representations can be learned in a datadriven manner with deep unsupervised learning in recent [6, 19, 62, 69, 81, 85].
2.2 Deep Neural Network
Deep Neural Network (DNN) is a significant technique in machine learning, which has been developed to manage tasks like humans do
[37]. With respect to the observability of label information, deep learning can be summarized into two categories, i.e., deep supervised learning and deep unsupervised learning.
In deep supervised learning, much interesting research has been proposed over the last decade. Thanks to the efficient hardware and a large amount of labeled data, remarkable successes have been achieved in various pattern analysis and machine learning problems, especially in computer vision, such as image classification and objection detection [9, 24, 27, 28, 45, 64]. Although the achievements are significant, deep supervised learning solely pertains to the tasks that with a large amount of labeled data.
To eliminate the requirement for the labeled data, deep unsupervised learning has drawn attention to train DNNs in an unsupervised manner and extract representations of data simultaneously [8, 52, 58, 84]
. In general, the vital problem in deep unsupervised learning is that how to definitely give pseudo labels to train DNNs. Primitively, most unsupervised learning methods attempt to reconstruct inputs with DNNs, namely labels are themselves. Typical approaches include the autoencoder
[6] and its variants [49, 69]. Similar to traditional machine learning methods, more robust and interpretable representations can be learned by introducing additional regularization terms. In recent, deep generative networks [26], one important branch in deep unsupervised learning has been developed. They train networks by representing data distributions with DNNs. Akin to traditional methods, representations can be learned by modifying generative networks, such as the generative adversarial clustering [32, 44, 55, 57, 62] and the variational autoencoder based clustering [19, 49].While many endeavors have been devoted to deep unsupervised learning, several intrinsic challenges still remain. First, how to provide proper driving force to train DNNs in an unsupervised manner? Second, clustering results can not be obtained directly via the learned feature representations by these methods, which implies that the generated representations are unsuitable for clustering naturally.
3 Deep Discriminative Clustering Model
Clustering, intrinsically, is a function that is in a position to assign patterns into a group of clusters whose number is either observable or unobservable. Formally, considering the problem of clustering patterns into clusters, i.e., , which are represented by discriminative clustering centers . By this definition, clustering is naturally formulated as
(1) 
where is a clustering function with the learnable parameter . In clustering, generally, a fundamental problem is how to model an objective function on a set of clustering centers to estimate the parameter . In the following, we focus on dealing with this problem. The flowchart of DDC is visually illustrated in Figure 1.
3.1 Relational Objective Function
The main obstacle of modeling an objective function in clustering is that the correspondence between patterns and clustering centers is nonunique. To eliminate this obstacle, relationships between patterns are employed as the label information, since whether pairwise patterns belong to the same cluster is unique. From this perspective, we reform the dataset as a relational dataset , where represents that and belong to the same cluster, and otherwise. Accordingly, the objective function in clustering can be reformulated as
(2) 
where is a learnable function to estimate similarities between patterns, and is the binary crossentropy to measure the error between and . In Eq. (2), two issues need to be handled. First, the groundtruth relationship between and is missing in clustering. Second, clustering labels of and can not be explicitly acquired even though the groundtruth relationship is observable. In the following, we manage these issues by introducing a global constraint and a local constraint.
3.1.1 Global Constraint
To tackle these issues, we first explore an inherent priori knowledge termed as global constraint in clustering, which can be employed to instantiate and refine the basic model in Eq. (2). Specifically, the global constraint naturally models the relationships between patterns with three essential rules, i.e., reflexivity, symmetry and transitivity.
Reflexivity The reflexivity signifies that every pattern and itself strictly belong to the same cluster, i.e., is met for arbitrary . Under this rule, we have
(3) 
Intrinsically, the reflexivity can be considered as a “reconstruction” constraint for every pattern. Contrary to the reconstruction in AE [6] that attempts to learn features to represent inputs, the reflexivity focuses on modeling the similar relationship in a highlevel latent space.
Symmetry The symmetry indicates that is (dis) similar to if and only if is (dis) similar to , i.e., is met for arbitrary and . In order to meet this rule, the function should be a symmetric function for the inputs and . For this purpose, the function is decomposed into the dot product between the same function , i.e.,
(4) 
where “” represents the dot product. Generally, a deep network is employed to parametrize because of its recognized capability in feature learning.
Transitivity The transitivity means that is similar to if is similar to and is similar to , i.e., if for arbitrary , and . Naturally, the transitivity signifies that the groundtruth similarity matrices in clustering possess a specific form, although the real similarities are unobservable.
Inspired by the above three rules, we attempt to estimate the groundtruth relationships from coarse relationships. In detail, given a deep network , the coarse similarity matrix is first obtained according to Eq. (4). Then under the global constraint, the groundtruth similarity matrix is estimated by finding the most similar matrix to the coarse similarity matrix, i.e.,
(5)  
where is the coarse similarity matrix calculated with the network , “” implies that mets the above global constraint, and is a global constraint error to measure the error between and . By introducing the global constraint into Eq. (2), the basic DDC model can be rewritten as follows:
(6)  
where the first term is employed to learn the parameter for estimating faithful similarities, and the second term acts as a penalty to explore the groundtruth relationships from coarse relationships.
3.1.2 Local Constraint
To learn beneficial representations for clustering, we introduce indicator features , which are capable of assigning each pattern to a cluster automatically. To this end, we impose a nonnegative local constraint on every element in indicator features, i.e.,
(7) 
where signifies the th element in the indicator feature , and is satisfied. Compared with the global constraint that focuses on the relationships between patterns, the local constraint is only concerned with local information, i.e., each element in indicator features. By integrating Eq. (7) into Eq. (6), the objective function of DDC is formulated as follows:
(8)  
In the following, we give Theorem 1 and Theorem 2 to theoretically analyze the availability of our DDC model.
Theorem 1
If the optimal minimum of Eq. (8) is attained, indicator features are either equivalent or orthogonal, i.e.,
or .
Further, the locations of the largest responses in indicator features are different if , and otherwise.
Theorem 1 indicates that the deep network in DDC intrinsically learns a group of discriminative clustering centers. Since different locations of the largest response are uniquely corresponding to different clustering centers, all patterns can be grouped by locating the largest response in indicator features, i.e.,
(9) 
where and are the cluster label and the learned indicator feature of , respectively. We also provide a sufficient condition for achieving the global minimum as follows.
Theorem 2
For an arbitrary similarity matrix with clusters, the global minimum of Eq. (8) can be always attained via a powerful enough function , if only
the dimension of indicator features is not smaller than .
Theorem 2 implies that DDC enables to handle a set of unlabeled data with clusters if only the dimension of indicator features is not smaller than . This obviously expands the practicability of DDC, since it is unnecessary to predefine the number of clusters before clustering.
3.2 In Context of Related Work
Our DDC model is established by introducing the local and global constraints into a supervised objective. Generally, the local constraint limits the nonnegativity of each element in indicator features, which focuses on the local information merely. In contrast, the global constraint is capable of considering relationships among multiple patterns. In additional, the developed DDC model can be treated as a generalization of the DAC model [11]. In the following, we discuss the crucial differences in details.
Technically, DAC enables to learn onehot representations for clustering. Contrary to our DDC, three practical drawbacks make this work less general and adaptive. First, DAC solely pertains to focus on relationships between pairwise patterns, which is insufficient for clustering. For example, may be dissimilar to in DAC, if is similar to and is similar to . Because of the transitivity, DDC is in a position to avoid such a fussy scenario. Second, DAC requires to predefine the number of clusters, which may hardly be observable before clustering. By comparison, DDC can automatically estimate the number of clusters in a purely datadriven way, as demonstrated in Theorem 2. Third, the performance of DAC is very sensitive to the hyperparameters for estimating relationships. Relying on the transitivity in the global constraint, in contrast, DDC is competent in estimating relationships without additional hyperparameters, further enhancing the dependability.
4 Deep Discriminative Clustering Algorithm
For optimizing the DDC model, we develop a straightforward yet effective DDC algorithm. Naturally, the algorithm focuses on two problems in Eq. (8), namely the constraint on indicator features and the optimization.
4.1 Implementation of Indicator Features
A constraint layer is built to learn the indicator features to met the nonnegativity and the reflexivity in the local and global constraints, respectively. Formally, the layer consists of two mapping functions, i.e.,
(10a)  
(10b) 
where , and indicate the input, temporary variable and output of the layer, respectively. Through the transformation of the above functions, outputs of arbitrary DNNs can be mapped as the indicator features consequently.
4.2 Alternating Optimization on Mini Batches
We solve the DDC model by iteratively optimizing the two terms in Eq. (8) on mini batches of patterns, which is inspired by a discovery in training DNNs. As illustrated in Figure 2, DNNs have considerable robustness to noisy labels when the rate of noisy label is small [74]. Intrinsically, the core problem in training DNNs is to find proper gradients on mini batches. If only a few noisy labels are imported, the real gradients may not be severely degenerated. Benefiting from this discovery, the DDC algorithm iteratively updates the to alleviate the impacts of noises in the estimated similarity matrices. In summary, the algorithm is illustrated in Algorithm 1 and elaborated as follows.
When in the function is fixed, the DDC model in Eq. (8) degenerates as follows:
(11)  
which can be recast as the normalized cut problem on . Therefore, we acquire relying on the spectral clustering [82] method. Because of the minibatch based optimization, the efficiency of our DDC model can be guaranteed for large datasets.
Once the similarity matrix is calculated, the DDC model in Eq. (8) is equivalent to
(12)  
which is a typical supervised task, and can be solved with the backpropagation algorithm.
Synthetically, the convergence of the DDC algorithm is guaranteed in the following Theorem 3, i.e.,
Theorem 3
If the sampled mini batches in DDC are identically distributed, the DDC algorithm is convergent.
Since the assumption in Theorem 3 is always satisfied in general scenarios, the DDC model is convergent.
5 Experiments
In this section, we systematically carry out extensive experiments to verify the capability of our DDC model. Due to the space restriction we report additional experimental details in the supplement, e.g., network architectures.
5.1 Datasets
For a comprehensive comparison, eight popular datasets on image, text and audio, are utilized in experiments. For the image dataset, MNIST [38], CIFAR10 [35], STL10 [16]
, ImageNet10
[11], and ImageNetDog [11] are used. As for the text datasets, 20NEWS and REUTERS [39] are used for comparison. An audio dataset AudioSet20 that randomly chosen from AudioSet [22] is employed. On these datasets, the training and testing patterns of each dataset are jointly utilized in our experiments, as described in [11, 73, 76]. Specifically, the term frequency inverse document frequency features^{1}^{1}1https://scikitlearn.org/stable and the Melfrequency cepstral coefficients^{2}^{2}2https://librosa.github.io/librosa/feature.html are employed to encode texts and audios, respectively. In summary, the number of clusters, and dimensions of patterns on each datasets are listed in Table 1.5.2 Evaluation Metrics
There are three frequentlyused metrics for evaluating the clustering results: Normalized Mutual Information (NMI) [63], Adjusted Rand Index (ARI) [29], and clustering Accuracy (ACC) [41]. Intrinsically, these metrics range in , and higher scores always support that more accurate clustering results are achieved.
5.3 Compared Methods
Several existing clustering methods are employed to compare with our approach. Specifically, the traditional methods, including Kmeans [70], SC [82] and AC [21] are adopted for comparison. For the representationbased clustering approaches, as described in [73], we employ some unsupervised learning methods, including AE [6], SAE [6], DAE [69], DeCNN [81], and SWWAE [85], to learn feature representations and use Kmeans [70] to cluster data as a post processing. For a comprehensive comparison, recent singlestage methods, including CatGAN [62], GMVAE [19], DAC [11], DAC* [11], JULESF [76], JULERC [76], and DEC [73] are employed.
5.4 Experimental Settings
Dataset  MNIST [38]  CIFAR10 [35]  STL10 [16]  ImageNet10 [11]  

Metric  NMI  ARI  ACC  NMI  ARI  ACC  NMI  ARI  ACC  NMI  ARI  ACC 
Kmeans [70]  0.4997  0.3652  0.5723  0.0871  0.0487  0.2289  0.1245  0.0608  0.1920  0.1186  0.0571  0.2409 
SC [82]  0.6626  0.5214  0.6958  0.1028  0.0853  0.2467  0.0978  0.0479  0.1588  0.1511  0.0757  0.2740 
AC [21]  0.6094  0.4807  0.6953  0.1046  0.0646  0.2275  0.2386  0.1402  0.3322  0.1383  0.0674  0.2420 
AE [6]  0.7257  0.6139  0.8123  0.2393  0.1689  0.3135  0.2496  0.1610  0.3030  0.2099  0.1516  0.3170 
SAE [6]  0.7565  0.6393  0.8271  0.2468  0.1555  0.2973  0.2520  0.1605  0.3203  0.2122  0.1740  0.3254 
DAE [69]  0.7563  0.6467  0.8316  0.2506  0.1627  0.2971  0.2242  0.1519  0.3022  0.2064  0.1376  0.3044 
DeCNN [81]  0.7577  0.6691  0.8179  0.2395  0.1736  0.2820  0.2267  0.1621  0.2988  0.1856  0.1421  0.3130 
SWWAE [85]  0.7360  0.6518  0.8251  0.2330  0.1638  0.2840  0.1962  0.1358  0.2704  0.1761  0.1603  0.3238 
CatGAN [62]  0.7637  0.7360  0.8279  0.2646  0.1757  0.3152  0.2100  0.1390  0.2984  0.2250  0.1571  0.3459 
GMVAE [19]  0.7364  0.7129  0.8317  0.2451  0.1674  0.2908  0.2004  0.1464  0.2815  0.1934  0.1683  0.3344 
DAC [11]  0.9351  0.9486  0.9775  0.3959  0.3059  0.5218  0.3656  0.2565  0.4699  0.3944  0.3019  0.5272 
DAC* [11]  0.9246  0.9406  0.9660  0.3793  0.2802  0.4982  0.3474  0.2351  0.4337  0.3693  0.2837  0.5026 
JULESF [76]  0.9063  0.9139  0.9592  0.1919  0.1357  0.2643  0.1754  0.1622  0.2741  0.1597  0.1205  0.2927 
JULERC [76]  0.9130  0.9270  0.9640  0.1923  0.1377  0.2715  0.1815  0.1643  0.2769  0.1752  0.1382  0.3004 
DEC [73]  0.7716  0.7414  0.8430  0.2568  0.1607  0.3010  0.2760  0.1861  0.3590  0.2819  0.2031  0.3809 
DDC  0.9514  0.9667  0.9800  0.4242  0.3285  0.5238  0.3712  0.2674  0.4891  0.4327  0.3451  0.5766 
Dataset  ImageNetDog [11]  20NEWS [39]  REUTERS [39]  AudioSet20 [22]  

Metric  NMI  ARI  ACC  NMI  ARI  ACC  NMI  ARI  ACC  NMI  ARI  ACC 
Kmeans [70]  0.0548  0.0204  0.1054  0.2154  0.0826  0.2328  0.3275  0.2593  0.5329  0.1901  0.0780  0.2074 
SC [82]  0.0383  0.0133  0.1111  0.2147  0.0934  0.2486  N/A  N/A  N/A  0.1774  0.0741  0.2195 
AC [21]  0.0368  0.0207  0.1385  0.2024  0.0963  0.2397  N/A  N/A  N/A  0.1854  0.0643  0.2286 
AE [6]  0.1039  0.0728  0.1851  0.4439  0.3237  0.4907  0.3564  0.3164  0.7197  0.1933  0.1000  0.2300 
SAE [6]  0.1129  0.0729  0.1830  0.4575  0.3265  0.4869  0.3556  0.3186  0.7256  0.1925  0.1061  0.2382 
DAE [69]  0.1043  0.0779  0.1903  0.4558  0.3274  0.5027  0.3675  0.3386  0.7246  0.1964  0.1048  0.2366 
DeCNN [81]  0.0983  0.0732  0.1747  0.4450  0.3526  0.5199  0.3750  0.3497  0.7221  0.2028  0.1313  0.2465 
SWWAE [85]  0.0936  0.0760  0.1585  0.4646  0.3247  0.5103  0.3805  0.3538  0.7284  0.1991  0.1323  0.2261 
CatGAN [62]  0.1213  0.0776  0.1738  0.4064  0.3024  0.4754  0.3553  0.3246  0.6245  0.2102  0.0946  0.2174 
GMVAE [19]  0.1074  0.0786  0.1788  0.4266  0.3356  0.4792  0.3823  0.3475  0.6365  0.2224  0.1066  0.2441 
DAC [11]  0.2185  0.1105  0.2748  0.5893  0.5005  0.5841  0.7116  0.5892  0.7875  0.2730  0.1372  0.2832 
DAC* [11]  0.1815  0.0953  0.2455  0.5655  0.4774  0.5495  0.6835  0.5537  0.7535  0.2556  0.1223  0.2645 
JULESF [76]  0.1213  0.0776  0.1738  0.4976  0.4506  0.5689  0.3986  0.3594  0.6111  0.1935  0.1128  0.2163 
JULERC [76]  0.0492  0.0261  0.1115  0.5126  0.4679  0.5834  0.4012  0.3825  0.6336  0.2106  0.1108  0.2336 
DEC [73]  0.1216  0.0788  0.1949  0.5286  0.4683  0.5832  0.6832  0.5635  0.7563  0.2280  0.1102  0.2461 
DDC  0.2395  0.1283  0.3063  0.6083  0.5247  0.6470  0.7328  0.6044  0.8277  0.2967  0.1744  0.3060 
In our experiments, DNNs with the constraint layer are devised to learn indicator features (the details of the devised DNNs can be found in the supplementary material). Specifically, we update parameters in models on mini batches with patterns, relying on the traditional spectral clustering [82]. The normalized Gaussian initialization strategy [28]
is utilized to initialize the devised deep networks. The RMSProp optimizer
[66] where the initial learning rate is set to is utilized to train the networks in DDC. For a reasonable evaluation, we perform random restarts for all experiments and the averages are employed to compare with the others methods.5.5 Clustering Results
The clustering results of the compared methods are listed in Table 2. In the table, DDC achieves the best performance on all the datasets (including images, texts and audios), with significant gains. It demonstrates that DDC is effective for managing general clustering tasks in practice. Furthermore, several tendencies can be observed from Table 2.
First, the representation learning is assuredly crucial in clustering. From the table, the traditional methods (e.g., Kmeans [70]) always achieve inferior performance than the representationbased clustering methods (e.g., AE [6]). Since the only difference between these methods is the utilized features, the performance gains demonstrate that the feature learning plays an important role in clustering.
Second, combing the representation learning and the traditional methods is beneficial to clustering. From the table, the approaches (e.g., DEC [73], JULE [76]) that can simultaneously perform the feature learning and clustering possess superior performance than the representationbased approaches. This scenario powerfully verifies our motivation in DDC that the representation learning and the clustering are both stimulative to each other.
Finally, DDC suffices to handle largescale datasets (e.g., ImageNet10, REUTERS and AudioSet20), not limited to simple datasets (e.g., MNIST). Specifically, “large” consists of two aspects, i.e.
, the dimension of patterns and the number of patterns. For the dimension, DDC enables to cluster high dimensional patterns with DNNs, which can partially manage the curse of dimensionality. As for the number, DDC optimizes DNNs in a minibatch based optimization, which is independent of the number of patterns and enhances the practicability consequently.
5.6 Ablation Study
In this subsection, we carry out extensive ablation studies to synthetically analyze the developed DDC. Intuitively, all the results are orderly illustrated in Figure 3.
5.6.1 Unobservable Number of Clusters
An experiment on MNIST is performed to demonstrate the validity of DDC, when the number of clusters is unobservable. In the basic network on MNIST, we select from to investigate the variations of global constraint in DDC. From Figure 3 (a), there are obvious differences for different . Specifically, smaller global constraint errors are achieved, as increases from to . For the cases when , larger global constraint errors are obtained as gradually increases. When , the smallest global constraint error on MNIST is achieved. That is, the global constraint error can be employed to search an appropriate number of clusters. This verifies that DDC enables to handle the clustering tasks when the number of clusters is unobservable. In Figure 3, the learned indicator features are orderly visualized with tSNE [48] when . It indicates that the real number of clusters also corresponds to the fastest convergence rate.
5.6.2 Performance on Larger Number of Clusters
We establish five different datasets to evaluate the performance of DDC, when the number of clusters is very large. By varying the number of clusters between and with an interval , subsets of ILSVRC20121K [18] are randomly sampled. Specifically, we observe the following two tendencies from Figure 3 (b). First, the performance of all the methods gradually degrade as the number of clusters increases. Second, the evident superiority always holds on every dataset, which signifies that DDC is in a position to handle the clustering tasks with various number of clusters.
5.6.3 Impact of Number of Patterns
Akin to deep supervised learning tasks, we consider to study the impact of number of patterns on our DDC model. To this end, we vary the number of patterns in the CIFAR10 dataset between and with an interval . From the clustering results illustrated in Figure 3 (c), one can observe that the performance of DDC is gradually enhanced as the number of patterns increases. This is reasonable since deep networks always possess a large number of learnable parameters. By utilizing more data to train the networks, these parameters can be estimated more accurately. Consequently, more stronger deep networks are learned. This observation also indicates a possibility that DDC can improve the performance by increasing the amount of data. Because of the accessibility of unlabeled data in practice, there is still much room for improvement.
5.6.4 Validity of Learned Indicator Features
To reveal the validity of the learned indicator features in DDC, we report the clustering results which are obtained with the learned indicator features and four popular traditional clustering methods, i.e., Kmeans [70], SC [82], AC [21], and Birch [63]. From Figure 3 (d), almost the same clustering processes are generated for all traditional clustering methods. The slight mismatching may originate from the randomness in training, including random minibatch selections or initializations. This scenario demonstrates that DDC is capable of learning highlevel discriminative representations for clustering and alleviating the influences of these clustering methods consequently.
5.6.5 Sensitivity to Initializations
As the previous studies have proved [6], the initial state is crucial for deep networks. Therefore we study the contribution of initializations for DDC. Typically, four frequentlyused initial strategies are utilized in this experiment, i.e., the normalized Gaussian initialization strategy [28], AE [6], SAE [6] and DAE [69]. Intuitively, Figure 3 (e) illustrates the clustering processes on AudioSet20 with diverse initializations, indicating that the better initializations are beneficial to our DDC model, which is predictable, since preferable initializations can generate proper indicator features and converge to better partitions consequently.
5.6.6 Serviceability of Learned DNNs
In this experiment, the serviceability of learned DNNs in DDC is validated. Naturally, clustering with DDC can be treated as a process to train deep networks in an unsupervised manner purely. To prove the validity of the trained networks, we finetune the networks with a small number of labeled patterns (i.e., 32, 128, 512, 2048). From Figure 3 (f), there are dramatic margins between the pretrained networks (solid lines) and unpretrained networks (dashed lines), especially when the labeled patterns are very limited, e.g., 32 (purple lines). These results strongly demonstrate that DDC is capable of training deep networks from random states to informative states in an unsupervised way.
6 Conclusion
We develop Deep Discriminative Clustering (DDC) to yield a unified mechanism of learning representations and clustering in a purely datadriven manner. For this purpose, the global and local constraints are introduced, which can be employed to guide the estimation of the relationships between patterns and the learning of discriminative representations. By optimizing the DDC model in a minibatch way, DDC theoretically converges and the network in DDC can directly generate clustering centers for clustering. Extensive experimental results strongly demonstrate that DDC outperforms current methods on popular image, text and audio datasets concurrently. In the future, a potential direction is to combine DDC and the graph networks [5] to explore highlevel relationships for improving the performance.
References
 [1] R. Achanta and S. Susstrunk. Superpixels and polygons using simple noniterative clustering. In CVPR, 2017.
 [2] R. Achanta and S. Süsstrunk. Superpixels and polygons using simple noniterative clustering. In CVPR, 2017.
 [3] A. Agudo, M. Pijoan, and F. MorenoNoguer. Image collection popup: 3d reconstruction and clustering of rigid and nonrigid categories. In CVPR, 2018.
 [4] C. Bajaj, T. Gao, Z. He, Q. Huang, and Z. Liang. SMAC: simultaneous mapping and clustering using spectral decompositions. In ICML, 2018.
 [5] P. Battaglia, J. Hamrick, V. Bapst, A. SanchezGonzalez, V. F. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, Ç. Gülçehre, F. Song, A. Ballard, J. Gilmer, G. Dahl, A. Vaswani, K. Allen, C. Nash, V. Langston, C. Dyer, N. Heess, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, 2018.
 [6] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layerwise training of deep networks. In NIPS, 2006.
 [7] A. Bhaskara and M. Wijewardena. Distributed clustering via LSH based data partitioning. In ICML, 2018.
 [8] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.

[9]
J. Chang, J. Gu, L. Wang, G. Meng, S. Xiang, and C. Pan.
Structureaware convolutional neural networks.
In NeurIPS, pages 11–20, 2018.  [10] J. Chang, G. Meng, L. Wang, S. Xiang, and C. Pan. Deep selfevolution clustering. IEEE Trans. Pattern Anal. Mach. Intell., 2018.
 [11] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan. Deep adaptive image clustering. In ICCV, 2017.
 [12] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan. Deep unsupervised learning with consistent inference of latent representations. Pattern Recognition, 2017.
 [13] V. Chatziafratis, R. Niazadeh, and M. Charikar. Hierarchical clustering with structural constraints. In ICML, 2018.
 [14] C. Chen and N. Quadrianto. Clustering high dimensional categorical data via topographical features. In ICML, 2016.
 [15] X. Chen, J. Huang, F. Nie, R. Chen, and Q. Wu. A selfbalanced mincut algorithm for image clustering. In ICCV, 2017.
 [16] A. Coates, A. Ng, and H. Lee. An analysis of singlelayer networks in unsupervised feature learning. In AISTATS, 2011.
 [17] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
 [18] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 [19] N. Dilokthanakul, P. Mediano, M. Garnelo, M. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. 2016.
 [20] K. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In ICCV, 2017.
 [21] P. Fränti, O. Virmajoki, and V. Hautamäki. Fast agglomerative clustering using a knearest neighbor graph. IEEE Trans. Pattern Anal. Mach. Intell., 2006.
 [22] J. Gemmeke, D. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and humanlabeled dataset for audio events. In ICASSP, 2017.
 [23] B. Gholami and V. Pavlovic. Probabilistic temporal subspace clustering. In CVPR, 2017.
 [24] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
 [25] R. Gomes, A. Krause, and P. Perona. Discriminative clustering by regularized information maximization. In NIPS, 2010.
 [26] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
 [27] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2015.
 [28] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In ICCV, 2015.
 [29] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 1985.
 [30] D. Ikami, T. Yamasaki, and K. Aizawa. Local and global optimization techniques in graphbased clustering. In CVPR, 2018.
 [31] A. Jain, M. Murty, and P. Flynn. Data clustering: A review. ACM Comput. Surv., 1999.
 [32] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. In IJCAI, 2017.
 [33] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image cosegmentation. In CVPR, 2010.
 [34] K. Kamnitsas, D. Castro, L. Folgoc, I. Walker, R. Tanno, D. Rueckert, B. Glocker, A. Criminisi, and A. Nori. Semisupervised learning via compact latent space clustering. In ICML, 2018.
 [35] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s Thesis, Department of Computer Science, University of Torono, 2009.
 [36] J. Lange, A. Karrenbauer, and B. Andres. Partial optimality and fast lower bounds for weighted correlation clustering. In ICML, 2018.
 [37] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.
 [38] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 [39] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004.
 [40] C. Li and R. Vidal. Structured sparse subspace clustering: A unified optimization framework. In CVPR, 2015.
 [41] T. Li and C. Ding. The relationships among various nonnegative matrix factorization methods for clustering. In ICDM, 2006.
 [42] W. Lin, S. Liu, J. Lai, and Y. Matsushita. Dimensionality’s blessing: Clustering images by underlying distribution. In CVPR, 2018.
 [43] Z. Liu, G. Lin, S. Yang, J. Feng, W. Lin, and W. Ling Goh. Learning markov clustering networks for scene text detection. In CVPR, 2018.
 [44] F. Locatello, D. Vincent, I. Tolstikhin, G. Rätsch, S. Gelly, and B. Schölkopf. Clustering meets implicit generative models. CoRR, 2018.
 [45] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [46] D. Lowe. Object recognition from local scaleinvariant features. In ICCV, 1999.

[47]
D. Luo, H. Huang, and C. Ding.
Discriminative high order SVD: adaptive tensor subspace selection for image classification, clustering, and retrieval.
In ICCV, 2011.  [48] L. Maaten and G. Hinton. Visualizing data using tsne. Journal of machine learning research, 2008.
 [49] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. CoRR, 2015.
 [50] D. Marin, M. Tang, I. Ayed, and Y. Boykov. Kernel clustering: density biases and solutions. IEEE Trans. Pattern Anal. Mach. Intell., 2017.
 [51] L. Martin, A. Loukas, and P. Vandergheynst. Fast approximate spectral clustering for dynamic networks. In ICML, 2018.
 [52] L. Mi, W. Zhang, X. Gu, and Y. Wang. Variational wasserstein clustering. In ECCV, 2018.
 [53] A. Miech, J. Alayrac, P. Bojanowski, I. Laptev, and J. Sivic. Learning from video and text via largescale discriminative clustering. In ICCV, 2017.
 [54] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In NIPS, 2006.
 [55] S. Mukherjee, H. Asnani, E. Lin, and S. Kannan. Clustergan : Latent space clustering in generative adversarial networks. CoRR, 2018.

[56]
C. Peng, Z. Kang, and Q. Cheng.
Subspace clustering via variance regularized ridge regression.
In CVPR, 2017.  [57] V. Premachandran and A. Yuille. Unsupervised learning using generative adversarial training and clustering. 2016.
 [58] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, 2015.
 [59] A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science, 2014.
 [60] S. Shah and V. Koltun. Deep continuous clustering. CoRR, abs/1803.01449, 2018.
 [61] K. Sinha. Kmeans clustering using random matrix sparsification. In ICML, 2018.
 [62] J. Springenberg. Unsupervised and semisupervised learning with categorical generative adversarial networks. 2015.
 [63] A. Strehl and J. Ghosh. Cluster ensembles — A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 2002.
 [64] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [65] M. Sznaier and O. Camps. Sosrsc: A sumofsquares polynomial approach to robustifying subspace clustering algorithms. In CVPR, 2018.
 [66] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 [67] F. Torre and T. Kanade. Discriminative cluster analysis. In ICML, 2006.
 [68] Z. Tu. Probabilistic boostingtree: Learning discriminative models for classification, recognition, and clustering. In ICCV, 2005.

[69]
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research, 2010.  [70] J. Wang, J. Wang, J. Song, X. Xu, H. Shen, and S. Li. Optimized cartesian kmeans. IEEE Trans. Knowl. Data Eng., 2015.
 [71] Q. Wang, J. Gao, and H. Li. Grassmannian manifold optimization assisted sparse spectral clustering. In CVPR, 2017.
 [72] W. Wang, Y. Wu, C. Tang, and M. Hor. Adaptive densitybased spatial clustering of applications with noise (DBSCAN) according to data. In ICMLC, 2015.
 [73] J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep embedding for clustering analysis. In ICML, 2016.
 [74] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian. Disturblabel: Regularizing CNN on the loss layer. In CVPR, 2016.
 [75] J. Xu, J. Han, and F. Nie. Discriminatively embedded kmeans for multiview clustering. In CVPR, 2016.
 [76] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning of deep representations and image clusters. In CVPR, 2016.

[77]
Z. Yang, J. Corander, and E. Oja.
Lowrank doubly stochastic matrix decomposition for cluster analysis.
Journal of Machine Learning Research, 2016.  [78] J. Ye, Z. Zhao, and M. Wu. Discriminative kmeans for clustering. In NIPS, 2007.
 [79] M. Yin, Y. Guo, J. Gao, Z. He, and S. Xie. Kernel sparse subspace clustering on symmetric positive definite manifolds. In CVPR, 2016.
 [80] C. You, D. Robinson, and R. Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In CVPR, 2016.
 [81] M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus. Deconvolutional networks. In CVPR, 2010.
 [82] L. ZelnikManor and P. Perona. Selftuning spectral clustering. In NIPS, 2004.
 [83] C. Zhang, Q. Hu, H. Fu, P. Zhu, and X. Cao. Latent multiview subspace clustering. In CVPR, 2017.
 [84] D. Zhang, Y. Sun, B. Eriksson, and L. Balzano. Deep unsupervised clustering using mixture of autoencoders. CoRR, 2017.
 [85] J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun. Stacked whatwhere autoencoders. 2015.
 [86] K. Zhao, W. Chu, and A. Martinez. Learning facial action units from web images with scalable weakly supervised clustering. In CVPR, 2018.
 [87] P. Zhou, Y. Hou, and J. Feng. Deep adversarial subspace clustering. In CVPR, 2018.
 [88] V. Zografos, L. Ellis, and R. Mester. Discriminative subspace clustering. In CVPR, 2013.