Log In Sign Up

Deep k-Means: Jointly Clustering with k-Means and Learning Representations

We study in this paper the problem of jointly clustering and learning representations. As several previous studies have shown, learning representations that are both faithful to the data to be clustered and adapted to the clustering algorithm can lead to better clustering performance, all the more so that the two tasks are performed jointly. We propose here such an approach for k-Means clustering based on a continuous reparametrization of the objective function that leads to a truly joint solution. The behavior of our approach is illustrated on various datasets showing its efficacy in learning representations for objects while clustering them.


page 1

page 2

page 3

page 4


POCS-based Clustering Algorithm

A novel clustering technique based on the projection onto convex set (PO...

Deep clustering with concrete k-means

We address the problem of simultaneously learning a k-means clustering a...

Representation Learning for Clustering: A Statistical Framework

We address the problem of communicating domain knowledge from a user to ...

Transformed K-means Clustering

In this work we propose a clustering framework based on the paradigm of ...

Interpretable Image Clustering via Diffeomorphism-Aware K-Means

We design an interpretable clustering algorithm aware of the nonlinear s...

On the symmetrical Kullback-Leibler Jeffreys centroids

Due to the success of the bag-of-word modeling paradigm, clustering hist...

A Kalman filtering induced heuristic optimization based partitional data clustering

Clustering algorithms have regained momentum with recent popularity of d...

1 Introduction

Clustering is a long-standing problem in the machine learning and data mining fields, and thus accordingly fostered abundant research. Traditional clustering methods,

e.g., -Means MacQueen1967

and Gaussian Mixture Models (GMMs) 

Bishop2006 , fully rely on the original data representations and may then be ineffective when the data points (e.g.

, images and text documents) live in a high-dimensional space – a problem commonly known as the curse of dimensionality. Significant progress has been made in the last decade or so to learn better, low-dimensional data representations 


. The most successful techniques to achieve such high-quality representations rely on deep neural networks (DNNs), which apply successive non-linear transformations to the data in order to obtain increasingly high-level features. Auto-encoders (AEs) are a special instance of DNNs which are trained to embed the data into a (usually dense and low-dimensional) vector at the bottleneck of the network, and then attempt to reconstruct the input based on this vector. The appeal of AEs lies in the fact that they are able to learn representations in a fully unsupervised way. The representation learning breakthrough enabled by DNNs spurred the recent development of numerous deep clustering approaches which aim at jointly learning the data points’ representations as well as their cluster assignments.

In this study, we specifically focus on the -Means-related deep clustering problem. Contrary to previous approaches that alternate between continuous gradient updates and discrete cluster assignment steps Yang2017 , we show here that one can solely rely on gradient updates to learn, truly jointly, representations and clustering parameters. This ultimately leads to a better deep

-Means method which is also more scalable as it can fully benefit from the efficiency of stochastic gradient descent (SGD). In addition, we perform a careful comparison of different methods by

(a) relying on the same auto-encoders, as the choice of auto-encoders impacts the results obtained, (b)

tuning the hyperparameters of each method on a small validation set, instead of setting them without clear criteria, and


enforcing, whenever possible, that the same initialization and sequence of SGD minibatches are used by the different methods. The last point is crucial to compare different methods as these two factors play an important role and the variance of each method is usually not negligible.

2 Related work

In the wake of the groundbreaking results obtained by DNNs in computer vision, several deep clustering algorithms were specifically designed for image clustering

Yang2016 ; Chang2017 ; Dizaji2017 ; Hu2017 ; Hsu2018

. These works have in common the exploitation of Convolutional Neural Networks (CNNs), which extensively contributed to last decade’s significant advances in computer vision. Inspired by agglomerative clustering,

Yang2016 proposed a recurrent process which successively merges clusters and learn image representations based on CNNs. In Chang2017

, the clustering problem is formulated as binary pairwise-classification so as to identify the pairs of images which should belong to the same cluster. Due to the unsupervised nature of clustering, the CNN-based classifier in this approach is only trained on noisily labeled examples obtained by selecting increasingly difficult samples in a curriculum learning fashion.


jointly trained a CNN auto-encoder and a multinomial logistic regression model applied to the AE’s latent space. Similarly,

Hsu2018 alternate between representation learning and clustering where mini-batch -Means is utilized as the clustering component. Differently from these works, Hu2017 proposed an information-theoretic framework based on data augmentation to learn discrete representations, which may be applied to clustering or hash learning. Although these different algorithms obtained state-of-the-art results on image clustering Aljalbout2018 , their ability to generalize to other types of data (e.g., text documents) is not guaranteed due to their reliance on essentially image-specific techniques – Convolutional Neural Network architectures and data augmentation.

Nonetheless, many general-purpose – non-image-specific – approaches to deep clustering have also been recently designed Huang2014 ; Peng2016 ; Xie2016 ; Dilokthanakul2017 ; Guo2017b ; Hu2017 ; Ji2017 ; Jiang2017 ; Peng2017 ; Yang2017 . Generative models were proposed in Dilokthanakul2017 ; Jiang2017 which combine variational AEs and GMMs to perform clustering. Alternatively, Peng2016 ; Peng2017 ; Ji2017 framed deep clustering as a subspace clustering problem in which the mapping from the original data space to a low-dimensional subspace is learned by a DNN. Xie2016

defined the Deep Embedded Clustering (DEC) method which simultaneously updates the data points’ representations, initialized from a pre-trained AE, and cluster centers. DEC uses soft assignments which are optimized to match stricter assignments through a Kullback-Leibler divergence loss. IDEC was subsequently proposed in

Guo2017b as an improvement to DEC by integrating the AE’s reconstruction error in the objective function.

Few approaches were directly influenced by -Means clustering Huang2014 ; Yang2017 . The Deep Embedding Network (DEN) model Huang2014 first learns representations from an AE while enforcing locality-preserving constraints and group sparsity; clusters are then obtained by simply applying -Means to these representations. Yet, as representation learning is decoupled from clustering, the performance is not as good as the one obtained by methods that rely on a joint approach. Besides Hsu2018 , mentioned before in the context of images, the only study, to our knowledge, that directly addresses the problem of jointly learning representations and clustering with -Means (and not an approximation of it) is the Deep Clustering Network (DCN) approach Yang2017 . However, as in Hsu2018 , DCN alternatively learns (rather than jointly learns) the object representations, the cluster centroids and the cluster assignments, the latter being based on discrete optimization steps which cannot benefit from the efficiency of stochastic gradient descent. The approach proposed here, entitled Deep -Means (DKM), addresses this problem.

3 Deep k-Means

In the remainder, denotes an object from a set of objects to be clustered. represents the space in which learned data representations are to be embedded. is the number of clusters to be obtained, the representative of cluster , and the set of representatives. Functions and define some distance in which are assumed to be fully differentiable wrt their variables. For any vector , gives the closest representative of according to .

The deep -Means problem takes the following form:


measures the error between an object and its reconstruction provided by an auto-encoder, representing the set of the auto-encoder’s parameters. A regularization term on can be included in the definition of . However, as most auto-encoders do not use regularization, we dispense with such a term here. denotes the representation of in output by the AE’s encoder part and is the clustering loss corresponding to the -Means objective function in the embedding space. Finally, in Problem (1) regulates the trade-off between seeking good representations for  – i.e., representations that are faithful to the original examples – and representations that are useful for clustering purposes. Similar optimization problems can be formulated when and are similarity functions or a mix of similarity and distance functions. The approach proposed here directly applies to such cases.

Figure 1 illustrates the overall framework retained in this study with and both based on the Euclidean distance. The closeness term in the clustering loss will be further clarified below.

Figure 1: Overview of the proposed Deep k-Means approach instantiated with losses based on the Euclidean distance.

3.1 Continuous generalization of Deep -Means

We now introduce a parameterized version of the above problem that constitutes a continuous generalization, whereby we mean here that all functions considered are continuous wrt the introduced parameter.111Note that, independently from this work, a similar relaxation has been previously proposed in Agustsson2017  – wherein soft-to-hard quantization is performed on an embedding space learned by an AE for compression. However, given the different nature of the goal here – clustering – our proposed learning framework substantially differs from theirs. To do so, we first note that the clustering objective function can be rewritten as with:

Let us now assume that we know some function such that:

  1. [(i),leftmargin=0.7cm]

  2. is differentiable wrt to and continuous wrt (differentiability wrt means differentiability wrt to all dimensions of );

  3. such that:

Then, one has, : , showing that the problem in (1) is equivalent to:


All functions in the above formulation are fully differentiable wrt both and

. One can thus estimate

and through a simple, joint optimization based on stochastic gradient descent (SGD) for a given :


with the learning rate and a random mini-batch of .

3.2 Choice of

Several choices are possible for . A simple choice, used throughout this study, is based on a parameterized softmax function. The fact that the softmax function can be used as a differentiable surrogate to or is well known and has been applied in different contexts, as in the recently proposed Gumbel-softmax distribution employed to approximate categorical samples Jang2017 ; Maddison2017 . The parameterized softmax function which we adopted takes the following form:


with . The function defined by Eq. 4 is differentiable wrt and (condition (i)) as it is a composition of functions differentiable wrt these variables. Furthermore, one has:

Property 3.1

(condition (ii)) If is unique for all , then:

The proof, which is straightforward, is detailed in the Supplementary Material.

The assumption that is unique for all objects is necessary for to take on binary values in the limit; it is not necessary to hold for small values of . In the unlikely event that the above assumption does not hold for some and large , one can slightly perturbate the representatives equidistant to prior to updating them. We have never encountered this situation in practice.

Finally, Eq. 4 defines a valid (according to conditions and ) function that can be used to solve the deep -Means problem (2). We adopt this function in the remainder of this study.

3.3 Choice of

The parameter can be defined in different ways. Indeed, can play the role of an inverse temperature such that, when is , each data point in the embedding space is equally close, through , to all the representatives (corresponding to a completely soft assignment), whereas when is , the assignment is hard. In the first case, for the deep -Means optimization problem, all representatives are equal and set to the point that minimizes . In the second case, the solution corresponds to exactly performing -Means in the embedding space, the latter being learned jointly with the clustering process. Following a deterministic annealing approach rose-90 , one can start with a low value of (close to 0), and gradually increase it till a sufficiently large value is obtained. At first, representatives are randomly initialized. As the problem is smooth when is close to 0, different initializations are likely to lead to the same local minimum in the first iteration; this local minimum is used for the new values of the representatives for the second iteration, and so on. The continuity of wrt implies that, provided the increment in is not too important, one evolves smoothly from the initial local minimum to the last one. In the above deterministic annealing scheme, allows one to initialize cluster representatives. The initialization of the auto-encoder can as well have an important impact on the results obtained and prior studies (e.g., Huang2014 ; Xie2016 ; Guo2017b ; Yang2017 ) have relied on pretraining for this matter. In such a case, one can choose a high value for to directly obtain the behavior of the -Means algorithm in the embedding space after pretraining. We evaluate both approaches in our experiments.

Algorithm 1 summarizes the deep -Means algorithm for the deterministic annealing scheme, where (respectively ) denote the minimum (respectively maximum) value of , and

is the number of epochs per each value of

for the stochastic gradient updates. Even though is finite, it can be set sufficiently large to obtain in practice a hard assignment to representatives. Alternatively, when using pretraining, one sets (i.e., a constant is used).

Input: data , number of clusters , balancing parameter , scheme for , number of epochs , number of minibatches , learning rate
Output:autoencoder parameters , cluster representatives
Initialize and (randomly or through pretraining) for  to  do  # inverse temperature
       for  to  do  # epochs per
             for  to  do  # minibatches
                   Draw a minibatch Update using SGD (Eq. 3)
             end for
       end for
end for
Algorithm 1 Deep -Means algorithm

3.4 Shrinking phenomenon

The loss functions defined in 

1 and 2 – as well as the loss used in the DCN approach Yang2017 and potentially in other approaches – might in theory induce a degenerative behavior in the learning procedure. Indeed, the clustering loss could be made arbitrarily small while preserving the reconstruction capacity of the AE by “shrinking” the subspace where the object embeddings and the cluster representatives live – thus reducing the distance between embeddings and representatives. We tested L2 regularization on the auto-encoder parameters to alleviate this potential issue by preventing the weights from arbitrarily shrinking the embedding space (indeed, by symmetry of the encoder and decoder, having small weights in the encoder, leading to shrinking, requires having large weights in the decoder for reconstruction; L2 regularization penalizes such large weights). We have however not observed any difference in our experiments with the case where no regularization is used, showing that the shrinking problem may not be important in practice. For the sake of simplicity, we dispense with it in the remainder.

4 Experiments

In order to evaluate the clustering results of our approach, we conducted experiments on different datasets and compared it against state-of-the-art standard and -Means-related deep clustering models.

4.1 Datasets

The datasets used in the experiments are standard clustering benchmark collections. We considered both image and text datasets to demonstrate the general applicability of our approach. Image datasets consist of MNIST (70,000 images, pixels, 10 classes) and USPS (9,298 images, pixels, 10 classes) which both contain hand-written digit images. We reshaped the images to one-dimensional vectors and normalized the pixel intensity levels (between 0 and 1 for MNIST, and between -1 and 1 for USPS). The text collections we considered are the 20 Newsgroups dataset (hereafter, 20NEWS) and the RCV1-v2 dataset (hereafter, RCV1). For 20NEWS, we used the whole dataset comprising 18,846 documents labeled into 20 different classes. Similarly to Xie2016 ; Guo2017b , we sampled from the full RCV1-v2 collection a random subset of 10,000 documents, each of which pertains to only one of the four largest classes. Because of the text datasets’ sparsity, and as proposed in Xie2016 , we selected the 2000 words with the highest tf-idf values to represent each document.

4.2 Baselines and deep -Means variants

Clustering models may use different strategies and different clustering losses, leading to different properties. As our goal in this work is to study the -Means clustering algorithm in embedding spaces, we focus on the family of -Means-related models and compare our approach against state-of-the-art models from this family, using both standard and deep clustering models. For the standard clustering methods, we used: the -Means clustering approach MacQueen1967 with initial cluster center selection Arthur2007 , denoted KM; an approach denoted as AE-KM in which dimensionality reduction is first performed using an auto-encoder followed by -Means applied to the learned representations.222We did not consider variational auto-encoders Kingma2014 in our baselines as Jiang2017 previously compared variational AE + GMM and “standard” AE + GMM, and found that the latter consistently outperformed the former. We compared as well against the only previous, “true” deep clustering -Means-related method, the Deep Clustering Network (DCN) approach described in Yang2017 . DCN is, to the best of our knowledge, the current most competitive clustering algorithm among -Means-related models.

In addition, we consider here the Improved Deep Embedded Clustering (IDEC) model Guo2017b as an additional baseline. IDEC is a general-purpose state-of-the-art approach in the deep clustering family. It is an improved version of the DEC model Xie2016 and thus constitutes a strong baseline. For both DCN and IDEC, we studied two variants: with pretraining (DCN and IDEC) and without pretraining (DCN and IDEC). The pretraining we performed here simply consists in initializing the weights by training the auto-encoder on the data to minimize the reconstruction loss in an end-to-end fashion – greedy layer-wise pretraining Bengio2006 did not lead to improved clustering in our preliminary experiments.

The proposed Deep -Means (DKM) is, as DCN, a “true” -Means approach in the embedding space; it jointly learns AE-based representations and relaxes the -Means problem by introducing a parameterized softmax as a differentiable surrogate to -Means argmin. In the experiments, we considered two variants of this approach. DKM implements an annealing strategy for the inverse temperature and does not rely on pretraining. The scheme we used for the evolution of the inverse temperature in DKM is given by the following recursive sequence: with . The rationale behind the choice of this scheme is that we want to spend more iterations on smaller values and less on larger values while preserving a gentle slope. Alternatively, we studied the variant DKM which is initialized by pretraining an auto-encoder and then follows Algorithm 1 with a constant such that . Such a high is equivalent to having hard cluster assignments while maintaining the differentiability of the optimization problem.

Implementation details.

For IDEC, we used the Keras code shared by their authors.

333 We used this version instead of as only the former enables auto-encoder pretraining in a non-layer-wise fashion.

Our own code for DKM is based on TensorFlow. To enable full control of the comparison between DCN and DKM – DCN being the closest competitor to DKM – we also re-implemented DCN in TensorFlow. The code for both DKM and DCN is available online.


Choice of and .

The functions and in Problem (1) define which distance functions is used for the clustering loss and reconstruction error, respectively. In this study, both and are simply instantiated with the Euclidean distance on all datasets. For the sake of comprehensiveness, we report in the supplementary material results for the cosine distance on 20NEWS.

4.3 Experimental setup

Auto-encoder description and training details.

The auto-encoder we used in the experiments is the same across all datasets and is borrowed from previous deep clustering studies Xie2016 ; Guo2017b

. Its encoder is a fully-connected multilayer perceptron with dimensions

-500-500-2000-, where is the original data space dimension and

is the number of clusters to obtain. The decoder is a mirrored version of the encoder. All layers except the one preceding the embedding layer and the one preceding the output layer are applied a ReLU activation function


before being fed to the next layer. For the sake of simplicity, we did not rely on any complementary training or regularization strategies such as batch normalization or dropout. The auto-encoder weights are initialized following the Xavier scheme 

Glorot2010 . For all deep clustering approaches, the training is based on the Adam optimizer Kingma2015 with standard learning rate and momentum rates and . The minibatch size is set to 256 on all datasets following Guo2017b . We emphasize that we chose exactly the same training configuration for all models to facilitate a fair comparison.

The number of pretraining epochs is set to 50 for all models relying on pretraining. The number of fine-tuning epochs for DCN and IDEC is fixed to 50 (or equivalently in terms of iterations: 50 times the number of minibatches). We set the number of training epochs for DCN and IDEC to 200. For DKM, we used the 40 terms of the sequence described in Section 4.2 as the annealing scheme and performed 5 epochs for each term (i.e., 200 epochs in total). DKM is fine-tuned by performing 100 epochs with constant

. The cluster representatives are initialized randomly from a uniform distribution

for models without pretraining. In case of pretraining, the cluster representatives are initialized by applying -Means to the pretrained embedding space.

Hyperparameter selection.

The hyperparameters for DCN and DKM and for IDEC, that define the trade-off between the reconstruction and the clustering error in the loss function, were determined by performing a line search on the set . To do so, we randomly split each dataset into a validation set (10% of the data) and a test set (90%). Each model is trained on the whole data and only the validation set labels are leveraged in the line search to identify the optimal or (optimality is measured with respect to the clustering accuracy metric). We provide the validation-optimal and obtained for each model and dataset in the supplementary material. The performance reported in the following sections corresponds to the evaluation performed only on the held-out test set.

While one might argue that such procedure affects the unsupervised nature of the clustering approaches, we believe that a clear and transparent hyperparameter selection methodology is preferable to a vague or hidden one. Moreover, although we did not explore such possibility in this study, it might be possible to define this trade-off hyperparameter in a data-driven way.

Experimental protocol.

We observed in pilot experiments that the clustering performance of the different models is subject to non-negligible variance from one run to another. This variance is due to the randomness in the initialization and in the minibatch sampling for the stochastic optimizer. When pretraining is used, the variance of the general pretraining phase and that of the model-specific fine-tuning phase add up, which makes it difficult to draw any confident conclusion about the clustering ability of a model. To alleviate this issue, we compared the different approaches using seeded runs whenever this was possible. This has the advantage of removing the variance of pretraining as seeds guarantee exactly the same results at the end of pretraining (since the same pretraining is performed for the different models). Additionally, it ensures that the same sequence of minibatches will be sampled. In practice, we used seeds for the models implemented in TensorFlow (KM, AE-KM, DCN and DKM). Because of implementation differences, seeds could not give the same pretraining states in the Keras-based IDEC. All in all, we randomly selected 10 seeds and for each model performed one run per seed. Additionally, to account for the remaining variance and to report statistical significance, we performed a Student’s -test from the 10 collected samples (i.e., runs).

KM 53.50.3 49.80.5 67.30.1 61.40.1 23.21.5 21.61.8 50.82.9 31.35.4
AE-KM 80.81.8 75.21.1 72.90.8 71.71.2 49.02.9 44.51.5 56.73.6 31.54.3
Deep clustering approaches without pretraining
DCN 34.83.0 18.11.0 36.43.5 16.91.3 17.91.0 9.80.5 41.34.0 6.91.8
DKM 82.33.2 78.01.9 75.56.8 73.02.3 44.82.4 42.81.1 53.85.5 28.05.8
Deep clustering approaches with pretraining
DCN 81.11.9 75.71.1 73.00.8 71.91.2 49.22.9 44.71.5 56.73.6 31.64.3
DKM 84.02.2 79.60.9 75.71.3 77.61.1 51.22.8 46.71.2 58.33.8 33.14.9
Table 1: Clustering results of the

-Means-related methods. Performance is measured in terms of NMI and ACC (%); higher is better. Each cell contains the average and standard deviation computed over 10 runs. Bold (resp. underlined) values correspond to results with no significant difference (

) to the best approach with (resp. without) pretraining for each dataset/metric pair.

Deep clustering approaches without pretraining
IDEC 61.83.0 62.41.6 53.95.1 50.03.8 22.31.5 22.31.5 56.75.3 31.42.8
DKM 82.33.2 78.01.9 75.56.8 73.02.3 44.82.4 42.81.1 53.85.5 28.05.8
Deep clustering approaches with pretraining
IDEC 85.72.4 86.41.0 75.20.5 74.90.6 40.51.3 38.21.0 59.55.7 34.75.0
DKM 84.02.2 79.60.9 75.71.3 77.61.1 51.22.8 46.71.2 58.33.8 33.14.9
Table 2: Clustering results of the DKM and IDEC methods. Performance is measured in terms of NMI and ACC (%); higher is better. Each cell contains the average and standard deviation computed over 10 runs. Bold (resp. underlined) values correspond to results with no significant difference () to the best approach with (resp. without) pretraining for each dataset/metric pair.

4.4 Clustering results

The results for the evaluation of the -Means-related clustering methods on the different benchmark datasets are summarized in Table 1. The clustering performance is evaluated with respect to two standard measures Cai2011 : Normalized Mutual Information (NMI) and the clustering accuracy (ACC). We report for each dataset/method pair the average and standard deviation of these metrics computed over 10 runs and conduct significance testing as previously described in the experimental protocol. The bold (resp. underlined) values in each column of Table 1 correspond to results with no statistically significant difference () to the best result with (resp. without) pretraining for the corresponding dataset/metric.

We first observe that when no pretraining is used, DKM with annealing (DKM) markedly outperforms DCN on all datasets. DKM achieves clustering performance similar to that obtained by pretraining-based methods. This confirms our intuition that the proposed annealing strategy can be seen as an alternative to pretraining.

Among the approaches integrating representation learning with pretraining, the AE-KM method, that separately performs dimension reduction and -Means clustering, overall obtains the worst results. This observation is in line with prior studies Yang2017 ; Guo2017b and underlines again the importance of jointly learning representations and clustering. We note as well that, apart from DKM, pretraining-based deep clustering approaches substantially outperform their non-pretrained counterparts, which stresses the importance of pretraining.

Furthermore, DKM yields significant improvements on all collections except RCV1 over DCN, the other “true” deep -Means approach. In all cases, DCN shows performance on par with that of AE-KM. This places, to the best of our knowledge, DKM as the current best deep -Means clustering method.

To further confirm DKM’s efficacy, we also compare it against IDEC, a state-of-the-art deep clustering algorihm which is not based on -Means. We report the corresponding results in Table 2. Once again, DKM significantly outperforms its non-pretrained counterpart, IDEC, except on RCV1. We note as well that, with the exception of the NMI results on MNIST, DKM is always either significantly better than IDEC or with no significant difference from this latter. This shows that the proposed DKM is not only the strongest -Means-related clustering approach, but is also remarkably competitive wrt deep clustering state of the art.

4.5 Illustration of learned representations

While the quality of the clustering results and that of the representations learned by the models are likely to be correlated, it is relevant to study to what extent learned representations are distorted to facilitate clustering. To provide a more interpretable view of the representations learned by -means-related deep clustering algorithm, we illustrate the embedded samples provided by AE (for comparison), DCN, DKM, and DKM on USPS in Figure 2 (best viewed in color). DCN was discarded due to its poor clustering performance. We used for that matter the -SNE visualization method vanderMaaten2008 to project the embeddings into a 2D space. We observe that the representations for points from different clusters are clearly better separated and disentangled in DKM than in other models. This brings further support to our experimental results, which showed the superior ability of DKM to learn representations that facilitate clustering.

(a) AE
(b) DCN
(c) DKM
(d) DKM
Figure 2: t-SNE visualization of the embedding spaces learned on USPS.

5 Conclusion

We have presented in this paper a new approach for jointly clustering with -Means and learning representations by considering the -Means clustering loss as the limit of a differentiable function. To the best of our knowledge, this is the first approach that truly jointly optimizes, through simple stochastic gradient descent updates, representation and -Means clustering losses. In addition to pretraining, that can be used in all methods, this approach can also rely on a deterministic annealing scheme for parameter initialization.

We further conducted careful comparisons with previous approaches by ensuring that the same architecture, initialization and minibatches are used. The experiments conducted on several datasets confirm the good behavior of Deep -Means that outperforms DCN, the current best approach for -Means clustering in embedding spaces, on all the collections considered.


Funding: This work was supported by the French National Agency for Research through the LOCUST project [grant number ANR-15-CE23-0027]; France’s Auvergne-Rhône-Alpes region through the AISUA project [grant number 17011072 01 - 4102]. Declarations of interest: none.


  • [1] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. Van Gool. Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations. In Proceedings of NIPS, NIPS ’17, pages 1141–1151, 2017.
  • [2] E. Aljalbout, V. Golkov, Y. Siddiqui, and D. Cremers. Clustering with Deep Learning: Taxonomy and New Methods. arXiv:1801.07648, 2018.
  • [3] D. Arthur and S. Vassilvitskii. K-Means++: The Advantages of Careful Seeding. In Proceedings of SODA, SODA ’07, pages 1027–1025, 2007.
  • [4] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy Layer-Wise Training of Deep Networks. In Proceedings of NIPS, NIPS ’06, pages 153–160, 2006.
  • [5] J. C. Bezdek, R. Ehrlich, and W. Full. FCM: The Fuzzy c-Means Clustering Algorithm. Computers & Geosciences, 10(2-3):191–203, 1984.
  • [6] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
  • [7] D. Cai, X. He, and J. Han. Locally Consistent Concept Factorization for Document Clustering. IEEE Transactions on Knowledge and Data Engineering, 23(6):902–913, 2011.
  • [8] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan. Deep Adaptive Image Clustering. In Proceedings of ICCV, ICCV ’17, pages 5879–5887, 2017.
  • [9] N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C. H. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan. Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders. arXiv:1611.02648, 2017.
  • [10] K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang. Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization. In Proceedings of ICCV, ICCV ’17, pages 5736–5745, 2017.
  • [11] X. Glorot and Y. Bengio. Understanding the Difficulty of Training Deep Feedforward Neural Networks. In Proceedings of AISTATS, AISTATS ’10, 2010.
  • [12] X. Guo, L. Gao, X. Liu, and J. Yin. Improved Deep Embedded Clustering with Local Structure Preservation. In Proceedings of IJCAI, IJCAI ’17, pages 1753–1759, 2017.
  • [13] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504–507, 2006.
  • [14] C.-C. Hsu and C.-W. Lin. CNN-Based Joint Clustering and Representation Learning with Feature Drift Compensation for Large-Scale Image Data. IEEE Transactions on Multimedia, 20(2):421–429, 2018.
  • [15] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama. Learning Discrete Representations via Information Maximizing Self-Augmented Training. In Proceedings of ICML, ICML ’17, pages 1558–1567, 2017.
  • [16] P. Huang, Y. Huang, W. Wang, and L. Wang. Deep Embedding Network for Clustering. In Proceedings of ICPR, ICPR ’14, pages 1532–1537, 2014.
  • [17] E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with Gumbel-Softmax. In Proceedings of ICLR, ICLR ’17, 2017.
  • [18] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid. Deep Subspace Clustering Networks. In Proceedings of NIPS, NIPS ’17, pages 23–32, 2017.
  • [19] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In Proceedings of IJCAI, IJCAI ’17, pages 1965–1972, 2017.
  • [20] D. P. Kingma and J. L. Ba. Adam: a Method for Stochastic Optimization. In Proceedings of ICLR, ICLR ’15, 2015.
  • [21] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In Proceedings of ICLR, ICLR ’14, 2014.
  • [22] H. W. Kuhn. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
  • [23] J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. In

    Proceedings of Berkeley Symposium on Mathematical Statistics and Probability

    , pages 281–297, 1967.
  • [24] C. J. Maddison, A. Mnih, and Y. W. Teh.

    The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.

    In Proceedings of ICLR, ICLR ’17, 2017.
  • [25] V. Nair and G. E. Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of ICML, ICML ’10, pages 807–814, 2010.
  • [26] X. Peng, J. Feng, J. Lu, W.-y. Yau, and Z. Yi. Cascade Subspace Clustering. In Proceedings of AAAI, AAAI ’17, pages 2478–2484, 2017.
  • [27] X. Peng, S. Xiao, J. Feng, W. Y. Yau, and Z. Yi. Deep Subspace Clustering with Sparsity Prior. In Proceedings of IJCAI, IJCAI ’16, pages 1925–1931, 2016.
  • [28] K. Rose, E. Gurewitz, and G. Fox. A Deterministic Annealing Approach to Clustering. Pattern Recognition Letters, 11(9):589–594, 1990.
  • [29] L. van der Maaten and G. Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
  • [30] N. X. Vinh, J. Epps, and J. Bailey. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. Journal of Machine Learning Research, 11:2837–2854, 2010.
  • [31] J. Xie, R. Girshick, and A. Farhadi.

    Unsupervised Deep Embedding for Clustering Analysis.

    In Proceedings of ICML, ICML ’16, pages 478–487, 2016.
  • [32] B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong.

    Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering.

    In Proceedings of ICML, ICML ’17, pages 3861–3870, 2017.
  • [33] J. Yang, D. Parikh, and D. Batra.

    Joint Unsupervised Learning of Deep Representations and Image Clusters.

    In Proceedings of CVPR, CVPR ’16, pages 5147–5156, 2016.

Appendix A Proof of Property 1

For , we remind the following property:

Property A.1

(condition (ii)) If is unique for all , then:

Proof: Let and let us assume that is a distance. One has:

As , one has:

Thus if and if .

Appendix B Alternative choice for

As plays the role of a closeness function for an object wrt representative , membership functions used in fuzzy clustering are potential candidates for . In particular, the membership function of the fuzzy -Means algorithm [5] is a valid candidate according to conditions and . It takes the following form:

with defined on and (condition ) equal to 1. However, in addition to being slightly more complex than the parametrized softmax, this formulation presents the disadvantage that it may be undefined when a representative coincides with an object; another assumption (in addition to the uniqueness assumption) is required here to avoid such a case.

Appendix C Annealing scheme for in DKM

The scheme we used for the evolution of the inverse temperature in DKM is given by the following recursive sequence: with . The 40 first terms of are plotted in Figure 3.


Figure 3: Annealing scheme for inverse temperature , following the sequence ; .

Appendix D Evaluation measures

In our experiments, the clustering performance of the evaluated methods is evaluated with respect to two standard measures [7]: Normalized Mutual Information (NMI) and the clustering accuracy (ACC). NMI is an information-theoretic measure based on the mutual information of the ground-truth classes and the obtained clusters, normalized using the entropy of each. Formally, let and denote the ground-truth classes and the obtained clusters, respectively. (resp. ) is the subset of data points from class (resp. cluster ). Let be the number of points in the dataset. The NMI is computed according to the following formula:

where corresponds to the mutual information between the partitions and , and is the entropy of .

On the other hand, ACC measures the proportion of data points for which the obtained clusters can be correctly mapped to ground-truth classes, where the matching is based on the Hungarian algorithm [22]. Let and further denote the ground-truth class and the obtained cluster, respectively, to which data point , is assigned. Then the clustering accuracy is defined as follows:

where denotes the indicator function: and ; is a mapping from cluster labels to class labels.

We additionally report in this supplementary material the clustering performance wrt to the adjusted Rand index (ARI) [30]. ARI counts the pairs of data points on which the classes and clusters agree or disagree, and is corrected for chance. Formally, ARI is given by:

Appendix E Dataset statistics and optimal hyperparameters

We summarize in Table 3 the statistics of the different datasets used in the experiments, as well as the dataset-specific optimal values of the hyperparameter ( for DKM-based and DCN-based methods and for IDEC-based ones) which trades off between the reconstruction loss and the clustering loss. We remind that this optimal value was determined using a validation set, disjoint from the test set on which we reported results in the paper.

#Samples 70,000 9,298 18,846 10,000
#Classes 10 10 20 4
Dimensions 28 28 16 16 2,000 2,000
1e-1 1e-1 1e-4 1e-4
1e+0 1e+0 1e-1 1e-2
1e+1 1e-1 1e-1 1e-1
1e-2 1e-1 1e-4 1e-3
1e-2 1e-3 1e-1 1e-3
1e-3 1e-1 1e-3 1e-4
Table 3: Statistics of the datasets and dataset-specific optimal trade-off hyperparameters ( for DKM-based and DCN-based methods and for IDEC-based ones) determined on the validation set.

Appendix F Additional results

The additional results given in this section have also been computed from 10 seeded runs whenever possible and Student’s -test was performed from those 10 samples.

f.1 ARI results

We report in Table 4 the results obtained by -Means-related methods wrt the ARI measure on the datasets used in the paper. Similarly, Table 5 compares the results of the approaches based on DKM and IDEC in terms of ARI.

KM 36.60.1 53.50.1 7.60.9 20.62.8
AE-KM 69.41.8 63.21.5 31.01.6 23.94.3
Deep clustering approaches without pretraining
DCN 15.61.1 14.71.8 5.70.5 6.92.1
DKM 73.63.1 66.34.9 26.71.5 20.74.4
Deep clustering approaches with pretraining
DCN 70.21.8 63.41.5 31.31.6 24.04.3
DKM 75.01.8 68.51.8 33.91.5 26.54.9
Table 4: Clustering results of the -Means-related methods. Performance is measured in terms of ARI (%); higher is better. Each cell contains the average and standard deviation computed over 10 runs. Bold (resp. underlined) values correspond to results with no significant difference () to the best approach with (resp. without) pretraining for each dataset/metric pair.

Deep clustering approaches without pretraining
IDEC 49.13.0 40.25.1 9.81.5 28.55.3
DKM 73.63.1 66.34.9 26.71.5 20.74.4
Deep clustering approaches with pretraining
IDEC 81.52.4 68.10.5 26.01.3 32.95.7
DKM 75.01.8 68.51.8 33.91.5 26.54.9
Table 5: Clustering results of the DKM and IDEC methods. Performance is measured in terms of ARI (%); higher is better. Each cell contains the average and standard deviation computed over 10 runs. Bold (resp. underlined) values correspond to results with no significant difference () to the best approach with (resp. without) pretraining for each dataset/metric pair.

f.2 Cosine distance

The proposed Deep -Means framework enables the use of different distance and similarity functions to compute the AE’s reconstruction error (based on function ) and the clustering loss (based on function ). In the paper, we adopted the euclidean distance for both and . For the sake of comprehensiveness, we performed additional experiments on DKM using different such functions. In particular, we showcase in Table 6 the results obtained by choosing the cosine distance for and the euclidean distance for (DKM and DKM) in comparison to using euclidean distance for both and (DKM and DKM) – these latter corresponding to the approaches reported in the paper. Note that for DKM and DKM as well the reported results correspond to those obtained with the optimal lambda determined on the validation set.

= euclidean distance, = euclidean distance
DKM 44.82.4 42.81.1 26.71.5
DKM 51.22.8 46.71.2 33.91.5
= cosine distance, = euclidean distance
DKM 51.31.5 44.40.7 32.61.0
DKM 51.02.6 45.11.2 33.01.1
Table 6: Clustering results for DKM using the euclidean and cosine distances on 20NEWS. Each cell contains the average and standard deviation computed over 10 runs. Bold values correspond to results with no significant difference () to the best for each dataset/metric pair..

AE-KM 80.81.8 75.21.1 72.90.8 71.71.2 49.02.9 44.51.5 56.73.6 31.54.3
DCN + KM 84.93.1 79.41.5 73.90.7 74.11.1 50.53.1 46.51.6 57.33.6 32.34.4
DKM + KM 84.81.3 78.70.8 76.94.9 74.31.5 49.02.5 44.01.0 53.45.9 27.45.3
DKM + KM 85.13.0 79.91.5 75.71.3 77.61.1 52.12.7 47.11.3 58.33.8 33.04.9
Table 7: Clustering results for -Means applied to different learned embedding spaces to measure the -Means-friendliness of each method. Performance is measured in terms of NMI and clustering accuracy (%), averaged over 10 runs with standard deviation. Bold values correspond to results with no significant difference () to the best for each dataset/metric.

f.3 -Means-friendliness of learned representations

In addition to the previous experiments – which evaluate the clustering ability of the different approaches – we analyzed how effective applying -Means to the representations learned by DCN, DKM, and DKM is in comparison to applying -Means to the AE-based representations (i.e., AE-KM). In other words, we evaluate the “-Means-friendliness” of the learned representations. The results of this experiment are reported in Table 7. We can observe that on most datasets the representations learned by -Means-related deep clustering approaches lead to significant improvement wrt AE-learned representations. This confirms that all these deep clustering methods truly bias their representations. Overall, although the difference is not statistically significant on all datasets/metrics, the representations learned by DKM are shown to be the most appropriate to -Means. This goes in line with the insight gathered from the previous experiments.