1 Introduction
Clustering is a longstanding problem in the machine learning and data mining fields, and thus accordingly fostered abundant research. Traditional clustering methods,
e.g., Means MacQueen1967and Gaussian Mixture Models (GMMs)
Bishop2006 , fully rely on the original data representations and may then be ineffective when the data points (e.g., images and text documents) live in a highdimensional space – a problem commonly known as the curse of dimensionality. Significant progress has been made in the last decade or so to learn better, lowdimensional data representations
Hinton2006. The most successful techniques to achieve such highquality representations rely on deep neural networks (DNNs), which apply successive nonlinear transformations to the data in order to obtain increasingly highlevel features. Autoencoders (AEs) are a special instance of DNNs which are trained to embed the data into a (usually dense and lowdimensional) vector at the bottleneck of the network, and then attempt to reconstruct the input based on this vector. The appeal of AEs lies in the fact that they are able to learn representations in a fully unsupervised way. The representation learning breakthrough enabled by DNNs spurred the recent development of numerous deep clustering approaches which aim at jointly learning the data points’ representations as well as their cluster assignments.
In this study, we specifically focus on the Meansrelated deep clustering problem. Contrary to previous approaches that alternate between continuous gradient updates and discrete cluster assignment steps Yang2017 , we show here that one can solely rely on gradient updates to learn, truly jointly, representations and clustering parameters. This ultimately leads to a better deep
Means method which is also more scalable as it can fully benefit from the efficiency of stochastic gradient descent (SGD). In addition, we perform a careful comparison of different methods by
(a) relying on the same autoencoders, as the choice of autoencoders impacts the results obtained, (b)tuning the hyperparameters of each method on a small validation set, instead of setting them without clear criteria, and
(c)enforcing, whenever possible, that the same initialization and sequence of SGD minibatches are used by the different methods. The last point is crucial to compare different methods as these two factors play an important role and the variance of each method is usually not negligible.
2 Related work
In the wake of the groundbreaking results obtained by DNNs in computer vision, several deep clustering algorithms were specifically designed for image clustering
Yang2016 ; Chang2017 ; Dizaji2017 ; Hu2017 ; Hsu2018. These works have in common the exploitation of Convolutional Neural Networks (CNNs), which extensively contributed to last decade’s significant advances in computer vision. Inspired by agglomerative clustering,
Yang2016 proposed a recurrent process which successively merges clusters and learn image representations based on CNNs. In Chang2017, the clustering problem is formulated as binary pairwiseclassification so as to identify the pairs of images which should belong to the same cluster. Due to the unsupervised nature of clustering, the CNNbased classifier in this approach is only trained on noisily labeled examples obtained by selecting increasingly difficult samples in a curriculum learning fashion.
Dizaji2017jointly trained a CNN autoencoder and a multinomial logistic regression model applied to the AE’s latent space. Similarly,
Hsu2018 alternate between representation learning and clustering where minibatch Means is utilized as the clustering component. Differently from these works, Hu2017 proposed an informationtheoretic framework based on data augmentation to learn discrete representations, which may be applied to clustering or hash learning. Although these different algorithms obtained stateoftheart results on image clustering Aljalbout2018 , their ability to generalize to other types of data (e.g., text documents) is not guaranteed due to their reliance on essentially imagespecific techniques – Convolutional Neural Network architectures and data augmentation.Nonetheless, many generalpurpose – nonimagespecific – approaches to deep clustering have also been recently designed Huang2014 ; Peng2016 ; Xie2016 ; Dilokthanakul2017 ; Guo2017b ; Hu2017 ; Ji2017 ; Jiang2017 ; Peng2017 ; Yang2017 . Generative models were proposed in Dilokthanakul2017 ; Jiang2017 which combine variational AEs and GMMs to perform clustering. Alternatively, Peng2016 ; Peng2017 ; Ji2017 framed deep clustering as a subspace clustering problem in which the mapping from the original data space to a lowdimensional subspace is learned by a DNN. Xie2016
defined the Deep Embedded Clustering (DEC) method which simultaneously updates the data points’ representations, initialized from a pretrained AE, and cluster centers. DEC uses soft assignments which are optimized to match stricter assignments through a KullbackLeibler divergence loss. IDEC was subsequently proposed in
Guo2017b as an improvement to DEC by integrating the AE’s reconstruction error in the objective function.Few approaches were directly influenced by Means clustering Huang2014 ; Yang2017 . The Deep Embedding Network (DEN) model Huang2014 first learns representations from an AE while enforcing localitypreserving constraints and group sparsity; clusters are then obtained by simply applying Means to these representations. Yet, as representation learning is decoupled from clustering, the performance is not as good as the one obtained by methods that rely on a joint approach. Besides Hsu2018 , mentioned before in the context of images, the only study, to our knowledge, that directly addresses the problem of jointly learning representations and clustering with Means (and not an approximation of it) is the Deep Clustering Network (DCN) approach Yang2017 . However, as in Hsu2018 , DCN alternatively learns (rather than jointly learns) the object representations, the cluster centroids and the cluster assignments, the latter being based on discrete optimization steps which cannot benefit from the efficiency of stochastic gradient descent. The approach proposed here, entitled Deep Means (DKM), addresses this problem.
3 Deep kMeans
In the remainder, denotes an object from a set of objects to be clustered. represents the space in which learned data representations are to be embedded. is the number of clusters to be obtained, the representative of cluster , and the set of representatives. Functions and define some distance in which are assumed to be fully differentiable wrt their variables. For any vector , gives the closest representative of according to .
The deep Means problem takes the following form:
(1) 
measures the error between an object and its reconstruction provided by an autoencoder, representing the set of the autoencoder’s parameters. A regularization term on can be included in the definition of . However, as most autoencoders do not use regularization, we dispense with such a term here. denotes the representation of in output by the AE’s encoder part and is the clustering loss corresponding to the Means objective function in the embedding space. Finally, in Problem (1) regulates the tradeoff between seeking good representations for – i.e., representations that are faithful to the original examples – and representations that are useful for clustering purposes. Similar optimization problems can be formulated when and are similarity functions or a mix of similarity and distance functions. The approach proposed here directly applies to such cases.
Figure 1 illustrates the overall framework retained in this study with and both based on the Euclidean distance. The closeness term in the clustering loss will be further clarified below.
3.1 Continuous generalization of Deep Means
We now introduce a parameterized version of the above problem that constitutes a continuous generalization, whereby we mean here that all functions considered are continuous wrt the introduced parameter.^{1}^{1}1Note that, independently from this work, a similar relaxation has been previously proposed in Agustsson2017 – wherein softtohard quantization is performed on an embedding space learned by an AE for compression. However, given the different nature of the goal here – clustering – our proposed learning framework substantially differs from theirs. To do so, we first note that the clustering objective function can be rewritten as with:
Let us now assume that we know some function such that:

[(i),leftmargin=0.7cm]

is differentiable wrt to and continuous wrt (differentiability wrt means differentiability wrt to all dimensions of );

such that:
Then, one has, : , showing that the problem in (1) is equivalent to:
(2) 
All functions in the above formulation are fully differentiable wrt both and
. One can thus estimate
and through a simple, joint optimization based on stochastic gradient descent (SGD) for a given :(3) 
with the learning rate and a random minibatch of .
3.2 Choice of
Several choices are possible for . A simple choice, used throughout this study, is based on a parameterized softmax function. The fact that the softmax function can be used as a differentiable surrogate to or is well known and has been applied in different contexts, as in the recently proposed Gumbelsoftmax distribution employed to approximate categorical samples Jang2017 ; Maddison2017 . The parameterized softmax function which we adopted takes the following form:
(4) 
with . The function defined by Eq. 4 is differentiable wrt and (condition (i)) as it is a composition of functions differentiable wrt these variables. Furthermore, one has:
Property 3.1
(condition (ii)) If is unique for all , then:
The proof, which is straightforward, is detailed in the Supplementary Material.
The assumption that is unique for all objects is necessary for to take on binary values in the limit; it is not necessary to hold for small values of . In the unlikely event that the above assumption does not hold for some and large , one can slightly perturbate the representatives equidistant to prior to updating them. We have never encountered this situation in practice.
3.3 Choice of
The parameter can be defined in different ways. Indeed, can play the role of an inverse temperature such that, when is , each data point in the embedding space is equally close, through , to all the representatives (corresponding to a completely soft assignment), whereas when is , the assignment is hard. In the first case, for the deep Means optimization problem, all representatives are equal and set to the point that minimizes . In the second case, the solution corresponds to exactly performing Means in the embedding space, the latter being learned jointly with the clustering process. Following a deterministic annealing approach rose90 , one can start with a low value of (close to 0), and gradually increase it till a sufficiently large value is obtained. At first, representatives are randomly initialized. As the problem is smooth when is close to 0, different initializations are likely to lead to the same local minimum in the first iteration; this local minimum is used for the new values of the representatives for the second iteration, and so on. The continuity of wrt implies that, provided the increment in is not too important, one evolves smoothly from the initial local minimum to the last one. In the above deterministic annealing scheme, allows one to initialize cluster representatives. The initialization of the autoencoder can as well have an important impact on the results obtained and prior studies (e.g., Huang2014 ; Xie2016 ; Guo2017b ; Yang2017 ) have relied on pretraining for this matter. In such a case, one can choose a high value for to directly obtain the behavior of the Means algorithm in the embedding space after pretraining. We evaluate both approaches in our experiments.
Algorithm 1 summarizes the deep Means algorithm for the deterministic annealing scheme, where (respectively ) denote the minimum (respectively maximum) value of , and
is the number of epochs per each value of
for the stochastic gradient updates. Even though is finite, it can be set sufficiently large to obtain in practice a hard assignment to representatives. Alternatively, when using pretraining, one sets (i.e., a constant is used).3.4 Shrinking phenomenon
The loss functions defined in
1 and 2 – as well as the loss used in the DCN approach Yang2017 and potentially in other approaches – might in theory induce a degenerative behavior in the learning procedure. Indeed, the clustering loss could be made arbitrarily small while preserving the reconstruction capacity of the AE by “shrinking” the subspace where the object embeddings and the cluster representatives live – thus reducing the distance between embeddings and representatives. We tested L2 regularization on the autoencoder parameters to alleviate this potential issue by preventing the weights from arbitrarily shrinking the embedding space (indeed, by symmetry of the encoder and decoder, having small weights in the encoder, leading to shrinking, requires having large weights in the decoder for reconstruction; L2 regularization penalizes such large weights). We have however not observed any difference in our experiments with the case where no regularization is used, showing that the shrinking problem may not be important in practice. For the sake of simplicity, we dispense with it in the remainder.4 Experiments
In order to evaluate the clustering results of our approach, we conducted experiments on different datasets and compared it against stateoftheart standard and Meansrelated deep clustering models.
4.1 Datasets
The datasets used in the experiments are standard clustering benchmark collections. We considered both image and text datasets to demonstrate the general applicability of our approach. Image datasets consist of MNIST (70,000 images, pixels, 10 classes) and USPS (9,298 images, pixels, 10 classes) which both contain handwritten digit images. We reshaped the images to onedimensional vectors and normalized the pixel intensity levels (between 0 and 1 for MNIST, and between 1 and 1 for USPS). The text collections we considered are the 20 Newsgroups dataset (hereafter, 20NEWS) and the RCV1v2 dataset (hereafter, RCV1). For 20NEWS, we used the whole dataset comprising 18,846 documents labeled into 20 different classes. Similarly to Xie2016 ; Guo2017b , we sampled from the full RCV1v2 collection a random subset of 10,000 documents, each of which pertains to only one of the four largest classes. Because of the text datasets’ sparsity, and as proposed in Xie2016 , we selected the 2000 words with the highest tfidf values to represent each document.
4.2 Baselines and deep Means variants
Clustering models may use different strategies and different clustering losses, leading to different properties. As our goal in this work is to study the Means clustering algorithm in embedding spaces, we focus on the family of Meansrelated models and compare our approach against stateoftheart models from this family, using both standard and deep clustering models. For the standard clustering methods, we used: the Means clustering approach MacQueen1967 with initial cluster center selection Arthur2007 , denoted KM; an approach denoted as AEKM in which dimensionality reduction is first performed using an autoencoder followed by Means applied to the learned representations.^{2}^{2}2We did not consider variational autoencoders Kingma2014 in our baselines as Jiang2017 previously compared variational AE + GMM and “standard” AE + GMM, and found that the latter consistently outperformed the former. We compared as well against the only previous, “true” deep clustering Meansrelated method, the Deep Clustering Network (DCN) approach described in Yang2017 . DCN is, to the best of our knowledge, the current most competitive clustering algorithm among Meansrelated models.
In addition, we consider here the Improved Deep Embedded Clustering (IDEC) model Guo2017b as an additional baseline. IDEC is a generalpurpose stateoftheart approach in the deep clustering family. It is an improved version of the DEC model Xie2016 and thus constitutes a strong baseline. For both DCN and IDEC, we studied two variants: with pretraining (DCN and IDEC) and without pretraining (DCN and IDEC). The pretraining we performed here simply consists in initializing the weights by training the autoencoder on the data to minimize the reconstruction loss in an endtoend fashion – greedy layerwise pretraining Bengio2006 did not lead to improved clustering in our preliminary experiments.
The proposed Deep Means (DKM) is, as DCN, a “true” Means approach in the embedding space; it jointly learns AEbased representations and relaxes the Means problem by introducing a parameterized softmax as a differentiable surrogate to Means argmin. In the experiments, we considered two variants of this approach. DKM implements an annealing strategy for the inverse temperature and does not rely on pretraining. The scheme we used for the evolution of the inverse temperature in DKM is given by the following recursive sequence: with . The rationale behind the choice of this scheme is that we want to spend more iterations on smaller values and less on larger values while preserving a gentle slope. Alternatively, we studied the variant DKM which is initialized by pretraining an autoencoder and then follows Algorithm 1 with a constant such that . Such a high is equivalent to having hard cluster assignments while maintaining the differentiability of the optimization problem.
Implementation details.
For IDEC, we used the Keras code shared by their authors.
^{3}^{3}3https://github.com/XifengGuo/IDECtoy. We used this version instead of https://github.com/XifengGuo/IDEC as only the former enables autoencoder pretraining in a nonlayerwise fashion.Our own code for DKM is based on TensorFlow. To enable full control of the comparison between DCN and DKM – DCN being the closest competitor to DKM – we also reimplemented DCN in TensorFlow. The code for both DKM and DCN is available online.
^{4}^{4}4https://github.com/MaziarMF/deepkmeansChoice of and .
The functions and in Problem (1) define which distance functions is used for the clustering loss and reconstruction error, respectively. In this study, both and are simply instantiated with the Euclidean distance on all datasets. For the sake of comprehensiveness, we report in the supplementary material results for the cosine distance on 20NEWS.
4.3 Experimental setup
Autoencoder description and training details.
The autoencoder we used in the experiments is the same across all datasets and is borrowed from previous deep clustering studies Xie2016 ; Guo2017b
. Its encoder is a fullyconnected multilayer perceptron with dimensions
5005002000, where is the original data space dimension andis the number of clusters to obtain. The decoder is a mirrored version of the encoder. All layers except the one preceding the embedding layer and the one preceding the output layer are applied a ReLU activation function
Nair2010before being fed to the next layer. For the sake of simplicity, we did not rely on any complementary training or regularization strategies such as batch normalization or dropout. The autoencoder weights are initialized following the Xavier scheme
Glorot2010 . For all deep clustering approaches, the training is based on the Adam optimizer Kingma2015 with standard learning rate and momentum rates and . The minibatch size is set to 256 on all datasets following Guo2017b . We emphasize that we chose exactly the same training configuration for all models to facilitate a fair comparison.The number of pretraining epochs is set to 50 for all models relying on pretraining. The number of finetuning epochs for DCN and IDEC is fixed to 50 (or equivalently in terms of iterations: 50 times the number of minibatches). We set the number of training epochs for DCN and IDEC to 200. For DKM, we used the 40 terms of the sequence described in Section 4.2 as the annealing scheme and performed 5 epochs for each term (i.e., 200 epochs in total). DKM is finetuned by performing 100 epochs with constant
. The cluster representatives are initialized randomly from a uniform distribution
for models without pretraining. In case of pretraining, the cluster representatives are initialized by applying Means to the pretrained embedding space.Hyperparameter selection.
The hyperparameters for DCN and DKM and for IDEC, that define the tradeoff between the reconstruction and the clustering error in the loss function, were determined by performing a line search on the set . To do so, we randomly split each dataset into a validation set (10% of the data) and a test set (90%). Each model is trained on the whole data and only the validation set labels are leveraged in the line search to identify the optimal or (optimality is measured with respect to the clustering accuracy metric). We provide the validationoptimal and obtained for each model and dataset in the supplementary material. The performance reported in the following sections corresponds to the evaluation performed only on the heldout test set.
While one might argue that such procedure affects the unsupervised nature of the clustering approaches, we believe that a clear and transparent hyperparameter selection methodology is preferable to a vague or hidden one. Moreover, although we did not explore such possibility in this study, it might be possible to define this tradeoff hyperparameter in a datadriven way.
Experimental protocol.
We observed in pilot experiments that the clustering performance of the different models is subject to nonnegligible variance from one run to another. This variance is due to the randomness in the initialization and in the minibatch sampling for the stochastic optimizer. When pretraining is used, the variance of the general pretraining phase and that of the modelspecific finetuning phase add up, which makes it difficult to draw any confident conclusion about the clustering ability of a model. To alleviate this issue, we compared the different approaches using seeded runs whenever this was possible. This has the advantage of removing the variance of pretraining as seeds guarantee exactly the same results at the end of pretraining (since the same pretraining is performed for the different models). Additionally, it ensures that the same sequence of minibatches will be sampled. In practice, we used seeds for the models implemented in TensorFlow (KM, AEKM, DCN and DKM). Because of implementation differences, seeds could not give the same pretraining states in the Kerasbased IDEC. All in all, we randomly selected 10 seeds and for each model performed one run per seed. Additionally, to account for the remaining variance and to report statistical significance, we performed a Student’s test from the 10 collected samples (i.e., runs).
Model  MNIST  USPS  20NEWS  RCV1  
ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI  
KM  53.50.3  49.80.5  67.30.1  61.40.1  23.21.5  21.61.8  50.82.9  31.35.4 
AEKM  80.81.8  75.21.1  72.90.8  71.71.2  49.02.9  44.51.5  56.73.6  31.54.3 
Deep clustering approaches without pretraining  
DCN  34.83.0  18.11.0  36.43.5  16.91.3  17.91.0  9.80.5  41.34.0  6.91.8 
DKM  82.33.2  78.01.9  75.56.8  73.02.3  44.82.4  42.81.1  53.85.5  28.05.8 
Deep clustering approaches with pretraining  
DCN  81.11.9  75.71.1  73.00.8  71.91.2  49.22.9  44.71.5  56.73.6  31.64.3 
DKM  84.02.2  79.60.9  75.71.3  77.61.1  51.22.8  46.71.2  58.33.8  33.14.9 
Meansrelated methods. Performance is measured in terms of NMI and ACC (%); higher is better. Each cell contains the average and standard deviation computed over 10 runs. Bold (resp. underlined) values correspond to results with no significant difference (
) to the best approach with (resp. without) pretraining for each dataset/metric pair.Model  MNIST  USPS  20NEWS  RCV1  
ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI  
Deep clustering approaches without pretraining  
IDEC  61.83.0  62.41.6  53.95.1  50.03.8  22.31.5  22.31.5  56.75.3  31.42.8 
DKM  82.33.2  78.01.9  75.56.8  73.02.3  44.82.4  42.81.1  53.85.5  28.05.8 
Deep clustering approaches with pretraining  
IDEC  85.72.4  86.41.0  75.20.5  74.90.6  40.51.3  38.21.0  59.55.7  34.75.0 
DKM  84.02.2  79.60.9  75.71.3  77.61.1  51.22.8  46.71.2  58.33.8  33.14.9 
4.4 Clustering results
The results for the evaluation of the Meansrelated clustering methods on the different benchmark datasets are summarized in Table 1. The clustering performance is evaluated with respect to two standard measures Cai2011 : Normalized Mutual Information (NMI) and the clustering accuracy (ACC). We report for each dataset/method pair the average and standard deviation of these metrics computed over 10 runs and conduct significance testing as previously described in the experimental protocol. The bold (resp. underlined) values in each column of Table 1 correspond to results with no statistically significant difference () to the best result with (resp. without) pretraining for the corresponding dataset/metric.
We first observe that when no pretraining is used, DKM with annealing (DKM) markedly outperforms DCN on all datasets. DKM achieves clustering performance similar to that obtained by pretrainingbased methods. This confirms our intuition that the proposed annealing strategy can be seen as an alternative to pretraining.
Among the approaches integrating representation learning with pretraining, the AEKM method, that separately performs dimension reduction and Means clustering, overall obtains the worst results. This observation is in line with prior studies Yang2017 ; Guo2017b and underlines again the importance of jointly learning representations and clustering. We note as well that, apart from DKM, pretrainingbased deep clustering approaches substantially outperform their nonpretrained counterparts, which stresses the importance of pretraining.
Furthermore, DKM yields significant improvements on all collections except RCV1 over DCN, the other “true” deep Means approach. In all cases, DCN shows performance on par with that of AEKM. This places, to the best of our knowledge, DKM as the current best deep Means clustering method.
To further confirm DKM’s efficacy, we also compare it against IDEC, a stateoftheart deep clustering algorihm which is not based on Means. We report the corresponding results in Table 2. Once again, DKM significantly outperforms its nonpretrained counterpart, IDEC, except on RCV1. We note as well that, with the exception of the NMI results on MNIST, DKM is always either significantly better than IDEC or with no significant difference from this latter. This shows that the proposed DKM is not only the strongest Meansrelated clustering approach, but is also remarkably competitive wrt deep clustering state of the art.
4.5 Illustration of learned representations
While the quality of the clustering results and that of the representations learned by the models are likely to be correlated, it is relevant to study to what extent learned representations are distorted to facilitate clustering. To provide a more interpretable view of the representations learned by meansrelated deep clustering algorithm, we illustrate the embedded samples provided by AE (for comparison), DCN, DKM, and DKM on USPS in Figure 2 (best viewed in color). DCN was discarded due to its poor clustering performance. We used for that matter the SNE visualization method vanderMaaten2008 to project the embeddings into a 2D space. We observe that the representations for points from different clusters are clearly better separated and disentangled in DKM than in other models. This brings further support to our experimental results, which showed the superior ability of DKM to learn representations that facilitate clustering.
5 Conclusion
We have presented in this paper a new approach for jointly clustering with Means and learning representations by considering the Means clustering loss as the limit of a differentiable function. To the best of our knowledge, this is the first approach that truly jointly optimizes, through simple stochastic gradient descent updates, representation and Means clustering losses. In addition to pretraining, that can be used in all methods, this approach can also rely on a deterministic annealing scheme for parameter initialization.
We further conducted careful comparisons with previous approaches by ensuring that the same architecture, initialization and minibatches are used. The experiments conducted on several datasets confirm the good behavior of Deep Means that outperforms DCN, the current best approach for Means clustering in embedding spaces, on all the collections considered.
Acknowledgments
Funding: This work was supported by the French National Agency for Research through the LOCUST project [grant number ANR15CE230027]; France’s AuvergneRhôneAlpes region through the AISUA project [grant number 17011072 01  4102]. Declarations of interest: none.
References
 [1] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. Van Gool. SofttoHard Vector Quantization for EndtoEnd Learning Compressible Representations. In Proceedings of NIPS, NIPS ’17, pages 1141–1151, 2017.
 [2] E. Aljalbout, V. Golkov, Y. Siddiqui, and D. Cremers. Clustering with Deep Learning: Taxonomy and New Methods. arXiv:1801.07648, 2018.
 [3] D. Arthur and S. Vassilvitskii. KMeans++: The Advantages of Careful Seeding. In Proceedings of SODA, SODA ’07, pages 1027–1025, 2007.
 [4] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy LayerWise Training of Deep Networks. In Proceedings of NIPS, NIPS ’06, pages 153–160, 2006.
 [5] J. C. Bezdek, R. Ehrlich, and W. Full. FCM: The Fuzzy cMeans Clustering Algorithm. Computers & Geosciences, 10(23):191–203, 1984.
 [6] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 [7] D. Cai, X. He, and J. Han. Locally Consistent Concept Factorization for Document Clustering. IEEE Transactions on Knowledge and Data Engineering, 23(6):902–913, 2011.
 [8] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan. Deep Adaptive Image Clustering. In Proceedings of ICCV, ICCV ’17, pages 5879–5887, 2017.
 [9] N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C. H. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan. Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders. arXiv:1611.02648, 2017.
 [10] K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang. Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization. In Proceedings of ICCV, ICCV ’17, pages 5736–5745, 2017.
 [11] X. Glorot and Y. Bengio. Understanding the Difficulty of Training Deep Feedforward Neural Networks. In Proceedings of AISTATS, AISTATS ’10, 2010.
 [12] X. Guo, L. Gao, X. Liu, and J. Yin. Improved Deep Embedded Clustering with Local Structure Preservation. In Proceedings of IJCAI, IJCAI ’17, pages 1753–1759, 2017.
 [13] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504–507, 2006.
 [14] C.C. Hsu and C.W. Lin. CNNBased Joint Clustering and Representation Learning with Feature Drift Compensation for LargeScale Image Data. IEEE Transactions on Multimedia, 20(2):421–429, 2018.
 [15] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama. Learning Discrete Representations via Information Maximizing SelfAugmented Training. In Proceedings of ICML, ICML ’17, pages 1558–1567, 2017.
 [16] P. Huang, Y. Huang, W. Wang, and L. Wang. Deep Embedding Network for Clustering. In Proceedings of ICPR, ICPR ’14, pages 1532–1537, 2014.
 [17] E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with GumbelSoftmax. In Proceedings of ICLR, ICLR ’17, 2017.
 [18] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid. Deep Subspace Clustering Networks. In Proceedings of NIPS, NIPS ’17, pages 23–32, 2017.
 [19] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In Proceedings of IJCAI, IJCAI ’17, pages 1965–1972, 2017.
 [20] D. P. Kingma and J. L. Ba. Adam: a Method for Stochastic Optimization. In Proceedings of ICLR, ICLR ’15, 2015.
 [21] D. P. Kingma and M. Welling. AutoEncoding Variational Bayes. In Proceedings of ICLR, ICLR ’14, 2014.
 [22] H. W. Kuhn. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly, 2(12):83–97, 1955.

[23]
J. MacQueen.
Some Methods for Classification and Analysis of Multivariate
Observations.
In
Proceedings of Berkeley Symposium on Mathematical Statistics and Probability
, pages 281–297, 1967. 
[24]
C. J. Maddison, A. Mnih, and Y. W. Teh.
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.
In Proceedings of ICLR, ICLR ’17, 2017.  [25] V. Nair and G. E. Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of ICML, ICML ’10, pages 807–814, 2010.
 [26] X. Peng, J. Feng, J. Lu, W.y. Yau, and Z. Yi. Cascade Subspace Clustering. In Proceedings of AAAI, AAAI ’17, pages 2478–2484, 2017.
 [27] X. Peng, S. Xiao, J. Feng, W. Y. Yau, and Z. Yi. Deep Subspace Clustering with Sparsity Prior. In Proceedings of IJCAI, IJCAI ’16, pages 1925–1931, 2016.
 [28] K. Rose, E. Gurewitz, and G. Fox. A Deterministic Annealing Approach to Clustering. Pattern Recognition Letters, 11(9):589–594, 1990.
 [29] L. van der Maaten and G. Hinton. Visualizing Data using tSNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
 [30] N. X. Vinh, J. Epps, and J. Bailey. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. Journal of Machine Learning Research, 11:2837–2854, 2010.

[31]
J. Xie, R. Girshick, and A. Farhadi.
Unsupervised Deep Embedding for Clustering Analysis.
In Proceedings of ICML, ICML ’16, pages 478–487, 2016. 
[32]
B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong.
Towards Kmeansfriendly Spaces: Simultaneous Deep Learning and Clustering.
In Proceedings of ICML, ICML ’17, pages 3861–3870, 2017. 
[33]
J. Yang, D. Parikh, and D. Batra.
Joint Unsupervised Learning of Deep Representations and Image Clusters.
In Proceedings of CVPR, CVPR ’16, pages 5147–5156, 2016.
Appendix A Proof of Property 1
For , we remind the following property:
Property A.1
(condition (ii)) If is unique for all , then:
Proof: Let and let us assume that is a distance. One has:
As , one has:
Thus if and if .
Appendix B Alternative choice for
As plays the role of a closeness function for an object wrt representative , membership functions used in fuzzy clustering are potential candidates for . In particular, the membership function of the fuzzy Means algorithm [5] is a valid candidate according to conditions and . It takes the following form:
with defined on and (condition ) equal to 1. However, in addition to being slightly more complex than the parametrized softmax, this formulation presents the disadvantage that it may be undefined when a representative coincides with an object; another assumption (in addition to the uniqueness assumption) is required here to avoid such a case.
Appendix C Annealing scheme for in DKM
The scheme we used for the evolution of the inverse temperature in DKM is given by the following recursive sequence: with . The 40 first terms of are plotted in Figure 3.
Appendix D Evaluation measures
In our experiments, the clustering performance of the evaluated methods is evaluated with respect to two standard measures [7]: Normalized Mutual Information (NMI) and the clustering accuracy (ACC). NMI is an informationtheoretic measure based on the mutual information of the groundtruth classes and the obtained clusters, normalized using the entropy of each. Formally, let and denote the groundtruth classes and the obtained clusters, respectively. (resp. ) is the subset of data points from class (resp. cluster ). Let be the number of points in the dataset. The NMI is computed according to the following formula:
where corresponds to the mutual information between the partitions and , and is the entropy of .
On the other hand, ACC measures the proportion of data points for which the obtained clusters can be correctly mapped to groundtruth classes, where the matching is based on the Hungarian algorithm [22]. Let and further denote the groundtruth class and the obtained cluster, respectively, to which data point , is assigned. Then the clustering accuracy is defined as follows:
where denotes the indicator function: and ; is a mapping from cluster labels to class labels.
We additionally report in this supplementary material the clustering performance wrt to the adjusted Rand index (ARI) [30]. ARI counts the pairs of data points on which the classes and clusters agree or disagree, and is corrected for chance. Formally, ARI is given by:
Appendix E Dataset statistics and optimal hyperparameters
We summarize in Table 3 the statistics of the different datasets used in the experiments, as well as the datasetspecific optimal values of the hyperparameter ( for DKMbased and DCNbased methods and for IDECbased ones) which trades off between the reconstruction loss and the clustering loss. We remind that this optimal value was determined using a validation set, disjoint from the test set on which we reported results in the paper.
Dataset  MNIST  USPS  20NEWS  RCV1 

#Samples  70,000  9,298  18,846  10,000 
#Classes  10  10  20  4 
Dimensions  28 28  16 16  2,000  2,000 
1e1  1e1  1e4  1e4  
1e+0  1e+0  1e1  1e2  
1e+1  1e1  1e1  1e1  
1e2  1e1  1e4  1e3  
1e2  1e3  1e1  1e3  
1e3  1e1  1e3  1e4 
Appendix F Additional results
The additional results given in this section have also been computed from 10 seeded runs whenever possible and Student’s test was performed from those 10 samples.
f.1 ARI results
We report in Table 4 the results obtained by Meansrelated methods wrt the ARI measure on the datasets used in the paper. Similarly, Table 5 compares the results of the approaches based on DKM and IDEC in terms of ARI.
Model  MNIST  USPS  20NEWS  RCV1 
KM  36.60.1  53.50.1  7.60.9  20.62.8 
AEKM  69.41.8  63.21.5  31.01.6  23.94.3 
Deep clustering approaches without pretraining  
DCN  15.61.1  14.71.8  5.70.5  6.92.1 
DKM  73.63.1  66.34.9  26.71.5  20.74.4 
Deep clustering approaches with pretraining  
DCN  70.21.8  63.41.5  31.31.6  24.04.3 
DKM  75.01.8  68.51.8  33.91.5  26.54.9 
Model  MNIST  USPS  20NEWS  RCV1 

Deep clustering approaches without pretraining  
IDEC  49.13.0  40.25.1  9.81.5  28.55.3 
DKM  73.63.1  66.34.9  26.71.5  20.74.4 
Deep clustering approaches with pretraining  
IDEC  81.52.4  68.10.5  26.01.3  32.95.7 
DKM  75.01.8  68.51.8  33.91.5  26.54.9 
f.2 Cosine distance
The proposed Deep Means framework enables the use of different distance and similarity functions to compute the AE’s reconstruction error (based on function ) and the clustering loss (based on function ). In the paper, we adopted the euclidean distance for both and . For the sake of comprehensiveness, we performed additional experiments on DKM using different such functions. In particular, we showcase in Table 6 the results obtained by choosing the cosine distance for and the euclidean distance for (DKM and DKM) in comparison to using euclidean distance for both and (DKM and DKM) – these latter corresponding to the approaches reported in the paper. Note that for DKM and DKM as well the reported results correspond to those obtained with the optimal lambda determined on the validation set.
Model  ACC  NMI  ARI 

= euclidean distance, = euclidean distance  
DKM  44.82.4  42.81.1  26.71.5 
DKM  51.22.8  46.71.2  33.91.5 
= cosine distance, = euclidean distance  
DKM  51.31.5  44.40.7  32.61.0 
DKM  51.02.6  45.11.2  33.01.1 
Model  MNIST  USPS  20NEWS  RCV1  

ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI  
AEKM  80.81.8  75.21.1  72.90.8  71.71.2  49.02.9  44.51.5  56.73.6  31.54.3 
DCN + KM  84.93.1  79.41.5  73.90.7  74.11.1  50.53.1  46.51.6  57.33.6  32.34.4 
DKM + KM  84.81.3  78.70.8  76.94.9  74.31.5  49.02.5  44.01.0  53.45.9  27.45.3 
DKM + KM  85.13.0  79.91.5  75.71.3  77.61.1  52.12.7  47.11.3  58.33.8  33.04.9 
f.3 Meansfriendliness of learned representations
In addition to the previous experiments – which evaluate the clustering ability of the different approaches – we analyzed how effective applying Means to the representations learned by DCN, DKM, and DKM is in comparison to applying Means to the AEbased representations (i.e., AEKM). In other words, we evaluate the “Meansfriendliness” of the learned representations. The results of this experiment are reported in Table 7. We can observe that on most datasets the representations learned by Meansrelated deep clustering approaches lead to significant improvement wrt AElearned representations. This confirms that all these deep clustering methods truly bias their representations. Overall, although the difference is not statistically significant on all datasets/metrics, the representations learned by DKM are shown to be the most appropriate to Means. This goes in line with the insight gathered from the previous experiments.