Deep clustering with concrete k-means

by   Boyan Gao, et al.

We address the problem of simultaneously learning a k-means clustering and deep feature representation from unlabelled data, which is of interest due to the potential of deep k-means to outperform traditional two-step feature extraction and shallow-clustering strategies. We achieve this by developing a gradient-estimator for the non-differentiable k-means objective via the Gumbel-Softmax reparameterisation trick. In contrast to previous attempts at deep clustering, our concrete k-means model can be optimised with respect to the canonical k-means objective and is easily trained end-to-end without resorting to alternating optimisation. We demonstrate the efficacy of our method on standard clustering benchmarks.



There are no comments yet.


page 1

page 2

page 3

page 4


Deep k-Means: Jointly Clustering with k-Means and Learning Representations

We study in this paper the problem of jointly clustering and learning re...

Dimensionality Reduction for k-means Clustering

We present a study on how to effectively reduce the dimensions of the k-...

Transformed K-means Clustering

In this work we propose a clustering framework based on the paradigm of ...

Extraction of Protein Sequence Motif Information using PSO K-Means

The main objective of the paper is to find the motif information.The fun...

Probabilistic K-means Clustering via Nonlinear Programming

K-means is a classical clustering algorithm with wide applications. Howe...

An end-to-end Neural Network Framework for Text Clustering

The unsupervised text clustering is one of the major tasks in natural la...

k-meansNet: When k-means Meets Differentiable Programming

In this paper, we study how to make clustering benefiting from different...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is a fundamental task in unsupervised machine learning, and one with numerous applications. A key challenge for clustering in practice is the inter-dependence between the chosen representation of the data and the measured distances that drive clustering. For example, the classic and ubiquitous

-means algorithm assumes a fixed feature representation, distance metric, and underlying spherical cluster distribution. This set of assumptions leads to poor performance if

-means is applied directly to complex high dimensional data such as raw image pixels, where right representation for clustering is a highly non-linear transformation of the input pixels. This observation motivates the vision of end-to-end deep clustering. Joint learning of data representation and

-means clustering has the potential to learn a “

-means friendly” space in which high-dimensional data can be well clustered, without problem-specific hand-engineering of feature representations. More generally, unifying unsupervised clustering and representation learning has the potential to help alleviate the data annotation bottleneck in the standard supervised deep learning paradigm.

The key challenge with realising this deep clustering vision is the non-differentiability of the discrete cluster assignment in the -means objective. Two recent methods—DEC [9] and DCN [11]

—attempt to address this issue by proposing surrogate losses and alternating optimization heuristics, respectively. However, the surrogate loss used by DEC may not lead to the optimal solution of the

-means objective. Furthermore it makes use of soft instance-cluster assignment, which is known to favour overlapping clusters compared to hard assignment methods [4], and more importantly does not provide the discrete assignments necessary for interpretibility in some applications of -means [4]. In contrast, DCN resorts to alternating optimisation rather than end-to-end gradient-based learning. This is sub-optimal and more importantly restricts the ability to integrate clustering as a module in a larger backprop-driven deep network. In this paper we propose concrete -means (CKM), the first end-to-end solution to solving the true -means objective jointly with representation learning. We achieve this by adapting the Gumbel-Softmax reparametisation trick [3]

to allow differentiation and backpropagation through the discrete cluster assignment. This CKM algorithm enables joint training of cluster centroids and feature representation, and is easy and fast to optimise using standard deep learning optimsation methods. Furthermore, we show that as a byproduct, our CKM also provides a solver for shallow

-means with comparable performance to the standard -means++ [1].

To summarise, our main contribution is the concrete -means algorithm, the first joint end-to-end solution to the learning of clusters and representations in discrete-assignment deep -means.

2 Related Work

Clustering methods aim to find subgroups of data that are related according to some distance metric or notion of density. The performance of distance-based clustering algorithms is highly dependent on the data representation, and the goal of deep clustering is to learn a representation of the data that best facilitates clustering. -means is perhaps the most ubiquitous clustering method [6, 1, 10], and it is widely used due to its simplicity, efficacy, and interpretability of its hard cluster assignment. For this reason, several attempts have been made to develop deep -means generalisations. However, this is challenging due to the required hard assignment between data points and cluster centres in the -means objective being hard to reconcile with gradient-based end-to-end learning. Xie et al. [9]

show how to jointly optimise an autoencoder and a

-means model to get a “-means friendly” latent space. The hard assignment in the

-means objective prevent them from optimising the true loss function, so their DEC method makes use of an approximation based on soft assignment of instances to clusters. However, this surrogate objective means that the solution to their model is not necessarily a minimum of the

-means objective. In contrast, DCN [11] resolves the issue by alternating optimisation. Each minibatch of training data is first used to update the deep representation while keeping the centroids held constant, and then used to update the centroids while holding the representation constant. However, alternating optimisation may be slow and ineffective compared to an end-to-end solution. More importantly it hampers integration of clustering as a module in a larger end-to-end deep learning system. In contrast to these methods, we show how one can jointly train a deep representation and cluster centroids with the standard -means objective using backpropagation and conventional deep learning optimisers.

3 Concrete -Means

We first introduce the conventional -means model. Following this, we show how to adapt the

-means objective to train cluster centroids and a deep neural network representation simultaneously. We refer to this novel generalisation as Concrete

-Means (CKM), due to the use of the Concrete distribution [7].

3.1 Conventional -Means

The -means algorithm groups data points, from some space, , into different clusters parameterised by centroids, , also from . By stacking each , the centroids can be collectively represented as a matrix, , where each row corresponds to a cluster centre. The -means objective is to find the assignment and set of centroids that minimise the distance between each point and its associated centroid.


where is a binary matrix that represents the cluster assignments of each point, is the th row of , and is number of data points.

The most common method for learning -means clusters is Lloyd’s algorithm [6], which can be formalised as an alternating optimisation problem. The first step is to find the optimal cluster assignments given the current cluster centres,


The second step is to optimise with respect to the cluster centres keeping assignments fixed,


Both of these optimisation problems permit closed form solutions: Equation 2 can be solved by finding the cluster centre closest to each data point, and Equation 3 is minimised when each cluster centre is set to the mean of its assigned data points. Lloyd’s algorithm alternates between finding locally optimal solutions to these two problems until the cluster assignments become stable.

3.2 Deep -Means with Concrete Gradients

3.2.1 A Regularised Deep Embedding Space

Deep -means strategies aim to cluster the data in a learned embedding space rather than the raw input space . The embedding is defined via a learned neural network , and we will train it to support -means clustering better than the original space. Following previous work [9, 11], we avoid degenerate solutions by defining an autoencoder that regularizes the latent space by reconstructing the original input. Specifically, we define an encoder, , which maps from the input space to the latent space, and decoder, , which maps from the latent space back to the input space. These networks are then composed and their parameters, and , are trained to minimise the reconstruction error,


… …

Figure 1: Illustration of the of Concrete -means architecture. An input item is embedded by , and clusters are learned in this low dimensional latent space. Decoder regularizes the latent space.

3.2.2 Differentiable Clustering in the Latent Space

The proposed algorithm performs clustering in the latent space rather than the input space . In the conventional -means algorithm, a data point is assigned to the cluster with the nearest centroid, as measured by Euclidean distance. The hard assignment operation is not differentiable, thus precluding the direct use of standard gradient-based optimisation techniques for training neural networks. In order to perform both the hard assignment operation and obtain gradient estimates for training he -means objective, we borrow the idea of Straight-Through Gumbel-Softmax [3]. This reparameterisation trick enables the use of a probabilistic hard assignment during the forward propagation, while also allowing gradients to be backpropagated through a soft assignment in order to train the network. We keep the Euclidean distance of the traditional

-means algorithm, and model cluster assignment probabilities using normalised radial basis functions (RBFs),


where is the event that instance is assigned to cluster , and , and we have omitted the dependence of on the cluster centres, , and network parameters,

, to keep notation compact. We would like to draw samples represented as one-hot vectors from

, while simultaneously being able to backpropagate through the sampling process. This can be accomplished by instead sampling from a Gumbel-Softmax distribution—a continuous relaxation of the distribution of one-hot encoded samples from

. By introducing Gumbel distributed random variables,

, one can make use of a reparameterisation trick to sample from the Gumbel-Softmax distribution,


where is the th component of the vector, corresponding to instance , and

is a temperature hyperparameter used for controlling the entropy of the continuous relaxation. As

goes to zero, converges towards true one-hot samples from . In contrast, as goes to infinity, the

converge towards a uniform distribution. In practice, we start training with a high temperature and gradually anneal it towards zero as training progresses.

During test time, the of is taken, rather than sampling via . The vectors can be discretised by rounding the largest component to one, and all others to zero, giving a truly discrete sample distributed according to . We denote the discretization of by . With this notation, we define the concrete -means loss as


noting that is a row vector. During the forward propagation, is used for evaluating the -means loss. During the backward pass, the gradient is estimated by back-propagating though the same loss, but parameterised by instead of

. This method of computing gradients for one-hot encoded categorical variables is known as the straight through Gumbel-softmax estimator 

[3], or the concrete estimator [7].

3.2.3 Summary

To train our Concrete -means, we optimize the main CKM objective in Eq. 7 along with the autoencoder, with respect to encoder and decoder parameters as well as cluster centres. The full objective is:


where is a regularisation strength hyperparameter. The stochastic computational graph [8] in Figure 2 illustrates the flow of information during training for both the forward and backward passes. Dashed arrows indicate the flow of gradients, and solid arrows are activations computed during the forward propagation. The red dashed arrows represent the gradients estimated by our method that would typically be blocked by hard assignment or generated by soft assignment in other methods.

In practice, pretraining the feature extractor using the autoencoder reconstruction before jointly training the full objective improves the final clustering solution. The algorithm and architecture for training deep CKM are outlined in Algorithm LABEL:ck_means and Figure 1 respectively.

Figure 2: A computational graph view of the information flow for the concrete -means algorithm. Solid arrows indicate computation during forward propagation, and dashed arrows indicate gradient flow during backpropagation. The red dashed arrows show which gradients are computed by the concrete gradient estimator.


3.2.4 Shallow Concrete -means

Our algorithm is motivated by the vision of joint clustering and representation learning. Nevertheless, it is worth noting that as a byproduct it provides a novel optimisation strategy for the conventional -means objective in Equation 1. We simply run CKM on raw features, which can be interpreted as fixing the encoder and decoder to the identity function, and solve Equation 8 for centroids alone. Thus we use stochastically estimated gradients to solve conventional -means by gradient descent rather than alternating minimisation [6].

4 Experiments

In this section, we evaluate CKM in conventional shallow and deep clustering.

4.1 Shallow Clustering

The concrete -means method presented in Section 3.2 does not require the presence of a feature extraction network, and can thus be used to optimise the -means objective in the ‘shallow’ setting where Lloyd’s [6] and -means++ [1] are typically applied. Our first experiment aims to confirm if the CKM gradient-based stochastic optimisation matches the performance of the standard -means solvers. Table 1 reports the clustering results of our shallow CKM and sklearn’s

-means++ implementation on ten UCI datasets. The evaluation metrics used for these experiments are normalized mutual information (NMI)

[2], adjusted rand index (ARI) [12], and cluster purity (ACC). The values of ACC and NMI are rescaled to lie between zero and one, with higher values indicating better performance. The range of the ARI is negative one to one. We can see that CKM performs comparably to the standard -means optimizer.

Shallow CKM -means++
pendigits 0.500.04 0.330.05 0.490.05 0.510.04 0.340.05 0.490.05
dig44 0.330.04 0.200.05 0.400.05 0.330.04 0.200.05 0.400.05
vehicle 0.150.03 0.090.03 0.400.03 0.150.03 0.090.03 0.400.03
letter 0.350.01 0.130.01 0.260.01 0.350.01 0.130.01 0.250.01
segment 0.410.05 0.270.05 0.460.04 0.410.05 0.270.05 0.460.04
waveform 0.350.04 0.270.04 0.570.05 0.360.01 0.250.01 0.520.02
vowel 0.410.01 0.210.01 0.360.02 0.420.01 0.210.01 0.360.02
spambase 0.100.03 0.090.05 0.660.04 0.100.03 0.090.05 0.660.04
twonorm 0.840.00 0.910.00 0.980.00 0.840.01 0.910.01 0.980.00
sat 0.580.05 0.480.08 0.640.07 0.580.05 0.480.08 0.640.07
Table 1: Shallow CKM uses gradient estimation to solve the standard fixed-feature -means problem equally well to the conventional alternating minimisation based -means++ [1] implemented in scikit-learn.

4.2 Deep Clustering

4.2.1 Datasets and Settings

Datasets We conduct deep clustering experiments are using the following datasets from the image and natural language domains: MNIST[5] consists of 70,000 greyscale images of handwritten digits. There are 10 classes and each image is pixels, with the digits appearing inside the central pixel area. USPS is a dataset of pixel handwritten digit images. The first 7,291 images are designated as the training fold, and the remaining 2,007 are used for evaluating the final performance of the models. 20Newsgroups was generated by collecting a total of 18,846 posts over 20 different newsgroups. We use the same preprocessing as [11], where the tf-idf representation of the 2,000 most frequently occurring words are used as features.

Architecture Like most deep clustering methods (e.g., [11] and [9]), our approach involves pretraining an autoencoder before optimising the clustering objective. The encoder architecture used for the clustering experiments on MNIST and USPS contains four fully connected layers with 500, 500, 2000, and 10 units, respectively. For the 20Newsgroup experiments, the smaller encoder with 250, 100, and 20 units described by [11]

is used. The decoder that maps the hidden representation back to the input space is the mirror version of the encoder.

Competitors Comparisons are made with DEC [9] and DCN [11], as well as some simple baselines. KM applies classic shallow K-means to raw input features from . AE+KM performs two step dimensionality reduction and clustering by training an autoencoder with the same architecture as CKM to embed instances into the latent space , and then fixes this space before applying classic -means clustering.

4.2.2 Results

max width= Method MNIST USPS 20NEWSGROUP NMI ARI ACC NMI ARI ACC NMI ARI ACC KM 51.80.4 36.50.4 53.30.5 60 0.7 44 0.9 58 0.9 22.71.7 8.01.4 22.62.1 AE+KM 74.30.9 66.90.8 80.61.2 68.10.3 59.40.4 68.40.7 42.01.7 28.31.2 44.32.3 DEC 80.41.3 76.31.8 84.21.7 72.61.1 63.80.9 71.12.5 48.61.2 35.41.4 49.12.5 DCN 81.71.1 75.21.2 83.11.9 71.91.2 61.91.4 73.90.8 44.71.5 34.41.3 46.32.9 CKM 81.41.8 77.71.1 85.42.1 70.70.2 61.30.2 72.10.4 46.51.4 34.11.6 47.32.3

Table 2: Deep clustering results on MNIST, USPS and 20 Newsgroups. indicates methods with interpretable hard assignments, and indicates methods with end-to-end learning by backpropagation. Only our Concrete -means combines hard assignment and end-to-end learning.

In Table 2 we illustrate the results of our CKM and the comparison with other models. For all experiments in this section,

is set to the number of classes present in the dataset. We run each method 15 times with different initial random seeds and report the mean and standard deviation of each result.

From the results, we can see that all the deep methods outperform shallow -means on raw-features (KM), and furthermore all the jointly trained methods outperform the two-step baseline (AE+KM). Compared to the published state of the art methods, our CKM approach generally performs best on MNIST, and comparably to DEC and DCN on USPS and 20Newsgroup. Importantly, our CKM is the only high-performing method to combine the favorable properties of hard-assignment, which is important for interpretability in many applications [4]; and end-to-end deep learning, which is important to be able to integrate clustering functionality as a module into a larger backpropagation-driven system.

Runtime Efficiency:

Comparing the three deep clustering methods, DCN’s alternating optimisation is slower than the end-to-end DEC and CKM. For MNIST, the clock time per epoch is 11

, 10, and 36 for CKM, DEC and DCN respectively.

5 Conclusion

This paper proposes the concrete -means deep clustering framework. Our stochastic hard assignment method is able to estimate gradients of the nondifferentiable -means loss function with respect to cluster centres. This, in turn, enables end-to-end training of a neural network feature extractor and a set of cluster centroids in this latent space. Our experimental results show that the proposed method is competitive with state-of-the-art approaches for solving deep clustering problems.


  • [1] D. Arthur and S. Vassilvitskii (2007) K-means++: the advantages of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms, Cited by: §1, §2, §4.1, Table 1.
  • [2] D. Cai, X. He, and J. Han (2010) Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering 23 (6), pp. 902–913. Cited by: §4.1.
  • [3] E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In ICLR, Cited by: §1, §3.2.2, §3.2.2.
  • [4] M. Kearns, Y. Mansour, and A. Y. Ng (1997) An information-theoretic analysis of hard and soft assignment methods for clustering. In UAI, Cited by: §1, §4.2.2.
  • [5] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.1.
  • [6] S. Lloyd (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory 28 (2), pp. 129–137. Cited by: §2, §3.1, §3.2.4, §4.1.
  • [7] C. J. Maddison, A. Mnih, and Y. W. Teh (2017)

    The concrete distribution: a continuous relaxation of discrete random variables

    In ICLR, Cited by: §3.2.2, §3.
  • [8] J. Schulman, N. Heess, T. Weber, and P. Abbeel (2015) Gradient estimation using stochastic computation graphs. In NIPS, Cited by: §3.2.3.
  • [9] J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In ICML, Cited by: §1, §2, §3.2.1, §4.2.1, §4.2.1.
  • [10] J. Xu and K. Lange (2019) Power k-means clustering. In ICML, Cited by: §2.
  • [11] B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong (2017) Towards k-means-friendly spaces: simultaneous deep learning and clustering. In ICML, Cited by: §1, §2, §3.2.1, §4.2.1, §4.2.1, §4.2.1.
  • [12] K. Y. Yeung and W. L. Ruzzo (2001)

    Details of the adjusted rand index and clustering algorithms, supplement to the paper an empirical study on principal component analysis for clustering gene expression data

    Bioinformatics 17 (9), pp. 763–774. Cited by: §4.1.