1 Introduction
Clustering is a fundamental task in unsupervised machine learning, and one with numerous applications. A key challenge for clustering in practice is the interdependence between the chosen representation of the data and the measured distances that drive clustering. For example, the classic and ubiquitous
means algorithm assumes a fixed feature representation, distance metric, and underlying spherical cluster distribution. This set of assumptions leads to poor performance ifmeans is applied directly to complex high dimensional data such as raw image pixels, where right representation for clustering is a highly nonlinear transformation of the input pixels. This observation motivates the vision of endtoend deep clustering. Joint learning of data representation and
means clustering has the potential to learn a “means friendly” space in which highdimensional data can be well clustered, without problemspecific handengineering of feature representations. More generally, unifying unsupervised clustering and representation learning has the potential to help alleviate the data annotation bottleneck in the standard supervised deep learning paradigm.
The key challenge with realising this deep clustering vision is the nondifferentiability of the discrete cluster assignment in the means objective. Two recent methods—DEC [9] and DCN [11]
—attempt to address this issue by proposing surrogate losses and alternating optimization heuristics, respectively. However, the surrogate loss used by DEC may not lead to the optimal solution of the
means objective. Furthermore it makes use of soft instancecluster assignment, which is known to favour overlapping clusters compared to hard assignment methods [4], and more importantly does not provide the discrete assignments necessary for interpretibility in some applications of means [4]. In contrast, DCN resorts to alternating optimisation rather than endtoend gradientbased learning. This is suboptimal and more importantly restricts the ability to integrate clustering as a module in a larger backpropdriven deep network. In this paper we propose concrete means (CKM), the first endtoend solution to solving the true means objective jointly with representation learning. We achieve this by adapting the GumbelSoftmax reparametisation trick [3]to allow differentiation and backpropagation through the discrete cluster assignment. This CKM algorithm enables joint training of cluster centroids and feature representation, and is easy and fast to optimise using standard deep learning optimsation methods. Furthermore, we show that as a byproduct, our CKM also provides a solver for shallow
means with comparable performance to the standard means++ [1].To summarise, our main contribution is the concrete means algorithm, the first joint endtoend solution to the learning of clusters and representations in discreteassignment deep means.
2 Related Work
Clustering methods aim to find subgroups of data that are related according to some distance metric or notion of density. The performance of distancebased clustering algorithms is highly dependent on the data representation, and the goal of deep clustering is to learn a representation of the data that best facilitates clustering. means is perhaps the most ubiquitous clustering method [6, 1, 10], and it is widely used due to its simplicity, efficacy, and interpretability of its hard cluster assignment. For this reason, several attempts have been made to develop deep means generalisations. However, this is challenging due to the required hard assignment between data points and cluster centres in the means objective being hard to reconcile with gradientbased endtoend learning. Xie et al. [9]
show how to jointly optimise an autoencoder and a
means model to get a “means friendly” latent space. The hard assignment in themeans objective prevent them from optimising the true loss function, so their DEC method makes use of an approximation based on soft assignment of instances to clusters. However, this surrogate objective means that the solution to their model is not necessarily a minimum of the
means objective. In contrast, DCN [11] resolves the issue by alternating optimisation. Each minibatch of training data is first used to update the deep representation while keeping the centroids held constant, and then used to update the centroids while holding the representation constant. However, alternating optimisation may be slow and ineffective compared to an endtoend solution. More importantly it hampers integration of clustering as a module in a larger endtoend deep learning system. In contrast to these methods, we show how one can jointly train a deep representation and cluster centroids with the standard means objective using backpropagation and conventional deep learning optimisers.3 Concrete Means
We first introduce the conventional means model. Following this, we show how to adapt the
means objective to train cluster centroids and a deep neural network representation simultaneously. We refer to this novel generalisation as Concrete
Means (CKM), due to the use of the Concrete distribution [7].3.1 Conventional Means
The means algorithm groups data points, from some space, , into different clusters parameterised by centroids, , also from . By stacking each , the centroids can be collectively represented as a matrix, , where each row corresponds to a cluster centre. The means objective is to find the assignment and set of centroids that minimise the distance between each point and its associated centroid.
(1)  
s.t. 
where is a binary matrix that represents the cluster assignments of each point, is the th row of , and is number of data points.
The most common method for learning means clusters is Lloyd’s algorithm [6], which can be formalised as an alternating optimisation problem. The first step is to find the optimal cluster assignments given the current cluster centres,
(2)  
s.t. 
The second step is to optimise with respect to the cluster centres keeping assignments fixed,
(3) 
Both of these optimisation problems permit closed form solutions: Equation 2 can be solved by finding the cluster centre closest to each data point, and Equation 3 is minimised when each cluster centre is set to the mean of its assigned data points. Lloyd’s algorithm alternates between finding locally optimal solutions to these two problems until the cluster assignments become stable.
3.2 Deep Means with Concrete Gradients
3.2.1 A Regularised Deep Embedding Space
Deep means strategies aim to cluster the data in a learned embedding space rather than the raw input space . The embedding is defined via a learned neural network , and we will train it to support means clustering better than the original space. Following previous work [9, 11], we avoid degenerate solutions by defining an autoencoder that regularizes the latent space by reconstructing the original input. Specifically, we define an encoder, , which maps from the input space to the latent space, and decoder, , which maps from the latent space back to the input space. These networks are then composed and their parameters, and , are trained to minimise the reconstruction error,
(4) 
3.2.2 Differentiable Clustering in the Latent Space
The proposed algorithm performs clustering in the latent space rather than the input space . In the conventional means algorithm, a data point is assigned to the cluster with the nearest centroid, as measured by Euclidean distance. The hard assignment operation is not differentiable, thus precluding the direct use of standard gradientbased optimisation techniques for training neural networks. In order to perform both the hard assignment operation and obtain gradient estimates for training he means objective, we borrow the idea of StraightThrough GumbelSoftmax [3]. This reparameterisation trick enables the use of a probabilistic hard assignment during the forward propagation, while also allowing gradients to be backpropagated through a soft assignment in order to train the network. We keep the Euclidean distance of the traditional
means algorithm, and model cluster assignment probabilities using normalised radial basis functions (RBFs),
(5) 
where is the event that instance is assigned to cluster , and , and we have omitted the dependence of on the cluster centres, , and network parameters,
, to keep notation compact. We would like to draw samples represented as onehot vectors from
, while simultaneously being able to backpropagate through the sampling process. This can be accomplished by instead sampling from a GumbelSoftmax distribution—a continuous relaxation of the distribution of onehot encoded samples from
. By introducing Gumbel distributed random variables,
, one can make use of a reparameterisation trick to sample from the GumbelSoftmax distribution,(6) 
where is the th component of the vector, corresponding to instance , and
is a temperature hyperparameter used for controlling the entropy of the continuous relaxation. As
goes to zero, converges towards true onehot samples from . In contrast, as goes to infinity, theconverge towards a uniform distribution. In practice, we start training with a high temperature and gradually anneal it towards zero as training progresses.
During test time, the of is taken, rather than sampling via . The vectors can be discretised by rounding the largest component to one, and all others to zero, giving a truly discrete sample distributed according to . We denote the discretization of by . With this notation, we define the concrete means loss as
(7) 
noting that is a row vector. During the forward propagation, is used for evaluating the means loss. During the backward pass, the gradient is estimated by backpropagating though the same loss, but parameterised by instead of
. This method of computing gradients for onehot encoded categorical variables is known as the straight through Gumbelsoftmax estimator
[3], or the concrete estimator [7].3.2.3 Summary
To train our Concrete means, we optimize the main CKM objective in Eq. 7 along with the autoencoder, with respect to encoder and decoder parameters as well as cluster centres. The full objective is:
(8) 
where is a regularisation strength hyperparameter. The stochastic computational graph [8] in Figure 2 illustrates the flow of information during training for both the forward and backward passes. Dashed arrows indicate the flow of gradients, and solid arrows are activations computed during the forward propagation. The red dashed arrows represent the gradients estimated by our method that would typically be blocked by hard assignment or generated by soft assignment in other methods.
In practice, pretraining the feature extractor using the autoencoder reconstruction before jointly training the full objective improves the final clustering solution. The algorithm and architecture for training deep CKM are outlined in Algorithm LABEL:ck_means and Figure 1 respectively.
algocf[t]
3.2.4 Shallow Concrete means
Our algorithm is motivated by the vision of joint clustering and representation learning. Nevertheless, it is worth noting that as a byproduct it provides a novel optimisation strategy for the conventional means objective in Equation 1. We simply run CKM on raw features, which can be interpreted as fixing the encoder and decoder to the identity function, and solve Equation 8 for centroids alone. Thus we use stochastically estimated gradients to solve conventional means by gradient descent rather than alternating minimisation [6].
4 Experiments
In this section, we evaluate CKM in conventional shallow and deep clustering.
4.1 Shallow Clustering
The concrete means method presented in Section 3.2 does not require the presence of a feature extraction network, and can thus be used to optimise the means objective in the ‘shallow’ setting where Lloyd’s [6] and means++ [1] are typically applied. Our first experiment aims to confirm if the CKM gradientbased stochastic optimisation matches the performance of the standard means solvers. Table 1 reports the clustering results of our shallow CKM and sklearn’s
means++ implementation on ten UCI datasets. The evaluation metrics used for these experiments are normalized mutual information (NMI)
[2], adjusted rand index (ARI) [12], and cluster purity (ACC). The values of ACC and NMI are rescaled to lie between zero and one, with higher values indicating better performance. The range of the ARI is negative one to one. We can see that CKM performs comparably to the standard means optimizer.Shallow CKM  means++  

NMI  ARI  ACC  NMI  ARI  ACC  
pendigits  0.500.04  0.330.05  0.490.05  0.510.04  0.340.05  0.490.05 
dig44  0.330.04  0.200.05  0.400.05  0.330.04  0.200.05  0.400.05 
vehicle  0.150.03  0.090.03  0.400.03  0.150.03  0.090.03  0.400.03 
letter  0.350.01  0.130.01  0.260.01  0.350.01  0.130.01  0.250.01 
segment  0.410.05  0.270.05  0.460.04  0.410.05  0.270.05  0.460.04 
waveform  0.350.04  0.270.04  0.570.05  0.360.01  0.250.01  0.520.02 
vowel  0.410.01  0.210.01  0.360.02  0.420.01  0.210.01  0.360.02 
spambase  0.100.03  0.090.05  0.660.04  0.100.03  0.090.05  0.660.04 
twonorm  0.840.00  0.910.00  0.980.00  0.840.01  0.910.01  0.980.00 
sat  0.580.05  0.480.08  0.640.07  0.580.05  0.480.08  0.640.07 
4.2 Deep Clustering
4.2.1 Datasets and Settings
Datasets We conduct deep clustering experiments are using the following datasets from the image and natural language domains: MNIST [5] consists of 70,000 greyscale images of handwritten digits. There are 10 classes and each image is pixels, with the digits appearing inside the central pixel area. USPS is a dataset of pixel handwritten digit images. The first 7,291 images are designated as the training fold, and the remaining 2,007 are used for evaluating the final performance of the models. 20Newsgroups was generated by collecting a total of 18,846 posts over 20 different newsgroups. We use the same preprocessing as [11], where the tfidf representation of the 2,000 most frequently occurring words are used as features.
Architecture Like most deep clustering methods (e.g., [11] and [9]), our approach involves pretraining an autoencoder before optimising the clustering objective. The encoder architecture used for the clustering experiments on MNIST and USPS contains four fully connected layers with 500, 500, 2000, and 10 units, respectively. For the 20Newsgroup experiments, the smaller encoder with 250, 100, and 20 units described by [11]
is used. The decoder that maps the hidden representation back to the input space is the mirror version of the encoder.
Competitors Comparisons are made with DEC [9] and DCN [11], as well as some simple baselines. KM applies classic shallow Kmeans to raw input features from . AE+KM performs two step dimensionality reduction and clustering by training an autoencoder with the same architecture as CKM to embed instances into the latent space , and then fixes this space before applying classic means clustering.
4.2.2 Results
In Table 2 we illustrate the results of our CKM and the comparison with other models. For all experiments in this section,
is set to the number of classes present in the dataset. We run each method 15 times with different initial random seeds and report the mean and standard deviation of each result.
From the results, we can see that all the deep methods outperform shallow means on rawfeatures (KM), and furthermore all the jointly trained methods outperform the twostep baseline (AE+KM). Compared to the published state of the art methods, our CKM approach generally performs best on MNIST, and comparably to DEC and DCN on USPS and 20Newsgroup. Importantly, our CKM is the only highperforming method to combine the favorable properties of hardassignment, which is important for interpretability in many applications [4]; and endtoend deep learning, which is important to be able to integrate clustering functionality as a module into a larger backpropagationdriven system.
Runtime Efficiency:
Comparing the three deep clustering methods, DCN’s alternating optimisation is slower than the endtoend DEC and CKM. For MNIST, the clock time per epoch is 11
, 10, and 36 for CKM, DEC and DCN respectively.5 Conclusion
This paper proposes the concrete means deep clustering framework. Our stochastic hard assignment method is able to estimate gradients of the nondifferentiable means loss function with respect to cluster centres. This, in turn, enables endtoend training of a neural network feature extractor and a set of cluster centroids in this latent space. Our experimental results show that the proposed method is competitive with stateoftheart approaches for solving deep clustering problems.
References
 [1] (2007) Kmeans++: the advantages of careful seeding. In ACMSIAM Symposium on Discrete Algorithms, Cited by: §1, §2, §4.1, Table 1.
 [2] (2010) Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering 23 (6), pp. 902–913. Cited by: §4.1.
 [3] (2017) Categorical reparameterization with gumbelsoftmax. In ICLR, Cited by: §1, §3.2.2, §3.2.2.
 [4] (1997) An informationtheoretic analysis of hard and soft assignment methods for clustering. In UAI, Cited by: §1, §4.2.2.
 [5] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.1.
 [6] (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory 28 (2), pp. 129–137. Cited by: §2, §3.1, §3.2.4, §4.1.

[7]
(2017)
The concrete distribution: a continuous relaxation of discrete random variables
. In ICLR, Cited by: §3.2.2, §3.  [8] (2015) Gradient estimation using stochastic computation graphs. In NIPS, Cited by: §3.2.3.

[9]
(2016)
Unsupervised deep embedding for clustering analysis
. In ICML, Cited by: §1, §2, §3.2.1, §4.2.1, §4.2.1.  [10] (2019) Power kmeans clustering. In ICML, Cited by: §2.
 [11] (2017) Towards kmeansfriendly spaces: simultaneous deep learning and clustering. In ICML, Cited by: §1, §2, §3.2.1, §4.2.1, §4.2.1, §4.2.1.

[12]
(2001)
Details of the adjusted rand index and clustering algorithms, supplement to the paper an empirical study on principal component analysis for clustering gene expression data
. Bioinformatics 17 (9), pp. 763–774. Cited by: §4.1.
Comments
There are no comments yet.