1 Introduction
The deep learning revolution has been fueled by the explosion of large scale datasets with meaningful labels. Applications range from image classification
[2] to text sentiment classification [18]. However, for many applications, we do not have meaningful data labels available. In this regime, deep autoencoders are gaining momentum [8] as a way to effectively map data to a lowdimensional feature space where data are more separable and hence more easily clustered [29].Longestablished methods for unsupervised clustering such as Kmeans and Gaussian mixture models (GMMs) are still the workhorses of many applications due to their simplicity. However, their distance measures are limited to local relations in the data space and they tend to be ineffective for high dimensional data that often has significant overlaps across clusters
[29, 31].Recently, there has been a surge of interest in developing more powerful clustering methods by leveraging deep neural networks. Various methods [31, 3, 31, 19] have been proposed to conduct clustering on the latent representations learned by (variational) autoencoders. Although these methods perform well in clustering, a weakness is that they use one single lowdimensional manifold to represent the data. As a consequence, for more complex data, the latent representations can be poorly separated. Learning good representations by leveraging the underlying structure of the data has been left largely unexplored and is the topic of our work. Our underlying assumption is that each data cluster is associated with a separate manifold. Therefore, modeling the dataset as a mixture of lowdimensional nonlinear manifolds is a natural and promising framework for clustering data generated from different categories.
In this paper we develop a novel deep architecture for multiple manifold clustering. Manifold learning and clustering has a rich literature, with parametric estimation methods
[4, 22] and spectral methods being the most common approaches [26, 17]. These methods require either a parametric model or distance metrics that capture the relationship among points in the dataset (or both). An autoencoder, on the other hand, identifies a nonlinear function mapping the highdimensional points to a lowdimensional latent representation without any metric, and while autoencoders are parametric in some sense, they are often trained with a large number of parameters, resulting in a high degree of flexibility in the final lowdimensional representation.
Our approach therefore is to use a MIXture of AutoEncoders (MIXAE), each of which should identify a nonlinear mapping suitable for a particular cluster. The autoencoders are trained simultaneously with a mixture assignment network via a composite objective function, thereby jointly motivating low reconstruction error (for learning each manifold) and cluster identification error. This kind of joint optimization has been shown to have good performance in other unsupervised architectures [31]
as well. The main advantage in combining clustering with representation learning this way is that the two parts collaborate with each other to reduce the complexity and improve representation ability–the latent vector itself is a lowdimensional data representation that allows a much smaller classifier network for mixture assignment, and is itself learned to be wellseparated by clustering (see section
4).Our contributions are: (i) a novel deep learning architecture for unsupervised clustering with mixture of autoencoders, (ii) a joint optimization framework for simultaneously learning a union of manifolds and clustering assignment, and (iii) stateoftheart performance on established benchmark largescale datasets.
2 Related Work
The most fundamental method for clustering is the Kmeans algorithm [7], which assumes that the data are centered around some centroids, and seeks to find clusters that minimize the sum of the squares of the norm distances to the centroid within each cluster. Instead of modeling each cluster with a single point (centroid), another approach called Ksubspaces clustering assumes the dataset can be well approximated by a union of subspaces (linear manifolds); this field is well studied and a large body of work has been proposed [25, 5]. However, neither Kmeans nor Ksubspaces clustering is designed to separate clusters that have nonlinear and nonseparable structure.
Nonlinear manifold clustering has been studied as a more promising generalization of linear models and has an extensive literature [14, 6, 26, 27, 22, 28]
. In particular, graphbased methods like spectral clustering
[20, 17, 26] and its variants [30] are popular ways of handling nonlinearly separated clusters. However, these methods in general can be computationally intensive, and still have difficulties in separating clusters with intersecting regions.Recent work [24]
extends spectral clustering by replacing the eigenvector representation of data with the embeddings from a deep autoencoder. Since training an autoencoder is linear in the number of samples
, finding the embeddings is much more scalable than traditional spectral clustering, despite the difficulties in training autoencoders. However, this approach requires a normalized adjacency matrix as input, which is a heavy burden on both computation and memory for very large .Mixture models are a computationally scalable probabilistic approach to clustering that also allows for overlapping clusters. The most popular mixture model is the Gaussian Mixture Model (GMM), which assumes that data are generated from a mixture of Gaussian distributions with unknown parameters, and the parameters are optimized by the Expectation Maximization (EM) algorithm. However, as with the Kmeans method, GMMs require strong assumptions on the distribution of the data, which are often not satisfied in practice.
A recent stream of work has focused on optimizing a clustering objective over the lowdimensional feature space of an autoencoder [29] or a variational autoencoder [31, 3]. Notably, the Deep Embedded Clustering (DEC) model [29] iteratively minimizes the withincluster KLdivergence and the reconstruction error. The Variational Deep Embedding (VaDE) [31] and Gaussian Mixture Variational Autoencoder (GMVAE) [3] models extend the DEC approach by training variational autoencoders, iteratively learning clusters and feature representation distribution parameters. In particular, [31] emphasizes a noticeable gain in training the autoencoder and the GMM components jointly rather than alternatively, which shares the same spirit of our joint representation and clustering framework. A weakness in these models is that they require careful initialization of model parameters, and often exhibit separation of clusters before actual training even begins. In contrast, the proposed MIXAE model can be trained from scratch.
Similarly, the DLGMM model [16] and CVAE model [21] also combine variational autoencoders with GMM for clustering, but are primarily used for different applications. Adversarial autoencoders [13]
are another popular extension, and both are also popular for semisupervised learning
[13, 1]. However, adversarial models are reputably difficult to train.3 Clustering with Mixture of Autoencoders
We now describe our MIXture of AutoEncoders (MIXAE) model in detail, giving the intuition behind our customized architecture and specialized objective function.
3.1 Autoencoders
An autoencoder is a common neural network architecture used for unsupervised representation learning. An autoencoder consists of an encoder and a decoder . Given the input data , the encoder first maps to its latent representation , where typically . The decoder then maps to a reconstruction , with reconstruction error measuring the deviation between and
. The parameters of the network are updated via backpropagation with the target of minimizing the reconstruction error. By restricting the latent space to lower dimensionality than the input space (
) the trained autoencoder parametrizes a lowdimensional nonlinear manifold that captures the data structure.3.2 Mixture of Autoencoders (MIXAE) Model
Our goal is to cluster a collection of data points into clusters, under the assumption that data from each cluster is sampled from a different lowdimensional manifold. A natural choice is to use a separate autoencoder to model each data cluster, and thereby the entire dataset as a collection of autoencoders. The cluster assignment is performed with an additional neural network, which infers the cluster labels of each data sample based on the autoencoders’ latent features. Specifically, for each data sample , this mixture assignment network takes the concatenation of the latent representations of each autoencoder
as input, and outputs a probabilistic vector that infers the distribution of over clusters, i.e., for ,
(1) 
To jointly optimize the autoencoders and the mixture assignment network, we use a composite objective function consisting of three important components.
Weighted reconstruction error
The mixture aggregation is done in the weighted reconstruction error term
(2) 
where is the data sample, is the reconstructed output of autoencoder for sample , is the reconstruction error, and
are the soft probabilities from the mixture assignment network for sample
, calculated via (1). Typical choices for are squared errors and KLdivergence. Intuitively, (2) will achieve its minimum when ’s are onehot vectors and select the autoencoder with minimum reconstruction error for that sample.Samplewise entropy
To motivate sparse mixture assignment probabilities (so that each data sample ultimately receives one dominant label assignment) we add a sample entropy deterrent:
(3) 
Specifically, (3) achieves its minimum only if is an onehot vector, specifying a deterministic distribution. We refer to this as the samplewise entropy.
Batchwise entropy:
One trivial minimizer of the samplewise entropy is for the mixture assignment neural network to output a constant onehot vector for all input data, i.e., selecting a single autoencoder for all of the data. To avoid this local minima, we motivate equal usage of all autoencoders via a batchwise entropy term
(4) 
Here, is the minibatch size and is the average soft cluster assignment over an entire minibatch. If (4) reaches a maximum value of , then , i.e., each cluster is selected with uniform probability. This is a valid assumption for a large enough minibatch, randomly selected over balanced data.
Objective function
Let be the parameters of the autoencoders and mixture assignment network. We minimize the composite cost function
(5) 
An important consideration is the choice of and , which can significantly affect the final clustering quality. We adjust these parameters dynamically during the training process. Intuitively, initially we should prioritize batchwise entropy and samplewise entropy in order to encourage equal use of autoencoders while avoiding the case where all autoencoders are equally optimized for each input, i.e.,
the probabilistic vector characterizes a uniform distribution for each input. Asymptotically, we should prioritize minimizing the reconstruction error to promote better learning of the manifolds for each cluster, and minimizing samplewise entropy to ensure assignment of every data sample to only one autoencoder. A simple and effective scheme is to use comparatively larger
and at the beginning of the training process, while adjusting andat each epoch such that the three terms in the objective function are approximately equal as the training process goes on. Empirically, this produces better results than static choices of
and .4 Experimental Results
We evaluate our MIXAE on three datasets representing different applications: images, texts, and sensor outputs.
Mnist
The MNIST [11] dataset contains 70000 pixel images of handwritten digits (0, 1, …, 9), each cropped and centered.
Reuters
The original Reuters dataset contains about 810000 English news stories labeled by a category tree. Following [29], we choose four root categories: corporate/industrial, government/social, markets, and economics as labels, and remove all of the documents that are labeled by multiple root categories, which results in a dataset with 685071 documents. We then compute the tfidf features on the 2000 most frequent words.
Hhar
The Heterogeneity Human Activity Recognition (HHAR, [23]) dataset contains 10299 samples of smartphone and smartwatch sensor time series feeds, each of length 561. There are 6 categories of human activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying.
A summary of the dataset statistics is also provided in Table 1.
Network architecture
We use different autoencoders and mixture assignment network sizes for different datasets, summarized in Figure 2. We use convolutional autoencoders for MNIST and fully connected autoencoders for the other (nonimage) datasets. For each dataset, we train MIXAE with ADAM [9]
acceleration, using Tensorflow. We use a decaying learning rate, initialized at
and reduced by a factor every epochs.Dataset  Dim  LC /SC  

MNIST  784  70000  10  11% / 9% 
Reuters  2000  685071  4  41% / 9% 
HHAR  561  10299  6  19% / 14% 
Evaluation metric
Following the work of DEC, the clustering accuracy of all algorithms is measured by the unsupervised clustering accuracy (ACC):
(6) 
where is the groundtruth label, is the cluster assignment produced by the mixture assignment network, i.e.,
and are all possible onetoone mappings between clusters and labels. This is equivalent to the cluster purity and is a common metric in clustering (see also [29]). Finding the optimal mapping can be done effectively using the Hungarian algorithm [15].
4.1 Clustering results
An overall comparison of each clustering method is given in Table 2. As we can see, the deep learning models (DEC, VaDE and MIXAE) all perform much better than traditional machine learning methods (Kmeans and GMM). This suggests that using autoencoders to extract the latent features of the data and then clustering on these latent features is advantageous for these challenging datasets.
Comparing the deep learning models, we see that MIXAE outperforms DEC, a single manifold model, suggesting the advantage of a mixture of manifolds in cluster representability. Additionally, note that both DEC and VaDE use stacked autoencoders to pretrain their models, which can introduce significant computational overhead, especially if the autoencoders are deep. In contrast, MIXAE trains from a random initialization.
At the same time, though MIXAE achieves the same or slightly better performance against VaDE on Reuters and HHAR, VaDE outperforms MIXAE and DEC on MNIST. In general it has been observed that variational autoencoders have better representability than deterministic autoencoders (e.g., [10]). This suggests that leveraging a mixture of variational autoencoders may further improve our model’s performance and is an interesting direction for future work.
Method  MNIST  Reuters  HHAR  Pretrain? 

Kmeans  53.5%  53.3%  60.1%  No 
GMM  53.7%  55.8%  60.3%  No 
DEC [29]  84.3%  75.6%  79.9%  Yes 
VaDE [31]  94.5%  79.4%  84.5%  Yes 
MIXAE  85.6%  79.4%  87.8%  No 
Figure 3 shows some samples grouped by cluster label. We see that MIXAE clusters well a variety of writing styles. There are some mistakes, but are consistent with frequently observed mistakes in supervised classification (e.g., 4 and 9 confusion).
4.2 Latent space separability
Fig 4 shows the tSNE projection of the dimensional concatenated latent vectors to 2D space. Specifically, we see that as training progresses, the latent feature clusters become more and more separated, suggesting that the overall architecture motivates finding representations with better clustering performance.
4.3 Effect of balanced data
As we can see in Table 2, all methods have significantly lower performance on Reuters (an unbalanced dataset) than MNIST and HHAR (balanced datasets). We investigate the effect of balanced data on MIXAE in Table 3 and Figure 5.
Ground Truth  Actual  

Dataset  BE  SE  BE  SE  
MNIST  2.31  2.30  0  2.28  0.026 
Reuters  1.39  1.26  0  1.38  0.598 
HHAR  1.79  1.79  0  1.77  0.054 
In Table 3, for each dataset, we record the values of batchwise entropy (BE) and samplewise entropy (SE) over the entire dataset after training, and we compare them with the ground truth entropies of the true labels. The batch entropy regularization (4) forces the final actual batchwise entropy to be very close to the maximal value of for all of the three datasets.
In other words, our batch entropy encourages the final cluster assignments to be uniform, which is valid for balanced datasets (MNIST and HHAR) but biased for unbalanced datasets (Reuters). Specifically, in Fig. 5, the sample covariance matrix of the true labels of Reuters has one dominant diagonal value, but the converged sample covariance matrix diagonal is much more even, suggesting that samples that should have gone to a dominant cluster are evenly (incorrectly) distributed to the other clusters.
Additionally, note that the converged samplewise entropy (actual SE value) for Reuters is far from 0 (Table 3), suggesting that even after convergence, there is considerable ambiguity in cluster assignment.
4.4 Varying K
We also explore the clustering performance of MIXAE with more autoencoders than natural clusters; i.e., for MNIST, and . In Figure 6, we plot the evolution of the three components of our objective function (5), as well as the final cluster purity. This purity is defined as the percentage of “correct” labels, where the “correct” label for a cluster is defined as the majority of the true labels for that cluster.
As we can see in Figure 7, the clustering accuracy for larger converges to higher values. One possible explanation is that with larger , the final clusters split each digit group into more clusters, and this reduces the overlap in underlying manifolds corresponding to different digit groups. On the other hand, the samplewise entropy no longer converges to 0, and the final probabalistic vectors are observed to have 2 or 3 significant nonzeros instead of only one; this suggests that the learned manifolds corresponding with each digit group may have certain overlap.
Figure 6(a) shows again the covariance matrices for MNIST, , and . Interestingly, here the final covariance diagonals are extremely uneven, suggesting that final cluster assignments are more and more unbalanced as we increase
. Since intuitively each digit group may have different magnitudes of variance in writing styles, this result is consistent with what we may expect. Figure
6(b) shows digit examples, sorted according to the finalized cluster assignments. Here, we see the emergence of “stylistic” clusters, with straight vs slanted 1’s, thin vs round 0’s, etc.5 Conclusions and Future Work
In this paper, we introduce the MIXAE architecture that uses a combination of small autoencoders and a cluster assignment network to intelligently cluster unlabeled data. This is based on the assumption that data from each cluster is generated from a separate lowdimensional manifold, and thus the aggregate data is modeled as a mixture of manifolds. Using this model, we produce improved performance over deterministic deep clustering models on established datasets.
There are several interesting extensions to pursue. First, though we have improved performance on the unbalanced dataset over DEC, we still find Reuters a challenging dataset due to its imbalanced distribution over natural clusters. One potential improvement is to replace the batch entropy regularization with crossentropy regularization, using knowledge about cluster sizes. However, knowing the sizes of clusters is not a realistic assumption in online machine learning. Additionally this would force each autoencoder to take a preassigned cluster identity, which might negatively affect the training.
Another important extension is in the direction of variational and adversarial autoencoders. We have seen that in single autoencoder models, VaDE outperforms DEC, which they also attribute to a KL penalty term for encouraging cluster separation. This extension can be done in our model to encourage separation in the latent representation variables.
Our model also has an interesting interpretation to dictionary learning, where a small set of basis vectors characterizes a structured high dimensional dataset. Specifically, we can consider the manifolds learned by the autoencoders as “codewords” and the sample entropy applied to the mixture assignment as the sparse regularization. An interesting extension is to apply this model to multilabel clustering, to see if each autoencoder can learn distinctive atomic features of each datapoint–for example, the components of an image, or voice signal.
References
 [1] E. Abbasnejad, A. Dick, and A. v. d. Hengel. Infinite variational autoencoder for semisupervised learning. arXiv preprint arXiv:1611.07800, 2016.
 [2] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 [3] N. Dilokthanakul, P. A. Mediano, M. Garnelo, M. C. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan. Deep unsupervised clustering with Gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
 [4] E. Elhamifar and R. Vidal. Sparse manifold clustering and embedding. In Advances in neural information processing systems, pages 55–63, 2011.
 [5] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765–2781, 2013.
 [6] A. Gionis, A. Hinneburg, S. Papadimitriou, and P. Tsaparas. Dimension induced clustering. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 51–60. ACM, 2005.
 [7] J. A. Hartigan and M. A. Wong. Algorithm as 136: A Kmeans clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.
 [8] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 [9] D. Kingma and J. Ba. ADAM: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [10] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [12] L. v. d. Maaten and G. Hinton. Visualizing data using tsne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
 [13] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.

[14]
P. Mordohai and G. G. Medioni.
Unsupervised dimensionality estimation and manifold learning in highdimensional spaces by tensor voting.
In IJCAI, pages 798–803, 2005.  [15] J. Munkres. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics, 5(1):32–38, 1957.
 [16] E. Nalisnick, L. Hertel, and P. Smyth. Approximate inference for deep latent gaussian mixtures. In NIPS Workshop on Bayesian Deep Learning, 2016.

[17]
A. Y. Ng, M. I. Jordan, and Y. Weiss.
On spectral clustering: Analysis and an algorithm.
In Advances in neural information processing systems, pages 849–856, 2002. 
[18]
A. Pak and P. Paroubek.
Twitter as a corpus for sentiment analysis and opinion mining.
In LREc, volume 10, 2010.  [19] X. Peng, J. Feng, S. Xiao, J. Lu, Z. Yi, and S. Yan. Deep sparse subspace clustering. arXiv preprint arXiv:1709.08374, 2017.
 [20] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
 [21] R. Shu, J. Brofos, F. Zhang, H. H. Bui, M. Ghavamzadeh, and M. Kochenderfer. Stochastic video prediction with conditional density estimation. In ECCV Workshop on Action and Anticipation for Visual Learning, 2016.
 [22] R. Souvenir and R. Pless. Manifold clustering. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 1, pages 648–653. IEEE, 2005.
 [23] A. Stisen, H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard, A. Dey, T. Sonne, and M. M. Jensen. Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, pages 127–140. ACM, 2015.
 [24] F. Tian, B. Gao, Q. Cui, E. Chen, and T.Y. Liu. Learning deep representations for graph clustering. In AAAI, pages 1293–1299, 2014.
 [25] R. Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52–68, 2011.
 [26] U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
 [27] X. Wang, K. Slavakis, and G. Lerman. Riemannian multimanifold modeling. arXiv preprint arXiv:1410.0095, 2014.
 [28] Y. Wang, Y. Jiang, Y. Wu, and Z.H. Zhou. Multimanifold clustering. In PRICAI, pages 280–291. Springer, 2010.
 [29] J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning, pages 478–487, 2016.
 [30] Y. Yang, D. Xu, F. Nie, S. Yan, and Y. Zhuang. Image clustering using local discriminant models and global integration. IEEE Transactions on Image Processing, 19(10):2761–2773, 2010.
 [31] Y. Zheng, H. Tan, B. Tang, H. Zhou, et al. Variational deep embedding: A generative approach to clustering. arXiv preprint arXiv:1611.05148, 2016.