The deep learning revolution has been fueled by the explosion of large scale datasets with meaningful labels. Applications range from image classification to text sentiment classification . However, for many applications, we do not have meaningful data labels available. In this regime, deep autoencoders are gaining momentum  as a way to effectively map data to a low-dimensional feature space where data are more separable and hence more easily clustered .
Long-established methods for unsupervised clustering such as K-means and Gaussian mixture models (GMMs) are still the workhorses of many applications due to their simplicity. However, their distance measures are limited to local relations in the data space and they tend to be ineffective for high dimensional data that often has significant overlaps across clusters[29, 31].
Recently, there has been a surge of interest in developing more powerful clustering methods by leveraging deep neural networks. Various methods [31, 3, 31, 19] have been proposed to conduct clustering on the latent representations learned by (variational) autoencoders. Although these methods perform well in clustering, a weakness is that they use one single low-dimensional manifold to represent the data. As a consequence, for more complex data, the latent representations can be poorly separated. Learning good representations by leveraging the underlying structure of the data has been left largely unexplored and is the topic of our work. Our underlying assumption is that each data cluster is associated with a separate manifold. Therefore, modeling the dataset as a mixture of low-dimensional nonlinear manifolds is a natural and promising framework for clustering data generated from different categories.
In this paper we develop a novel deep architecture for multiple manifold clustering. Manifold learning and clustering has a rich literature, with parametric estimation methods[4, 22] and spectral methods being the most common approaches [26, 17]
. These methods require either a parametric model or distance metrics that capture the relationship among points in the dataset (or both). An autoencoder, on the other hand, identifies a nonlinear function mapping the high-dimensional points to a low-dimensional latent representation without any metric, and while autoencoders are parametric in some sense, they are often trained with a large number of parameters, resulting in a high degree of flexibility in the final low-dimensional representation.
Our approach therefore is to use a MIXture of AutoEncoders (MIXAE), each of which should identify a non-linear mapping suitable for a particular cluster. The autoencoders are trained simultaneously with a mixture assignment network via a composite objective function, thereby jointly motivating low reconstruction error (for learning each manifold) and cluster identification error. This kind of joint optimization has been shown to have good performance in other unsupervised architectures 
as well. The main advantage in combining clustering with representation learning this way is that the two parts collaborate with each other to reduce the complexity and improve representation ability–the latent vector itself is a low-dimensional data representation that allows a much smaller classifier network for mixture assignment, and is itself learned to be well-separated by clustering (see section4).
Our contributions are: (i) a novel deep learning architecture for unsupervised clustering with mixture of autoencoders, (ii) a joint optimization framework for simultaneously learning a union of manifolds and clustering assignment, and (iii) state-of-the-art performance on established benchmark large-scale datasets.
2 Related Work
The most fundamental method for clustering is the K-means algorithm , which assumes that the data are centered around some centroids, and seeks to find clusters that minimize the sum of the squares of the norm distances to the centroid within each cluster. Instead of modeling each cluster with a single point (centroid), another approach called K-subspaces clustering assumes the dataset can be well approximated by a union of subspaces (linear manifolds); this field is well studied and a large body of work has been proposed [25, 5]. However, neither K-means nor K-subspaces clustering is designed to separate clusters that have nonlinear and non-separable structure.
. In particular, graph-based methods like spectral clustering[20, 17, 26] and its variants  are popular ways of handling nonlinearly separated clusters. However, these methods in general can be computationally intensive, and still have difficulties in separating clusters with intersecting regions.
Recent work 
extends spectral clustering by replacing the eigenvector representation of data with the embeddings from a deep autoencoder. Since training an autoencoder is linear in the number of samples, finding the embeddings is much more scalable than traditional spectral clustering, despite the difficulties in training autoencoders. However, this approach requires a normalized adjacency matrix as input, which is a heavy burden on both computation and memory for very large .
Mixture models are a computationally scalable probabilistic approach to clustering that also allows for overlapping clusters. The most popular mixture model is the Gaussian Mixture Model (GMM), which assumes that data are generated from a mixture of Gaussian distributions with unknown parameters, and the parameters are optimized by the Expectation Maximization (EM) algorithm. However, as with the K-means method, GMMs require strong assumptions on the distribution of the data, which are often not satisfied in practice.
A recent stream of work has focused on optimizing a clustering objective over the low-dimensional feature space of an autoencoder  or a variational autoencoder [31, 3]. Notably, the Deep Embedded Clustering (DEC) model  iteratively minimizes the within-cluster KL-divergence and the reconstruction error. The Variational Deep Embedding (VaDE)  and Gaussian Mixture Variational Autoencoder (GMVAE)  models extend the DEC approach by training variational autoencoders, iteratively learning clusters and feature representation distribution parameters. In particular,  emphasizes a noticeable gain in training the autoencoder and the GMM components jointly rather than alternatively, which shares the same spirit of our joint representation and clustering framework. A weakness in these models is that they require careful initialization of model parameters, and often exhibit separation of clusters before actual training even begins. In contrast, the proposed MIXAE model can be trained from scratch.
are another popular extension, and both are also popular for semi-supervised learning[13, 1]. However, adversarial models are reputably difficult to train.
3 Clustering with Mixture of Autoencoders
We now describe our MIXture of AutoEncoders (MIXAE) model in detail, giving the intuition behind our customized architecture and specialized objective function.
An autoencoder is a common neural network architecture used for unsupervised representation learning. An autoencoder consists of an encoder and a decoder . Given the input data , the encoder first maps to its latent representation , where typically . The decoder then maps to a reconstruction , with reconstruction error measuring the deviation between and
. The parameters of the network are updated via backpropagation with the target of minimizing the reconstruction error. By restricting the latent space to lower dimensionality than the input space () the trained autoencoder parametrizes a low-dimensional nonlinear manifold that captures the data structure.
3.2 Mixture of Autoencoders (MIXAE) Model
Our goal is to cluster a collection of data points into clusters, under the assumption that data from each cluster is sampled from a different low-dimensional manifold. A natural choice is to use a separate autoencoder to model each data cluster, and thereby the entire dataset as a collection of autoencoders. The cluster assignment is performed with an additional neural network, which infers the cluster labels of each data sample based on the autoencoders’ latent features. Specifically, for each data sample , this mixture assignment network takes the concatenation of the latent representations of each autoencoder
as input, and outputs a probabilistic vector that infers the distribution of over clusters, i.e., for ,
To jointly optimize the autoencoders and the mixture assignment network, we use a composite objective function consisting of three important components.
Weighted reconstruction error
The mixture aggregation is done in the weighted reconstruction error term
where is the data sample, is the reconstructed output of autoencoder for sample , is the reconstruction error, and
are the soft probabilities from the mixture assignment network for sample, calculated via (1). Typical choices for are squared errors and KL-divergence. Intuitively, (2) will achieve its minimum when ’s are one-hot vectors and select the autoencoder with minimum reconstruction error for that sample.
To motivate sparse mixture assignment probabilities (so that each data sample ultimately receives one dominant label assignment) we add a sample entropy deterrent:
Specifically, (3) achieves its minimum only if is an one-hot vector, specifying a deterministic distribution. We refer to this as the sample-wise entropy.
One trivial minimizer of the sample-wise entropy is for the mixture assignment neural network to output a constant one-hot vector for all input data, i.e., selecting a single autoencoder for all of the data. To avoid this local minima, we motivate equal usage of all autoencoders via a batch-wise entropy term
Here, is the minibatch size and is the average soft cluster assignment over an entire minibatch. If (4) reaches a maximum value of , then , i.e., each cluster is selected with uniform probability. This is a valid assumption for a large enough minibatch, randomly selected over balanced data.
Let be the parameters of the autoencoders and mixture assignment network. We minimize the composite cost function
An important consideration is the choice of and , which can significantly affect the final clustering quality. We adjust these parameters dynamically during the training process. Intuitively, initially we should prioritize batch-wise entropy and sample-wise entropy in order to encourage equal use of autoencoders while avoiding the case where all autoencoders are equally optimized for each input, i.e.,
the probabilistic vector characterizes a uniform distribution for each input. Asymptotically, we should prioritize minimizing the reconstruction error to promote better learning of the manifolds for each cluster, and minimizing sample-wise entropy to ensure assignment of every data sample to only one autoencoder. A simple and effective scheme is to use comparatively largerand at the beginning of the training process, while adjusting and
at each epoch such that the three terms in the objective function are approximately equal as the training process goes on. Empirically, this produces better results than static choices ofand .
4 Experimental Results
We evaluate our MIXAE on three datasets representing different applications: images, texts, and sensor outputs.
The MNIST  dataset contains 70000 pixel images of handwritten digits (0, 1, …, 9), each cropped and centered.
The original Reuters dataset contains about 810000 English news stories labeled by a category tree. Following , we choose four root categories: corporate/industrial, government/social, markets, and economics as labels, and remove all of the documents that are labeled by multiple root categories, which results in a dataset with 685071 documents. We then compute the tf-idf features on the 2000 most frequent words.
The Heterogeneity Human Activity Recognition (HHAR, ) dataset contains 10299 samples of smartphone and smartwatch sensor time series feeds, each of length 561. There are 6 categories of human activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying.
A summary of the dataset statistics is also provided in Table 1.
We use different autoencoders and mixture assignment network sizes for different datasets, summarized in Figure 2. We use convolutional autoencoders for MNIST and fully connected autoencoders for the other (non-image) datasets. For each dataset, we train MIXAE with ADAM 
acceleration, using Tensorflow. We use a decaying learning rate, initialized atand reduced by a factor every epochs.
|MNIST||784||70000||10||11% / 9%|
|Reuters||2000||685071||4||41% / 9%|
|HHAR||561||10299||6||19% / 14%|
Following the work of DEC, the clustering accuracy of all algorithms is measured by the unsupervised clustering accuracy (ACC):
where is the ground-truth label, is the cluster assignment produced by the mixture assignment network, i.e.,
and are all possible one-to-one mappings between clusters and labels. This is equivalent to the cluster purity and is a common metric in clustering (see also ). Finding the optimal mapping can be done effectively using the Hungarian algorithm .
4.1 Clustering results
An overall comparison of each clustering method is given in Table 2. As we can see, the deep learning models (DEC, VaDE and MIXAE) all perform much better than traditional machine learning methods (K-means and GMM). This suggests that using autoencoders to extract the latent features of the data and then clustering on these latent features is advantageous for these challenging datasets.
Comparing the deep learning models, we see that MIXAE outperforms DEC, a single manifold model, suggesting the advantage of a mixture of manifolds in cluster representability. Additionally, note that both DEC and VaDE use stacked autoencoders to pretrain their models, which can introduce significant computational overhead, especially if the autoencoders are deep. In contrast, MIXAE trains from a random initialization.
At the same time, though MIXAE achieves the same or slightly better performance against VaDE on Reuters and HHAR, VaDE outperforms MIXAE and DEC on MNIST. In general it has been observed that variational autoencoders have better representability than deterministic autoencoders (e.g., ). This suggests that leveraging a mixture of variational autoencoders may further improve our model’s performance and is an interesting direction for future work.
Figure 3 shows some samples grouped by cluster label. We see that MIXAE clusters well a variety of writing styles. There are some mistakes, but are consistent with frequently observed mistakes in supervised classification (e.g., 4 and 9 confusion).
4.2 Latent space separability
Fig 4 shows the t-SNE projection of the -dimensional concatenated latent vectors to 2-D space. Specifically, we see that as training progresses, the latent feature clusters become more and more separated, suggesting that the overall architecture motivates finding representations with better clustering performance.
4.3 Effect of balanced data
As we can see in Table 2, all methods have significantly lower performance on Reuters (an unbalanced dataset) than MNIST and HHAR (balanced datasets). We investigate the effect of balanced data on MIXAE in Table 3 and Figure 5.
In Table 3, for each dataset, we record the values of batch-wise entropy (BE) and sample-wise entropy (SE) over the entire dataset after training, and we compare them with the ground truth entropies of the true labels. The batch entropy regularization (4) forces the final actual batch-wise entropy to be very close to the maximal value of for all of the three datasets.
In other words, our batch entropy encourages the final cluster assignments to be uniform, which is valid for balanced datasets (MNIST and HHAR) but biased for unbalanced datasets (Reuters). Specifically, in Fig. 5, the sample covariance matrix of the true labels of Reuters has one dominant diagonal value, but the converged sample covariance matrix diagonal is much more even, suggesting that samples that should have gone to a dominant cluster are evenly (incorrectly) distributed to the other clusters.
Additionally, note that the converged sample-wise entropy (actual SE value) for Reuters is far from 0 (Table 3), suggesting that even after convergence, there is considerable ambiguity in cluster assignment.
4.4 Varying K
We also explore the clustering performance of MIXAE with more autoencoders than natural clusters; i.e., for MNIST, and . In Figure 6, we plot the evolution of the three components of our objective function (5), as well as the final cluster purity. This purity is defined as the percentage of “correct” labels, where the “correct” label for a cluster is defined as the majority of the true labels for that cluster.
As we can see in Figure 7, the clustering accuracy for larger converges to higher values. One possible explanation is that with larger , the final clusters split each digit group into more clusters, and this reduces the overlap in underlying manifolds corresponding to different digit groups. On the other hand, the sample-wise entropy no longer converges to 0, and the final probabalistic vectors are observed to have 2 or 3 significant nonzeros instead of only one; this suggests that the learned manifolds corresponding with each digit group may have certain overlap.
Figure 6(a) shows again the covariance matrices for MNIST, , and . Interestingly, here the final covariance diagonals are extremely uneven, suggesting that final cluster assignments are more and more unbalanced as we increase
. Since intuitively each digit group may have different magnitudes of variance in writing styles, this result is consistent with what we may expect. Figure6(b) shows digit examples, sorted according to the finalized cluster assignments. Here, we see the emergence of “stylistic” clusters, with straight vs slanted 1’s, thin vs round 0’s, etc.
5 Conclusions and Future Work
In this paper, we introduce the MIXAE architecture that uses a combination of small autoencoders and a cluster assignment network to intelligently cluster unlabeled data. This is based on the assumption that data from each cluster is generated from a separate low-dimensional manifold, and thus the aggregate data is modeled as a mixture of manifolds. Using this model, we produce improved performance over deterministic deep clustering models on established datasets.
There are several interesting extensions to pursue. First, though we have improved performance on the unbalanced dataset over DEC, we still find Reuters a challenging dataset due to its imbalanced distribution over natural clusters. One potential improvement is to replace the batch entropy regularization with cross-entropy regularization, using knowledge about cluster sizes. However, knowing the sizes of clusters is not a realistic assumption in online machine learning. Additionally this would force each autoencoder to take a pre-assigned cluster identity, which might negatively affect the training.
Another important extension is in the direction of variational and adversarial autoencoders. We have seen that in single autoencoder models, VaDE outperforms DEC, which they also attribute to a KL penalty term for encouraging cluster separation. This extension can be done in our model to encourage separation in the latent representation variables.
Our model also has an interesting interpretation to dictionary learning, where a small set of basis vectors characterizes a structured high dimensional dataset. Specifically, we can consider the manifolds learned by the autoencoders as “codewords” and the sample entropy applied to the mixture assignment as the sparse regularization. An interesting extension is to apply this model to multilabel clustering, to see if each autoencoder can learn distinctive atomic features of each datapoint–for example, the components of an image, or voice signal.
-  E. Abbasnejad, A. Dick, and A. v. d. Hengel. Infinite variational autoencoder for semi-supervised learning. arXiv preprint arXiv:1611.07800, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  N. Dilokthanakul, P. A. Mediano, M. Garnelo, M. C. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan. Deep unsupervised clustering with Gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
-  E. Elhamifar and R. Vidal. Sparse manifold clustering and embedding. In Advances in neural information processing systems, pages 55–63, 2011.
-  E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765–2781, 2013.
-  A. Gionis, A. Hinneburg, S. Papadimitriou, and P. Tsaparas. Dimension induced clustering. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 51–60. ACM, 2005.
-  J. A. Hartigan and M. A. Wong. Algorithm as 136: A K-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.
-  G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
-  D. Kingma and J. Ba. ADAM: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
-  A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
P. Mordohai and G. G. Medioni.
Unsupervised dimensionality estimation and manifold learning in high-dimensional spaces by tensor voting.In IJCAI, pages 798–803, 2005.
-  J. Munkres. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics, 5(1):32–38, 1957.
-  E. Nalisnick, L. Hertel, and P. Smyth. Approximate inference for deep latent gaussian mixtures. In NIPS Workshop on Bayesian Deep Learning, 2016.
A. Y. Ng, M. I. Jordan, and Y. Weiss.
On spectral clustering: Analysis and an algorithm.In Advances in neural information processing systems, pages 849–856, 2002.
A. Pak and P. Paroubek.
Twitter as a corpus for sentiment analysis and opinion mining.In LREc, volume 10, 2010.
-  X. Peng, J. Feng, S. Xiao, J. Lu, Z. Yi, and S. Yan. Deep sparse subspace clustering. arXiv preprint arXiv:1709.08374, 2017.
-  J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
-  R. Shu, J. Brofos, F. Zhang, H. H. Bui, M. Ghavamzadeh, and M. Kochenderfer. Stochastic video prediction with conditional density estimation. In ECCV Workshop on Action and Anticipation for Visual Learning, 2016.
-  R. Souvenir and R. Pless. Manifold clustering. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 1, pages 648–653. IEEE, 2005.
-  A. Stisen, H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard, A. Dey, T. Sonne, and M. M. Jensen. Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, pages 127–140. ACM, 2015.
-  F. Tian, B. Gao, Q. Cui, E. Chen, and T.-Y. Liu. Learning deep representations for graph clustering. In AAAI, pages 1293–1299, 2014.
-  R. Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52–68, 2011.
-  U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
-  X. Wang, K. Slavakis, and G. Lerman. Riemannian multi-manifold modeling. arXiv preprint arXiv:1410.0095, 2014.
-  Y. Wang, Y. Jiang, Y. Wu, and Z.-H. Zhou. Multi-manifold clustering. In PRICAI, pages 280–291. Springer, 2010.
-  J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning, pages 478–487, 2016.
-  Y. Yang, D. Xu, F. Nie, S. Yan, and Y. Zhuang. Image clustering using local discriminant models and global integration. IEEE Transactions on Image Processing, 19(10):2761–2773, 2010.
-  Y. Zheng, H. Tan, B. Tang, H. Zhou, et al. Variational deep embedding: A generative approach to clustering. arXiv preprint arXiv:1611.05148, 2016.