Deep Unsupervised Clustering Using Mixture of Autoencoders

12/21/2017 ∙ by Dejiao Zhang, et al. ∙ University of Michigan adobe 0

Unsupervised clustering is one of the most fundamental challenges in machine learning. A popular hypothesis is that data are generated from a union of low-dimensional nonlinear manifolds; thus an approach to clustering is identifying and separating these manifolds. In this paper, we present a novel approach to solve this problem by using a mixture of autoencoders. Our model consists of two parts: 1) a collection of autoencoders where each autoencoder learns the underlying manifold of a group of similar objects, and 2) a mixture assignment neural network, which takes the concatenated latent vectors from the autoencoders as input and infers the distribution over clusters. By jointly optimizing the two parts, we simultaneously assign data to clusters and learn the underlying manifolds of each cluster.



There are no comments yet.


page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The deep learning revolution has been fueled by the explosion of large scale datasets with meaningful labels. Applications range from image classification 

[2] to text sentiment classification [18]. However, for many applications, we do not have meaningful data labels available. In this regime, deep autoencoders are gaining momentum [8] as a way to effectively map data to a low-dimensional feature space where data are more separable and hence more easily clustered [29].

Long-established methods for unsupervised clustering such as K-means and Gaussian mixture models (GMMs) are still the workhorses of many applications due to their simplicity. However, their distance measures are limited to local relations in the data space and they tend to be ineffective for high dimensional data that often has significant overlaps across clusters

[29, 31].

Recently, there has been a surge of interest in developing more powerful clustering methods by leveraging deep neural networks. Various methods [31, 3, 31, 19] have been proposed to conduct clustering on the latent representations learned by (variational) autoencoders. Although these methods perform well in clustering, a weakness is that they use one single low-dimensional manifold to represent the data. As a consequence, for more complex data, the latent representations can be poorly separated. Learning good representations by leveraging the underlying structure of the data has been left largely unexplored and is the topic of our work. Our underlying assumption is that each data cluster is associated with a separate manifold. Therefore, modeling the dataset as a mixture of low-dimensional nonlinear manifolds is a natural and promising framework for clustering data generated from different categories.

In this paper we develop a novel deep architecture for multiple manifold clustering. Manifold learning and clustering has a rich literature, with parametric estimation methods

[4, 22] and spectral methods being the most common approaches [26, 17]

. These methods require either a parametric model or distance metrics that capture the relationship among points in the dataset (or both). An autoencoder, on the other hand, identifies a nonlinear function mapping the high-dimensional points to a low-dimensional latent representation without any metric, and while autoencoders are parametric in some sense, they are often trained with a large number of parameters, resulting in a high degree of flexibility in the final low-dimensional representation.

Our approach therefore is to use a MIXture of AutoEncoders (MIXAE), each of which should identify a non-linear mapping suitable for a particular cluster. The autoencoders are trained simultaneously with a mixture assignment network via a composite objective function, thereby jointly motivating low reconstruction error (for learning each manifold) and cluster identification error. This kind of joint optimization has been shown to have good performance in other unsupervised architectures [31]

as well. The main advantage in combining clustering with representation learning this way is that the two parts collaborate with each other to reduce the complexity and improve representation ability–the latent vector itself is a low-dimensional data representation that allows a much smaller classifier network for mixture assignment, and is itself learned to be well-separated by clustering (see section


Our contributions are: (i) a novel deep learning architecture for unsupervised clustering with mixture of autoencoders, (ii) a joint optimization framework for simultaneously learning a union of manifolds and clustering assignment, and (iii) state-of-the-art performance on established benchmark large-scale datasets.

2 Related Work

The most fundamental method for clustering is the K-means algorithm [7], which assumes that the data are centered around some centroids, and seeks to find clusters that minimize the sum of the squares of the norm distances to the centroid within each cluster. Instead of modeling each cluster with a single point (centroid), another approach called K-subspaces clustering assumes the dataset can be well approximated by a union of subspaces (linear manifolds); this field is well studied and a large body of work has been proposed [25, 5]. However, neither K-means nor K-subspaces clustering is designed to separate clusters that have nonlinear and non-separable structure.

Nonlinear manifold clustering has been studied as a more promising generalization of linear models and has an extensive literature [14, 6, 26, 27, 22, 28]

. In particular, graph-based methods like spectral clustering

[20, 17, 26] and its variants [30] are popular ways of handling nonlinearly separated clusters. However, these methods in general can be computationally intensive, and still have difficulties in separating clusters with intersecting regions.

Recent work [24]

extends spectral clustering by replacing the eigenvector representation of data with the embeddings from a deep autoencoder. Since training an autoencoder is linear in the number of samples

, finding the embeddings is much more scalable than traditional spectral clustering, despite the difficulties in training autoencoders. However, this approach requires a normalized adjacency matrix as input, which is a heavy burden on both computation and memory for very large .

Mixture models are a computationally scalable probabilistic approach to clustering that also allows for overlapping clusters. The most popular mixture model is the Gaussian Mixture Model (GMM), which assumes that data are generated from a mixture of Gaussian distributions with unknown parameters, and the parameters are optimized by the Expectation Maximization (EM) algorithm. However, as with the K-means method, GMMs require strong assumptions on the distribution of the data, which are often not satisfied in practice.

A recent stream of work has focused on optimizing a clustering objective over the low-dimensional feature space of an autoencoder [29] or a variational autoencoder [31, 3]. Notably, the Deep Embedded Clustering (DEC) model [29] iteratively minimizes the within-cluster KL-divergence and the reconstruction error. The Variational Deep Embedding (VaDE) [31] and Gaussian Mixture Variational Autoencoder (GMVAE) [3] models extend the DEC approach by training variational autoencoders, iteratively learning clusters and feature representation distribution parameters. In particular, [31] emphasizes a noticeable gain in training the autoencoder and the GMM components jointly rather than alternatively, which shares the same spirit of our joint representation and clustering framework. A weakness in these models is that they require careful initialization of model parameters, and often exhibit separation of clusters before actual training even begins. In contrast, the proposed MIXAE model can be trained from scratch.

Similarly, the DLGMM model [16] and CVAE model [21] also combine variational autoencoders with GMM for clustering, but are primarily used for different applications. Adversarial autoencoders [13]

are another popular extension, and both are also popular for semi-supervised learning

[13, 1]. However, adversarial models are reputably difficult to train.

3 Clustering with Mixture of Autoencoders

Figure 1: Overall architecture of the MIXAE model. The MIXAE architecture contains several parts: (a) a collection of autoencoders, each of them seeking to learn the underlying manifold of one data cluster; (b) for each input data, the mixture assignment network takes the concatenated latent features as input and outputs soft clustering assignments; (c) the mixture aggregation which is done via the weighted reconstruction error together with proper regularizations on .

We now describe our MIXture of AutoEncoders (MIXAE) model in detail, giving the intuition behind our customized architecture and specialized objective function.

3.1 Autoencoders

An autoencoder is a common neural network architecture used for unsupervised representation learning. An autoencoder consists of an encoder and a decoder . Given the input data , the encoder first maps to its latent representation , where typically . The decoder then maps to a reconstruction , with reconstruction error measuring the deviation between and

. The parameters of the network are updated via backpropagation with the target of minimizing the reconstruction error. By restricting the latent space to lower dimensionality than the input space (

) the trained autoencoder parametrizes a low-dimensional nonlinear manifold that captures the data structure.

3.2 Mixture of Autoencoders (MIXAE) Model

Our goal is to cluster a collection of data points into clusters, under the assumption that data from each cluster is sampled from a different low-dimensional manifold. A natural choice is to use a separate autoencoder to model each data cluster, and thereby the entire dataset as a collection of autoencoders. The cluster assignment is performed with an additional neural network, which infers the cluster labels of each data sample based on the autoencoders’ latent features. Specifically, for each data sample , this mixture assignment network takes the concatenation of the latent representations of each autoencoder

as input, and outputs a probabilistic vector that infers the distribution of over clusters, i.e., for ,


To jointly optimize the autoencoders and the mixture assignment network, we use a composite objective function consisting of three important components.

Weighted reconstruction error

The mixture aggregation is done in the weighted reconstruction error term


where is the data sample, is the reconstructed output of autoencoder for sample , is the reconstruction error, and

are the soft probabilities from the mixture assignment network for sample

, calculated via (1). Typical choices for are squared errors and KL-divergence. Intuitively, (2) will achieve its minimum when ’s are one-hot vectors and select the autoencoder with minimum reconstruction error for that sample.

Sample-wise entropy

To motivate sparse mixture assignment probabilities (so that each data sample ultimately receives one dominant label assignment) we add a sample entropy deterrent:


Specifically, (3) achieves its minimum only if is an one-hot vector, specifying a deterministic distribution. We refer to this as the sample-wise entropy.

Batch-wise entropy:

One trivial minimizer of the sample-wise entropy is for the mixture assignment neural network to output a constant one-hot vector for all input data, i.e., selecting a single autoencoder for all of the data. To avoid this local minima, we motivate equal usage of all autoencoders via a batch-wise entropy term


Here, is the minibatch size and is the average soft cluster assignment over an entire minibatch. If (4) reaches a maximum value of , then , i.e., each cluster is selected with uniform probability. This is a valid assumption for a large enough minibatch, randomly selected over balanced data.

Objective function

Let be the parameters of the autoencoders and mixture assignment network. We minimize the composite cost function


An important consideration is the choice of and , which can significantly affect the final clustering quality. We adjust these parameters dynamically during the training process. Intuitively, initially we should prioritize batch-wise entropy and sample-wise entropy in order to encourage equal use of autoencoders while avoiding the case where all autoencoders are equally optimized for each input, i.e., 

the probabilistic vector characterizes a uniform distribution for each input. Asymptotically, we should prioritize minimizing the reconstruction error to promote better learning of the manifolds for each cluster, and minimizing sample-wise entropy to ensure assignment of every data sample to only one autoencoder. A simple and effective scheme is to use comparatively larger

and at the beginning of the training process, while adjusting and

at each epoch such that the three terms in the objective function are approximately equal as the training process goes on. Empirically, this produces better results than static choices of

and .

4 Experimental Results

We evaluate our MIXAE on three datasets representing different applications: images, texts, and sensor outputs.


The MNIST [11] dataset contains 70000 pixel images of handwritten digits (0, 1, …, 9), each cropped and centered.


The original Reuters dataset contains about 810000 English news stories labeled by a category tree. Following [29], we choose four root categories: corporate/industrial, government/social, markets, and economics as labels, and remove all of the documents that are labeled by multiple root categories, which results in a dataset with 685071 documents. We then compute the tf-idf features on the 2000 most frequent words.


The Heterogeneity Human Activity Recognition (HHAR, [23]) dataset contains 10299 samples of smartphone and smartwatch sensor time series feeds, each of length 561. There are 6 categories of human activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying.

A summary of the dataset statistics is also provided in Table 1.

Figure 2: Network Architecture.

Autoencoder and mixture assignment networks for (a) MNIST, (b) Reuters, and (c) HHAR experiments. Thin solid, thick solid, and dashed lines show the output of fully-connected, CNN, and softmax layers respectively.

Network architecture

We use different autoencoders and mixture assignment network sizes for different datasets, summarized in Figure 2. We use convolutional autoencoders for MNIST and fully connected autoencoders for the other (non-image) datasets. For each dataset, we train MIXAE with ADAM [9]

acceleration, using Tensorflow. We use a decaying learning rate, initialized at

and reduced by a factor every epochs.

Dataset Dim LC /SC
MNIST 784 70000 10 11% / 9%
Reuters 2000 685071 4 41% / 9%
HHAR 561 10299 6 19% / 14%
Table 1: Datasets. # samples. # clusters. The last column shows class balance by giving the percent of data in the largest class (LC) / smallest class (SC).

Evaluation metric

Following the work of DEC, the clustering accuracy of all algorithms is measured by the unsupervised clustering accuracy (ACC):


where is the ground-truth label, is the cluster assignment produced by the mixture assignment network, i.e.,

and are all possible one-to-one mappings between clusters and labels. This is equivalent to the cluster purity and is a common metric in clustering (see also [29]). Finding the optimal mapping can be done effectively using the Hungarian algorithm [15].

4.1 Clustering results

An overall comparison of each clustering method is given in Table 2. As we can see, the deep learning models (DEC, VaDE and MIXAE) all perform much better than traditional machine learning methods (K-means and GMM). This suggests that using autoencoders to extract the latent features of the data and then clustering on these latent features is advantageous for these challenging datasets.

Comparing the deep learning models, we see that MIXAE outperforms DEC, a single manifold model, suggesting the advantage of a mixture of manifolds in cluster representability. Additionally, note that both DEC and VaDE use stacked autoencoders to pretrain their models, which can introduce significant computational overhead, especially if the autoencoders are deep. In contrast, MIXAE trains from a random initialization.

At the same time, though MIXAE achieves the same or slightly better performance against VaDE on Reuters and HHAR, VaDE outperforms MIXAE and DEC on MNIST. In general it has been observed that variational autoencoders have better representability than deterministic autoencoders (e.g.,   [10]). This suggests that leveraging a mixture of variational autoencoders may further improve our model’s performance and is an interesting direction for future work.

Method MNIST Reuters HHAR Pretrain?
K-means 53.5% 53.3% 60.1% No
GMM 53.7% 55.8% 60.3% No
DEC [29] 84.3% 75.6% 79.9% Yes
VaDE [31] 94.5% 79.4% 84.5% Yes
MIXAE 85.6% 79.4% 87.8% No
Table 2: Clustering accuracy. Comparison of unsupervised clustering accuracy (ACC) on different datasets.
Figure 3: Sample output. Visualization of the clustering results on MNIST for . (Each row within each subfigure is a cluster.)
Figure 4: Latent space visualization. t-SNE projection of the 80-dim concatenated latent vectors from the MNIST experiment, projected to a 2-dimensional space [12].

Figure 3 shows some samples grouped by cluster label. We see that MIXAE clusters well a variety of writing styles. There are some mistakes, but are consistent with frequently observed mistakes in supervised classification (e.g., 4 and 9 confusion).

4.2 Latent space separability

Fig 4 shows the t-SNE projection of the -dimensional concatenated latent vectors to 2-D space. Specifically, we see that as training progresses, the latent feature clusters become more and more separated, suggesting that the overall architecture motivates finding representations with better clustering performance.

4.3 Effect of balanced data

As we can see in Table 2, all methods have significantly lower performance on Reuters (an unbalanced dataset) than MNIST and HHAR (balanced datasets). We investigate the effect of balanced data on MIXAE in Table 3 and Figure 5.

Ground Truth Actual
Dataset BE SE BE SE
MNIST 2.31 2.30 0 2.28 0.026
Reuters 1.39 1.26 0 1.38 0.598
HHAR 1.79 1.79 0 1.77 0.054
Table 3: Data imbalance. Left: the expected batch entropy assuming uniform clusters (max value) and given cluster imbalance. Right: the actual batch and sample entropy values for each dataset. Max BE (batch entropy) . Expected BE , where # samples with label / # samples. Actual BE and SE (sample entropy) are converged values.

In Table 3, for each dataset, we record the values of batch-wise entropy (BE) and sample-wise entropy (SE) over the entire dataset after training, and we compare them with the ground truth entropies of the true labels. The batch entropy regularization (4) forces the final actual batch-wise entropy to be very close to the maximal value of for all of the three datasets.

Figure 5: Clustering covariance. Covariance matrices , where is the output of the mixture assignment network at sample , and . The true covariance matrix is computed similarly, but replacing with the one-hot representation of the true label of sample .
Figure 6: Evolution for varying . Key components of the objective function (5) during the training over the MNIST dataset, for , , and .

In other words, our batch entropy encourages the final cluster assignments to be uniform, which is valid for balanced datasets (MNIST and HHAR) but biased for unbalanced datasets (Reuters). Specifically, in Fig. 5, the sample covariance matrix of the true labels of Reuters has one dominant diagonal value, but the converged sample covariance matrix diagonal is much more even, suggesting that samples that should have gone to a dominant cluster are evenly (incorrectly) distributed to the other clusters.

Additionally, note that the converged sample-wise entropy (actual SE value) for Reuters is far from 0 (Table 3), suggesting that even after convergence, there is considerable ambiguity in cluster assignment.

4.4 Varying K

We also explore the clustering performance of MIXAE with more autoencoders than natural clusters; i.e., for MNIST, and . In Figure 6, we plot the evolution of the three components of our objective function (5), as well as the final cluster purity. This purity is defined as the percentage of “correct” labels, where the “correct” label for a cluster is defined as the majority of the true labels for that cluster.

As we can see in Figure 7, the clustering accuracy for larger converges to higher values. One possible explanation is that with larger , the final clusters split each digit group into more clusters, and this reduces the overlap in underlying manifolds corresponding to different digit groups. On the other hand, the sample-wise entropy no longer converges to 0, and the final probabalistic vectors are observed to have 2 or 3 significant nonzeros instead of only one; this suggests that the learned manifolds corresponding with each digit group may have certain overlap.

Figure 6(a) shows again the covariance matrices for MNIST, , and . Interestingly, here the final covariance diagonals are extremely uneven, suggesting that final cluster assignments are more and more unbalanced as we increase

. Since intuitively each digit group may have different magnitudes of variance in writing styles, this result is consistent with what we may expect. Figure

6(b) shows digit examples, sorted according to the finalized cluster assignments. Here, we see the emergence of “stylistic” clusters, with straight vs slanted 1’s, thin vs round 0’s, etc.

(a) Covariance matrix of the estimated vector.
(b) Sample outputs (4 per cluster).
Figure 7: Variant K. Visualization of the clustering results of MNIST with and .

5 Conclusions and Future Work

In this paper, we introduce the MIXAE architecture that uses a combination of small autoencoders and a cluster assignment network to intelligently cluster unlabeled data. This is based on the assumption that data from each cluster is generated from a separate low-dimensional manifold, and thus the aggregate data is modeled as a mixture of manifolds. Using this model, we produce improved performance over deterministic deep clustering models on established datasets.

There are several interesting extensions to pursue. First, though we have improved performance on the unbalanced dataset over DEC, we still find Reuters a challenging dataset due to its imbalanced distribution over natural clusters. One potential improvement is to replace the batch entropy regularization with cross-entropy regularization, using knowledge about cluster sizes. However, knowing the sizes of clusters is not a realistic assumption in online machine learning. Additionally this would force each autoencoder to take a pre-assigned cluster identity, which might negatively affect the training.

Another important extension is in the direction of variational and adversarial autoencoders. We have seen that in single autoencoder models, VaDE outperforms DEC, which they also attribute to a KL penalty term for encouraging cluster separation. This extension can be done in our model to encourage separation in the latent representation variables.

Our model also has an interesting interpretation to dictionary learning, where a small set of basis vectors characterizes a structured high dimensional dataset. Specifically, we can consider the manifolds learned by the autoencoders as “codewords” and the sample entropy applied to the mixture assignment as the sparse regularization. An interesting extension is to apply this model to multilabel clustering, to see if each autoencoder can learn distinctive atomic features of each datapoint–for example, the components of an image, or voice signal.