Clustering is a fundamental problem in data analysis where a set of objects is partitioned into clusters according to similarities between the objects. Objects within a cluster are similar to each other, and objects across clusters are different, according to some criteria. Clustering has its roots in the 1960s [cluster_og1, cluster_og2], but is still researched heavily today [cluster_review, jain]. Clustering can be applied to many different problems such as separating potential customers into market segments [cluster_market], segmenting satellite images to measure land cover [cluster_satellite], or identifying when different images contain the same person [cluster_face].
A popular technique for clustering is Gaussian mixture modeling. In this approach, a Gaussian mixture is fit to the observed data via maximum likelihood estimation. The flexibility of the Gaussian mixture model, however, comes at the cost hyperparameters that can be difficult to tune, and model assumptions that can be difficult to choose[jain]. If users make assumptions about the model’s covariance matrices, they risk inappropriate model restriction. On the other hand, relaxing covariance assumptions leads to a large number of parameters to estimate. Users are also forced to choose the number of mixture components and how to initialize the estimation procedure.
This paper presents AutoGMM, a Gaussian mixture model based algorithm implemented in python that automatically chooses the initialization, number of clusters and covariance constraints. Inspired by the mclust package in R [mclust5], our algorithm iterates through different clustering options and cluster numbers and evaluates each according to the Bayesian Information Criterion. The algorithm starts with agglomerative clustering, then fits a Gaussian mixture model with a dynamic regularization scheme that discourages singleton clusters. We compared the algorithm to mclust on several datasets, and they perform similarly.
2.1 Gaussian Mixture Models
The most popular statistical model of clustered data is the Gaussian mixture model (GMM). A Gaussian mixture is simply a composition of multiple normal distributions. Each component has a “weight”,
: the proportion of the overall data that belongs to that component. Therefore, the combined probability distribution,is of the form:
where is the number of clusters, is the dimensionality of the data.
The maximum likelihood estimate (MLE) of Gaussian mixture parameters cannot be directly computed, so the Expectation-Maximization (EM) algorithm is typically used to estimate model parameters[mclachlan]. The EM algorithm is guaranteed to monotonically increase the likelihood with each iteration [em]. A drawback of the EM algorithm, however, is that it can produce singular covariance matrices if not adequately constrained. The computational complexity of a single EM iteration with respect to the number of data points is .
After running EM, the fitted GMM can be used to “hard cluster” data by calculating which mixture component was most likely to produce a data point. Soft clusterings of the data are also available upon running the EM algorithm, as each point is assigned a weight corresponding to all components.
To initialize the EM algorithm, typically all points are assigned a cluster, which is then fed as input into the M-step. The key question in the initialization then becomes how to initially assign points to clusters.
The simplest way to initialize the EM algorithm is by randomly choosing data points to serve as the initial mixture component means. This method is simple and fast, but different initializations can lead to drastically different results. In order to alleviate this issue, it is common to perform random initialization and subsequent EM several times, and choose the best result. However, there is no guarantee the random initializations will lead to satisfactory results, and running EM many times can be computationally costly.
Another strategy is to use the k-means algorithm to initialize the mixture component means. K-means is perhaps the most popular clustering algorithm[jain], and it seeks to minimize the squared distance within clusters. The k-means algorithm is usually fast, since the computational complexity of performing a fixed number iterations is [cluster_review]. K-means itself needs to be initialized, and k-means++ is a principled choice, since it bounds the k-means cost function [kmeans++]. Since there is randomness in k-means++, running this algorithm on the same dataset may result in different clusterings. GraSPy, a Python package for graph statistics, performs EM initialization this way in its GaussianCluster class.
2.2.3 Agglomerative Clustering
Agglomerative clustering is a hierarchical technique that starts with every data point as its own cluster. Then, the two closest clusters are merged until the desired number of clusters is reached. In scikit-learn’s AgglomerativeClustering
class, "closeness" between clusters can be quantified by L1 distance, L2 distance, or cosine similarity.
Additionally, there are several linkage criteria that can be used to determine which clusters should be merged next. Complete linkage, which merges clusters according to the maximally distant data points within a pair of clusters, tends to find compact clusters of similar size. On the other hand, single linkage, which merges clusters according to the closest pairs of data points, is more likely to result in unbalanced clusters with more variable shape. Average linkage merges according to the average distance between points of different clusters, and Ward linkage merges clusters that cause the smallest increase in within-cluster variance. All four of these linkage criteria are implemented inAgglomerativeClustering and further comparisons between them can be found in everitt.
The computational complexity of agglomerative clustering can be prohibitive in large datasets [xu]. Naively, agglomerative clustering has computational complexity of . However, algorithmic improvements have improved this upper bound hclust_eff. scikit-learn uses minimum spanning tree and nearest neighbor chain methods to achieve complexity. Efforts to make faster agglomerative methods involve novel data structures [birch], and cluster summary statistics [cure], which approximate standard agglomeration methods. The algorithm in mclust5 caps the number of data points on which it performs agglomeration by some number . If the number of data points exceeds , then it agglomerates a random subset of points, and uses those results to initialize the M step of the GMM initialization. So as increases beyond this cap, computational complexity of agglomeration remains constant with respect to per iteration.
2.3 Covariance Constraints
There are many possible constraints that can be made on the covariance matrices in Gaussian mixture modeling [constraints, mclust5]. Constraints lower the number of parameters in the model, which can reduce overfitting, but can introduce unnecessary bias. scikit-learn’s GaussianMixture class implements four covariance constraints (see Table 1).
|Constraint name||Equivalent model in mclust||Description|
|Full||VVV||Covariances are unconstrained and can vary between components.|
|Tied||EEE||All components have the same, unconstrained, covariance.|
|Diag||VVI||Covariances are diagonal and can vary between components|
|Spherical||VII||Covariances are spherical and can vary between components|
2.4 Automatic Model Selection
When clustering data, the user must decide how many clusters to use. In Gaussian mixture modeling, this cannot be done with the typical likelihood ratio test approach because mixture models do not satisfy regularity conditions [mclachlan].
One approach to selecting the number of components is to use a Dirichlet process model [rasmussen, ferguson]
. The Dirichlet process is an extension of the Dirichlet distribution which is the conjugate prior to the multinomial distribution. The Dirichlet process models the probability that a data point comes from the same mixture component as other data points, or a new component altogether. This approach requires approximating posterior probabilities of clusterings with a Markov Chain Monte Carlo method, which is rather computationally expensive.
Another approach is to use metrics such as Bayesian information criterion (BIC) [bic], or Akaike information criterion (AIC) [aic]. BIC approximates the posterior probability of a model with a uniform prior, while AIC uses a prior that incorporates sample size and number of parameters [aicbic]. From a practical perspective, BIC is more conservative because its penalty scales with and AIC does not directly depend on . AIC and BIC can also be used to evaluate constraints on covariance matrices, unlike the Dirichlet process model. Our algorithm relies on BIC, as computed by:
where is the maximized data likelihood, is the number of parameters, and is the number of data points. We chose BIC as our evaluation criteria so we can make more direct comparisons with mclust, and because it performed empirically better than AIC on the datasets presented here (not shown).
This work is directly inspired by mclust, a clustering package available only in R. The original mclust publication derived various agglomeration criteria from different covariance constraints [mclust_original]
. The different covariance constraints are denoted by three letter codes. For example, "EII" means that the covariance eigenvalues areEqual across mixture components, the covariance eigenvalues are Identical to each other, and the orientation is given by the I
dentity matrix (the eigenvectors are elements of the standard basis).
In subsequent work, mclust was updated to include the fitting of GMMs, and the models were compared via BIC [mclust_em]. Later, model selection was made according to a modified version of BIC that avoids singular covariance matrices [mclust_regularize]. The most recent version of mclust was released in 2016 [mclust5].
2.6 Comparing Clusterings
There are several ways to evaluate a given clustering, and they can be broadly divided into two categories. The first compares distances between points in the same cluster to distances between points in different clusters. The Silhouette Coefficient and the Davies-Bouldin Index are two examples of this type of metric. The second type of metric compares the estimated clustering to a ground truth clustering. Examples of this are Mutual Information, and Rand Index. The Rand Index is the fraction of times that two clusterings agree whether a pair of points are in the same cluster or different clusters. Adjusted Rand Index (ARI) corrects for chance and takes values in the interval . If the clusterings are identical, ARI is one, and if one of the clusterings is random, then the expected value of ARI is zero.
We evaluate the performance of our algorithm as compared to mclust on three datasets. For each dataset, the algorithms search over all of their clustering options, and across all cluster numbers between 1 and 20.
3.1.1 Synthetic Gaussian Mixture
For the synthetic Gaussian mixture dataset, we sampled 100 data points from a Gaussian mixture with three equally weighted components in three dimensions. All components have an identity covariance matrix and the means are:
We include this dataset to verify that the algorithms can cluster data with clear group structure.
3.1.2 Wisconsin Breast Cancer Diagnostic Dataset
The Wisconsin Breast Cancer Diagnostic Dataset contains data from 569 breast masses that were biopsied with fine needle aspiration. Each data point includes 30 quantitative cytological features, and is labeled by the clinical diagnosis of benign or malignant. The dataset is available through the UCI Machine Learning Respository[bc]. We include this dataset because it was used in one of the original mclust publications [mclust_bc]. As in mclust_bc, we only include the extreme area, extreme smoothness, and mean texture features.
3.1.3 Spectral Embedding of Larval Drosophila Mushroom Body Connectome
drosophila analyzes a Drosophila connectome that was obtained via electron microscopy [drosophila_connectome]. As in drosophila
, we cluster the first six dimensions of the right hemisphere’s adjacency spectral embedding. The neuron types, Kenyon cells, input neurons, output neurons, and projection neurons, are considered the true clustering.
AutoGMM performs different combinations of clustering options and selects the method that results in the best BIC (see Appendix A for details).
For each of the available 10 available agglomerative techniques (from
sklearn.cluster.AgglomerativeCluster) perform initial clustering on up to points (user specified, default value is ). We also perform 1 k-means clustering initialized with k-means++.
Compute cluster sample means and covariances in each clustering. These are the 11 initializations.
For each of the four available covariance constraints (from
sklearn.mixture.GaussianMixture), initialize the M step of the EM algorithm with a result from Step 2. Run EM with no regularization.
If the EM algorithm diverges or any cluster contains only a single data point, restart the EM algorithm, this time adding to the covariance matrices’ diagonals as regularization.
Increase the regularization by a factor of 10 if the EM algorithm diverges or any cluster contains only a single data point. If the regularization is increased beyond , simply report that this GMM constraint has failed and proceed to Step 4.
If the EM algorithm successfully converges, save the resulting GMM and proceed to Step 4.
Repeat Step 3 for all of the 11 initializations from Step 2.
Repeat Steps 1-4 for all cluster numbers .
For each of the GMM’s that did not fail, compute BIC for the GMM.
Select the optimal clustering—the one with the largest BIC—as the triple of (i) initialization algorithm, (ii) number of clusters, and (iii) GMM covariance constraint.
By default, AutoGMM iterates through all combinations of 11 agglomerative methods (Step 1), 4 EM methods (Step 3), and 20 cluster numbers (Step 5). However, users are allowed to restrict the set of options. AutoGMM limits the number of data points in the agglomerative step and limits the number of iterations of EM so the computational complexity with respect to number of data points is .
The EM algorithm can run into singularities if clusters contain only a single element (“singleton clusters”), so the regularization scheme described above avoids such clusters. At first, EM is run with no regularization. However if this attempt fails, then EM is run again with regularization. As in scikit-learn, regularization involves adding a regularization factor to the diagonal of the covariance matrices to ensure that they are positive. This method does not modify the eigenvectors of the covariance matrix, making it rotationally invariant [ledoit].
AutoGMM has a scikit-learn compliant API and is freely available at https://github.com/tathey1/graspy.
3.3 Reference Clustering Algorithms
We compare our algorithm to two other clustering algorithms. The first, mclust v5.4.2, is available on CRAN [mclust5]. We use the package’s Mclust function. The second, which we call GraSPyclust, uses the GaussianCluster function in GraSPy. Since initialization is random in GraSPyclust, we perform clustering 50 times and select the result with the best BIC. Both of these algorithms limit the number EM iterations performed so their computational complexities are linear with respect to the number of data points.
The data described in Section 3.1 has underlying labels so we choose ARI to evaluate the clustering algorithms.
3.4 Statistical Comparison
In order to statistically compare the clustering methods, we evaluate their performances on random subsets of the data. For each dataset, we take ten independently generated, random subsamples, containing 80% of the total data points. We compile ARI and Runtime data from each clustering method on each subsample. Since the same subsamples are used by each clustering method, we can perform a Wilcoxon signed-rank test to statistically evaluate whether the methods perform differently on the datasets.
We also cluster the complete data to analyze how each model was chosen, according to BIC. All algorithms were run on a single CPU core with 16GB RAM.
Table 2 shows the models that were chosen by each clustering algorithm on the complete datasets, and the corresponding BIC and ARI values. The actual clusterings are shown in Figures 1-3. In the synthetic dataset, all three methods chose a spherical covariance constraint, which was the correct underlying covariance structure. The GraSPyclust algorithm, however, failed on this dataset, partitioning the data into 10 clusters.
In the Wisconsin Breast Cancer dataset, the different algorithms chose component numbers of three or four, when there were only 2 underlying labels. All algorithms achieved similar BIC and ARI values. In the Drosophila dataset, all algorithms left the mixture model covariances completely unconstrained. Even though AutoGMM achieved the highest BIC, it had the lowest ARI.
In both the Wisconsin Breast Cancer dataset and the Drosophila dataset, AutoGMM achieved ARI values between 0.5 and 0.6, which are not particularly impressive. In the Drosophila data, most of the disagreement between the AutoGMM clustering, and the neuron type classification arises from the subdivision of the Kenyon cell type into multiple subgroups (Figure 3). The authors of drosophila, who used mclust, also note this result.
|Number of Clusters|
Figure 4 shows results from clustering random subsets of the data. The results were compared with the Wilcoxon signed-rank test at . On all three datasets, AutoGMM and mclust acheived similar ARI values. GraSPyclust resulted in lower ARI values on the synthetic dataset as compared to the mclust, but was not statistically different on the other datasets. Figure 4 shows that in all datasets, mclust was the fastest algorithm, and GraSPyclust was the second fastest algorithm.
Figure 5 shows the BIC curves that demonstrate model selection of AutoGMM and mclust on the Drosophila dataset. We excluded the BIC curves from GraSPyclust for simplicity. The curves peak at the chosen models.
We also investigated how algorithm runtimes scale with the number of data points. Figure 6 shows how the runtimes of all clustering options of the different algorithms scale with large datasets. We used linear regression on the log-log data to estimate computational complexity. The slopes for the AutoGMM, mclust, and GraSPyclust runtimes were 0.98, 0.86, and 1.09 respectively. This supports our calculations that the runtime of the three algorithms is linear with respect to .
In this paper we present an algorithm, AutoGMM, that performs automatic model selection for Gaussian mixture modeling in Python. To the best of our knowledge, this algorithm is the first of its kind that is freely available in Python. AutoGMM iterates through 44 combinations of clustering options in Python’s scikit-learn package and chooses the GMM that achieves the best BIC. The algorithm avoids Gaussian mixtures whose likelihoods diverge, or have singleton clusters.
AutoGMM was compared to mclust, a state of the art clustering package in R, and achieved similar BIC and ARI values on three datasets. Results from the synthetic Gaussian mixture (Table 2, Figure 1) highlight the intuition behind AutoGMM’s regularization scheme. GraSPyclust did not perform well on the synthetic data, because it erroneously subdivided the 3 cluster data into 10 clusters. AutoGMM avoids this problem because its regularization does not allow singleton clusters. In all fairness, GraSPyclust’s performance on subsets of the synthetic data (Figure 4) is much better than its performance on the complete data. However, its random initialization leaves it more susceptible to inconsistent results.
Figure 4, shows that on our datasets, mclust is the fastest algorithm. However, computational complexity of all algorithms is linear with respect to the number of data points, and this is empirically validated by Figure 6. Thus, mclust is faster by only a constant factor. Several features of the mclust algorithm contribute to this factor. One is that much of the computation in mclust is written in Fortran, a compiled programming language. AutoGMM and GraSPyclust, on the other hand, are written exclusively in Python, an interpreted programming language. Compiled programming languages are typically faster than interpreted ones. Another is that mclust evaluates fewer clustering methods (14) than AutoGMM (44). Indeed, using AutoGMM trades runtime for a larger space of GMMs. However, when users have an intuition into the structure of the data, they can restrict the modeling options and make the algorithm run faster.
One opportunity to speed up the runtime of AutoGMM would involve a more principled approach to selecting the regularization factor. Currently, the algorithm iterates through 8 fixed regularization factors until it achieves a satisfactory solution. However, the algorithm could be improved to choose a regularization factor that is appropriate for the data at hand.
The most obvious theoretical shortcoming of AutoGMM
is that there is no explicit handling of outliers or singleton clusters. Since the algorithm does not allow for clusters with only one member, it may perform poorly with stray points of any sort. Future work to mitigate this problem could focus on the data or the model. Data-based approaches include preprocessing for outliers, or clustering subsets of the data. Alternatively, the model could be modified to allow singleton clusters, while still regularizing to avoid singularities.
In the future, we are interested in high dimensional clustering using statistical models. The EM algorithm, a mainstay of GMM research, is even more likely to run into singularities in high dimensions, so we would need to modify AutoGMM accordingly. One possible approach would use random projections, as originally proposed by dasgupta in the case where the mixture components have means that are adequately separated, and the same covariance. Another approach would involve computing spectral decompositions of the data, which can recover the true clustering and Gaussian mixture parameters under less restrictive conditions [vempalaspectral],[achspectral].