AutoGMM: Automatic Gaussian Mixture Modeling in Python

09/06/2019 ∙ by Thomas L. Athey, et al. ∙ Johns Hopkins University 0

Gaussian mixture modeling is a fundamental tool in clustering, as well as discriminant analysis and semiparametric density estimation. However, estimating the optimal model for any given number of components is an NP-hard problem, and estimating the number of components is in some respects an even harder problem. In R, a popular package called mclust addresses both of these problems. However, Python has lacked such a package. We therefore introduce AutoGMM, a Python algorithm for automatic Gaussian mixture modeling. AutoGMM builds upon scikit-learn's AgglomerativeClustering and GaussianMixture classes, with certain modifications to make the results more stable. Empirically, on several different applications, AutoGMM performs approximately as well as mclust. This algorithm is freely available and therefore further shrinks the gap between functionality of R and Python for data science.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is a fundamental problem in data analysis where a set of objects is partitioned into clusters according to similarities between the objects. Objects within a cluster are similar to each other, and objects across clusters are different, according to some criteria. Clustering has its roots in the 1960s [cluster_og1, cluster_og2], but is still researched heavily today [cluster_review, jain]. Clustering can be applied to many different problems such as separating potential customers into market segments [cluster_market], segmenting satellite images to measure land cover [cluster_satellite], or identifying when different images contain the same person [cluster_face].

A popular technique for clustering is Gaussian mixture modeling. In this approach, a Gaussian mixture is fit to the observed data via maximum likelihood estimation. The flexibility of the Gaussian mixture model, however, comes at the cost hyperparameters that can be difficult to tune, and model assumptions that can be difficult to choose

[jain]. If users make assumptions about the model’s covariance matrices, they risk inappropriate model restriction. On the other hand, relaxing covariance assumptions leads to a large number of parameters to estimate. Users are also forced to choose the number of mixture components and how to initialize the estimation procedure.

This paper presents AutoGMM, a Gaussian mixture model based algorithm implemented in python that automatically chooses the initialization, number of clusters and covariance constraints. Inspired by the mclust package in R [mclust5], our algorithm iterates through different clustering options and cluster numbers and evaluates each according to the Bayesian Information Criterion. The algorithm starts with agglomerative clustering, then fits a Gaussian mixture model with a dynamic regularization scheme that discourages singleton clusters. We compared the algorithm to mclust on several datasets, and they perform similarly.

2 Background

2.1 Gaussian Mixture Models

The most popular statistical model of clustered data is the Gaussian mixture model (GMM). A Gaussian mixture is simply a composition of multiple normal distributions. Each component has a “weight”,

: the proportion of the overall data that belongs to that component. Therefore, the combined probability distribution,

is of the form:

(1)

where is the number of clusters, is the dimensionality of the data.

The maximum likelihood estimate (MLE) of Gaussian mixture parameters cannot be directly computed, so the Expectation-Maximization (EM) algorithm is typically used to estimate model parameters

[mclachlan]. The EM algorithm is guaranteed to monotonically increase the likelihood with each iteration [em]. A drawback of the EM algorithm, however, is that it can produce singular covariance matrices if not adequately constrained. The computational complexity of a single EM iteration with respect to the number of data points is .

After running EM, the fitted GMM can be used to “hard cluster” data by calculating which mixture component was most likely to produce a data point. Soft clusterings of the data are also available upon running the EM algorithm, as each point is assigned a weight corresponding to all components.

To initialize the EM algorithm, typically all points are assigned a cluster, which is then fed as input into the M-step. The key question in the initialization then becomes how to initially assign points to clusters.

2.2 Initialization

2.2.1 Random

The simplest way to initialize the EM algorithm is by randomly choosing data points to serve as the initial mixture component means. This method is simple and fast, but different initializations can lead to drastically different results. In order to alleviate this issue, it is common to perform random initialization and subsequent EM several times, and choose the best result. However, there is no guarantee the random initializations will lead to satisfactory results, and running EM many times can be computationally costly.

2.2.2 K-Means

Another strategy is to use the k-means algorithm to initialize the mixture component means. K-means is perhaps the most popular clustering algorithm

[jain], and it seeks to minimize the squared distance within clusters. The k-means algorithm is usually fast, since the computational complexity of performing a fixed number iterations is [cluster_review]. K-means itself needs to be initialized, and k-means++ is a principled choice, since it bounds the k-means cost function [kmeans++]. Since there is randomness in k-means++, running this algorithm on the same dataset may result in different clusterings. GraSPy, a Python package for graph statistics, performs EM initialization this way in its GaussianCluster class.

2.2.3 Agglomerative Clustering

Agglomerative clustering is a hierarchical technique that starts with every data point as its own cluster. Then, the two closest clusters are merged until the desired number of clusters is reached. In scikit-learn’s AgglomerativeClustering

class, "closeness" between clusters can be quantified by L1 distance, L2 distance, or cosine similarity.

Additionally, there are several linkage criteria that can be used to determine which clusters should be merged next. Complete linkage, which merges clusters according to the maximally distant data points within a pair of clusters, tends to find compact clusters of similar size. On the other hand, single linkage, which merges clusters according to the closest pairs of data points, is more likely to result in unbalanced clusters with more variable shape. Average linkage merges according to the average distance between points of different clusters, and Ward linkage merges clusters that cause the smallest increase in within-cluster variance. All four of these linkage criteria are implemented in

AgglomerativeClustering and further comparisons between them can be found in everitt.

The computational complexity of agglomerative clustering can be prohibitive in large datasets [xu]. Naively, agglomerative clustering has computational complexity of . However, algorithmic improvements have improved this upper bound hclust_eff. scikit-learn uses minimum spanning tree and nearest neighbor chain methods to achieve complexity. Efforts to make faster agglomerative methods involve novel data structures [birch], and cluster summary statistics [cure], which approximate standard agglomeration methods. The algorithm in mclust5 caps the number of data points on which it performs agglomeration by some number . If the number of data points exceeds , then it agglomerates a random subset of points, and uses those results to initialize the M step of the GMM initialization. So as increases beyond this cap, computational complexity of agglomeration remains constant with respect to per iteration.

2.3 Covariance Constraints

There are many possible constraints that can be made on the covariance matrices in Gaussian mixture modeling [constraints, mclust5]. Constraints lower the number of parameters in the model, which can reduce overfitting, but can introduce unnecessary bias. scikit-learn’s GaussianMixture class implements four covariance constraints (see Table 1).

Constraint name Equivalent model in mclust Description
Full VVV Covariances are unconstrained and can vary between components.
Tied EEE All components have the same, unconstrained, covariance.
Diag VVI Covariances are diagonal and can vary between components
Spherical VII Covariances are spherical and can vary between components
Table 1: Covariance Constraints in scikit-learn’s GaussianMixture

2.4 Automatic Model Selection

When clustering data, the user must decide how many clusters to use. In Gaussian mixture modeling, this cannot be done with the typical likelihood ratio test approach because mixture models do not satisfy regularity conditions [mclachlan].

One approach to selecting the number of components is to use a Dirichlet process model [rasmussen, ferguson]

. The Dirichlet process is an extension of the Dirichlet distribution which is the conjugate prior to the multinomial distribution. The Dirichlet process models the probability that a data point comes from the same mixture component as other data points, or a new component altogether. This approach requires approximating posterior probabilities of clusterings with a Markov Chain Monte Carlo method, which is rather computationally expensive.

Another approach is to use metrics such as Bayesian information criterion (BIC) [bic], or Akaike information criterion (AIC) [aic]. BIC approximates the posterior probability of a model with a uniform prior, while AIC uses a prior that incorporates sample size and number of parameters [aicbic]. From a practical perspective, BIC is more conservative because its penalty scales with and AIC does not directly depend on . AIC and BIC can also be used to evaluate constraints on covariance matrices, unlike the Dirichlet process model. Our algorithm relies on BIC, as computed by:

(2)

where is the maximized data likelihood, is the number of parameters, and is the number of data points. We chose BIC as our evaluation criteria so we can make more direct comparisons with mclust, and because it performed empirically better than AIC on the datasets presented here (not shown).

2.5 mclust

This work is directly inspired by mclust, a clustering package available only in R. The original mclust publication derived various agglomeration criteria from different covariance constraints [mclust_original]

. The different covariance constraints are denoted by three letter codes. For example, "EII" means that the covariance eigenvalues are

Equal across mixture components, the covariance eigenvalues are Identical to each other, and the orientation is given by the I

dentity matrix (the eigenvectors are elements of the standard basis).

In subsequent work, mclust was updated to include the fitting of GMMs, and the models were compared via BIC [mclust_em]. Later, model selection was made according to a modified version of BIC that avoids singular covariance matrices [mclust_regularize]. The most recent version of mclust was released in 2016 [mclust5].

2.6 Comparing Clusterings

There are several ways to evaluate a given clustering, and they can be broadly divided into two categories. The first compares distances between points in the same cluster to distances between points in different clusters. The Silhouette Coefficient and the Davies-Bouldin Index are two examples of this type of metric. The second type of metric compares the estimated clustering to a ground truth clustering. Examples of this are Mutual Information, and Rand Index. The Rand Index is the fraction of times that two clusterings agree whether a pair of points are in the same cluster or different clusters. Adjusted Rand Index (ARI) corrects for chance and takes values in the interval . If the clusterings are identical, ARI is one, and if one of the clusterings is random, then the expected value of ARI is zero.

3 Methods

3.1 Datasets

We evaluate the performance of our algorithm as compared to mclust on three datasets. For each dataset, the algorithms search over all of their clustering options, and across all cluster numbers between 1 and 20.

3.1.1 Synthetic Gaussian Mixture

For the synthetic Gaussian mixture dataset, we sampled 100 data points from a Gaussian mixture with three equally weighted components in three dimensions. All components have an identity covariance matrix and the means are:

(3)

We include this dataset to verify that the algorithms can cluster data with clear group structure.

3.1.2 Wisconsin Breast Cancer Diagnostic Dataset

The Wisconsin Breast Cancer Diagnostic Dataset contains data from 569 breast masses that were biopsied with fine needle aspiration. Each data point includes 30 quantitative cytological features, and is labeled by the clinical diagnosis of benign or malignant. The dataset is available through the UCI Machine Learning Respository

[bc]. We include this dataset because it was used in one of the original mclust publications [mclust_bc]. As in mclust_bc, we only include the extreme area, extreme smoothness, and mean texture features.

3.1.3 Spectral Embedding of Larval Drosophila Mushroom Body Connectome

drosophila analyzes a Drosophila connectome that was obtained via electron microscopy [drosophila_connectome]. As in drosophila

, we cluster the first six dimensions of the right hemisphere’s adjacency spectral embedding. The neuron types, Kenyon cells, input neurons, output neurons, and projection neurons, are considered the true clustering.

3.2 AutoGMM

AutoGMM performs different combinations of clustering options and selects the method that results in the best BIC (see Appendix A for details).

  1. [leftmargin=0.5cm]

  2. For each of the available 10 available agglomerative techniques (from
    sklearn.cluster.AgglomerativeCluster) perform initial clustering on up to points (user specified, default value is ). We also perform 1 k-means clustering initialized with k-means++.

  3. Compute cluster sample means and covariances in each clustering. These are the 11 initializations.

  4. For each of the four available covariance constraints (from
    sklearn.mixture.GaussianMixture), initialize the M step of the EM algorithm with a result from Step 2. Run EM with no regularization.

    1. [leftmargin=0.5cm]

    2. If the EM algorithm diverges or any cluster contains only a single data point, restart the EM algorithm, this time adding to the covariance matrices’ diagonals as regularization.

    3. Increase the regularization by a factor of 10 if the EM algorithm diverges or any cluster contains only a single data point. If the regularization is increased beyond , simply report that this GMM constraint has failed and proceed to Step 4.

    4. If the EM algorithm successfully converges, save the resulting GMM and proceed to Step 4.

  5. Repeat Step 3 for all of the 11 initializations from Step 2.

  6. Repeat Steps 1-4 for all cluster numbers .

  7. For each of the GMM’s that did not fail, compute BIC for the GMM.

  8. Select the optimal clustering—the one with the largest BIC—as the triple of (i) initialization algorithm, (ii) number of clusters, and (iii) GMM covariance constraint.

By default, AutoGMM iterates through all combinations of 11 agglomerative methods (Step 1), 4 EM methods (Step 3), and 20 cluster numbers (Step 5). However, users are allowed to restrict the set of options. AutoGMM limits the number of data points in the agglomerative step and limits the number of iterations of EM so the computational complexity with respect to number of data points is .

The EM algorithm can run into singularities if clusters contain only a single element (“singleton clusters”), so the regularization scheme described above avoids such clusters. At first, EM is run with no regularization. However if this attempt fails, then EM is run again with regularization. As in scikit-learn, regularization involves adding a regularization factor to the diagonal of the covariance matrices to ensure that they are positive. This method does not modify the eigenvectors of the covariance matrix, making it rotationally invariant [ledoit].

AutoGMM has a scikit-learn compliant API and is freely available at https://github.com/tathey1/graspy.

3.3 Reference Clustering Algorithms

We compare our algorithm to two other clustering algorithms. The first, mclust v5.4.2, is available on CRAN [mclust5]. We use the package’s Mclust function. The second, which we call GraSPyclust, uses the GaussianCluster function in GraSPy. Since initialization is random in GraSPyclust, we perform clustering 50 times and select the result with the best BIC. Both of these algorithms limit the number EM iterations performed so their computational complexities are linear with respect to the number of data points.

The data described in Section 3.1 has underlying labels so we choose ARI to evaluate the clustering algorithms.

3.4 Statistical Comparison

In order to statistically compare the clustering methods, we evaluate their performances on random subsets of the data. For each dataset, we take ten independently generated, random subsamples, containing 80% of the total data points. We compile ARI and Runtime data from each clustering method on each subsample. Since the same subsamples are used by each clustering method, we can perform a Wilcoxon signed-rank test to statistically evaluate whether the methods perform differently on the datasets.

We also cluster the complete data to analyze how each model was chosen, according to BIC. All algorithms were run on a single CPU core with 16GB RAM.

4 Results

Table 2 shows the models that were chosen by each clustering algorithm on the complete datasets, and the corresponding BIC and ARI values. The actual clusterings are shown in Figures 1-3. In the synthetic dataset, all three methods chose a spherical covariance constraint, which was the correct underlying covariance structure. The GraSPyclust algorithm, however, failed on this dataset, partitioning the data into 10 clusters.

In the Wisconsin Breast Cancer dataset, the different algorithms chose component numbers of three or four, when there were only 2 underlying labels. All algorithms achieved similar BIC and ARI values. In the Drosophila dataset, all algorithms left the mixture model covariances completely unconstrained. Even though AutoGMM achieved the highest BIC, it had the lowest ARI.

In both the Wisconsin Breast Cancer dataset and the Drosophila dataset, AutoGMM achieved ARI values between 0.5 and 0.6, which are not particularly impressive. In the Drosophila data, most of the disagreement between the AutoGMM clustering, and the neuron type classification arises from the subdivision of the Kenyon cell type into multiple subgroups (Figure 3). The authors of drosophila, who used mclust, also note this result.

Agglomeration Covariance Constraint Regularization
Dataset AutoGMM AutoGMM mclust GraSPyclust AutoGMM
Synthetic L2, Ward Spherical EII Spherical 0
Breast Cancer None Diagonal VVI Diagonal 0
Drosophila Cosine, Complete Full VVV Full 0
Number of Clusters
Dataset AutoGMM mclust GraSPyclust
Synthetic 3 3 10
Breast Cancer 4 3 4
Drosophila 6 7 7
BIC ARI
Dataset AutoGMM mclust GraSPyclust AutoGMM mclust GraSPyclust
Synthetic -1120 -1111 -1100 1 1 0.64
Breast Cancer -8976 -8970 -8974 0.56 0.57 0.54
Drosophila 4608 4430 3299 0.5 0.57 0.55
Table 2: Models Chosen by Different Clustering Methods. There are 3 true clusters in the Synthetic dataset, 2 in the Breast Cancer dataset, and 4 in the Drosophila dataset. BIC indicates how well the estimated GMM parameters fit the data (higher is better). ARI indicates how well the clustering agrees with the true clustering (higher is better).
Figure 1: Clustering results of different algorithms on the synthetic dataset (Section 3.1.1). (b-c) AutoGMM and mclust correctly clustered the data. (c) GraSPyclust erroneously subdivided the true clusters .
Figure 2: Clustering results of different algorithms on the breast cancer dataset (Section 3.1.2). The original data was partitioned into two clusters (benign and malignant), but all algorithms here further subdivided the data into three or four clusters.
Figure 3: Clustering results of different algorithms on the drosophila dataset (Section 3.1.3). There is considerable variation in the different algorithms’ results. One similarity, however, is that all algorithms subdivided the Kenyon cell cluster (red points in (a)) into several clusters.

Figure 4 shows results from clustering random subsets of the data. The results were compared with the Wilcoxon signed-rank test at . On all three datasets, AutoGMM and mclust acheived similar ARI values. GraSPyclust resulted in lower ARI values on the synthetic dataset as compared to the mclust, but was not statistically different on the other datasets. Figure 4 shows that in all datasets, mclust was the fastest algorithm, and GraSPyclust was the second fastest algorithm.

Figure 5 shows the BIC curves that demonstrate model selection of AutoGMM and mclust on the Drosophila dataset. We excluded the BIC curves from GraSPyclust for simplicity. The curves peak at the chosen models.

Figure 4: By clustering random subsets of the data, ARI and Runtime values can be compared via the Wilcoxon signed-rank test (). (a) On the synthetic dataset, GraSPyclust had a significantly lower ARI than mclust. (b), (c) There were no statistically significant differences between the algorithms on the other datasets. (d-f) mclust was the fastest algorithm, and AutoGMM the slowest, on all three datasets. The p-value was the same (0.005) for all of the statistical tests in (d-f)
Figure 5: BIC values of all clustering options in AutoGMM and mclust on the Drosophila dataset. (a) There are 44 total clustering options in AutoGMM. Each curve corresponds to an agglomeration method, and each subplot corresponds to a covariance constraint (Table 1). (b) The 14 curves correspond to the 14 clustering options in mclust. The chosen models, from Table 2, are marked with a vertical dashed line. Missing or truncated curves indicate that the algorithm did not converge to a satisfactory solution at those points.
Figure 6:

Clustering runtimes on datasets of varying size. Each algorithm has several clustering options, and each dotted line corresponds to one of these options. The solid lines are a linear regression between

and on the large datasets. The slopes of these regressions are shown in the legend, and approximate the order of the computational complexity with respect to the number of data points (). All clustering methods seem to scale linearly with , as expected. Each dataset contains randomly generated data from a three component Gaussian mixture in three dimensions, as described in Section 3.1.1.

We also investigated how algorithm runtimes scale with the number of data points. Figure 6 shows how the runtimes of all clustering options of the different algorithms scale with large datasets. We used linear regression on the log-log data to estimate computational complexity. The slopes for the AutoGMM, mclust, and GraSPyclust runtimes were 0.98, 0.86, and 1.09 respectively. This supports our calculations that the runtime of the three algorithms is linear with respect to .

5 Discussion

In this paper we present an algorithm, AutoGMM, that performs automatic model selection for Gaussian mixture modeling in Python. To the best of our knowledge, this algorithm is the first of its kind that is freely available in Python. AutoGMM iterates through 44 combinations of clustering options in Python’s scikit-learn package and chooses the GMM that achieves the best BIC. The algorithm avoids Gaussian mixtures whose likelihoods diverge, or have singleton clusters.

AutoGMM was compared to mclust, a state of the art clustering package in R, and achieved similar BIC and ARI values on three datasets. Results from the synthetic Gaussian mixture (Table 2, Figure 1) highlight the intuition behind AutoGMM’s regularization scheme. GraSPyclust did not perform well on the synthetic data, because it erroneously subdivided the 3 cluster data into 10 clusters. AutoGMM avoids this problem because its regularization does not allow singleton clusters. In all fairness, GraSPyclust’s performance on subsets of the synthetic data (Figure 4) is much better than its performance on the complete data. However, its random initialization leaves it more susceptible to inconsistent results.

Figure 4, shows that on our datasets, mclust is the fastest algorithm. However, computational complexity of all algorithms is linear with respect to the number of data points, and this is empirically validated by Figure 6. Thus, mclust is faster by only a constant factor. Several features of the mclust algorithm contribute to this factor. One is that much of the computation in mclust is written in Fortran, a compiled programming language. AutoGMM and GraSPyclust, on the other hand, are written exclusively in Python, an interpreted programming language. Compiled programming languages are typically faster than interpreted ones. Another is that mclust evaluates fewer clustering methods (14) than AutoGMM (44). Indeed, using AutoGMM trades runtime for a larger space of GMMs. However, when users have an intuition into the structure of the data, they can restrict the modeling options and make the algorithm run faster.

One opportunity to speed up the runtime of AutoGMM would involve a more principled approach to selecting the regularization factor. Currently, the algorithm iterates through 8 fixed regularization factors until it achieves a satisfactory solution. However, the algorithm could be improved to choose a regularization factor that is appropriate for the data at hand.

The most obvious theoretical shortcoming of AutoGMM

is that there is no explicit handling of outliers or singleton clusters. Since the algorithm does not allow for clusters with only one member, it may perform poorly with stray points of any sort. Future work to mitigate this problem could focus on the data or the model. Data-based approaches include preprocessing for outliers, or clustering subsets of the data. Alternatively, the model could be modified to allow singleton clusters, while still regularizing to avoid singularities.

In the future, we are interested in high dimensional clustering using statistical models. The EM algorithm, a mainstay of GMM research, is even more likely to run into singularities in high dimensions, so we would need to modify AutoGMM accordingly. One possible approach would use random projections, as originally proposed by dasgupta in the case where the mixture components have means that are adequately separated, and the same covariance. Another approach would involve computing spectral decompositions of the data, which can recover the true clustering and Gaussian mixture parameters under less restrictive conditions [vempalaspectral],[achspectral].

References

Appendix A Algorithms

1:
  • - samples of -dimensional data points.

Optional:
  • [leftmargin=*, labelindent=16pt]

  • - minimum number of clusters, default to 2.

  • - maximum number of clusters, default to 20.

  • affinities - affinities to use in agglomeration, subset of , default to all. Note: indicates the k-means initialization option, so it is not paired with linkages.

  • linkages - linkages to use in agglomeration, subset of , default to all. Note: linkage is only compatible with affinity.

  • cov_constraints - covariance constraints to use in GMM, subset of - , default to all.

2:
  • best_clustering - label vector in

    indicating cluster assignment

  • [leftmargin=*, labelindent=24pt]

  • best_k - number of clusters

  • best_init - initialization method

  • best_cov_constraint - covariance constraint

  • best_reg_covar - regularization factor

continued on next page
Algorithm 1 AutoGMM works by performing an initial clustering, then fitting a GMM. The GMM with the best BIC is identified, and used to cluster the data. The 44 available clustering options are composed of [(10 agglomeration methods + k-means) for initialization] [4 covariance constraints in EM]. By default, AutoGMM iterates through all 44 clustering options for all cluster numbers () from 2 to 20. However, users can modify the range of cluster numbers, and restrict the algorithm to a subset of the clustering options. Lastly, our algorithm saves the results of all clustering options, but those details are beyond the scope of this pseudocode.
1:function AutoGMM(, ,,affinities, linkages, cov_constraints)
2:      =
3:     best_bic =
4:     for  in  do
5:         for affinity in affinities do
6:              for linkage in linkages do ward linkage is only compatible with L2 affinity
7:                  if affinity none then
8:                        = Subset() Algorithm 2
9:                       init = sklearn.cluster.AgglomerativeCluster(,affinity,linkage).fit_predict() [scikit-learn]
10:                  else
11:                       init = Use the default initialization of GaussianMixture which is 1 repetition of k-means++ followed by k-means
12:                  end if
13:                  for cov_constraint in cov_constraints do
14:                       clustering, bic, reg_covar = GaussianCluster(, , cov_constraint, init)       Algorithm 3
15:                       if best_bic or bic best_bic then
16:                           best_bic = bic
17:                           best_clustering = clustering
18:                           best_k =
19:                           best_init = [affinity, linkage]
20:                           best_cov_constraint = cov_constraint
21:                           best_reg_covar = reg_covar
22:                       end if
23:                  end for
24:              end for
25:         end for
26:     end for
27:end function
Algorithm 1 continued
1: - samples of -dimensional data points.
2: - random subset of the data.
3:function Subset()
4:     n_samples = length() Number of data points
5:     if n_samples  then
6:          RandomSubset() Sample 2000 random data points
7:     else
8:         
9:     end if
10:end function
Algorithm 2 Sample a random subset of the data if the number of data points () exceeds 2000. So, for , computational complexity of agglomeration is constant with respect to .
1:
  • - samples of -dimensional data points.

  • [leftmargin=*, labelindent=16pt]

  • - number of clusters.

  • cov_constraint - covariance constraint.

Optional:
  • [leftmargin=*, labelindent=16pt]

  • init - initialization.

2:
  • - cluster labels.

  • [leftmargin=*, labelindent=24pt]

  • bic - Bayesian Information Criterion.

  • reg_covar - regularization factor.

3:function GaussianCluster(, , cov_constraint, init)
4:     if init  then
5:         weights_init, means_init, precisions_init = EstimateGaussianParameters(, init) Compute sample weight, mean and precision for each cluster
6:     else
7:         weights_init, means_init, precisions_init = ,,
8:     end if
9:     reg_covar =
10:     bic =
11:     converged =
12:     while reg_covar and converged  do
13:         gmm = sklearn.mixture.GaussianMixture(k,cov_constraint,reg_covar,     weights_init,means_init,precisions_init) [scikit-learn]
14:          = gmm.fit_predict(X) EM algorithm
15:         if SingletonCluster() or EMError then       If there is a singleton cluster, or if EM failed
16:              reg_covar = IncreaseReg(reg_covar) Algorithm 4
17:         else
18:              bic = gmm.bic()
19:              converged =
20:         end if
21:     end while
22:end function
Algorithm 3 Regularized Gaussian Cluster. Initially, the algorithm attempts to fit a GMM without regularization. If the EM algorithm diverges, or if the GMM clustering has a cluster with a single data point (“singleton cluster"), then EM is repeated with a higher regularization. If there continue to be singleton clusters even when the regularization is increased to a certain point (1), then this clustering attempt terminates and is considered a failure.
1:reg_covar - current covariance regularization factor.
2:reg_covar_new - incremented covariance regularization factor.
3:function IncreaseReg(reg_covar)
4:     if reg_covar  then
5:         reg_covar_new
6:     else
7:         reg_covar_new reg_covar
8:     end if
9:end function
Algorithm 4 Increase Covariance Regularization Factor. The lowest nonzero regularization factor is , which is subsequently increased by a factor of 10.