 # An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

In order to cluster or partition data, we often use Expectation-and-Maximization (EM) or Variational approximation with a Gaussian Mixture Model (GMM), which is a parametric probability density function represented as a weighted sum of K̂ Gaussian component densities. However, model selection to find underlying K̂ is one of the key concerns in GMM clustering, since we can obtain the desired clusters only when K̂ is known. In this paper, we propose a new model selection algorithm to explore K̂ in a Bayesian framework. The proposed algorithm builds the density of the model order which any information criterions such as AIC and BIC basically fail to reconstruct. In addition, this algorithm reconstructs the density quickly as compared to the time-consuming Monte Carlo simulation.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The Gaussian Mixture Model (GMM) is a well-known approach for clustering data. GMM is a parametric probability density function represented as a weighted sum of Gaussian component densities. Given an assumed model , the GMM takes the form

 p(y|MK)=K∑k=1πkp(y|μk,Qk) (1)

where is a set of measurements (observations) and it has a multivariate (-dimensional) continuous valued form. Here, and represent the mixture weight and the Gaussian density of the -th component respectively. Each component density has the multivariate Gaussian function with mean and the covariance of the th component. Further, the sum of the non-negative weights is one, i.e and . In this parameterized form, we can collectively represent hidden variables by for . Now, our interest is to reconstruct the posterior distribution given a model assumption that the Gaussian Mixture Model has components.

Let be the optimal number of Gaussian components, where . It is known that if is known, the desired

is straightforwardly estimated by Variational approximation or classic Expectation-and-Maximization algorithm (EM). However, in general the optimal number of components

is not known and therefore it is rather difficult to estimate hidden parameters . This is because the dimension of is changing with varying . Generically, the model selection problem for GMM involves finding the optimal by . There are many studies in the literature that have addressed the model order estimation for GMM Fruhwirth-Schnatter_2007; Kadane_Lazar_2004; BISHEM01; Roberts98:ModelSelectionGMM; Constantinopoulos_bayesianfeature. For instance, Keribin et al. Keribin_2000 estimated the number of components for mixture models using a maximum penalized likelihood. Information criterions have also been applied to GMM model selection, such as AIC Aitkin85:ModelSelection4MixtureModel, BIC Fraley98:numClust and the Entropy criterion Celeux_Soromenho_1996

. In the Monte Carlo simulation, Richardson et al. developed the inference of GMM model selection using the reversible jump Markov chain Monte Carlo (RJMCMC)

Richardson97:GMM. Recently, Nobile et al. introduced an efficient clustering algorithm, the so called allocation sampler Nobile_Fearnside_2010, which basically infers the model order and clusters using Monte Carlo in an efficient marginalized proposal distribution.

## 2 Statistical Background

### 2.1 Integrated Nested Laplace Approximation (INLA)

Suppose that we have a set of hidden variables and a set of observations . Integrated Nested Laplace Approximation (INLA) Rue09:INLA approximates the marginal posterior by

 p(f|Y) = ∫p(f|Y,θ)p(θ|Y)dθ (2) ≈ ∫~p(f|Y,θ)~p(θ|Y)dθ ≈ ∑θi~p(f|Y,θ)~p(θ|Y)Δθi where
 ~p(θ|Y)∝p(f,Y,θ)pF(f|Y,θ)∣∣∣f=f∗(θ)=p(Y|f,θ)p(f|θ)p(θ)pF(f|Y,θ)∣∣∣f=f∗(θ). (4)

Here, denotes a simple functional approximation close to , as in Gaussian approximation, and is a value of the functional approximation. For the simple Gaussian approximation case, the proper choice of is the mode of Gaussian approximation of . Given the log of posterior, we can calculate a mode and its Hessian matrix via quasi-Newton style optimization: , and for , we do a grid search from the mode in all directions until the log for a given threshold .

## 3 Proposed Approach

In this study, we extend our previous work, which addressed model selection for the K-nearest neighbour classifier using the

K-ORder Estimation Algorithm (KOREA) Yoon13:PKNN_arXiv, to resolve the model selection problem in clustering domains using KOREA. Our proposed algorithm reconstructs the distribution of the number of components using Eq. (4).

### 3.1 Obtaining the optimal number of components

Let denote a set of observations and let be a set of the model parameters given a model order . The first step of our algorithm is to estimate the optimal number of components, : . According to Eq. (4), we can obtain an approximated marginal posterior distribution by

 ~p(K|y)∝p(y,x1:K,K)pF(x1:K|y,K)∣∣∣x1:K(K)=x∗1:K(K). (5)

This equation has the property that is an integer variable, while of Eq. (4) is, in general, continuous variables. By ignoring this difference, we can still use a quasi-Newton method to obtain optimal efficiently. Alternatively, we can also calculate some potential candidates between and if is not too large. Otherwise, we may still use the quasi-Newton style algorithm with a rounding operator that transforms a real value to an integer for .

### 3.2 Bayesian Model Selection for GMM

In the GMM model of Eq. (1), we have four different types of hidden variables for the profile of the components: mean (), precision (), the weights () of the component and an unknown number of components . Therefore, given Eq. (5), we can make the mathematical form: , where and . However, it is rather difficult to obtain the approximated distribution close to target distribution since there is no close form. Worse, it is infeasible to build a Hessian matrix via a quasi-Newton method since it is extremely slow when the dimension of is large and

is not a vector but a matrix. Therefore, we introduce labeling indicator

to decompose the mixture model and apply the Variational approach Bishop06:MachineLearning to obtain such that . Therefore, we re-define the problems by adding component indicators of observations . We finally obtain in a form similar to that of Eq. (5)

 ~p(K|y) ∝ p(y,z,x1:K,K)pF(z,x1:K|y,K)∣∣∣(z,x1:K)=(z,x1:K)∗(K) (6) = p(y|z∗,x∗1:K)p(z∗|x∗1:K)p(x∗1:K)p(K)q∗(z∗,x∗1:K) (7)

where and are the mode of , which is an approximated posterior obtained by variational approximation. Here, .

## 4 Evaluation

In order to evaluate the performance of our proposed approach, we simulated our clustering algorithm using an artificial dataset and several real experimental datasets.

We first investigated the performance of the proposed algorithm on two dimensional synthetic datasets for GMM clustering. Given , data were generated by the hierarchical model:

 μj = [20cos(2πKj),20sin(2πKj)] (8) Σj ∼ W(I2×2,5) for j∈{1,2,⋯,K} (9) π1:K ∼ DP(1/K,1/K,⋯,1/K) (10) yi ∼ K∑s=1πsN(⋅;μs,Σs)%fori∈{1,2,⋯,N} (11)

where and represent the Wishart distribution and Dirichlet Process respectively. In the experiments, we tested the performance by varying the number of clusters and the number of data . Figures 1 and 2 demonstrate a comparison of the performance of our approach with that of other model selection algorithms, AIC and BIC. Fig. 1: Clustered synthetic dataset where the number of clusters K and the number of observations N are varied: (a) AIC, (b) BIC, and (c) our proposed approach

Figure 1 explains the outputs of three model selection approaches with different and . Interestingly, our proposed approach is effective even when , where both AIC and BIC fail. In addition, whereas AIC and BIC can find only when is large, our proposed approach builds a distinguishable and clear posterior distribution in all cases from to , and this enables us to detect the apparent close to easily. Fig. 2: Mean of K∗ in five different runs and MSE with varying ^K

The next question that arises concerns the stability of our approach in noisy environments. Therefore, we ran five parallel and random simulations with different seeds. The mean and MSE (mean square error) of for five different runs are displayed in Figure 2. We find that our proposed algorithm is stable even when and , where AIC and BIC are not effective. Furthermore, AIC and BIC sharply increase the MSE (Mean Square Error) as increases, when . However, our approach has a small (close to zero) and stable MSE, although increases.

We also evaluated the performance of our algorithm with the three well-known data sets used in Richardson97:GMM for real experimental data: Enzymatic activity in the blood of 245 unrelated individuals Bechtel93:Mixture, acidity in a sample of 155 lakes in the Northeastern United States Crawford94:Mixture, and galaxy data with the velocities of 82 distant galaxies. Fig. 3: Histogram and its clustering results of one dimensional real datasets: (a) Enzyme (b) Acidity and (c) Galaxy

Figure 3 shows the performance comparison between AIC, BIC, MCMC, and our approach. The top sub-graphs demonstrate the histograms of the datasets with different numbers of mixture components. AIC and BIC with varying are plotted in the center row of sub-graphs. The bottom sub-figures display plots of the reconstructed distributions of by MCMC used in Richardson97:GMM and our approach 111The mathematical models used in MCMC and our approach are slightly different..

## 5 Conclusion

For Gaussian mixture clustering, we proposed a novel model selection algorithm, which is based on functional approximation in a Bayesian framework. This algorithm has a few advantages as compared to other conventional model selection techniques. First, the proposed approach can quickly provide a proper distribution of the model order which is not provided by other approaches, only a few time-consuming techniques such as Monte Carlo simulation can provide it. In addition, since the proposed algorithm is based on the Bayesian scheme, we do not need to run a cross validation, as is usually done in performance evaluation.