Dirichlet Process Parsimonious Mixtures for clustering

01/14/2015 ∙ by Faicel Chamroukhi, et al. ∙ 0

The parsimonious Gaussian mixture models, which exploit an eigenvalue decomposition of the group covariance matrices of the Gaussian mixture, have shown their success in particular in cluster analysis. Their estimation is in general performed by maximum likelihood estimation and has also been considered from a parametric Bayesian prospective. We propose new Dirichlet Process Parsimonious mixtures (DPPM) which represent a Bayesian nonparametric formulation of these parsimonious Gaussian mixture models. The proposed DPPM models are Bayesian nonparametric parsimonious mixture models that allow to simultaneously infer the model parameters, the optimal number of mixture components and the optimal parsimonious mixture structure from the data. We develop a Gibbs sampling technique for maximum a posteriori (MAP) estimation of the developed DPMM models and provide a Bayesian model selection framework by using Bayes factors. We apply them to cluster simulated data and real data sets, and compare them to the standard parsimonious mixture models. The obtained results highlight the effectiveness of the proposed nonparametric parsimonious mixture models as a good nonparametric alternative for the parametric parsimonious models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is one of the essential tasks in statistics and machine learning. Model-based clustering, that is the clustering approach based on the parametric finite mixture model

(1), is one of the most popular and successful approaches in cluster analysis (2; 3; 4). The finite mixture model decomposes the density of the observed data as a weighted sum of a finite number of component densities. Most often, the used model for multivariate real data is the finite Gaussian mixture model (GMM) in which each mixture component is Gaussian. This paper will be focusing on Gaussian mixture modeling for multivariate real data.

In (3) and (5), the authors developed a parsimonious GMM clustering approach by exploiting an eigenvalue decomposition of the group covariance matrices of the GMM components, which provides a wide range of very flexible models with different clustering criteria. It was also demonstrated in (4) that the parsimonious mixture model-based clustering framework provide very good results in density estimation cluster and discriminant analyses.

In model-based clustering using GMMs, the parameters of the Gaussian mixture are usually estimated into a maximum likelihood estimation (MLE) framework by maximizing the observed data likelihood. This is usually performed by the Expectation-Maximization (EM) algorithm

(6; 7) or EM extensions (7). The parameters of the parsimonious Gaussian mixture models may also be estimated in a MLE framework by using the EM algorithm (5).

However, a first issue in the MLE approach using the EM algorithm for normal mixtures is that it may fail due to singularities or degeneracies, as hilighted namely in (8; 9; 10). The Bayesian estimation methods for mixture models have lead to intensive research in the field for dealing with the problems encountered in MLE for mixtures (11; 12; 13; 14; 8; 15; 16; 17; 18). They allow to avoid these problems by replacing the MLE by the maximum a posterior (MAP) estimator. This is namely achieved by introducing a regularization over the model parameters via prior parameter distributions, which are assumed to be uniform in the case of MLE.

The MAP estimation for the Bayesian Gaussian mixture is performed by maximizing the posterior parameter distribution. This can be performed, in some situations by an EM-MAP scheme as in (9; 10)

where the authors proposed an EM algorihtm for estimating Bayesian parsimonious Gaussian mixtures. However, the common estimation approach in the case of Bayesian mixtures is still the one based on Bayesian sampling such as Markov Chain Monte Carlo (MCMC) namely Gibbs sampling

(11; 8; 15) when the number of mixture components is known, or by reversible jump MCMC introduced by (19) as in (14; 8). The flexible eigenvalue decomposition of the group covariance matrix described previously was also exploited in Bayesian parsimonious model-based clustering by (15; 16) where the authors used a Gibbs sampler for the model inference.

For these model-based clustering approaches, the number of mixture components is usually assumed to be known. Another issue in the finite mixture model-based clustering approach, including the MLE approach as well as the MAP approach, is therefore the one of selecting the optimal number of mixture components, that is the problem of model selection. The model selection is in general performed through a two-fold strategy by selecting the best model from pre-established inferred model candidates. For the MLE approach, the choice of the optimal number of mixture components can be performed via penalized log-likelihood criteria such as the Bayesian Information Criterion (BIC) (20), the Akaike Information Criterion (AIC) (21), the Approximate Weight of Evidence (AWE) criterion (3), or the Integrated Classification Likelihood criterion (ICL) (22), etc. For the MAP approach, this can still be performed via modified penalized log-likelihood criteria such as a modified version of BIC as in (10) computed for the posterior mode, and more generally the Bayes factors (23) as in (15) for parsimonious mixtures. Bayes factors are indeed the natural Bayesian criterion for model selection and comparison in the Bayesian framework and for which the criteria such as BIC, AWE, etc represent indeed approximations. There is also Bayesian extensions for mixture models that analyse mixtures with unknown number of components, for example the one in (14) using RJMCMC and the one in (24; 8) using the Birth and death process. They are referred to as fully Bayesian mixture models (14) as they consider the number of mixture components as a parameter to be inferred from the data, jointly with the mixture model parameters, based on the posterior distributions.

However, these standard finite mixture models, including the non-Bayesian and the Bayesian ones, are parametric and may not be well adapted in the case of unknown and complex data structure. Recently, the Bayesian-non parametric (BNP) formulation of mixture models, that goes back to (25) and (26), has took much attention as a nonparametric alternative for fomulating mixtures. The BNP methods (13; 27) have indeed recently become popular due to their flexible modeling capabilities and advances in inference techniques, in particular for mixture models, by using namely MCMC sampling techniques (28; 29) or variational inference ones (30). BNP methods for clustering (13; 27), including Dirichlet Process Mixtures (DPM) and Chinese Restaurant Process (CRP) mixtures (25; 26; 31; 32; 33)represented as Infinite Gaussian Mixture Models (IGMM) (29), provide a principled way to overcome the issues in standard model-based clustering and classical Bayesian mixtures for clustering. BNP mixtures for clustering are fully Bayesian approaches that offer a principled alternative to jointly infer the number of mixture components (i.e clusters) and the mixture parameters, from the data. By using general processes as priors, they allow to avoid the problem of singularities and degeneracies of the MLE, and to simultaneously infer the optimal number of clusters from the data, in a one-fold scheme, rather than in a two-fold approach as in standard model-based clustering. They also avoid assuming restricted functional forms and thus allow the complexity and accuracy of the inferred models to grow as more data is observed. They also represent a good alternative to the difficult problem of model selection in parametric mixture models.

In this paper, we propose a new BNP formulation of the Gaussian mixture with the eigenvalue decomposition of the group covariance matrix of each Gaussian component which has proven its flexibility in cluster analysis for the parametric case (3; 5; 4; 15). A first idea of this approach was presented in (34). We develop new Dirichlet Process mixture models with parsimonious covariance structure, which results in Dirichlet Process Parsimonious Mixtures (DPPM). They represent a Bayesian nonparametric formulation of these parsimonious Gaussian mixture models. The proposed DPPM models are Bayesian parsimonious mixture models with a Dirichlet Process prior and thus provide a principled way to overcome the issues encountered in the parametric Bayesian and non-Bayesian case and allow to automatically and simultaneously infer the model parameters and the optimal model structure from the data, from different models, going from simplest spherical ones to the more complex standard general one. We develop a Gibbs sampling technique for maximum a posteriori (MAP) estimation of the various models and provide an unifying framework for model selection and models comparison by using namely Bayes factors, to simultaneously select the optimal number of mixture components and the best parsimonious mixture structure. The proposed DPPM are more flexible in terms of modeling and their use in clustering, and automatically infer the number of clusters from the data.

The paper is organized as follows. Section 2 describes and discusses previous work on model-based clustering. Then, section 3 presents the proposed models and the learning technique. In section 4, we give experimental results to evaluate the proposed models on simulated data and real data. Finally, Section 5 is devoted to a discussion and concluding remarks.

2 Parametric model-based clustering

Let be a sample of i.i.d observations in , and let be the corresponding unknown cluster labels where represents the cluster label of the th data point , being the possibly unknown number of clusters.

2.1 Model-based clustering

Parametric Gaussian clustering, also called model-based clustering (2; 4), is based on the finite GMM (1)

in which the probability density function of the data is given by:

(1)

where the ’s are the non-negative mixing proportions that sum to one,

are respectively the mean vector and the covariance matrix for the

th Gaussian component density and is the GMM parameter vector. From a generative point of view, the generative process of the data for the finite mixture model in this case can be stated as follows. First, a mixture component is sampled independently from a Multinomial distribution given the mixing proportions . Then, given the mixture component , and the corresponding parameters , the data are generated independelty from a Gaussian with parameters of component . This is summarized by the two steps:

(2)
(3)

The mixture model parameters can be estimated in a Maximum Likelihood estimation (MLE) framework by maximizing the observed data likelihood (4):

(4)

The maximum likelihood estimation usually relies on the Expectation-Maximization (EM) algorithm (6; 7) or EM extensions (7).

2.2 Bayesian model-based clustering

As stated in Section 1, the MLE approach using the EM algorithm for normal mixtures may fail due to singularities or degeneracies (8; 9; 10). The Bayesian approach of mixture models avoids the problems associated with the maximum likelihood described previously. The parameter estimation for the Gaussian mixture model in the Bayesian approach is performed in a MAP estimation framework by maximizing the posterior parameter distribution

(5)

being a chosen prior distribution over the model parameters . The prior distribution in general takes the following form for the GMM:

(6)

where

are hyperparameters. A common choice for the GMM is to assume conjugate priors, that is Dirichlet distribution for the mixing proportions

as in (14; 35), and a multivariate normal Inverse-Wishart prior distribution for the Gaussian parameters, that is a multivariate normal for the means and an Inverse-Wishart for the covariance matrices for example as in (15; 9; 10).

From a generative point of view, to generate data from the Bayesian GMM, a first step is to sample the model parameters from the prior, that is to sample the mixing proportions from their conjugate Dirichlet prior distribution, and the mean vectors and the covariance matrices of the Gaussian components from the corresponding conjugate multivariate normal Inverse-Wishart prior. The generative procedure stills the same as in the previously described generative process, and is summarized by the following steps:

(7)
(8)
(9)
(10)

where are hyperparameters of the Dirichlet prior distribution, and is a prior distribution for the parameters of the Gaussian component, that is a multivariate Normal Inverse-Wishart distribution for the GMM case:

(11)
(12)

where the stands for the Inverse-Wishart distribution.

The parameters of the Bayesian Gaussian mixture are estimated by MAP estimation by maximizing the posterior parameter distribution (5). The MAP estimation can still be performed by EM, namely in the case of conjugate priors where the prior distribution is only considered for the parameters of the Gaussian components, as in (9; 10). However, in general, the common estimation approach in the case the Bayesian GMM described above, is the one using Bayesian sampling such as MCMC sampling techniques, namely the Gibbs sampler. (36; 37; 11; 15; 35; 13; 8).

2.3 Parsimonious Gaussian mixture models

The GMM clustering has been extended to parsimonious GMM clustering (5; 3) by exploiting an eigenvalue decomposition of the group covariance matrices, which provides a wide range of very flexible models with different clustering criteria. In these parsimonious models, the group covariance matrix for each cluster is decomposed as

(13)

where ,

is an orthogonal matrix of eigenvectors of

and is a diagonal matrix with determinant 1 whose diagonal elements are the normalized eigenvalues of in a decreasing order. As pointed in (5), the scalar determines the volume of cluster , its orientation and its shape. Thus, this decomposition leads to several flexible models (5) going from simplest spherical models to the complex general one and hence is adapted to various clustering situations.

The parameters of the parsimonious Gaussian mixture models are estimated in a MLE framework by using the EM algorithm. The details of the EM algorithm for the different parimonious finite GMMs are given in (5). The parsimonious GMMs have also took much attention under the Bayesian prospective. For example, in (15), the authors proposed a fully Bayesian formulation for inferring the previously described parsimonious finite Gaussian mixture models. This Bayesian formulation was applied in model-based cluster analysis (15; 16), The model inference in this Bayesian formulation is performed in a MAP estimation framework by using MCMC sampling techniques, see for example (15; 16). Another Bayesian regularization for the parsimonious GMM was proposed by (9; 10) in which the maximization of the posterior can still be performed by the EM algorithm in the MAP framework.

2.4 Model selection in finite mixture models

Finite mixture model-based clustering requires to specify the number of mixture components (i.e., clusters) and, in the case of parsimonious models, the type of the model. The main issues in this parametric model are therefore the one of selecting the number of mixture components (clusters), and possibly the type of the model, that fit at best the data. This problem can be tackled by penalized log-likelihood criteria such as BIC

(20) or penalized classification log-likelihood criteria such as AWE (3) or ICL (22), etc, or more generally by using Bayes factors (23)

which provide a general way to select and compare models in (Bayesian) statistical modeling, namely in Bayesian mixture models.

3 Dirichlet Process Parsimonious Mixture (DPPM)

However, the Bayesian and non-Bayesian finite mixture models described previously are in general parametric and may not be well adapted to represent complex and realistic data sets. Recently, the Bayesian-non parametric (BNP) mixtures, in particular the Dirichlet Process Mixture (DPM) (25; 26; 32; 33) or by equivalence the Chinese Restaurant Process (CRP) mixture (38; 39; 33), which may be seen and an infinite mixture model (29), provide a principled way to overcome the issues in standard model-based clustering and classical Bayesian mixtures for clustering. They are fully Bayesian approaches and offer a principled alternative to jointly infer the number of mixture components (i.e clusters) and the mixture parameters, from the data. In the next section, we rely on the Dirichlet Process Mixture (DPM) formulation to derive the proposed approach.

BNP mixture approaches for clustering assume general process as prior on the infinite possible partitions, which is not restrictive as in classical Bayesian inference. Such a prior can be a Dirichlet Process

(25; 26; 33) or, by equivalence, a Chinese Restaurant Process (39; 33).

3.1 Dirichlet Process Parsimonious Mixture

A Dirichlet Process (DP) (25) is a distribution over distributions and has two parameters, the concentration parameter and the base measure . We denote it by . Assume there is a parameter following a distribution , that is . Modeling with DP means that we assume that the prior over is a DP, that is, is itself generated from a DP, that is . This can be summarized by the following generative process:

(14)
(15)

The DP has two properties (25). First, random distributions drawn from DP, that is , are discrete. Thus, there is a strictly positive probability of multiple observations taking identical values within the set . Suppose we have a random distribution drawn from a DP followed by repeated draws from that random distribution (40)

introduced a Polya urn representation of the joint distribution of the random variables

, that is

(16)

which is obtained by marginalizing out the underlying random measure :

(17)

and results in the following Polya urn representation for the calculation of the predictive terms of the joint distribution (16):

(18)
(19)

where is the number of clusters after samples, denotes the number of times each of the parameters occurred in the set . The DP therefore place its probability mass on a countability infinite collection of points, also called atoms, that is an infinite mixture of Dirac deltas (25; 41; 33):

(20)

where represents the probability assigned to the th atom which satisfy , and is the location or value of that component (atom). These atoms are drawn independently from the base measure . Hence, according to the DP process, the generated parameters exhibit a clustering property, that is, they share repeated values with positive probability where the unique values of shared among the variables are independent draws for the base distribution (25; 33). The Dirichlet process therefore provides a very interesting approach for clustering perspective, when we do not have a fixed number of clusters, in other words having an infinite mixture saying tends to infinity. Consider a set of observations to be clustered. Clustering with DP adds a third step to the DP (15), that is we assume that the random variables , given the distribution parameters which are generated from a DP, are generated from a distribution . This is the DP Mixture model (DPM) (26; 42; 32; 33). The DPM adds therefore a third step to the DP, that is the of generating random variables given the distribution parameters . The generative process of the DP Mixture (DPM) is as follows:

(21)
(22)
(23)

where is a cluster-specific density, for example a multivariate Gaussian density in the case of DP multivariate Gaussian mixture, where is composed of a mean vector and a covariance matrix. In that case, may be a multivariate normal Inverse-Wishart conjugate prior. When tends to infinity, it can be shown that the finite mixture model (1) converges to a Dirichlet process mixture model (43; 28; 29). The Dirichlet process has a number of properties which make inference based on this nonparametric prior computationally tractable. It has a interpretation in term of the CRP mixture (39; 33). Indeed, the second property of the DP, that is the fact that random parameters drawn from a DP exhibit a clustering property, connects the DP to the CRP. Consider a random distribution drawn from DP followed by a repeated draws from that random distribution , . The structure of shared values defines a partition of the integers from to , and the distribution of this partition is a CRP (25; 33).

3.2 Chinese Restaurant Process (CRP) parsimonious mixture

Consider the unknown cluster labels where or each value

is an indicator random variable that represents the label of the unique value

of such that for all . The CRP provides a distribution on the infinite partitions of the data, that is a distribution over the positive integers . Consider the following joint distribution of the unknown cluster assignments :

(24)

From the Polya urn distribution (19), each predictive term of the joint distribution (24) can be computed is given by:

(25)

where is the number of indicator random variable taking the value , and is the previously unseen value. From this distribution, one can therefore allow assigning new data to possibly previously unseen (new) clusters as the data are observed, after starting with one cluster. The distribution on partitions induced by the sequence of conditional distributions in Eq. (25) is commonly referred to as the Chinese Restaurant Process (CRP). It can be interpreted as follows. Suppose there is a restaurant with an infinite number of tables and in which customers are entering and sitting at tables. We assume that customers are social, so that the th customer sits at table with probability proportional to the number of already seated customers ( being a previously occupied table), and may choose a new table (, being a new table to be occupied) with a probability proportional to a small positive real number , which represents the CRP concentration parameter.

In clustering with the CRP, customers correspond to data points and tables correspond to clusters. In CRP mixture, the prior is completed with a likelihood with parameters with each table (cluster) (i.e., a multivariate Gaussian likelihood with mean vector and covariance matrix in the GMM case), and a prior distribution () for the parameters. For example, in the GMM case, one can use a conjugate multivariate normal Inverse-Wishart prior distribution for the mean vectors and the covariance matrices. This corresponds to the th customer sits at table chooses a dish (the parameter ) from the prior of that table (cluster). The CRP mixture can be summarized according to the following generative process.

(26)
(27)
(28)

where the CRP distribution is given by Eq. (24), is a base measure (the prior distribution) and is a cluster-specific density. In the DPM and CRP mixtures with multivariate Gaussian components, the parameters of each cluster density are composed of a mean vector and a covariance matrix. In that case, a common base measure is a multivariate normal Inverse-Wishart conjugate prior.

We note that in the proposed DP parsimonious mixture, or by equivalence, CRP parsimonious mixture, the cluster covariance matrices are parametrized in term of an eigenvalue decomposition to provide more flexible clusters with possibly different volumes, shapes and orientations. In terms of a CRP interpretation, this can be seen as a variability of dishes for each table (cluster). We indeed use the eigenvalue value decomposition described in section 2.3 which till now has been considered only in the case of parametric finite mixture model-based clustering (eg. see (5; 3)), and Bayesian parametric finite mixture model-based clustering (eg. see (15; 16; 9; 10).) We investigate eight parsimonious models, covering the three families of the mixture models: the general, the diagonal and the spherical family. The parsimonious models therefore go from the simplest spherical one to the more general full model. Table 1 summarizes the considered parsimonious Gaussian mixture models, the corresponding prior distribution for each model and the corresponding number of free parameters for a mixture model with components for data in dimension .

Model Type Prior Applied to # free parameters
Spherical
Spherical
Diagonal diag. elements of
Diagonal diag. elements of
General
General and and
General diag. elements of
General diag. elements
General
Table 1: Considered Parsimonious models via eigenvalue decomposition, the associated prior for the covariance structure and the corresponding number of free parameters where denotes an inverse distribution,

a Gamma distribution and

a Wishart distribution, and , being the number of mixture components and the number of variables for each individual.

We used conjugate priors, that is Dirichlet distribution for the mixing proportions (14; 35), and a multivariate Normal for the mean vector , and and an Inverse-Wishart or an Inverse-Gamma prior, depending on the parsimonious model, for the covariance matrix, (10; 15).

3.3 Bayesian learning via Gibbs sampling

Given observations modeled by the proposed Dirichlet process parsimonious mixture (DPPM), the aim is to infer the number of latent clusters underlying the observed data, their parameters and the latent cluster labels . We developed an MCMC Gibbs sampling technique, as in (28; 29; 32), to learn the proposed Bayesian nonparametric parsimonious mixture models.

The Gibbs sampler for mixtures performs in an iterative way as follows. Given an initial mixture parameters , and a prior over the missing labels (here a conjugate Dirichlet prior), the Gibbs sampler, instead of estimating the missing labels , simulates them from their posterior distribution at each iteration , which is in this case a Multinomial distribution whose parameters are the posterior class probabilities. Then, given the completed data and the prior distribution over the mixture parameters, the Gibbs sampler generates the mixture parameters from the corresponding posterior distribution , which is in this case a multivariate Normal Inverse-Wishart, or a a Normal-Inverse-Gamma distribution, depending on the parsimonious model. This Bayesian sampling procedure produces an ergodic Markov chain of samples with a stationary distribution . Therefore, after initial burn-in steps in Gibbs samples, the variables , can be considered to be approximately distributed according to the posterior distribution . The Gibbs sampler consists in sampling the couple from their posterior distribution. The posterior distribution for given all the other variables is given by

(29)

where and is the prior distribution for , that is , with being the hyperparameters of the model. The cluster labels are similarly sampled from posterior distributions which is given by

(30)

where , and is the prior predictive distribution corresponds which to the CRP distribution computed as in Equation (25). The prior distribution, and the resulting posterior distribution, for each of the considered models, are close to those in (15) and are provided in details in the supplementary material, also available here.

3.3.1 Sampling the hyperparameter of the DPPM

The number of mixture components in the models depends on the hyperparameter of the Dirichlet Process (26). We therefore choose to sample it to avoid fixing an arbitrary value for it. We follow the method introduced by (12) which consists in sampling by assuming a prior Gamma distribution with a shape hyperparameter and scale hyperparameter . Then, a variable is introduced and sampled conditionally on and the number of clusters

, according to a Beta distribution

. The resulting posterior distribution for the hyperparameter is given by:

(31)

where the weights . The developed Gibbs sampler is summarized by the pseudo-code (1).

Inputs: Data set and # Gibbs samples

1:  Initialize the model hyperparameters .
2:  Start with one cluster
3:  for  do
4:     for  do
5:         for  do
6:            if  then
7:               Decrease ;  let
8:            end if
9:         end for
10:         Sample a cluster label from the posterior:
11:         if  then
12:            Increase (We get a new cluster) and sample a new cluster parameter from the prior distribution as in Table 1
13:         end if
14:     end for
15:     for  do
16:         Sample the parameters from the posterior distribution.
17:     end for
18:     Sample the hyperparameter from the posterior (31)
19:     
20:  end for

Outputs: {}

Algorithm 1 Gibbs sampling for the proposed IPGMM

The retained solution is the one corresponding to the posterior mode of the number of mixture components, that is the one that appears the most frequently during the sampling.

3.4 Bayesian model selection and comparison via Bayes factors

This section provides the used strategy for model selection and comparison, that is, the choice of the number of mixture components (clusters) for a given model, and the selection of the best model from the different parsimonious models. We use Bayes factors (23) which provide a general way to compare models in (Bayesian) statistical modeling, and has been widely studied in the case of mixture models (23; 15; 44; 45; 46). Suppose that we have two model candidates and

, if we assume that the two models have the same prior probability

, the Bayes factor is given by

(32)

which corresponds to the ratio between the marginal likelihoods of the two models and . It is a summary of the evidence for model against model given the data . The marginal likelihood for model , also called the integrated likelihood, is given by

(33)

where is the likelihood of model with parameters and is the prior density of the mixture parameters for model . As it is difficult to compute analytically the marginal likelihood (33), several approximations have been proposed to approximate it. One of the most used approximations is the Laplace-Metropolis approximation (47) given by

(34)

where is the posterior estimation of (posterior mode) for model , is the number of free parameters of the mixture model as given in Table 1, and is minus the inverse Hessian of the function evaluated at the posterior mode of , that is . The matrix is asymptotically equal to the posterior covariance matrix (47), and is computed as the sample covariance matrix of the posterior simulated sample. We note that, in the proposed DPPM models, as the number of components is itself a parameter in the model and is changing during the sampling, which leads to parameters with different dimension, we compute the Hessian matrix in Eq. (34) by taking the posterior samples corresponding to the posterior mode of . Once the estimation of Bayes factors is obtained, it can be interpreted as described in Table 2 as suggested by (48), see also (23).

BF Evidence for model
Negative ( is selected)
Not bad
Substantial
Strong
Decisive
Table 2: Model comparaion and selection using Bayes factors.

4 Experiments

We performed experiments on both simulated and real data in order to evaluate our proposed DPPM models. We assess their flexibility in terms of modeling, their use for clustering and inferring the number of clusters from the data. We show how the proposed DPPM approach is able to automatically and simultaneously select the best model with the optimal number of clusters by using the Bayes factors, which is used to evaluate the results. We also perform comparisons with the finite model-based clustering approach (as in (15; 10)), which will be abbreviated as PGMM approach. We also use the Rand index to evaluate and compare the provided partitions, and the misclassification error rate when the number of estimated components equals the actual one.

For the simulations, we consider several situations of simulated data, from different models, and with different levels of cluster separations, in order to assess the efficiency of the proposed approach to retrieved the actual partition with the actual number of clusters. We also assess the stability of our proposed DPPMs models regarding the choice of the hyperparameters values, by considering several situations and varying them. Then, we perform experiments on several real data sets and provide numerical results in terms of comparisons of the Bayes factors (via the log marginal likelihood values) and as well the Rand index and the misclassification error rate for data sets with known actual partition. In the experiments, for each of the compared approaches and for each model, each Gibbs is run ten times with different initializations. Each Gibbs run generates 2000 samples for which 100 burn-in samples are removed. The solution corresponding to the highest Bayes factor, of those ten runs, is then selected.

4.1 Experiments on simulated data

4.1.1 Varying the clusters shapes, orientations, volumes and separation

In this experiment, we apply the proposed models on simulated data simulated according to different models, and with different level of mixture separation, going from poorly separated mixtures to very-well separated mixtures. To simulate the data, we first consider an experimental protocol close to the one used by (5) where the authors considered the parsimonious mixture estimation within a MLE framework. This therefore allows to see how do the proposed Bayesian nonparametric DPPM perform compared to the standard parametric non-Bayesian one. We note however that in (5) the number of components was known a priori and the problem of estimating the number of classes was not considered. We have performed extensive extensive experiments involving all the models and many Monte Carlo simulations for several data structure situations. Given the variety of models, data structures, level of separation, etc, it is not possible to display all the results in the paper. We choose to perform in the same way as in the standard paper (5) by selecting the results display, for the experiments on simulated data, fo six models of different structures. The data are generated from a two component Gaussian mixture in with observations. The six different structures of the mixture that have been considered to generate the data are: two spherical models: and , two diagonal models: and and two general models and . Table (3) shows the considered model structures and the respective model parameter values used to generate the data sets.

Model Parameters values
Table 3: Considered two-component Gaussian mixture with different structures.

Let us recall that the variation in the volume is related , the variation of the shape is related to and the variation of the orientation is related to . Furthermore, for each type of model structure, we consider three different levels of mixture separation, that is: poorly separated, well separated, and very-well separated mixture. This is achieved by varying the following distance between the two mixture components . We consider the values . As a result, we obtain 18 different data structures with poorly (), well () and very well () separated mixture components. As it is difficult to show the figures for all the situations and those of the corresponding results, in Figure (1), we show for three models with equal volume across the mixture components, different data sets with varying level of mixture separation. Respectively, in Figure (2), we show for the models with varying volume across the mixture components, different data sets with varying level of mixture separation.

Figure 1: Examples of simulated data with the same volume across the mixture components: spherical model with poor separation (left), diagonal model with good separation (middle), and general model with very good separation (right).
Figure 2: Examples of simulated data with the volume changing across the mixture components: spherical model with poor separation (left), diagonal model with good separation (middle), and general model with very good separation (right).

We compare the proposed DPPM to the parametric PGMM approach in model-based clustering (15), for which the number of mixture components was varying in the range and the optimal number of mixture components was selected by using the Bayes factor (via the log marginal likelihoods). For these data sets, the used hyperparameters was as follows: was equal to the mean of the data, the shrinkage

, the degree of freedom

, the scale matrix was equal to the covariance of the data, and the hyperparameter for the spherical models as the greatest eigenvalue of .

4.1.2 Obtained results

Tables 4, 5 and 6 provide the obtained approximated log marginal likelihoods obtained by the PGMM and the proposed DPPM models, for, respectively, the equal (with equal clusters volumes) spherical data structure model () and poorly separated mixture (), the equal diagonal data structure model () and good mixture separation (), and the equal general data structure model () and very good mixture separation (). Tables 7, 8 and 9 provide the obtained approximated log marginal likelihoods obtained by the PGMM and the proposed DPPM models, for, respectively, the different (with different clusters volumes) spherical data structure model () and poorly separated mixture (), the different diagonal data structure model () with good mixture separation (), and the different general data structure model () with very good mixture separation ().

DPPM PGMM
Model
2 -604.54 -633.88 -631.59 -635.07 -587.41 -595.63
2 -589.59 -592.80 -589.88 -592.87 -593.26 -602.98
2 -589.74 -591.67 -590.10 -593.04 -598.67 -599.75
2 -591.65 -594.37 -592.46 -595.88 -607.01 -611.36
2 -590.65 -592.20 -589.65 -596.29 -598.63 -607.74
2 -591.77 -594.33 -594.89 -597.96 -594.49 -601.84
Table 4: Log marginal likelihood values obtained by the proposed DPPM and PGMM for the generated data with model structure and poorly separated mixture ().
DPPM PGMM
Model
2 -730.31 -771.39 -702.38 -703.90 -708.71 -840.49
2 -702.89 -730.26 -702.30 -704.68 -708.43 -713.58
2 -679.76 -704.40 -680.03 -683.13 -686.19 -691.93
2 -685.33 -707.26 -688.69 -696.46 -703.68 -712.93
2 -681.84 -693.44 -682.63 -688.39 -694.25 -717.26
2 -693.70 -695.81 -684.63 -688.17 -694.02 -695.75
Table 5: Log marginal likelihood values obtained by the proposed DPPM and the PGMM for the generated data with model structure and well separated mixture ().
DPPM PGMM
Model
2 -762.16 -850.66 -747.29 -746.09 -744.63 -824.06
2 -748.97 -809.46 -748.17 -751.08 -756.59 -766.26
2 -746.05 -778.42 -746.32 -749.59 -753.64 -758.92
2 -751.17 -781.31 -752.66 -761.02 -772.44 -780.34
2 -701.94 -746.11 -698.54 -702.79 -707.83 -716.43
2 -702.79 -748.36 -703.35 -708.77 -715.10 -722.25
Table 6: Log marginal likelihood values obtained by the proposed DPPM and PGMM for the generated data with model structure and very well separated mixture ().
DPPM PGMM
Model
3 -843.50 -869.52 -825.68 -890.26 -906.44 -1316.40
2 -805.24 -828.39 -805.21 -808.43 -811.43 -822.99
2 -820.33 -823.55 -821.22 -825.58 -828.86 -838.82
2 -808.32 -826.34 -808.46 -816.65 -824.20 -836.85
2 -824.00 -823.72 -821.92 -830.44 -841.22 -852.78
2 -821.29 -826.05 -803.96 -813.61 -819.66 -821.75
Table 7: Log marginal likelihood values and estimated number of clusters for the generated data with model structure and poorly separated mixture ().
DPPM PGMM
Model
3 -927.01 -986.12 -938.65 -956.05 -1141.00 -1064.90
3 -912.27 -944.87 -925.75 -911.31 -914.33 -918.99
3 -899.00 -918.47 -906.59 -911.13 -917.18 -926.69
2 -883.05 -921.44 -883.22 -897.99 -909.26 -928.90
2 -903.43 -918.19 -902.23 -906.40 -914.35 -924.12
2 -894.05 -920.65 -876.62 -886.86 -904.45 -919.45
Table 8: Log marginal likelihood values obtained by the proposed DPPM and PGMM for the generated data with model structure and well separated mixture ().
DPPM PGMM
Model
2 -984.33 -1077.20 -1021.60 -1012.30 -1021.00 -987.06
3 -963.45 -1035.80 -972.45 -961.91 -967.64 -970.93
2 -980.07 -1012.80 -980.92 -986.39 -992.05 -999.14
2 -988.75 -1015.90 -991.21 -1007.00 -1023.70 -1041.40
3 -931.42 -984.93 -939.63 -944.89 -952.35 -963.04
2 -921.90 -987.39 -921.99 -930.61 -946.18 -956.35
Table 9: Log marginal likelihood values obtained by the proposed DPPM and PGMM for the generated data with model structure and very well separated mixture ().

From theses results, we can see that, the proposed DPPM, in all the situations (except for the first situation in Table 4) retrieves the actual model, with the actual number of clusters. We can also see that, except for two situations, the selected DPPM model, has the highest log marginal likelihood value, compared to the PGMM. We also observe that the solutions provided by the proposed DPPM are, in some cases more parsimonious than those provided by the PGMM, and, in the other cases, the same as those provided by the PGMM. For example, in Table 4, which corresponds to data from poorly separated mixture, we can see that the proposed DPPM selects the spherical model , which is more parsimonious than the general model selected by the PGMM, with a better misclassification error (see Table 10). The same thing can be observed in Table 8 where the proposed DPPM selects the actual diagonal model , however the PGMM selects the general model , while the clusters are well separated ().

Also in terms of misclassification error, as shown in Table 10, the proposed DPPM models, compared to the PGMM ones, provide partitions with the lower miscclassification error, for situations with poorly, well or very-well separated clusters, and for clusters with equal and different volumes (except for one situation).

PGMM
DPPM
Table 10: Misclassification error rates obtained by the proposed DPPM and the PGMM approach. From left to right, the situations respectively shown in Table 4, 5, 6, 7, 8, 9

On the other hand, for the DPMM models, from the log marginal likelihoods shown in Tables 4 to 9, we can see that the evidence of the selected model, compared to the majority of the other alternative is, according to Table 2, in general decisive. Indeed, it can be easily seen that the value of the Bayes Factor between the selected model, and the other models, is more than 10, which corresponds to a decisive evidence for the selected model. Also, if we consider the evidence of the selected model, against the more competitive one, one can see from Table 11 and Table 12, that, for the situation with very bad mixture separation, with clusters having the same volume, the evidence is not bad (0.3). However, for all the other situations, the optimal model is selected with an evidence going from an almost substantial evidence (a value of 1.7), to a strong and decisive evidence, especially for the models with different clusters volumes. We can also conclude that the models with different clusters volumes may work better in practice as highlighted by (5).

vs vs vs vs
0.30 4.16 1.70
Table 11: Bayes factor values obtained by the proposed DPPM by comparing the selected model (denoted ) and the one more competitive for it (denoted ). From left to right, the situations respectively shown in Table 4, Table 5 and Table 6
vs vs vs vs
6.16 22 19.04
Table 12: Bayes factor values obtained by the proposed DPPM by comparing the selected model (denoted ) and the one more competitive for it (denoted ). From left to right, the situations respectively shown in Table 7, Table 8 and Table (6) 9

Finally, Figure (3) shows the best estimated partitions for the data structures with equal volume across the mixture components shown in Fig. 1 and the posterior distribution over the number of clusters.

Figure 3: Partitions obtained by the proposed DPPM for the data sets in Fig. 1.

One can see that for the case of clusters with equal volume, the diagonal family () with well separated mixture () and the general family () with very well separated mixture () data structure estimates a good number of clusters with the actual model. However, the equal spherical data model structure () estimates the model, which is also a spherical model. Figure (4) shows the best estimated partitions for the data structures with different volume across the mixture components shown in Fig. 2 and the posterior distribution over the number of clusters.

Figure 4: Partitions obtained by the proposed DPPM for the data sets in Fig. 2.

One can see that for all of different data structure models: different spherical , different diagonal and different general , the proposed DPPM approach succeeded to estimate a good number of clusters equal to with an actual cluster structure.

4.1.3 Stability with respect to the variation of the hyperparameters values

In order to illustrate the effect of the choice of the hyperparameters values of the mixture on the estimations, we considered two-class situations identical to those used in the parametric parsimonious mixture approach proposed in (15). The data set consists in a sample of observations from a two-component Gaussian mixture in with the following parameters: , and , and two spherical covariances with different volumes and . In Figure (5) we can see a simulated data set from this experiment with the corresponding actual partition and density ellipses.

Figure 5: A two-class data set simulated according to , and the actual partition.

In order to assess the stability of the models with respect to the values of the hyperparameters, we consider four situations with different hyperparameter values. These situations are as follows. The hyperparameters and are assumed to be the same for the four situations and their values are respectively (related to the number of degrees of freedom) and equals the empirical mean vecotr of the data. We variate the two hyperparameters, that controls the prior over the mean and that controls the covariance. The considered four situations are shown in Table 13.

Sit.
4
Table 13: Four different situations the hyperparameters values.

We consider and compare four models corresponding to the spherical, diagonal and general family, which are , , and . Table 14 shows the obtained log marginal likelihood values for the four models for each of the situations of the hyperparameters. One can see that, for all the situations, the selected model is , that is the one that corresponds to the actual model, and has the correct number of clusters (two clusters).

Model
Sit.
1 2 -919.3150 2 -865.9205 3 -898.7853 3 -885.9710
2 3 -898.6422 2 -860.1917 2 -890.6766 2 -885.5094
3 2 -927.8240 2 -884.6627 2 -906.7430 2 -901.0774
4 2 -919.4910 2 -861.0925 2