I Introduction
Statistical model selection is concerned with choosing a model that adequately explains the observations from a family of candidate models. Many methods have been proposed in the literature, see for example [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25] and the review in [26]. Model selection problems arise in various applications, such as the estimation of the number of signal components [15, 23, 24, 25, 18, 20, 19]
, the selection of the number of nonzero regression parameters in regression analysis
[4, 22, 14, 5, 6, 11, 12, 21], and the estimation of the number of data clusters in unsupervised learning problems [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]. In this paper, our focus lies on the derivation of a Bayesian model selection criterion for cluster analysis.The estimation of the number of clusters, also called cluster enumeration, has been intensively researched for decades [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45] and a popular approach is to apply the Bayesian Information Criterion (BIC) [31, 37, 32, 33, 29, 38, 39, 40, 41, 44]. The BIC finds the large sample limit of the Bayes’ estimator which leads to the selection of a model that is a posteriori most probable. It is consistent if the true data generating model belongs to the family of candidate models under investigation. The BIC was originally derived by Schwarz in [8] assuming that (i) the observations are independent and identically distributed (iid), (ii) they arise from an exponential family of distributions, and (iii) the candidate models are linear in parameters. Ignoring these rather restrictive assumptions, the BIC has been used in a much larger scope of model selection problems. A justification of the widespread applicability of the BIC was provided in [16] by generalizing Schwarz’s derivation. In [16], the authors drop the first two assumptions made by Schwarz given that some regularity conditions are satisfied. The BIC is a generic criterion in the sense that it does not incorporate information regarding the specific model selection problem at hand. As a result, it penalizes two structurally different models the same way if they have the same number of unknown parameters.
The works in [46, 15] have shown that model selection rules that penalize for model complexity have to be examined carefully before they are applied to specific model selection problems. Nevertheless, despite the widespread use of the BIC for cluster enumeration [31, 37, 32, 33, 29, 38, 39, 40, 41, 44], very little effort has been made to check the appropriateness of the original BIC formulation [16] for cluster analysis. One noticeable work towards this direction was made in [38] by providing a more accurate approximation to the marginal likelihood for small sample sizes. This derivation was made specifically for mixture models assuming that they are well separated. The resulting expression contains the original BIC term plus some additional terms that are based on the mixing probability and the Fisher Information Matrix (FIM) of each partition. The method proposed in [38]
requires the calculation of the FIM for each cluster in each candidate model, which is computationally very expensive and impractical in real world applications with high dimensional data. This greatly limits the applicability of the cluster enumeration method proposed in
[38]. Other than the above mentioned work, to the best of our knowledge, no one has thoroughly investigated the derivation of the BIC for cluster analysis using large sample approximations.We derive a new BIC by formulating the problem of estimating the number of partitions (clusters) in an observed data set as maximization of the posterior probability of the candidate models. Under some mild assumptions, we provide a general expression for the BIC, , which is applicable to a broad class of data distributions. This serves as a starting point when deriving the BIC for specific data distributions in cluster analysis. Along this line, we simplify by imposing an assumption on the data distribution. A closedform expression, , is derived assuming that the data set is distributed as a multivariate Gaussian. The derived model selection criterion, , is based on large sample approximations and it does not require the calculation of the FIM. This renders our criterion computationally cheap and practical compared to the criterion presented in [38].
Standard clustering methods, such as the ExpectationMaximization (EM) and Kmeans algorithm, can be used to cluster data only when the number of clusters is supplied by the user. To mitigate this shortcoming, we propose a twostep cluster enumeration algorithm which provides a principled way of estimating the number of clusters by utilizing existing clustering algorithms. The proposed twostep algorithm uses a modelbased unsupervised learning algorithm to partition the observed data into the number of clusters provided by the candidate model prior to the calculation of
for that particular model. We use the EM algorithm, which is a modelbased unsupervised learning algorithm, because it is suitable for Gaussian mixture models and this complies with the Gaussianity assumption made by
. However, the model selection criterion that we propose is more general and can be used as a wrapper around any clustering algorithm, see [47] for a survey of clustering methods.The paper is organized as follows. Section II formulates the problem of estimating the number of clusters given data. The proposed generic Bayesian cluster enumeration criterion, , is introduced in Section III. Section IV presents the proposed Bayesian cluster enumeration algorithm for multivariate Gaussian data in detail. A brief description of the existing BICbased cluster enumeration methods is given in Section V. Section VI provides a comparison of the penalty terms of different cluster enumeration criteria. A detailed performance evaluation of the proposed criterion and comparisons to existing BICbased cluster enumeration criteria using simulated and real data sets are given in Section VII. Finally, concluding remarks are drawn and future directions are briefly discussed in Section VIII. A detailed proof is provided in Appendix A, whereas Appendix B
contains the vector and matrix differentiation rules that we used in the derivations.
Notation: Lower and uppercase boldface letters stand for column vectors and matrices, respectively; Calligraphic letters denote sets with the exception of which is used for the likelihood function; represents the set of real numbers; denotes the set of positive integers; Probability density and mass functions are denoted by and , respectively;
represents a Gaussian distributed random variable
with mean and covariance matrix ; stands for the estimator (or estimate) of the parameter ; denotes the natural logarithm; iid stands for independent and identically distributed; (A.) denotes an assumption, for example (A.1) stands for the first assumption; represents Landau’s term which tends to a constant as the data size goes to infinity; stands for an identity matrix; denotes anall zero matrix;
represents the cardinality of the set ; stands for vector or matrix transpose; denotes the determinant of the matrix ; represents the trace of a matrix; denotes the Kronecker product; refers to the staking of the columns of an arbitrary matrix into one long column vector.Ii Problem Formulation
Given a set of dimensional vectors , let be a partition of into clusters for . The subsets (clusters) are independent, mutually exclusive, and nonempty. Let be a family of candidate models that represent a partitioning of into subsets, where . The parameters of each model are denoted by which lies in a parameter space . Let
denote the probability density function (pdf) of the observation set
given the candidate model and its associated parameter matrix . Let be the discrete prior of the model over the set of candidate models and let denote a prior on the parameter vectors in given .According to Bayes’ theorem, the joint posterior density of
and given the observed data set is given by(1) 
where is the pdf of . Our objective is to choose the candidate model , where , which is most probable a posteriori assuming that

the true number of clusters in the observed data set satisfies the constraint .
Mathematically, this corresponds to solving
(2) 
where is the posterior probability of given the observations . can be written as
(3) 
where is the likelihood function. can also be determined via
(4) 
instead of Eq. (2) since is a monotonic function. Hence, taking the logarithm of Eq. (3) results in
(5) 
where is replaced by (a constant) since it is not a function of and thus has no effect on the maximization of over . Since the partitions (clusters) are independent, mutually exclusive, and nonempty, and can be written as
(6)  
(7) 
Substituting Eqs. (6) and (7) into Eq. (5) results in
(8) 
Maximizing over all candidate models involves the computation of the logarithm of a multidimensional integral. Unfortunately, the solution of the multidimensional integral does not possess a closed analytical form for most practical cases. This problem can be solved using either numerical integration or approximations that allow a closedform solution. In the context of model selection, closedform approximations are known to provide more insight into the problem than numerical integration [15]. Following this line of argument, we use Laplace’s method of integration [15, 46, 48] and provide an asymptotic approximation to the multidimensional integral in Eq. (8).
Iii Proposed Bayesian Cluster Enumeration Criterion
In this section, we derive a general BIC expression for cluster analysis, which we call . Under some mild assumptions, we provide a closedform expression that is applicable to a broad class of data distributions.
In order to provide a closedform analytic approximation to Eq. (8), we begin by approximating the multidimensional integral using Laplace’s method of integration. Laplace’s method of integration makes the following assumptions.

with has first and secondorder derivatives which are continuous over the parameter space .

with has a global maximum at , where is an interior point of .

with is continuously differentiable and its firstorder derivatives are bounded on .

The negative of the Hessian matrix of
(9) is positive definite, where is the cardinality of . That is, for and , where is the
th eigenvalue of
and is a small positive constant.
The first step in Laplace’s method of integration is to write the Taylor series expansion of and around . We begin by approximating by its secondorder Taylor series expansion around as follows:
(10) 
where . The first derivative of evaluated at vanishes because of assumption (A.3). With
(11) 
substituting Eq. (10) into Eq. (11) and approximating by its Taylor series expansion yields
(12) 
where HOT denotes higher order terms and is a Gaussian kernel with mean and covariance matrix . The second term in the first line of Eq. (12) vanishes because it simplifies to , where is a constant (see [48, p. 53] for more details). Consequently, Eq. (12) reduces to
(13) 
where stands for the determinant, given that . Using Eq. (13), we are thus able to provide an asymptotic approximation to the multidimensional integral in Eq. (8). Now, substituting Eq. (13) into Eq. (8), we arrive at
(14) 
where
(15) 
is the Fisher Information Matrix (FIM) of data from the th partition.
In the derivation of , so far, we have made no distributional assumption on the data set except that the loglikelihood function and the prior on the parameter vectors , for , should satisfy some mild conditions under each model . Hence, Eq. (14) is a general expression of the posterior probability of the model given for a general class of data distributions that satisfy assumptions (A.2)(A.5). The BIC is concerned with the computation of the posterior probability of candidate models and thus Eq. (14) can also be written as
(16) 
After calculating for each candidate model , the number of clusters in is estimated as
(17) 
However, calculating using Eq. (16) is a computationally expensive task as it requires the estimation of the FIM, , for each cluster in the candidate model . Our objective is to find an asymptotic approximation for in order to simplify the computation of . We solve this problem by imposing specific assumptions on the distribution of the data set . In the next section, we provide an asymptotic approximation for assuming that contains iid multivariate Gaussian data points.
Iv Proposed Bayesian Cluster Enumeration Algorithm for Multivariate Gaussian Data
We propose a twostep approach to estimate the number of partitions (clusters) in and provide an estimate of cluster parameters, such as cluster centroids and covariance matrices, in an unsupervised learning framework. The proposed approach consists of a modelbased clustering algorithm, which clusters the data set according to each candidate model , and a Bayesian cluster enumeration criterion that selects the model which is a posteriori most probable.
Iva Proposed Bayesian Cluster Enumeration Criterion for Multivariate Gaussian Data
Let denote the observed data set which can be partitioned into clusters . Each cluster contains data vectors that are realizations of iid Gaussian random variables , where and represent the centroid and the covariance matrix of the th cluster, respectively. Further, let denote a set of Gaussian candidate models and let there be a clustering algorithm that partitions into independent, mutually exclusive, and nonempty subsets (clusters) by providing parameter estimates for each candidate model , where and . Assume that (A.1)(A.7) are satisfied.
Theorem 1.
The posterior probability of given can be asymptotically approximated as
(18) 
where is the cardinality of the subset and it satisfies . The term sums the logarithms of the number of data vectors in each cluster .
Proof.
Once the Bayesian Information Criterion, , is computed for each candidate model , the number of partitions (clusters) in is estimated as
(19) 
Remark.
The first step in calculating for each model is the partitioning of the data set into clusters and the estimation of the associated cluster parameters using an unsupervised learning algorithm. Since the approximations in are based on maximizing the likelihood function of Gaussian distributed random variables, we use a clustering algorithm that is based on the maximum likelihood principle. Accordingly, a natural choice is the EM algorithm for Gaussian mixture models.
IvB The ExpectationMaximization (EM) Algorithm for Gaussian Mixture Models
The EM algorithm finds maximum likelihood solutions for models with latent variables [49]. In our case, the latent variables are the cluster memberships of the data vectors in , given that the component Gaussian mixture distribution of a data vector can be written as
(20) 
where represents the variate Gaussian pdf and is the mixing coefficient of the th cluster. The goal of the EM algorithm is to maximize the loglikelihood function of the data set with respect to the parameters of interest as follows:
(21) 
where and . Maximizing Eq. (21) with respect to the elements of results in coupled equations. The EM algorithm solves these coupled equations using a twostep iterative procedure. The first step (E step) evaluates , which is an estimate of the probability that data vector belongs to the th cluster at the th iteration, for and . is calculated as follows:
(22) 
where and represent the centroid and covariance matrix estimates, respectively, of the th cluster at the previous iteration (). The second step (M step) reestimates the cluster parameters using the current values of as follows:
(23)  
(24)  
(25) 
The E and M steps are performed iteratively until either the cluster parameter estimates or the loglikelihood estimate converges.
A summary of the estimation of the number of clusters in an observed data set using the proposed twostep approach is provided in Algorithm 1. Note that the computational complexity of is only , which can easily be ignored during the runtime analysis of the proposed approach. Hence, since the EM algorithm is run for all candidate models in , the computational complexity of the proposed twostep approach is , where is a fixed stopping threshold of the EM algorithm.
V Existing BICBased Cluster Enumeration Methods
As discussed in Section I, existing cluster enumeration algorithms that are based on the original BIC use the criterion as it is known from parameter estimation tasks without questioning its validity on cluster analysis. Nevertheless, since these criteria have been widely used, we briefly review them to provide a comparison to the proposed criterion , which is given by Eq. (18).
The original BIC, as derived in [16], evaluated at a candidate model is written as
(26) 
where denotes the likelihood function and . In Eq. (26), denotes the datafidelity term, while is the penalty term. Under the assumption that the observed data is Gaussian distributed, the datafidelity terms of and the ones of our proposed criterion, , are exactly the same. The only deference between the two is the penalty term. Hence, we use a similar procedure as in Algorithm 1 to implement the original BIC as a wrapper around the EM algorithm.
Moreover, the original BIC is commonly used as a wrapper around Kmeans by assuming that the data points that belong to each cluster are iid as Gaussian and all clusters are spherical with an identical variance, i.e.
for , where is the common variance of the clusters in [29, 32, 33]. Under these assumptions, the original BIC is given by(27) 
where denotes the original BIC of the candidate model derived under the assumptions stated above and is the number of estimated parameters in . Ignoring the model independent terms, can be written as
(28) 
where
(29) 
is the maximum likelihood estimator of the common variance.
In our experiments, we implement as a wrapper around the Kmeans++ algorithm [51]. The implementation of the proposed BIC as a wrapper around the Kmeans++ algorithm is given by Eq. (37).
Vi Comparison of the Penalty Terms of Different Bayesian Cluster Enumeration Criteria
Comparing Eqs. (18), (26), and (27), we notice that they have a common form [46, 11], that is,
(30) 
but with different penalty terms, where
(31)  
(32)  
(33) 
Remark.
The penalty terms of and depend linearly on , while the penalty term of our proposed criterion, , depends on in a nonlinear manner. Comparing the penalty terms in Eqs. (31)(33), has the weakest penalty term. In the asymptotic regime, the penalty terms of and coincide. But, in the finite sample regime, for values of , the penalty term of is stronger than the penalty term of . Note that the penalty term of our proposed criterion, , depends on the number of data vectors in each cluster, , of each candidate model , while the penalty term of the original BIC depends only on the total number of data vectors in the data set. Hence, the penalty term of our proposed criterion might exhibit sensitivities to the initialization of cluster parameters and the associated number of data vectors per cluster.
Vii Experimental Evaluation
In this section, we compare the cluster enumeration performance of our proposed criterion, given by Eq. (18), with the cluster enumeration methods discussed in Section V, namely and given by Eqs. (26) and (28), respectively, using synthetic and real data sets. We first describe the performance measures used for comparing the different cluster enumeration criteria. Then, the numerical experiments performed on synthetic data sets and the results obtained from real data sets are discussed in detail. For all simulations, we assume that a family of candidate models is given with and , where is the true number of clusters in the data set . All simulation results are an average of Monte Carlo experiments unless stated otherwise. The compared cluster enumeration criteria are based on the same initial cluster parameters in each Monte Carlo experiment, which allows for a fair comparison.
The MATLAB^{©} code that implements the proposed twostep algorithm and the Bayesian cluster enumeration methods discussed in Section V is available at:
https://github.com/FreTekle/BayesianClusterEnumeration
Viia Performance Measures
The empirical probability of detection , the empirical probability of underestimation , the empirical probability of selection, and the Mean Absolute Error (MAE) are used as performance measures. The empirical probability of detection is defined as the probability with which the correct number of clusters is selected and it is calculated as
(34) 
where MC is the total number of Monte Carlo experiments, is the estimated number of clusters in the th Monte Carlo experiment, and is an indicator function which is defined as
(35) 
The empirical probability of underestimation is the probability that . The empirical probability of selection is defined as the probability with which the number of clusters specified by each candidate model in is selected. The last performance measure, which is the Mean Absolute Error (MAE), is computed as
(36) 
ViiB Numerical Experiments
ViiB1 Simulation Setup
We consider two synthetic data sets, namely Data1 and Data2, in our simulations. Data1, shown in Fig. 0(a), contains realizations of the random variables , where , with cluster centroids and covariance matrices
The first cluster is linearly separable from the others, while the remaining clusters are overlapping. The number of data vectors per cluster is specified as , , and , where is a constant.
The second data set, Data2, contains realizations of the random variables , where , with cluster centroids , , , , , , , , , , and covariance matrices
where . As depicted in Fig. 0(b), Data2 contains eight identical and spherical clusters and two elliptical clusters. There exists an overlap between two clusters, while the rest of the clusters are well separated. All clusters in this data set have the same number of data vectors.
ViiB2 Simulation Results
Data1 is particularly challenging for cluster enumeration criteria because it has not only overlapping but also unbalanced clusters. Cluster unbalance refers to the fact that different clusters have a different number of data vectors, which might result in some clusters dominating the others. The impact of cluster overlap and unbalance on and MAE is displayed in Table I. This table shows and MAE as a function of , where is allowed to take values from the set . The cluster enumeration performance of is lower than the other methods because it is designed for spherical clusters with identical variance, while Data1 has one elliptical and two spherical clusters with different covariance matrices. Our proposed criterion performs best in terms of and MAE for all values of . As increases, which corresponds to an increase in the number of data vectors in the data set, the cluster enumeration performance of and greatly improves, while the performance of deteriorates because of the increase in overestimation. The total criterion (BIC) and penalty term of different Bayesian cluster enumeration criteria as a function of the number of clusters specified by the candidate models for is depicted in Fig. 2. The BIC plot in Fig. 1(a) is the result of one Monte Carlo run. It shows that and have a maximum at the true number of clusters , while overestimates the number of clusters to . As shown in Fig. 1(b), our proposed criterion, , has the second strongest penalty term. Note that, the penalty term of our proposed criterion shows a curvature at the true number of clusters, while the penalty terms of and are uninformative on their own.
MAE 

Comments
There are no comments yet.