Cluster analysis is an unsupervised learning task that finds the intrinsic structure in a set of unlabeled data by grouping similar objects into clusters. Cluster analysis plays a crucial role in a wide variety of fields of study, such as social sciences, biology, medical sciences, statistics, machine learning, pattern recognition, and computer vision[1, 2, 3, 4]. A major challenge in cluster analysis is that the number of clusters is usually unknown but it is required to cluster the data. The estimation of the number of clusters, also called cluster enumeration, has attracted the interest of researchers for decades and various methods have been proposed in the literature, see for example [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21] and the reviews in [22, 23, 24, 25, 4]. However, to this day, no single best cluster enumeration method exists.
In real-world applications, the observed data is often subject to heavy tailed noise and outliers [3, 26, 27, 28, 29, 30] which obscure the true underlying structure of the data. Consequently, cluster enumeration becomes even more challenging when either the data is contaminated by a fraction of outliers or there exist deviations from the distributional assumptions. To this end, many robust cluster enumeration methods have been proposed in the literature, see [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 27, 42, 43, 44, 45] and the references therein. A popular approach in robust cluster analysis is to use the Bayesian information criterion (BIC), as derived by Schwarz , to estimate the number of data clusters after either removing outliers from the data [32, 33, 34, 31], modeling noise or outliers using an additional component in a mixture modeling framework [35, 36], or exploiting the idea that the presence of outliers causes the distribution of the data to be heavy tailed and, subsequently, modeling the data as a mixture of heavy tailed distributions [37, 38]. For example, modeling the data using a family of distributions [47, 48, 49, 50, 51, 52, 53] provides a principled way of dealing with outliers by giving them less weight in the objective function. The family of distributions is flexible as it contains the heavy tailed Cauchy for
and the Gaussian distribution foras special cases. Consequently, we model the clusters using a family of multivariate
distributions and derive robust cluster enumeration criteria that account for outliers given that the degree of freedom parameteris sufficiently small.
In statistical model selection, it is known that the original BIC [46, 54] penalizes two structurally different models the same way if they have the same number of unknown parameters [55, 56]. Hence, careful examination of the original BIC is a necessity prior to its application in specific model selection problems . Following this line of argument, we have recently derived the BIC for cluster analysis by formulating cluster enumeration as maximization of the posterior probability of candidate models [20, 21]. In  we showed that the BIC derived specifically for cluster enumeration has a different penalty term compared to the original BIC. However, robustness was not considered in , where a family of multivariate Gaussian candidate models were used to derive the criterion, which we refer to as .
To the best of our knowledge, this is the first attempt made to derive a robust cluster enumeration criterion by formulating the cluster enumeration problem as maximization of the posterior probability of multivariate candidate models. Under some mild assumptions, we derive a robust Bayesian cluster enumeration criterion, . We show that has a different penalty term compared to the original BIC [46, 54], given that the candidate models in the original BIC are represented by a family of multivariate distributions. Interestingly, for both the data fidelity and the penalty terms depend on the assumed distribution for the data, while for the original BIC changes in the data distribution only affect the data fidelity term. Asymptotically, converges to . As a result, our derivations also provide a justification for the use of the original BIC with multivariate candidate models from a cluster analysis perspective. Further, we refine the derivation of by providing an exact expression for its penalty term. This results in a robust criterion, , which behaves better than in the finite sample regime and converges to in the asymptotic regime.
In general, BIC based cluster enumeration methods require a clustering algorithm that partitions the data according to the number of clusters specified by each candidate model and provides an estimate of cluster parameters. Hence, we apply the expectation maximization (EM) algorithm to partition the data prior to the calculation of an enumeration criterion, resulting in a two-step approach. The proposed algorithm provides a unified framework for the robust estimation of the number of clusters and cluster memberships.
The paper is organized as follows. Section II formulates the cluster enumeration problem and Section III introduces the proposed robust cluster enumeration criterion. Section IV presents the two-step cluster enumeration algorithm. A comparison of different Bayesian cluster enumeration criteria is given in Section V. A performance evaluation and comparison to existing methods using numerical and real data experiments is provided in Section VI. Finally, Section VII contains concluding remarks and highlights future research directions. Notably, a detailed proof is provided in Appendix B.
Lower- and upper-case boldface letters represent column vectors and matrices, respectively; Calligraphic letters denote sets with the exception ofwhich represents the likelihood function; , , and denote the set of real numbers, the set of positive real numbers, and the set of positive integers, respectively; and
denote probability mass function and probability density function (pdf), respectively;represents a multivariate
distributed random variablewith location parameter , scatter matrix , and degree of freedom ; denotes the estimator (or estimate) of the parameter ; iid stands for independent and identically distributed; (A.) denotes an assumption; stands for the natural logarithm; represents the expectation operator; stands for the limit; represents vector or matrix transpose; denotes the determinant when its argument is a matrix and an absolute value when its argument is scalar; represents the Kronecker product; refers to the stacking of the columns of an arbitrary matrix into a long column vector; denotes Landau’s term which tends to a constant as the data size goes to infinity; stands for an
dimensional identity matrix;and represent an dimensional all zero and all one matrix, respectively; denotes the cardinality of the set ; represents equality by definition; denotes mathematical equivalence.
Ii Problem Formulation
Let denote the observed data set which can be partitioned into independent, mutually exclusive, and non-empty clusters . Each cluster , for , contains data vectors that are realizations of iid multivariate random variables , where , , and represent the centroid, the scatter matrix, and the degree of freedom of the th cluster, respectively. Let be a family of multivariate candidate models, where and represent the specified minimum and maximum number of clusters, respectively. Each candidate model , for and , represents a partition of into clusters with associated cluster parameter matrix , which lies in a parameter space . Assuming that
the degree of freedom parameter , for , is fixed at some prespecified value,
the parameters of interest reduce to . Our research goal is to estimate the number of clusters in given assuming that
the constraint is satisfied.
Note that, given some mild assumptions are satisfied, we have recently derived a general Bayesian cluster enumeration criterion, which we refer to as . However, since we assume multivariate candidate models, some of the assumptions made in the derivation of require mathematical justification. In the next section, we highlight the specific assumptions that require justification and, in an attempt to keep the article self contained, provide all necessary derivations.
Iii Robust Bayesian Cluster Enumeration Criterion
Our objective is to select a model which is a posteriori most probable among the set of candidate models . Mathematically
where is the posterior probability of given , which can be written as
where denotes the joint posterior density of and given . According to Bayes’ theorem
we maximize over the family of candidate models for mathematical convenience. Hence, taking the natural logarithm of Eq. (4) results in
where is a constant that is independent of and, consequently, has no effect on the maximization of over . As the partitions (clusters) , for , are independent, mutually exclusive, and non-empty, the following holds:
Maximization of Eq. (9) over involves the maximization of the natural logarithm of a multidimensional integral. The multidimensional integral can be solved using either numerical integration or asymptotic approximations that result in a closed-form solution. Closed-form solutions are known to provide an insight into the model selection problem at hand . Hence, we apply Laplace’s method of integration [55, 56, 57, 20], which makes asymptotic approximations, to simplify the multidimensional integral in Eq. (9).
For ease of notation, Eq. (9) is written as
, for , has first- and second-order derivatives that are continuous over the parameter space ,
, for , has global maximum at , where is an interior point of , and
the Hessian matrix of , which is given by
where is the cardinality of , is positive definite for ,
can be approximated by its second-order Taylor series expansion around as follows:
where . The first derivative of evaluated at vanishes because of assumption (A.4).
Note that , for , is known to have multiple local maxima [53, 51]. For assumption (A.4) to hold, we have to show that is the global maximum of , for . We know that the global maximizer of , is , where is the true parameter vector. is the maximum likelihood estimator and its derivation and the final expressions are given in Appendix A. In , it was proven that
with probability one. As a result, asymptotically, assumption (A.4) holds. Assumption (A.5) follows because is a maximizer of .
, for , is continuously differentiable and its first-order derivatives are bounded on with .
where HOT denotes higher order terms and is a Gaussian kernel with mean and covariance matrix . Ignoring the higher order terms, Eq. (13) reduces to
is the Fisher information matrix (FIM) of the data vectors that belong to the th partition.
Iv Proposed Robust Bayesian Cluster Enumeration Algorithm
We propose a robust cluster enumeration algorithm to estimate the number of clusters in the data set . The presented two-step approach utilizes an unsupervised learning algorithm to partition into the number of clusters specified by each candidate model prior to the computation of one of the proposed robust cluster enumeration criteria for that particular model.
Iv-a Proposed Robust Bayesian Cluster Enumeration Criteria
For each candidate model , let there be a clustering algorithm that partitions into clusters and provides parameter estimates , for . Assume that (A.1)-(A.7) are fulfilled.
The posterior probability of given can be asymptotically approximated by
where represents the number of estimated parameters per cluster and
The likelihood function, also called the data fidelity term, is given by
where , denotes the gamma function, and is the squared Mahalanobis distance. The second term in the second line of Eq. (18) is referred to as the penalty term.
Once is computed for each candidate model , the number of clusters in is estimated as
When the data size is finite, one can opt to compute , without asymptotic approximations to obtain a more accurate penalty term. In such cases, the posterior probability of given becomes
where the expression for is given in Appendix C.
Iv-B The Expectation Maximization (EM) Algorithm for Mixture Models
We consider maximum likelihood estimation of the parameters of the -component mixture of distributions
where denotes the -variate pdf and . are the mixing coefficients and are assumed to be known or estimated, e.g. using . The mixing coefficients satisfy the constraints for , and .
The EM algorithm is widely used to estimate the parameters of the -component mixture of distributions [48, 47, 49, 59]. The EM algorithm contains two basic steps, namely the E step and the M step, which are performed iteratively until a convergence condition is satisfied. The E step computes
where is the posterior probability that belongs to the th cluster at the th iteration and is the weight given to by the th cluster at the th iteration. Once and are calculated, the M step updates cluster parameters as follows:
Algorithm 1 summarizes the working principle of the proposed robust two-step cluster enumeration approach. Given that the degree of freedom parameter is fixed at some finite value, the computational complexity of Algorithm 1 is the sum of the run times of the two steps. Since the initialization, i.e., the K-medians algorithm is performed only for a few iterations, the computational complexity of the first step is dominated by the EM algorithm and it is given by for a single candidate model , where is a fixed stopping threshold of the EM algorithm. The computational complexity of is , which is much smaller than the run-time of the EM algorithm and, as a result, it can easily be ignored in the run-time analysis of the proposed algorithm. Hence, the total computational complexity of Algorithm 1 is .
V Comparison of Different Bayesian Cluster Enumeration Criteria
where is the data fidelity term and is the penalty term. The proposed robust cluster enumeration criteria, and , and the original BIC with multivariate candidate models, , [48, 37, 38] have an identical data fidelity term. The difference in these criteria lies in their penalty terms, which are given by
where and are given by Eq. (19) and Eq. (69), respectively. Note that calculates an exact value of the penalty term, while and compute its asymptotic approximation. In the finite sample regime the penalty term of is stronger than the penalty term of , while asymptotically all three criteria have an identical penalty term.
When the degree of freedom parameter converges to
where is the asymptotic criterion derived in  assuming a family of Gaussian candidate models.
A modification of the data distribution of the candidate models only affects the data fidelity term of the original BIC [46, 54]. However, given that the BIC is specifically derived for cluster analysis, we showed that both the data fidelity and penalty terms change as the data distribution of the candidate models changes, see Eq. (18) and the expression for in .
A related robust cluster enumeration method that uses the original BIC to estimate the number of clusters is the trimmed BIC (TBIC) . The TBIC estimates the number of clusters using Gaussian candidate models after trimming some percentage of the data. In TBIC, the fast trimmed likelihood estimator (FAST-TLE) is used to obtain maximum likelihood estimates of cluster parameters. The FAST-TLE is computationally expensive since it carries out a trial and a refinement step multiple times, see  for details.
Vi Experimental Results
In this section, we compare the performance of the proposed robust two-step algorithm with state-of-the-art cluster enumeration methods using numerical and real data experiments. In addition to the methods discussed in Section V, we compare our cluster enumeration algorithm with the gravitational clustering (GC)  and the X-means  algorithms. All experimental results are an average of Monte Carlo runs. The degree of freedom parameter is set to for all methods that have multivariate candidate models. We use the author’s implementation of the gravitational clustering algorithm . For the TBIC, we trim of the data and perform iterations of the trial and refinement steps. For the model selection based methods, the minimum and maximum number of clusters is set to and , where denotes the true number of clusters in the data under consideration.
Vi-a Performance Measures
Performance is assessed in terms of the empirical probability of detection and the mean absolute error (MAE), which are defined as
where is the total number of Monte Carlo experiments, is the estimated number of clusters in the th Monte Carlo experiment, and is the indicator function defined as
In addition to these two performance measures, we also report the empirical probability of underestimation and overestimation, which are defined as
Vi-B Numerical Experiments
Vi-B1 Analysis of the sensitivity of different cluster enumeration methods to outliers
we generate two data sets which contain realizations of -dimensional random variables , where , with cluster centroids , , , and covariance matrices
The first data set (Data-1), as depicted in Fig. 0(a)
, replaces a randomly selected data point with an outlier that is generated from a uniform distribution over the rangeon each variate at each iteration. The sensitivity of different cluster enumeration methods to a single replacement outlier over iterations as a function of the number of data vectors per cluster is displayed in Table I. From the compared methods, our robust criterion has the best performance in terms of both and MAE. Except for and the TBIC, the performance of all methods deteriorates when , for , is small and, notably, performs poorly. This behavior is attributed to the fact that is an asymptotic criterion and in the small sample regime its penalty term becomes weak which results in an increase in the empirical probability of overestimation. and X-means are very sensitive to the presence of a single outlier because they model individual clusters as multivariate Gaussian. X-means performs even worse than
since it uses the K-means algorithm to cluster the data, which is ineffective in handling elliptical clusters. An illustrative example of the sensitivity ofand to the presence of an outlier is displayed in Fig. 2. Despite the difference in , when the outlier is either in one of the clusters or very close to one of the clusters, both and are able to estimate the correct number of clusters reasonably well. The difference between these two methods arises when the outlier is far away from the bulk of data. While is still able to estimate the correct number of clusters, starts to overestimate the number of clusters.
The second data set (Data-2), shown in Fig. 0(b), contains data points in each cluster and replaces a certain percentage of the data set with outliers that are generated from a uniform distribution over the range on each variate. Data-2 is generated in a way that no outlier lies inside one of the data clusters. In this manner, we make sure that outliers are points that do not belong to the bulk of data. Fig. 3 shows the empirical probability of detection as a function of the percentage of outliers . GC is able to correctly estimate the number of clusters for . The proposed robust criteria, and , and the original BIC, , behave similarly and are able to estimate the correct number of clusters when . The behavior of these methods is rather intuitive because as the amount of outliers increases, then the methods try to explain the outliers by opening a new cluster. A similar trend is observed for the TBIC even though its curve decays slowly. is able to estimate the correct number of clusters of the time when there are no outliers in the data set. However, even of outliers is enough to drive into overestimating the number of clusters.
Vi-B2 Impact of the increase in the number of features on the performance of cluster enumeration methods
we generate realizations of the random variables , for , whose cluster centroids and scatter matrices are given by
with , denoting an dimensional all one column vector, and representing an dimensional identity matrix. For this data set, referred to as Data-3, the number of features is varied in the range and the number of data points per cluster is set to . Because , Data-3 contains realizations of heavy tailed distributions and, as a result, the clusters contain outliers. The empirical probability of detection as a function of the number of features is displayed in Fig. 4. The performance of GC appears to be invariant to the increase in the number of features, while the remaining methods are affected. But, compared to the other cluster enumeration methods, GC is computationally very expensive. outperforms and the TBIC when the number of features is low, while the proposed criterion outperforms both methods in high dimensions. is not computed for this data set because it is computationally expensive and it is not beneficial given the large number of samples.
Vi-B3 Analysis of the sensitivity of different cluster enumeration methods to cluster overlap
here, we use Data-2 with