# Robust Bayesian Cluster Enumeration

A major challenge in cluster analysis is that the number of data clusters is mostly unknown and it must be estimated prior to clustering the observed data. In real-world applications, the observed data is often subject to heavy tailed noise and outliers which obscure the true underlying structure of the data. Consequently, estimating the number of clusters becomes challenging. To this end, we derive a robust cluster enumeration criterion by formulating the problem of estimating the number of clusters as maximization of the posterior probability of multivariate t_ν candidate models. We utilize Bayes' theorem and asymptotic approximations to come up with a robust criterion that possesses a closed-form expression. Further, we refine the derivation and provide a robust cluster enumeration criterion for the finite sample regime. The robust criteria require an estimate of cluster parameters for each candidate model as an input. Hence, we propose a two-step cluster enumeration algorithm that uses the expectation maximization algorithm to partition the data and estimate cluster parameters prior to the calculation of one of the robust criteria. The performance of the proposed algorithm is tested and compared to existing cluster enumeration methods using numerical and real data experiments.

## Authors

• 2 publications
• 2 publications
• 16 publications
• ### A Novel Bayesian Cluster Enumeration Criterion for Unsupervised Learning

The Bayesian Information Criterion (BIC) has been widely used for estima...
10/22/2017 ∙ by Freweyni K. Teklehaymanot, et al. ∙ 0

• ### Robust Factor Analysis Parameter Estimation

This paper considers the problem of robustly estimating the parameters o...
09/27/2019 ∙ by Rui Zhou, et al. ∙ 0

• ### Piecewise regression mixture for simultaneous functional data clustering and optimal segmentation

This paper introduces a novel mixture model-based approach for simultane...
12/25/2013 ∙ by Faicel Chamroukhi, et al. ∙ 0

• ### Bayesian Robustness: A Nonasymptotic Viewpoint

We study the problem of robustly estimating the posterior distribution f...
07/27/2019 ∙ by Kush Bhatia, et al. ∙ 18

• ### s-DRN: Stabilized Developmental Resonance Network

Online incremental clustering of sequentially incoming data without prio...
12/18/2019 ∙ by In-Ug Yoon, et al. ∙ 0

• ### Pair-Wise Cluster Analysis

This paper studies the problem of learning clusters which are consistent...
09/19/2010 ∙ by David R. Hardoon, et al. ∙ 0

• ### EM-based approach to 3D reconstruction from single-waveform multispectral Lidar data

In this paper, we present a novel Bayesian approach for estimating spect...
12/12/2019 ∙ by Quentin Legros, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Cluster analysis is an unsupervised learning task that finds the intrinsic structure in a set of unlabeled data by grouping similar objects into clusters. Cluster analysis plays a crucial role in a wide variety of fields of study, such as social sciences, biology, medical sciences, statistics, machine learning, pattern recognition, and computer vision

[1, 2, 3, 4]. A major challenge in cluster analysis is that the number of clusters is usually unknown but it is required to cluster the data. The estimation of the number of clusters, also called cluster enumeration, has attracted the interest of researchers for decades and various methods have been proposed in the literature, see for example [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21] and the reviews in [22, 23, 24, 25, 4]. However, to this day, no single best cluster enumeration method exists.

In real-world applications, the observed data is often subject to heavy tailed noise and outliers [3, 26, 27, 28, 29, 30] which obscure the true underlying structure of the data. Consequently, cluster enumeration becomes even more challenging when either the data is contaminated by a fraction of outliers or there exist deviations from the distributional assumptions. To this end, many robust cluster enumeration methods have been proposed in the literature, see [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 27, 42, 43, 44, 45] and the references therein. A popular approach in robust cluster analysis is to use the Bayesian information criterion (BIC), as derived by Schwarz [46], to estimate the number of data clusters after either removing outliers from the data [32, 33, 34, 31], modeling noise or outliers using an additional component in a mixture modeling framework [35, 36], or exploiting the idea that the presence of outliers causes the distribution of the data to be heavy tailed and, subsequently, modeling the data as a mixture of heavy tailed distributions [37, 38]. For example, modeling the data using a family of distributions [47, 48, 49, 50, 51, 52, 53] provides a principled way of dealing with outliers by giving them less weight in the objective function. The family of distributions is flexible as it contains the heavy tailed Cauchy for

and the Gaussian distribution for

as special cases. Consequently, we model the clusters using a family of multivariate

distributions and derive robust cluster enumeration criteria that account for outliers given that the degree of freedom parameter

is sufficiently small.

In statistical model selection, it is known that the original BIC [46, 54] penalizes two structurally different models the same way if they have the same number of unknown parameters [55, 56]. Hence, careful examination of the original BIC is a necessity prior to its application in specific model selection problems [55]. Following this line of argument, we have recently derived the BIC for cluster analysis by formulating cluster enumeration as maximization of the posterior probability of candidate models [20, 21]. In [20] we showed that the BIC derived specifically for cluster enumeration has a different penalty term compared to the original BIC. However, robustness was not considered in [20], where a family of multivariate Gaussian candidate models were used to derive the criterion, which we refer to as .

To the best of our knowledge, this is the first attempt made to derive a robust cluster enumeration criterion by formulating the cluster enumeration problem as maximization of the posterior probability of multivariate candidate models. Under some mild assumptions, we derive a robust Bayesian cluster enumeration criterion, . We show that has a different penalty term compared to the original BIC [46, 54], given that the candidate models in the original BIC are represented by a family of multivariate distributions. Interestingly, for both the data fidelity and the penalty terms depend on the assumed distribution for the data, while for the original BIC changes in the data distribution only affect the data fidelity term. Asymptotically, converges to . As a result, our derivations also provide a justification for the use of the original BIC with multivariate candidate models from a cluster analysis perspective. Further, we refine the derivation of by providing an exact expression for its penalty term. This results in a robust criterion, , which behaves better than in the finite sample regime and converges to in the asymptotic regime.

In general, BIC based cluster enumeration methods require a clustering algorithm that partitions the data according to the number of clusters specified by each candidate model and provides an estimate of cluster parameters. Hence, we apply the expectation maximization (EM) algorithm to partition the data prior to the calculation of an enumeration criterion, resulting in a two-step approach. The proposed algorithm provides a unified framework for the robust estimation of the number of clusters and cluster memberships.

The paper is organized as follows. Section II formulates the cluster enumeration problem and Section III introduces the proposed robust cluster enumeration criterion. Section IV presents the two-step cluster enumeration algorithm. A comparison of different Bayesian cluster enumeration criteria is given in Section V. A performance evaluation and comparison to existing methods using numerical and real data experiments is provided in Section VI. Finally, Section VII contains concluding remarks and highlights future research directions. Notably, a detailed proof is provided in Appendix B.

Notation:

Lower- and upper-case boldface letters represent column vectors and matrices, respectively; Calligraphic letters denote sets with the exception of

which represents the likelihood function; , , and denote the set of real numbers, the set of positive real numbers, and the set of positive integers, respectively; and

denote probability mass function and probability density function (pdf), respectively;

represents a multivariate

distributed random variable

with location parameter , scatter matrix , and degree of freedom ; denotes the estimator (or estimate) of the parameter ; iid stands for independent and identically distributed; (A.) denotes an assumption; stands for the natural logarithm; represents the expectation operator; stands for the limit; represents vector or matrix transpose; denotes the determinant when its argument is a matrix and an absolute value when its argument is scalar; represents the Kronecker product; refers to the stacking of the columns of an arbitrary matrix into a long column vector; denotes Landau’s term which tends to a constant as the data size goes to infinity; stands for an

dimensional identity matrix;

and represent an dimensional all zero and all one matrix, respectively; denotes the cardinality of the set ; represents equality by definition; denotes mathematical equivalence.

## Ii Problem Formulation

Let denote the observed data set which can be partitioned into independent, mutually exclusive, and non-empty clusters . Each cluster , for , contains data vectors that are realizations of iid multivariate random variables , where , , and represent the centroid, the scatter matrix, and the degree of freedom of the th cluster, respectively. Let be a family of multivariate candidate models, where and represent the specified minimum and maximum number of clusters, respectively. Each candidate model , for and , represents a partition of into clusters with associated cluster parameter matrix , which lies in a parameter space . Assuming that

1. the degree of freedom parameter , for , is fixed at some prespecified value,

the parameters of interest reduce to . Our research goal is to estimate the number of clusters in given assuming that

1. the constraint is satisfied.

Note that, given some mild assumptions are satisfied, we have recently derived a general Bayesian cluster enumeration criterion, which we refer to as [20]. However, since we assume multivariate candidate models, some of the assumptions made in the derivation of require mathematical justification. In the next section, we highlight the specific assumptions that require justification and, in an attempt to keep the article self contained, provide all necessary derivations.

## Iii Robust Bayesian Cluster Enumeration Criterion

Our objective is to select a model which is a posteriori most probable among the set of candidate models . Mathematically

 M^K=argmaxMp(Ml|X), (1)

where is the posterior probability of given , which can be written as

 p(Ml|X)=∫Ωlf(Ml,Θl|X)dΘl, (2)

where denotes the joint posterior density of and given . According to Bayes’ theorem

 f(Ml,Θl|X)=p(Ml)f(Θl|Ml)L(Θl|X)f(X), (3)

where denotes a discrete prior on , represents the prior on given , denotes the pdf of the data set , and is the likelihood function. Substituting Eq. (3) into Eq. (2) results in

 p(Ml|X)=f(X)−1p(Ml)∫Ωlf(Θl|Ml)L(Θl|X)dΘl. (4)

Since

 argmaxMp(Ml|X)≡argmaxMlogp(Ml|X), (5)

we maximize over the family of candidate models for mathematical convenience. Hence, taking the natural logarithm of Eq. (4) results in

 logp(Ml|X)=logp(Ml)+log∫Ωlf(Θl|Ml)L(Θl|X)dΘl+ρ, (6)

where is a constant that is independent of and, consequently, has no effect on the maximization of over . As the partitions (clusters) , for , are independent, mutually exclusive, and non-empty, the following holds:

 f(Θl|Ml) =l∏m=1f(θm|Ml) (7) L(Θl|X) =l∏m=1L(θm|Xm) (8)

Substituting Eqs. (7) and (8) into Eq. (6) results in

 logp(Ml|X) =logp(Ml)+l∑m=1log∫Rqf(θm|Ml)L(θm|Xm)dθm +ρ. (9)

Maximization of Eq. (9) over involves the maximization of the natural logarithm of a multidimensional integral. The multidimensional integral can be solved using either numerical integration or asymptotic approximations that result in a closed-form solution. Closed-form solutions are known to provide an insight into the model selection problem at hand [55]. Hence, we apply Laplace’s method of integration [55, 56, 57, 20], which makes asymptotic approximations, to simplify the multidimensional integral in Eq. (9).

For ease of notation, Eq. (9) is written as

 logp(Ml|X) =logp(Ml)+l∑m=1logU+ρ, (10)

where

 (11)

Given that

1. , for , has first- and second-order derivatives that are continuous over the parameter space ,

2. , for , has global maximum at , where is an interior point of , and

3. the Hessian matrix of , which is given by

 ^Hm≜−1Nmd2logL(θm|Xm)dθmdθ⊤m∣∣∣θm=^θm∈Rq×q,

where is the cardinality of , is positive definite for ,

can be approximated by its second-order Taylor series expansion around as follows:

 logL(θm|Xm) ≈logL(^θm|Xm)+~θ⊤mdlogL(θm|Xm)dθm∣∣∣θm=^θm +12~θ⊤m[d2logL(θm|Xm)dθmdθ⊤m∣∣∣θm=^θm]~θm =logL(^θm|Xm)−Nm2~θ⊤m^Hm~θm, (12)

where . The first derivative of evaluated at vanishes because of assumption (A.4).

Note that , for , is known to have multiple local maxima [53, 51]. For assumption (A.4) to hold, we have to show that is the global maximum of , for . We know that the global maximizer of , is , where is the true parameter vector. is the maximum likelihood estimator and its derivation and the final expressions are given in Appendix A. In [58], it was proven that

 limNm→∞^θm=θ0m

with probability one. As a result, asymptotically, assumption (A.4) holds. Assumption (A.5) follows because is a maximizer of .

Assume that

1. , for , is continuously differentiable and its first-order derivatives are bounded on with .

Then, substituting Eq. (12) into Eq. (11) and approximating by its Taylor series expansion yields

 U ≈∫Rq([f(^θm|Ml)+~θ⊤mdf(θm|Ml)dθm∣∣∣θm=^θm+HOT] (13)

where HOT denotes higher order terms and is a Gaussian kernel with mean and covariance matrix . Ignoring the higher order terms, Eq. (13) reduces to

 U ≈f(^θm|Ml)L(^θm|Xm)∫Rqexp(−Nm2~θ⊤m^Hm~θm)dθm +∫Rq~θ⊤mdf(θm|Ml)dθm∣∣∣θm=^θmL(^θm|Xm) ×exp(−Nm2~θ⊤m^Hm~θm)dθm. (14)

The second term in Eq. (14) is equivalent to , where is a constant. Consequently, Eq. (14) reduces to

 U ≈f(^θm|Ml)L(^θm|Xm)∫Rqexp(−Nm2~θ⊤m^Hm~θm)dθm =f(^θm|Ml)L(^θm|Xm)∫Rq((2π)q/2|N−1m^H−1m|1/2 1(2π)q/2∣∣N−1m^H−1m∣∣1/2exp(−Nm2~θ⊤m^Hm~θm)dθm) =f(^θm|Ml)L(^θm|Xm)(2π)q/2|N−1m^H−1m|1/2 (15)

given that . Substituting Eq. (15) into Eq. (10) results in

 logp(Ml|X) ≈logp(Ml)+l∑m=1log(f(^θm|Ml)L(^θm|Xm)) +lq2log2π−12l∑m=1log|^Jm|+ρ, (16)

where

 ^Jm≜Nm^Hm=−d2logL(θm|Xm)dθmdθ⊤m∣∣∣θm=^θm (17)

is the Fisher information matrix (FIM) of the data vectors that belong to the th partition.

## Iv Proposed Robust Bayesian Cluster Enumeration Algorithm

We propose a robust cluster enumeration algorithm to estimate the number of clusters in the data set . The presented two-step approach utilizes an unsupervised learning algorithm to partition into the number of clusters specified by each candidate model prior to the computation of one of the proposed robust cluster enumeration criteria for that particular model.

### Iv-a Proposed Robust Bayesian Cluster Enumeration Criteria

For each candidate model , let there be a clustering algorithm that partitions into clusters and provides parameter estimates , for . Assume that (A.1)-(A.7) are fulfilled.

###### Theorem 1.

The posterior probability of given can be asymptotically approximated by

 \emph{BIC}tν(Ml)≜logp(Ml|X)≈logL(^Θl|X)−q2∑lm=1logϵ, (18)

where represents the number of estimated parameters per cluster and

 ϵ=\emph{max}(∑xn∈Xmw2n,Nm). (19)

The likelihood function, also called the data fidelity term, is given by

 logL(^Θl|X) ≈l∑m=1NmlogNm−l∑m=1Nm2log|^Ψm| +l∑m=1NmlogΓ((νm+r)/2)Γ(νm/2)(πνm)r/2 −12l∑m=1∑xn∈Xm(νm+r)log(1+δnνm), (20)

where , denotes the gamma function, and is the squared Mahalanobis distance. The second term in the second line of Eq. (18) is referred to as the penalty term.

###### Proof.

Proving that Eq. (16) reduces to Eq. (18) requires approximating and, consequently, writing a closed-form expression for . A detailed proof is given in Appendix B. ∎

Once is computed for each candidate model , the number of clusters in is estimated as

 ^K{BIC}tν=argmaxl=Lmin,…,Lmax{BIC}tν(Ml). (21)
###### Corollary 1.

When the data size is finite, one can opt to compute , without asymptotic approximations to obtain a more accurate penalty term. In such cases, the posterior probability of given becomes

 \emph{BIC}\tiny\emph{F}tν(Ml)≈logL(^Θl|X)−12l∑m=1log|^Jm|, (22)

where the expression for is given in Appendix C.

### Iv-B The Expectation Maximization (EM) Algorithm for tν Mixture Models

We consider maximum likelihood estimation of the parameters of the -component mixture of distributions

 f(xn|Ml,Φl)=l∑m=1τmg(xn;μm,Ψm,νm), (23)

where denotes the -variate pdf and . are the mixing coefficients and are assumed to be known or estimated, e.g. using [48]. The mixing coefficients satisfy the constraints for , and .

The EM algorithm is widely used to estimate the parameters of the -component mixture of distributions [48, 47, 49, 59]. The EM algorithm contains two basic steps, namely the E step and the M step, which are performed iteratively until a convergence condition is satisfied. The E step computes

 ^υ(i)nm =^τ(i−1)mg(xn;μ(i−1)m,Ψ(i−1)m,νm)∑lj=1^τ(i−1)mg(xn;μ(i−1)j,Ψ(i−1)j,νj) (24) ^w(i)nm =νm+rνm+δ(i−1)n, (25)

where is the posterior probability that belongs to the th cluster at the th iteration and is the weight given to by the th cluster at the th iteration. Once and are calculated, the M step updates cluster parameters as follows:

 ^τ(i)m =∑Nn=1^υ(i)nmN (26) ^μ(i)m =∑Nn=1^υ(i)nmw(i)nmxn∑Nn=1^υ(i)nmw(i)nm (27) ^Ψ(i)m =∑Nn=1^υ(i)nmw(i)nm(xn−^μ(i)m)(xn−^μ(i)m)⊤∑Nn=1^υ(i)nm (28)

Algorithm 1 summarizes the working principle of the proposed robust two-step cluster enumeration approach. Given that the degree of freedom parameter is fixed at some finite value, the computational complexity of Algorithm 1 is the sum of the run times of the two steps. Since the initialization, i.e., the K-medians algorithm is performed only for a few iterations, the computational complexity of the first step is dominated by the EM algorithm and it is given by for a single candidate model , where is a fixed stopping threshold of the EM algorithm. The computational complexity of is , which is much smaller than the run-time of the EM algorithm and, as a result, it can easily be ignored in the run-time analysis of the proposed algorithm. Hence, the total computational complexity of Algorithm 1 is .

Note that if is used in Algorithm 1 instead of , the computational complexity of the algorithm increases significantly with the increase in the number of features due to the calculation of Eq. (69).

## V Comparison of Different Bayesian Cluster Enumeration Criteria

Model selection criteria that are derived by maximizing the posterior probability of candidate models given data are known to have a common form [56, 60, 20] that is consistent with

 logL(^Θl|X)−η, (29)

where is the data fidelity term and is the penalty term. The proposed robust cluster enumeration criteria, and , and the original BIC with multivariate candidate models, , [48, 37, 38] have an identical data fidelity term. The difference in these criteria lies in their penalty terms, which are given by

 BICtν: η=q2l∑m=1logϵ (30) BIC\tiny Ftν: η=q2l∑m=1log|^Jm| (31) BIC\tiny Otν: η=ql2logN, (32)

where and are given by Eq. (19) and Eq. (69), respectively. Note that calculates an exact value of the penalty term, while and compute its asymptotic approximation. In the finite sample regime the penalty term of is stronger than the penalty term of , while asymptotically all three criteria have an identical penalty term.

###### Remark.

When the degree of freedom parameter converges to

 \emph{BIC}\tiny\emph{N}(Ml) ≈l∑m=1NmlogNm−l∑m=1Nm2log|^Σm| −q2l∑m=1logNm, (33)

where is the asymptotic criterion derived in [20] assuming a family of Gaussian candidate models.

###### Remark.

A modification of the data distribution of the candidate models only affects the data fidelity term of the original BIC [46, 54]. However, given that the BIC is specifically derived for cluster analysis, we showed that both the data fidelity and penalty terms change as the data distribution of the candidate models changes, see Eq. (18) and the expression for in [20].

A related robust cluster enumeration method that uses the original BIC to estimate the number of clusters is the trimmed BIC (TBIC) [32]. The TBIC estimates the number of clusters using Gaussian candidate models after trimming some percentage of the data. In TBIC, the fast trimmed likelihood estimator (FAST-TLE) is used to obtain maximum likelihood estimates of cluster parameters. The FAST-TLE is computationally expensive since it carries out a trial and a refinement step multiple times, see [32] for details.

## Vi Experimental Results

In this section, we compare the performance of the proposed robust two-step algorithm with state-of-the-art cluster enumeration methods using numerical and real data experiments. In addition to the methods discussed in Section V, we compare our cluster enumeration algorithm with the gravitational clustering (GC) [41] and the X-means [10] algorithms. All experimental results are an average of Monte Carlo runs. The degree of freedom parameter is set to for all methods that have multivariate candidate models. We use the author’s implementation of the gravitational clustering algorithm [41]. For the TBIC, we trim of the data and perform iterations of the trial and refinement steps. For the model selection based methods, the minimum and maximum number of clusters is set to and , where denotes the true number of clusters in the data under consideration.

### Vi-a Performance Measures

Performance is assessed in terms of the empirical probability of detection and the mean absolute error (MAE), which are defined as

 pdet =1II∑i=11{^Ki=K} (34) MAE =1II∑i=1|K−^Ki|, (35)

where is the total number of Monte Carlo experiments, is the estimated number of clusters in the th Monte Carlo experiment, and is the indicator function defined as

 1{^Ki=K}≜{1,if^Ki=K0,otherwise. (36)

In addition to these two performance measures, we also report the empirical probability of underestimation and overestimation, which are defined as

 punder =1II∑i=11{^Ki

respectively.

### Vi-B Numerical Experiments

#### Vi-B1 Analysis of the sensitivity of different cluster enumeration methods to outliers

we generate two data sets which contain realizations of -dimensional random variables , where , with cluster centroids , , , and covariance matrices

 Σ1=[20.50.50.5],Σ2=[1000.1],Σ3=[2−0.5−0.50.5].

The first data set (Data-1), as depicted in Fig. 0(a)

, replaces a randomly selected data point with an outlier that is generated from a uniform distribution over the range

on each variate at each iteration. The sensitivity of different cluster enumeration methods to a single replacement outlier over iterations as a function of the number of data vectors per cluster is displayed in Table I. From the compared methods, our robust criterion has the best performance in terms of both and MAE. Except for and the TBIC, the performance of all methods deteriorates when , for , is small and, notably, performs poorly. This behavior is attributed to the fact that is an asymptotic criterion and in the small sample regime its penalty term becomes weak which results in an increase in the empirical probability of overestimation. and X-means are very sensitive to the presence of a single outlier because they model individual clusters as multivariate Gaussian. X-means performs even worse than

since it uses the K-means algorithm to cluster the data, which is ineffective in handling elliptical clusters. An illustrative example of the sensitivity of

and to the presence of an outlier is displayed in Fig. 2. Despite the difference in , when the outlier is either in one of the clusters or very close to one of the clusters, both and are able to estimate the correct number of clusters reasonably well. The difference between these two methods arises when the outlier is far away from the bulk of data. While is still able to estimate the correct number of clusters, starts to overestimate the number of clusters.

The second data set (Data-2), shown in Fig. 0(b), contains data points in each cluster and replaces a certain percentage of the data set with outliers that are generated from a uniform distribution over the range on each variate. Data-2 is generated in a way that no outlier lies inside one of the data clusters. In this manner, we make sure that outliers are points that do not belong to the bulk of data. Fig. 3 shows the empirical probability of detection as a function of the percentage of outliers . GC is able to correctly estimate the number of clusters for . The proposed robust criteria, and , and the original BIC, , behave similarly and are able to estimate the correct number of clusters when . The behavior of these methods is rather intuitive because as the amount of outliers increases, then the methods try to explain the outliers by opening a new cluster. A similar trend is observed for the TBIC even though its curve decays slowly. is able to estimate the correct number of clusters of the time when there are no outliers in the data set. However, even of outliers is enough to drive into overestimating the number of clusters.

#### Vi-B2 Impact of the increase in the number of features on the performance of cluster enumeration methods

we generate realizations of the random variables , for , whose cluster centroids and scatter matrices are given by

 μk =c1r×1 Ψk =Ir,

with , denoting an dimensional all one column vector, and representing an dimensional identity matrix. For this data set, referred to as Data-3, the number of features is varied in the range and the number of data points per cluster is set to . Because , Data-3 contains realizations of heavy tailed distributions and, as a result, the clusters contain outliers. The empirical probability of detection as a function of the number of features is displayed in Fig. 4. The performance of GC appears to be invariant to the increase in the number of features, while the remaining methods are affected. But, compared to the other cluster enumeration methods, GC is computationally very expensive. outperforms and the TBIC when the number of features is low, while the proposed criterion outperforms both methods in high dimensions.