 # Weighted likelihood mixture modeling and model based clustering

A weighted likelihood approach for robust fitting of a mixture of multivariate Gaussian components is developed in this work. Two approaches have been proposed that are driven by a suitable modification of the standard EM and CEM algorithms, respectively. In both techniques, the M-step is enhanced by the computation of weights aimed at downweighting outliers. The weights are based on Pearson residuals stemming from robust Mahalanobis-type distances. Formal rules for robust clustering and outlier detection can be also defined based on the fitted mixture model. The behavior of the proposed methodologies has been investigated by some numerical studies and real data examples in terms of both fitting and classification accuracy and outlier detection.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multivariate normal mixture models represent a very popular tool for both density estimation and clustering

(McLachlan and Peel, 2004). The parameters of a mixture model are commonly estimated by maximum likelihood by resorting to the EM algorithm (Dempster et al., 1977). Let be a random sample of size . The mixture likelihood can be expressed as

 L(y;τ)=n∏i=1K∑k=1πkϕp(yi;μk,Σk) , (1)

where , is the -dimensional multivariate normal density,

denotes the vector of prior membership probabilities and

are the mean vector and variance-covariance matrix of the

component, respectively. Rather than using the likelihood in (1), the EM algorithm works with the complete likelihood function

 Lc(y;τ)=n∏i=1K∏k=1[πkϕp(yi;μk,Σk)]uik , (2)

where is an indicator of the unit belonging to the component. The EM algorithm iteratively alternates between two steps: expectation (E) and maximization (M). In the E-step, the posterior expectation of (2) is evaluated by setting

equal to the posterior probability that

belongs to the component, i.e.

 uik∝πkϕp(yi;μk,Σk) ,

whereas at the M step , and are estimated conditionally on .

An alternative strategy is given by the penalized Classification EM (CEM) algorithm (Symon, 1977; Bryant, 1991; Celeux and Govaert, 1993): the substantial difference is that the E-step is followed by a C-step (where C stands for classification) in which is estimated as either or , meaning that each unit is assigned to the most likely component, conditionally on the current parameters’ values, i.e. , and for . The classification approach is aimed at maximizing the corresponding classification likelihood (2) over both the mixture parameters and the individual components’ labels. In the case , then the standard CEM algorithm is recovered. A detailed comparison of the EM and CEM algorithms can be found in Celeux and Govaert (1993).

When the sample data are prone to contamination and several unexpected outliers occur with respect to (w.r.t.) the assumed mixture model, maximum likelihood is likely to lead to unrealistic estimates and to fail in recovering the underlying clustering structure of the data (see Farcomeni and Greco (2015a) for a recent account). In the presence of noisy data that depart from the underlying mixture model, there is the need to replace maximum likelihood with a suitable robust procedure, leading to estimates and clustering rules that are not badly affected by contamination.

The need for robust tools in the estimation of mixture models has been first addressed in Campbell (1984), who suggested to replace standard maximum likelihood with M-estimation. In a more general fashion, Farcomeni and Greco (2015b)

proposed to resort to multivariate S-estimation of location and scatter in the M-step. Actually, the authors focused their attention on Hidden Markov Models but their approach can be adapted from dynamic to static finite mixtures. According to such strategies, each data point is attached a weight lying in

(a strategy commonly addressed as soft trimming). An alternative approach to robust fitting and clustering is based on hard trimming procedures, i.e. a crispy weight is attached to each observation: atypical observations are expected to be trimmed and the model is fitted by using a subset of the original data. The tclust methodology (Garcia-Escudero et al., 2008; Fritz et al., 2013) is particularly appealing. A very recent extension has been discussed in Dotto et al. (2016), who proposed a reweighted trimming procedure named rtclust. Both the tclust and rtclust are based on penalized CEM algorithms modified by a trimming strategy: at each iteration the most distant data points, with distances computed conditionally on the current cluster assignments, are discarded and a trimmed classification likelihood is maximized. A related proposal has been presented in Neykov et al. (2007) based on the so called trimmed likelihood methodology. Furthermore, it is worth to mention that mixture model estimation and clustering can be also implemented by using the adaptive hard trimming strategy characterizing the Forward Search (Atkinson et al., 2013).

There are also different proposals aimed at being robust that are not based on soft or hard trimming procedures. Some of them are characterized by the use of flexible components in the mixture. The idea is that of embedding the Gaussian mixture in a supermodel: McLachlan and Peel (2004)

introduced a mixture of Student’s t distributions, a mixture of skewed Student’s t distributions has been proposed in

Lin (2010) and Lee and McLachlan (2014), whereas Fraley and Raftery (1998, 2002) considered an additional component modeled as a Poisson process to handle noisy data (the method is available from package mclust in R (Fraley et al., 2012). A robust approach, named otrimle, has been proposed recently by Coretto and Hennig (2016), who considered the addition of an improper uniform mixture component to accomodate outliers.

We propose a robust version of both the EM and the penalized CEM algorithms to fit a mixture of multivariate Gaussian components based on soft trimming, in which weights are evaluated according to the weighted likelihood methodology (Markatou et al., 1998). A first attempt in this direction has been pursued by Markatou (2000). Here, that approach has been developed further and made more general leading to a newly established technique, in which weights are based on the recent results stated in Agostinelli and Greco (2018)

. The methodology leads to a robust fit and is aimed at providing both cluster assignment of genuine data points and outlier detection rules. Data points flagged as anomalous are not meant to be classified into any of the clusters. Furthermore, a relevant aspect of our proposal is represented by the introduction of constraints, not considered in

Markatou (2000), aimed at avoiding local or spurious solutions (Fritz et al., 2013).

Some necessary preliminaries on weighted likelihood estimation are given in Section 2. The weighted EM and penalized CEM algorithms are introduced in Section 3 and classification and outlier detection rules are outlined. Some illustrative example based on simulated data are given in Section 4, whereas Section 5 is devoted to model selection. Numerical studies are presented in Section 6 and real data examples are discussed in Section 7.

## 2 Background

Let us assume a mixture model composed by heterogeneous multivariate Gaussian components, where is fixed in advance, with density function denoted by . Markatou (2000) suggested to work with the following weighted likelihood estimating equation (WLEE) in the M-step of the EM algorithm:

 n∑i=1wik∑j=1uij∂∂τ[logπj+logϕp(yi;μj,Σj)]=0 . (3)

We notice that maximum likelihood estimates are replaced by weighted estimates. The weights are defined as

 wi=w(δ(yi))=[A(δ(yi))+1]+δ(yi)+1 , (4)

where denotes the positive part, is the Pearson residual function and is the Residual Adjustment Function (RAF, Basu and Lindsay (1994)). The Pearson residual gives a measure of the agreement between the assumed model and the data, that are summarized by a non-parametric density estimate , based on a kernel indexed by a bandwidth , that is

 δ(y)=^mn(y)m(y;τ)−1 , (5)

with . In the construction of Pearson residuals, Markatou (2000) suggested to use a smoothed model density in the continuous case, by using the same kernel involved in non-parametric density estimation (see Basu and Lindsay (1994); Markatou et al. (1998) for general results), i.e.

 m∗(y;τ)=∫k(t;y,h)m(t;τ)dt.

When the model is correctly specified, the Pearson residual function (5) evaluated at the true parameter value converges almost surely to zero, whereas, otherwise, for each value of the parameters, large Pearson residuals detect regions where the observation is unlikely to occur under the assumed model. The weight function (4) can be choosen to be unimodal so that it declines smoothly as the residual departs from zero. Hence, those observations lying in such regions are attached a weight that decreases with increasing Pearson residual. Large Pearson residuals and small weights will correspond to data points that are likely to be outliers. The RAF plays the role to bound the effect of large residuals on the fitting procedure, as well as the Huber and Tukey-bisquare function bound large distances in M-type estimation and we assume is such that . Here, we consider the families of RAF based on the Power Divergence Measure

 Apdm(δ)={τ((δ+1)1/τ−1)τ<∞log(δ+1)τ→∞

Special cases are maximum likelihood (, as the weights become all equal to one), Hellinger distance (

) and Neyman’s Chi–Square (). Another example is given by the Generalized Kullback-Leibler divergence (GKL) defined as

 Agkl(δ)=log(τδ+1)/τ, 0≤τ≤1.

Maximum likelihood is a special case when and Kullback–Leibler divergence is obtained for .

The shape of the kernel function has a very limited effect on weighted likelihood estimation. On the contrary, the smoothing parameter directly affects the robustness/efficiency trade-off of the methodology in finite samples. Actually, large values of

lead to Pearson residuals all close to zero and weights all close to one and, hence, large efficiency, since the kernel density estimate is stochastically close to the postulated model. On the other hand, small values of

make the kernel density estimate more sensitive to the occurrence of outliers and the Pearson residuals become large for those data points that are in disagreement with the model. In other words, in finite samples more smoothing will lead to higher efficiency but larger bias under contamination.

## 3 Weighted likelihood mixture modeling

The approach proposed by Markatou (2000) exhibits the same limitations that have been highlighted in Agostinelli and Greco (2018)

in the case of weighted likelihood estimation of multivariate location and scatter. The main drawbacks are driven by the employ of multivariate kernels. Actually, in the multivariate setting the computation of weights becomes troublesome because multivariate non-parametric density estimation is prone to the curse of dimensionality and the selection of

becomes problematic, as well. The novel technique proposed in Agostinelli and Greco (2018) is based on the Mahalanobis distances

 d=d(y;μ,Σ)=[(y−μ)TΣ−1(y−μ)]1/2.

Then, Pearson residuals and weights are obtained by comparing a univariate kernel density estimate based on squared distances and their underlying (asymptotic) distribution at the assumed multivariate normal model, rather than working with multivariate data and multivariate kernel density estimates, i.e

 δ(y)=^mn(d2)mχ2p(d2)−1 , (6)

where

 ^mn(t)=n−1n∑i=1k(t;d2,h)

is an unbiased at the boundary univariate kernel density estimate over and denotes the density function. It is worth noting that Pearson residuals can be evaluated w.r.t. the original density, so avoiding model smoothing (see also Kuchibhotla and Basu (2015, 2018)).

By exploiting the approach developed in Agostinelli and Greco (2018), we propose a weighted EM algorithm and penalized CEM algorithm whose M-steps are characterized by soft trimming procedures based on the Pearson residuals introduced in (6).

The Weighted EM algorithm (WEM) is structured as follows:

1. Initialization:

 τ(0)=(π(0),μ(0)1,…,μ(0)K,Σ(0)1,…,Σ(K)1) .

An appealing starting solution is given by a very robust method such as tclust based on a large rate of trimming but other robust candidate solutions can be used to initialize the algorithm. In order to avoid the algorithm to be dependent on initial values, a simple and common strategy is to run the algorithm from a number (say, 20 to 50) of starting values. For instance, different initial solutions can be obtained by randomly perturbing the deterministic starting solution and/or the final one obtained from it (Farcomeni and Greco, 2015b). Details on how to choose the best solution will be given at the end of the Section.

2. E step: the standard E step is left unchanged, with

 u(s)ik=π(s−1)kϕp(yi;μ(s−1)k,Σ(s−1)k)∑Kk=1π(s−1)kϕp(yi;μ(s−1)k,Σ(s−1)k)
3. Weighted M step: based on current parameter estimates,

1. Soft-trimming: let us evaluate component-wise Mahalanobis-type distances

 d(s)ik=d(yi;μ(s−1)k,Σ(s−1)k) .

Then, for each group, compute Pearson residuals and weights as

 δ(s)ik=^mn(d(s)2ik)mχ2p(d(s)2ik)−1

and

 w(s)ik=[A(δ(s)ik)+1]+δ(s)ik+1

respectively.

2. Update membership probabilities and component specific parameter estimates:

 π(s+1)k = n∑i=1u(s)ikw(s)ik∑ni=1∑Kk=1u(s)ikw(s)ik μ(s+1)k = n∑i=1yiw(s)iku(s)ik∑ni=1w(s)iku(s)ik Σ(s+1)k = n∑i=1(yi−μ(s+1)k)(yi−μ(s+1)k)Tw(s)iku(s)ik∑ni=1w(s)iku(s)ik
3. Set  .

It is worth noting that at the M-step it is proposed to solve the following WLEE

 n∑i=1k∑j=1uij∂∂τ[logπj+logϕp(yi;μj,Σj)]wij , (7)

that is characterized by the evaluation of component-wise sets of weights, rather than one weight for each observation, as in equation (3).

The Weighted penalized CEM algorithm (WCEM) is obtained by introducing a standard C-step between the E-step and the weighted M-step. The main feature of the WCEM algorithm is that one single weight is attached to each unit, based on its current assignment after the C-step, rather than component-wise weights. Then, the resulting WLEE shows the same structure as in (3) but with the difference that or .The WCEM is described as follows:

1. Initialization:

 τ(0)=(π(0),μ(0)1,…,μ(0)K,Σ(0)1,…,Σ(0)K) .
2. E step:

 u(s)ik=π(s−1)kϕp(yi;μ(s−1)k,Σ(s−1)k)∑Kk=1π(s−1)kϕp(yi;μ(s−1)k,Σ(s−1)k)
3. C step: let identify the cluster assignment for the unit at the iteration. Then

 ~u(s)ik={1if k=ki0if k≠ki .
4. Weighted M step: based on current parameter estimates and cluster assignments ,

1. Soft-trimming: evaluate the Mahalanobis-type distances of each point w.r.t. the component it belongs in

Then, compute the corresponding Pearson residuals and weights as

 δ(s)i=δ(s)iki=^mn(d(s)2iki)mχ2p(d(s)2iki)−1

and

 w(s)i=w(s)iki=[A(δ(s)iki)+1]+δ(s)iki+1

respectively, where

 ^mn(d2iki) = 1∑ni=1~uikik(d2;d2iki,h) = 1∑ni=1~uikk(d2;d2ik,h)~uik .

Hence, component-wise kernel density estimates only involve distances conditionally on cluster assignment.

2. Update membership probabilities and component specific parameter estimates:

 π(s+1)k = n∑i=1~u(s)ikw(s)iki∑ni=1w(s)iki μ(s+1)k = n∑i=1yiw(s)iki~u(s)ik∑ni=1w(s)iki~u(s)ik Σ(s+1)k = n∑i=1(yi−μ(s+1)k)(yi−μ(s+1)k)Tw(s)iki~u(s)ik∑ni=1w(s)iki~u(s)ik.
3. Set

It is worth noting that both the weighted algorithms returns weighted estimates of covariance. The final output can be suitably modified in order to provide unbiased weighted estimates.

A strategy to select the best root is given in Agostinelli and Greco (2018). The selected root is that with the lowest fitted probability

 Pr^τ[δ(^d2;^τ,^Mn)<−0.95] . (8)

It is worth to stress that Pearson residuals involved in (8) have to be evaluated conditionally on cluster assignment based on the original multivariate data for purely selection purposes. The sum of the weights also provides some guidance for root selection. Actually, if than the WLE is close to the MLE, whereas if is too small, than the corresponding WLE is a degenerate solution, indicating that it only represents a small subset of the data. The fitted weights are evaluated w.r.t. the final cluster assignment, as well.

### 3.1 Eigen-ratio constraint

It is well known that maximization of the mixture likelihood (1) or the classification likelihood (2) is an ill-posed problem since the objective function may be unbounded (Day, 1969; Maronna and Jacovkis, 1974). Therefore, in order to avoid such problems, the optimization is performed under suitable constraints. In particular, we employed the eigen-ratio constraint defined as

 maxjmaxkλj(Σk)minjminkλj(Σk)≤c,j=1,2…,p, k=1,2,…,K (9)

where denoted the eigenvalue of the covariance matrix and is a fixed constant not smaller than one aimed at tuning the strength of the constraint. For spherical clusters are imposed, while as increases varying shaped clusters are allowed. The eigen-ratio constraint (9) can be satisfied at each iteration by adjusting the eigenvalues of each . This is achieved by replacing them with a truncated version

 λ∗j(Σk)=⎧⎪ ⎪⎨⎪ ⎪⎩cif λj(Σk)cθc

where is an unknown bound depending on . The reader is pointed to Fritz et al. (2013); Garcia-Escudero et al. (2015) for a feasible solution to the problem of finding .

### 3.2 The selection of h

The selection of is a crucial task. Let be the average of the weights at the true parameter value. Then, the expected value of tells us about the expected rate of contamination in the data. In the case of a normal kernel, Markatou (2000) suggested to tune for a fixed (approximated) expected downweighting level. According to authors’ experience (see Agostinelli and Greco (2018), Agostinelli and Greco (2017) and Greco (2017), for instance), but also as already suggested by Markatou et al. (1998), a safe selection of can be achieved by monitoring the empirical downweighting level as varies, with , where the weights at convergence are evaluated at the fitted parameter value, conditionally on the final cluster assignment. The monitoring of WLE analyses has been applied successfully in Agostinelli and Greco (2017) to the case of robust estimation of multivariate location and scatter. The reader is pointed to the paper by Cerioli et al. (2017) for an account on the benefits of monitoring.

The tuning of the smoothing parameter could be based on several quantities of interest stemming from the fitted mixture model. For instance, one could monitor the weighted log-likelihood at convergence or unit-specific robust distances conditionally on the final cluster assignment, rather than the empirical downweighting level. A good strategy would be to monitor different quantities, for a varying number of clusters. For instance, if an abrupt change in the monitored empirical downweighting level or in the robust distances is observed, it may indicate the transition from a robust to a non robust fit and aid in the selection of a value of that gives an appropriate compromise between efficiency and robustness. It is worth to note that, the trimming level in tclust or the improper density constant in otrimle are selected in a monitoring fashion, as well.

### 3.3 Classification and outlier detection

The WCEM automatically provides a classification of the sample units, since the value of at convergence is either zero or one. With the WEM, by paralleling a common approach, a Maximum-A-Posteriori criterion can be used for cluster assignment, that is a C-step is applied after the last E-step. Such criteria lead to classify all the observations, both genuine and contaminated data, meaning that also outliers are assigned to a cluster. Actually, we are not interested in classifying outliers and for purely clustering purposes outliers have to be discarded.

We distinguish two main approaches to outlier detection. According to the first, outlier detection should be based on the robust fitted model and performed separately by using formal rules. The key ingredients in multivariate outlier detection are the robust distances (Rousseeuw and Van Zomeren, 1990; Cerioli, 2010). The reader is pointed to Cerioli and Farcomeni (2011) for a recent account on outlier detection. An observation is flagged as an outlier when its squared robust distance exceeds a fixed threshold, corresponding to the

level quantile of the reference (asymptotic) distribution of the squared robust distances. A common solution is represented by the use of the

and popular choices are and . In the case of finite mixtures, the main idea is that the outlyingness of each data point should be measured conditionally on the final assignment. Hence, according to a proper testing strategy, an observation is declared as outlying when

 d2iki>χ2p;1−α , d2iki=(yi−^μki)T^Σki(yi−^μki) . (10)

The second approach stems from hard trimming procedures, such as tclust, rtclust and otrimle. These techniques are not meant to provide simultaneous robust fit and outlier detection based on formal testing rules, but outliers are identified with those data points falling in the trimmed set or assigned to the improper density component, respectively. Therefore, by paralleling what happens with hard trimming, one could flag as outliers those data points whose weight, conditionally on the final cluster assignment, is below a fixed (small) threshold. Values as 0.10 or 0.20 seem reasonable choices. Furthermore, the empirical downweighting level represents a natural upper bound for the cut-off value, that would give an indication of the largest tolerable swamping and of the minimum feasible masking for the given level of smoothing. This approach is motivated by the fact that the multivariate WLE shares important features with hard trimming procedures, even if it is based on soft trimming, as claimed in Agostinelli and Greco (2017).

The process of outlier detection may result in type-I and type-II errors. In the former case, a genuine observation is wrongly flagged as outlier (swamping), in the latter case, a true outlier is not identified (masking). Swamped genuine observations are false positives, whereas masked outliers are false negatives. According to the first strategy, the larger

the more swamping and the less masking. In a similar fashion, the higher the threshold the more swamping and the less masking will characterize the second approach to outlier detection.

In the following, both approaches to outlier detection will be taken into account and critically compared.

## 4 Illustrative examples

### 4.1 Simulated data

Let us consider a three component mixture model with , , , and

 Σ1=(1−0.5−0.51),Σ2=(21.251.252),Σ3=(3−1.75−1.753).

and a simulated sample of size , with of background noise. Outliers have been generated uniformly within an hypercube whose dimensions include the range of the data and are such that the distance to the closest component is larger than the -level quantile of a distribution. WEM and WCEM have been run by setting the eigen-ratio restriction constant to (the true value is 9.5). The weights are based on the Generalized Kullback-Libler divergence and a folded normal kernel. The smoothing parameter has been selected by monitoring the empirical downweighting level and unit specific clustering-conditioned distances over a grid of values (Agostinelli and Greco, 2017). Figure 1 displays the monitoring analyses of the empirical downweighting level and the robust distances for the WEM. In both panels an abrupt change is detected, meaning that for values on the right side of the vertical line the procedure is no more able to identify the outliers, hence being not robust w.r.t. the presence of contamination. Similar trajectories are observed for the WCEM. A color map has been used that goes from light gray to dark gray in order to highlight those trajectories corresponding to observations that are flagged as outlying for most of the monitoring. Figures 2 displays the result of applying both the WEM and WCEM algorithm to the sample at hand with an outlier detection rule based on the 0.99-level quantile of the distribution and on a threshold for weights set at 0.2. Component-specific tolerance ellipses are based on the 0.95-level quantile of the distribution. We notice that both methods succeed in recovering the underlying structure of the clean data and that the outlier detection rules provide quite similar and satisfactory outcomes. The entries in Table 1 gives the rate of detected outliers , swamping and masking stemming from the alternative strategies. Figure 1: Simulated data. Monitoring the empirical downweighting level (left) and robust distances (right) based on WEM. The vertical lines give the selected h. The horizontal line gives the χ22;0.99 quantile. Figure 2: Simulated data. Fitted components, cluster assignments and outlier detection by WEM (left) and WCEM (right). Top row: outlier detection based on d2ki<χ2p;0.99. Bottom row: outlier detection based on wki<0.2 95% tolerance ellipses overimposed.

## 5 Model selection

In model based clustering, formal approaches to choose the number of components are based on the value of the log-likelihood function at convergence. Criteria such as the BIC or the AIC are commonly used to select when running the classical EM algorithm. In a robust setting, in tclust the number of clusters is chosen by monitoring the changes in the trimmed classification likelihood over different values of and contamination levels. A formal approach has not been investigated yet in the case of the otrimle, even if the authors conjectures that a monitoring approach or the development of information criteria can be pursued as well.

Here, when the robust fit is achieved by the WEM algorithm, we suggest to resort to a weighted counterparts of the classical AIC or BIC criteria, according to the results stated in Agostinelli (2002). Then, the proposed strategy is based on minimizing

 Qw(K)=−2ℓw(y;^τ)+m(K) (11)

where and is a penalty term reflecting model complexity.

Figure 3 displays the behavior of the weighted BIC for the set of simulated data of Section 4, for different choices of over a grid of values for the smoothing parameter . A similar trajectory is observed for both and . The abrupt change is detected at close values of but the choice is preferred since it leads to a smaller weighted BIC before the robust fit turns into a non robust one. Figure 3: Example 1. Monitoring the weighted BIC. The vertical line gives the selected h.

Furthermore, for what concerns the WCEM algorithm, one could mimic the approach used in tclust and monitor the weighted conditional likelihood at convergence for varying and . Then, the number of clusters should be set equal to the minimum for which there is no substantial improvement in the objective function when adding one group more.

It can be proved that the robust criterion in (11) is asymptotically equivalent to its classical counterparts at the assumed model, i.e. when the data are not prone to contamination. The proof is based on some regularity conditions about the kernel and the model that are required to assess the asymptotic behavior of the WLE (Agostinelli, 2002; Agostinelli and Greco, 2013, 2018). In the case of finite mixture models, it is assumed further that an ideal clustering setting holds under the postulated mixture model, that is data are assumed to be well clustered. The following Theorem holds.

Theorem. Let be the set of points belonging to the component, whose cardinality is . The full data is defined as with and . Assume that (i) the model is correctly specified, (ii) the WLE is a consistent estimator of , (iii) . Then, .
Proof. Let denote the maximum likelihood estimate.

 12|Qw(k)−Q(k)| = ∣∣ ∣∣∑iwiℓ(yi;^τ)−∑iℓ(yi;~τ)∣∣ ∣∣ ≤ {∣∣ ∣∣∑i(wi−1)ℓ(yi;^τ)∣∣ ∣∣} + {|ℓ(yi;^τ)−ℓ(yi;~τ)|} ≤ ∑i|(wi−1)ℓ(yi;^τ)| ≤ supy|wi−1|∑iℓ(yi;^τ) p→0 as nj→∞

## 6 Numerical studies

In this section, we investigate the finite sample behavior of the proposed WEM and WCEM algorithms. Both algorithms are still based on non optimized R code. Nevertheless, the results that follow are satisfactory and computational time always lays in a feasible range. We set , and simulate data according to the M5 scheme as introduced in Garcia-Escudero et al. (2008). Clusters have been generated by

-variate Gaussian distributions with parameters

 μ1 = (−β,−β,0,…,0), μ2 = (0,β,0,…,0), μ3 = (β,0,0,…,0)

and

where is a null row vector of dimension and is the identity matrix. Dimensions have been taken into account. The parameter regulates the degree of overlapping among clusters: smaller values yield severe overlapping whereas larger values give a better separation. Here, we set . Theoretical cluster weights are fixed as . Outliers have been generated uniformly within an hypercube whose dimensions include the range of the data and are such that the distance to the closest component is larger than the -level quantile of a distribution. The rate of contamination has been set to . The case has been used to evaluate the efficiency of the proposed techniques when applied to clean data. The numerical studies are based on Monte Carlo trials. The weighted likelihood algorithms are both based on a folded normal kernel and a GKL RAF (with ), whereas we set as eigen-ratio constraint. The smoothing parameter has been selected in such a way that the empirical downweighting level lies in the range under contamination, whereas it is about when no outliers occur. Fitting accuracy has been evaluated according to the following measures:

1. , where and are matrices with and in each row, respectively, for ;

2. , where denotes the condition number of the matrix ;

3. .

For what concerns the task of outlier detection, several strategies have been compared: we considered a detection rule based on the 0.99-level quantile of the distribution, according to (10), but also based on the fitted weights, with thresholds set at , and . For each decision rule, for the contaminated scenario, we report (a) the rate of detected outliers ; (b) the swamping rate; (c) the masking rate. The first is a measure of the fitted contamination level, whereas the others give insights on the level and power of the outlier detection procedure. We notice that comparisons across the different methods should be considered the more fair the closer are the values of . For , swamping only is taken into account.

Classification accuracy has been measured by (i) the Rand index and (ii) the misclassification error rate (MCE), both evaluated over true negatives for the robust techniques. In order to avoid problems due to label switching issues, cluster labels have been sorted according to the first entry of the fitted location vectors.

Under the assumed model, WEM and WCEM have been initialized by tclust with of trimming and their behavior have been compared with the EM and CEM algorithms and the otrimle, for the same eigen ratio constraint and the same initial values. In the presence of contamination, we do not report the results concerning the non robust EM and CEM but only those regarding the WEM, WCEM, otrimle and oracle tclust, i.e. with trimming level equal to the actual contamination level (tclust10 and tclust20). In this case, starting values have been driven by tclust with of trimming.

It is worth to stress, here, that the comparison in terms of outlier detection reliability between weighted likelihood estimation, tclust and otrimle can be considered fair only by looking at the rate of weights below the fixed threshold for the former methodology and trimmed observation or those assigned to the improper density group for the latter techniques, since formal testing rules have not ben considered neither for tclust or otrimle.

First, let us consider the behavior of WEM and WCEM at the assumed model. The entries in Table 2 give the considered average measures of fitting accuracy; Table 3 gives the level of swamping according to the different strategies for WEM and WCEM, that are based on the distribution and the inspection of weights; Table 4 reports classification accuracy. The overall behavior of WEM and WCEM is appreciable: we observe a tolerable efficiency loss, a negligible swamping effect and a reliable classification accuracy, indeed, as compared with the non robust procedures. Furthermore, the results are quite similar to those stemming from otrimle and quite often the inspection of the weights from WEM and WCEM leads to a smaller number of false positives, on average.

The performance of WEM and WCEM under contamination is explored next. The fitting accuracy provided by the proposed weighted likelihood based strategies is illustrated in Tables 5, 6, 7. In all considered scenarios, the behavior of WEM and WCEM is satisfactory and they both compare well with the oracle tclust and otrimle. In particular, the good performance of WEM and WCEM has to be remarked in the challenging situation of severe overlapping. Furthermore, for all data configurations, we notice the ability of WEM to combine accurate estimates of component-specific parameters with those of the cluster weights. The entries in Tables 8, 9, 10 show the behavior of the testing procedure based on the distribution and the inspection of weights for all considered scenarios. The empirical level of contamination is always larger than the nominal one but it is acceptable and stable as and change. Masking is always negligible hence highlighting the appreciable power of the testing procedure. We remark that one could also consider multiple testing adjustments in outlier detection as outlined in Cerioli and Farcomeni (2011). To conclude the analysis, Tables 11, 12, 13 give the considered measures of classification accuracy as varies. The results are quite stable across the four methods and all dimensions. As well as before, WEM and WCEM lead to a satisfactory classification, even in the challenging case of severe overlapping.

## 7 Real data examples

### 7.1 Swiss bank note data

Let us consider the well known Swiss banknote dataset concerning measurements of old Swiss 1000-franc banknotes, half of which are counterfeit. The weighted likelihood strategy is based on a gamma kernel and a symmetric Chi-square RAF. Our first task is to choose the number of clusters. To this end we look at the weighted BIC (11) on a fixed grid of values for and a restriction factor . The inspection of Figure 4 clearly suggests a two groups structure for all considered values of the smoothing parameter . The empirical downweighting level is fairly stable for a wide range of values. We decided to set leading to an empirical downweighting level equal to . The WEM algorithm based on the testing rule (10) with leads to identify 21 outliers, that include 15 forged and 6 genuine bills. On the contrary, there are 19 data points whose weight is lower than , that include 14 forged and 5 genuine bills.The cluster assignments stemming from the latter approach is displayed in Figure 5. It is worth to note that the outlying forged bills coincide with the group that has been recognized to follow a different forgery pattern and characterized by a peculiar length of the diagonal (see García-Escudero et al. (2011); Dotto et al. (2016) and references therein). On the other hand, the outlying genuine bills all exhibit some extreme measures. For the same value of the eigen-ratio constraint, the otrimle assignes 19 bills to the improper component density, leading to the same classification of the WEM, whereas rtclust includes in the trimming set one counterfeit bill more. A visual comparison between the three results is possible from Figure 6, whose panels show a scatterplot of the fourth against the sixth variable with the classification resulting from WEM (with both outlier detection rules), rtclust and otrimle, respectively. The WCEM has been tuned to achieve the same empirical downweighting level and leads to the same results. Figure 4: Swiss banknote data. Monitoring the weighted BIC for WEM, K=1,2,3,4. Figure 5: Swiss banknote data. Cluster assignments by WEM. Observation whose weight is lower than 1−¯w are considered outliers. Genuine bills are denoted by a green +, forged bills by a red △. Outliers are denoted by a black filled circle. Figure 6: Swiss banknote data. Fourth against the sixth variable with cluster assignments by WEM (α=0.01), WEM (wki<1−¯w), rtclust and otrimle in clockwise fashion. Genuine bills are denoted by a green +, forged bills by a red △. Outliers are denoted by a black filled circle.

### 7.2 2018 World Happiness Report data

In this section the weighted likelihood methodology is applied to a dataset from the 2018 World Happiness Report by the United Nations Sustainable Development Solutions Network (Helliwell et al., 2018) (hereafter denoted by WHR18). The data give measures about six key variables used to explain the variation of subjective well-being across countries: per capita Gross Domestic Product (on log scale), Social Support, i.e. the national average of the binary responses to the Gallup World Poll (GWP) question If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?, Health Life Expectancy at birth, Freedom to make life choices, i.e. the national average of binary responses to the GWP question Are you satisfied or not with your freedom to choose what you do with your life?, Generosity, measured by the residual of regressing the national average of GWP responses to the question Have you donated money to a charity in the past month? on GDP per capita, perception of Corruption, i.e. the national average of binary responses to the GWP questions Is corruption widespread throughout the government or not? and Is corruption widespread within businesses or not?. The dataset is made of 142 rows, after the removal of some countries characterized by missing values. The objective is to obtain groups of countries with a similar behavior, to identify possible countries with anomalous and unexpected traits and to highlight those features that are the main source of separation among clusters.

In this example, a GKL RAF has been chosen, the unbiased at the boundary kernel density estimate has been obtain by first evaluating a kernel density estimate on the log-transformed squared distance over the whole real line and then back-transforming the fitted density to (Agostinelli and Greco, 2018), and we set .

In order to select the number of clusters, we monitored the weighted BIC stemming from WEM and the Classification log-Likelihood at convergence from WCEM for different values of and . The corresponding monitoring plots are given in Figure 7, respectively. Based on the WEM algorithm, is to be preferred, even if the gap with the case is very small for all considered values of the smoothing parameter. On the other hand, the inspection of the weighted Classification log-Likelihood driven by the WCEM suggests . Therefore, we have applied our WEM and WCEM algorithms both based on and . As with two groups are not very separated, we preferred and reported only those results for reasons of space. Moreover, the results stemming from WEM and WCEM were very similar both in terms of fitted parameters, cluster assignments and detected outliers. Then, in the following we only give the results driven by WEM. The empirical downweighting level was found not to depend in a remarkable fashion on the number of groups. In particular, for , in the monitoring process of we did not observe any abrupt change but a smooth decline until a stabilization of the level of contamination occurred. Then, we decided to use a value leading to . Figure 8 displays the distance plot stemming from WEM. According to (10) for a level , 12 outliers are detected. A closer inspection of the plot unveils that some of such points are close to the cut-off value. Therefore, they are not considered as outliers but correctly assigned to the corresponding cluster. Furthermore, we notice that all the points leading to the largest distances are attached a very small weight (). The weight corresponding to Myanmar is about 0.40. The other countries near the threshold line all show weights about equal to 0.80.

The cluster profiles and raw measurements for the detected outlying countries are reported in Table 14. The three clusters are well separated in terms of all of the items considered, even if small differences are seen in terms of perception of Corruption between clusters 1 and 2. Cluster 3 includes all the countries characterized by the highest level of subjective well-being. The differences with the other two clusters concerning GDP, HLE and Corruption are outstanding. On the opposite, in cluster 1 we find all the countries with the hardest economic and social conditions. For what concerns the explanation of the outliers, we notice that the subgroup of African countries composed by Central African Republic, Chad, Ivory Coast, Lesotho, Nigeria and Sierra Leone might belong to group 1 but is characterized by the six lowest HLE values. For what concerns Myanmar, it exhibits extremely large Freedom and Generosity indexes and a surprising small value for Corruption. It should belong to cluster 1 but it is closer to cluster 3 according to the last three measurements, indeed. For instance, it may be supposed that such measurements are not completely reliable because of problems with the questionnaires and the sampling or they may revel a surprising positive attitude despite of the difficult economic and life conditions.

A spatial map of cluster assignments is given in Figure 9, that confirms the goodness and coherence of the results and the ability of the considered six features to find reasonable clusters. Cluster 1 is mainly localized in Africa, cluster 2 is composed by developing countries, whereas cluster 3 includes the world leading countries, among which there are USA, Canada, Australia and the countries in the European Union. Figure 7: WHR18 data. Monitoring the weighted BIC for WEM (left) and the weighted Classification log-Likelihood for WCEM, K=2,3,4,5,6. Figure 8: WHR18 data. Distance plot for WEM with K=3. The horizontal line gives the χ26;0.99 quantile. Figure 9: WHR18 data. Spatial classification from WEM with K=3.

## 8 Conclusions

We have proposed a robust technique for fitting a finite mixture of multivariate Gaussian components based on recent developments in weighted likelihood estimation. Actually, the proposed methodology is meant to provide a step further with respect to the original proposal in Markatou (2000). The method is based on the idea of using a univariate kernel density estimate based on robust distances rather than a multivariate one based on the data in order to compute weights. Furthermore, the proposed technique is characterized by the introduction of an eigen constraint aimed at avoiding problems connected with an unbounded likelihood or spurious solutions.

Based on the robustly fitted mixture model, a model based clustering strategy can be built in a standard fashion by looking at the value of posterior membership probabilities. At the same time, formal rules for outlier detection can be derived, as well. Then, one could assign units to clusters provided that the corresponding outlyingness test is not significant, that means that detected outliers have to be discarded and not assigned to any group. The numerical studies and the real data examples showed the satisfactory reliability of the proposed methodology.

There is still room for further work, along a path shared with tclust, rtclust and otrimle. Actually the proposed method works for a given smoothing parameter and a fixed number of clusters

. In addition, outlier detection depends upon a fixed threshold. At the moment, the selection of

stemming from the monitoring of several quantities, such as the empirical downweighting level, the unit specific robust distances or even the fitted parameters, provides an acceptable adaptive solution. Such a procedure is not different form the implementation of a sequence of refinement steps of an initial robust partition stemming from a sequence of decreasing values of . The selection of remains a difficult problem to deal with too, despite the satisfactory behavior of the proposed criteria, i.e. the weighted BIC and the weighted Classification log-Likelihood. Outlier detection is a novel aspect in the framework of robust mixture modeling and model based clustering. In the specific context, the outlyingness of each unit is tested conditionally on the final cluster assignment. The number of outliers clearly depends on the chosen level or the selected threshold for the final weights. A fair choice of the level of the test is still an open problem in outlier detection. However, the suggested testing strategies work satisfactory, at least in those considered scenarios, and provide a good compromise between swamping and masking that could be improved further by using multiplicity adjustments (Cerioli and Farcomeni, 2011).

## References

• Agostinelli (2002) Agostinelli C (2002) Robust model selection in regression via weighted likelihood methodology. Statistics & probability letters 56(3):289–300
• Agostinelli and Greco (2013)

Agostinelli C, Greco L (2013) A weighted strategy to handle likelihood uncertainty in bayesian inference. Computational Statistics 28(1):319–339

• Agostinelli and Greco (2017) Agostinelli C, Greco L (2017) Discussion on ”the power of monitoring: How to make the most of a contaminated sample”. Statistical Methods & Applications DOI 10.1007/s10260-017-0416-9
• Agostinelli and Greco (2018) Agostinelli C, Greco L (2018) Weighted likelihood estimation of multivariate location and scatter. Test DOI https://doi.org/10.1007/s11749-018-0596-0
• Atkinson et al. (2013) Atkinson A, Riani M, Cerioli A (2013) Exploring multivariate data with the forward search. Springer Science Business Media
• Basu and Lindsay (1994) Basu A, Lindsay B (1994) Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Annals of the Institute of Statistical Mathematics 46(4):683–705
• Bryant (1991) Bryant P (1991) Large-sample results for optimization-based clustering methods. Journal of Classification 8(1):31–44
• Campbell (1984) Campbell N (1984) Mixture models and atypical values. Mathematical Geology 16(5):465–477
• Celeux and Govaert (1993)

Celeux G, Govaert G (1993) Comparison of the mixture and the classification maximum likelihood in cluster analysis. Journal of Statistical Computation and simulation 47(3-4):127–146

• Cerioli (2010) Cerioli A (2010) Multivariate outlier detection with high-breakdown estimators. Journal of the American Statistical Association 105(489):147–156
• Cerioli and Farcomeni (2011) Cerioli A, Farcomeni A (2011) Error rates for multivariate outlier detection. Computational Statistics & Data Analysis 55(1):544–553
• Cerioli et al. (2017) Cerioli A, Riani M, Atkinson A, Corbellini A (2017) The power of monitoring: How to make the most of a contaminated sample. Statistical Methods & Applications DOI 10.1007/s10260-017-0409-8
• Coretto and Hennig (2016) Coretto P, Hennig C (2016) Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust gaussian clustering. Journal of the American Statistical Association 111(516):1648–1659
• Day (1969)

Day N (1969) Estimating the components of a mixture of normal distributions. Biometrika 56(3):463–474

• Dempster et al. (1977) Dempster A, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society Series B (methodological) pp 1–38
• Dotto et al. (2016) Dotto F, Farcomeni A, Garcia-Escudero LA, Mayo-Iscar A (2016) A reweighting approach to robust clustering. Statistics and Computing pp 1–17
• Farcomeni and Greco (2015a) Farcomeni A, Greco L (2015a) Robust methods for data reduction. CRC press
• Farcomeni and Greco (2015b) Farcomeni A, Greco L (2015b) S-estimation of hidden markov models. Computational Statistics 30(1):57–80
• Fraley and Raftery (1998) Fraley C, Raftery A (1998) How many clusters? which clustering method? answers via model-based cluster analysis. The computer journal 41(8):578–588
• Fraley and Raftery (2002) Fraley C, Raftery A (2002) Model-based clustering, discriminant analysis, and density estimation. Journal of the American statistical Association 97(458):611–631
• Fraley et al. (2012) Fraley C, Raftery A, Murphy T, Scrucca L (2012) mclust version 4 for r: Normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report 597, University of Washington, Seattle
• Fritz et al. (2013) Fritz H, Garcia-Escudero L, Mayo-Iscar A (2013) A fast algorithm for robust constrained clustering. Computational Statistics Data Analysis 61:124–136
• Garcia-Escudero et al. (2008) Garcia-Escudero L, Gordaliza A, Matran C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. The Annals of Statistics pp 1324–1345
• Garcia-Escudero et al. (2015) Garcia-Escudero L, Gordaliza A, Matran C, Mayo-Iscar A (2015) Avoiding spurious local maximizers in mixture modeling. Statistics and Computing 25(3):619–633
• García-Escudero et al. (2011) García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2011) Exploring the number of groups in robust model-based clustering. Statistics and Computing 21(4):585–599
• Greco (2017) Greco L (2017) Weighted likelihood based inference for . Communications in Statistics-Simulation and Computation 46(10):7777–7789
• Helliwell et al. (2018) Helliwell J, Layard R, Sachs J (2018) World Happiness Report 2018
• Kuchibhotla and Basu (2015) Kuchibhotla A, Basu A (2015) A general set up for minimum disparity estimation. Statistics and Probability Letters 96:68–74
• Kuchibhotla and Basu (2018) Kuchibhotla A, Basu A (2018) A minimum distance weighted likelihood method of estimation. Tech. rep., Interdisciplinary Statistical Research Unit (ISRU), Indian Statistical Institute, Kolkata, India, URL https://faculty.wharton.upenn.edu/wp-content/uploads/2018/02/attemptv4p1.pdf
• Lee and McLachlan (2014) Lee S, McLachlan G (2014) Finite mixtures of multivariate skew t-distributions: some recent and new results. Statistics and Computing 24(2):181–202
• Lin (2010) Lin T (2010) Robust mixture modeling using multivariate skew t distributions. Statistics and Computing 20(3):343–356
• Markatou (2000) Markatou M (2000) Mixture models, robustness, and the weighted likelihood methodology. Biometrics 56(2):483–486
• Markatou et al. (1998) Markatou M, Basu A, Lindsay BG (1998) Weighted likelihood equations with bootstrap root search. Journal of the American Statistical Association 93(442):740–750
• Maronna and Jacovkis (1974) Maronna R, Jacovkis P (1974) Multivariate clustering procedures with variable metrics. Biometrics pp 499–505
• McLachlan and Peel (2004) McLachlan G, Peel D (2004) Finite mixture models. John Wiley & Sons
• Neykov et al. (2007) Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Computational Statistics & Data Analysis 52(1):299–308
• Rousseeuw and Van Zomeren (1990) Rousseeuw P, Van Zomeren B (1990) Unmasking multivariate outliers and leverage points. Journal of the American Statistical association 85(411):633–639
• Symon (1977) Symon M (1977) Clustering criterion and multi-variate normal mixture. Biometrics 77:35–43