 # Robust Mixture Modeling using Weighted Complete Estimating Equations

Mixture modeling that takes account of potential heterogeneity in data is widely adopted for classification and clustering problems. However, it can be sensitive to outliers especially when the mixture components are Gaussian. In this paper, we introduce the robust estimating methods using the weighted complete estimating equations for robust fitting of multivariate mixture models. The proposed approach is based on a simple modification of the complete estimating equation given the latent variables for grouping indicators with the weights that depend on the components of mixture distributions for downweighting outliers. We develop a simple expectation-estimating-equation (EEE) algorithm to solve the weighted complete estimating equations. As examples, the multivariate Gaussian mixture, mixture of experts and multivariate skew normal mixture are considered. In particular, we derive a novel EEE algorithm for the skew normal mixture which results in the closed form expressions for both the E- and EE-steps by slightly extending the proposed method. The numerical performance of the proposed method is examined through the simulated and real datasets.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Mixture modeling (McLachlan and Peel, 2004) is a very popular statistical method for distribution estimation, regression and model-based clustering by taking account of potential heterogeneity of data. Typically, such mixture models are fitted by the maximum likelihood method using the well-known EM algorithm. However, data often contain outliers which can highly affect the maximum likelihood method. The presence of such outliers would result in biased and inefficient statistical inference on the parameters of interest and would thus make recovering underlying clustering structure of the data very difficult. A typical approach to this problem is to use a heavy tailed distribution for the mixture components (e.g. Peel and McLachlan, 2004; Früwirth-Schnatter and Pyne, 2010; Nguyen and McLachlan, 2016). However, this approach cannot distinguish meaningful observations from outliers straightforwardly since it fits a model to all the observations including outliers. Apart from using heavy-tailed distributions, there are some other robust approaches using the ideas of the extended likelihood inference (e.g. Fujisawa and Eguchi, 2006; Notsu and Eguchi, 2016; Coretto and Hennig, 2016; Greco and Agostinelli, 2019)

. However, these approaches may suffer from computational problems in that the objective functions may contain integrals especially when the component distributions are not simple like Gaussian distributions.

In this paper, we introduce a new approach to robust mixture modeling using the idea of the weighted complete estimating equations (WCE). The weight is defined based on the assumed distributions, thereby the outliers are downweighted and their information is automatically omitted from the estimating equations. Instead of directly using the weighted estimating equations for the mixture models, we consider the estimating equations given the latent grouping variables of the mixture models, which leads to the weighted estimating equations for single component distributions. Since the derived WCE depends on the unknown latent variables, the latent variables are augmented via the expectations to solve the WCE calling for an expectation-estimating-equation (EEE) algorithm. The proposed EEE algorithm is general and can be applied to a variety of mixture models. The proposed WCE method is then applied to three types of mixture models, the multivariate Gaussian mixture, mixture of experts (Jacobs et al., 1991) and multivariate skew-normal mixture (Lin et al., 2007; Lin, 2009)

. For the multivariate Gaussian mixture, updating steps of the proposed EEE algorithm are obtained in the closed forms without requiring numerical integration and optimization steps. A similar algorithm can be derived for the mixture of experts models when the component distributions are Gaussian. Moreover, for multivariate the skew normal mixtures, by introducing additional stochastic representation of the skew normal distribution, we can obtain a novel EEE algorithm as a slight extension of the proposed general EEE algorithm in which all the updating steps are proceeded analytically. Due to the rather complicated structures of the skew normal distribution compared with the standard normal distributions, the derived algorithm would be the first one to provide a feasible method for robust fitting of the skew normal mixtures.

As related works, Greco and Agostinelli (2019) employed a similar idea using the weighted likelihood for robust fitting of the multivariate Gaussian mixtures, in which the weights are modeled via the Pearson residuals with kennel density estimation rather than the Gaussian density. Thus, their approach cannot be extended straightforwardly to other mixture models such as the multivariate skew normal mixtures. On the other hand, the proposed weighted estimating equations is closely related to the density power divergence (Basu et al., 1998). A similar divergence is adopted in Fujisawa and Eguchi (2006) for robust estimation of the Gaussian mixtures against outliers. However, since the objective function includes an integral with respect to the mixture model which cannot admit an analytical form, the estimation procedure would be computationally intensive. Compared with these works, the proposed method can be applied to a variety of mixture models and the proposed algorithm can be carried out easily for some mixture models as the updating steps are obtained in analytical forms.

The rest of this paper is organized as follows. In Section 2, we first describe the proposed WCE method under the general mixture model and derive a general EEE algorithm for solving WCE. Then, the general algorithm are applied to the specific mixture models, the multivariate Gaussian mixtures (Section 3), mixtures of experts (Section 4) and multivariate skew normal mixtures (Section 5). The details of the EEE algorithms are described. In Section 6, the proposed method is demonstrated for the multivariate skew normal mixtures through the simulation studies. Section illustrates the practical advantage of the proposed method using the real data. We finally give conclusions and discussions in Section 8.

## 2 Weighted Estimating Equations for Mixture Modeling

### 2.1 Weighted estimating equations and EEE algorithm

Let

be the random variables on

. We consider the following mixture models:

 f(yi;Ψ)=K∑k=1πkf(yi;θk), (1)

where is the set of model parameters in the th component,

is the vector of grouping probabilities or prior membership probabilities, and

is the collection of all the model parameters. For fitting the model (1), we introduce the latent membership variable defined as , thereby the conditional distribution of given is . For notional simplicity, we let , the indicator function being .

The complete estimating equations for given ’s are given as follows:

 n∑i=1uik∂∂θklogf(yi;θk)=0 n∑i=1uikπk−n∑i=1uiKπK=0,k=1,…,K.

Since the above estimating equations may be sensitive to outliers, we introduce the weight to control the amount of contribution from the th observation. We then consider the following modified estimating equations:

 n∑i=1uik{w(yi;θk)∂∂θklogf(yi;θk)−C(θk)}=0,n∑i=1uikπkw(yi;θk)B(θk)−n∑i=1uiKπKw(yi;θK)B(θK)=0, (2)

for , where is the weight function which may depend on some tuning parameter and

 C(θk)=∫Rpf(t;θk)w(t;θk)∂∂θklogf(t;θk)dt, B(θk)=∫Rpf(t;θk)w(t;θk)dt.

Note that the weighted estimation equations (2

) is unbiased, that is, the expectations of these estimating functions with respect to the joint distribution of

and are zero. We consider the specific form of the weight function given by with . The weight would be small if is an outlier, i.e. is located far in the tail of the distribution . The weighted estimation equations reduce to the original complete estimating equation when .

Starting from some initial values, the above modified estimating equation can be iteratively solved by the expectation and estimating equation (EEE) algorithm given as follows:

• E-step:

Compute the posterior probability:

 u(s)ik=π(s)kf(yi;θ(s)k)∑Kℓ=1π(s)ℓf(yi;θ(s)ℓ),    k=1,…,K.
• EE-step:   Update the membership probabilities ’s as

 π(s+1)k=∑ni=1u(s)ikw(yi;θ(s)k)/B(θ(s)k)∑Kℓ=1∑ni=1u(s)iℓw(yi;θ(s)ℓ)/B(θ(s)ℓ)

and component-specific parameters by solving the estimating equations:

 n∑i=1u(s)ik{w(yi;θ(s)k)∂∂θklogf(yi;θk)−C(θk)}=0.

To apply the above general algorithm to specific mixture models such as the multivariate Gaussian mixtures, the only thing we need to work on is to calculate the bias correction terms and . As shown in Section 3, the bias correction terms are quite simple under the Gaussian distribution. Moreover, the above algorithm can be easily modified to the case where the distribution of each component admits a hierarchical or stochastic representation. For instance, the multivariate skew normal distribution has a hierarchical representation based on the multivariate normal distribution, which allows us to derive tractable weighted complete estimating equations to carry out the proposed robust EEE algorithm, as demonstrated in Section 5.

### 2.2 Selecting the number of components

In practice, the number of components is unknown and is to be suitably chosen in a reasonable way. When the data contains outliers, they should be adequately omitted for selecting , otherwise the selected can be different from the true one. In order to appropriately downweight outliers and select reasonably, we first the define normalized density weight as , where is the mixture model (1) fitted to the data and is the robust estimator based on the proposed method. Note that . When is an outlier, the corresponding weight is supposed to be small. Then we may employ the following BIC-type criteria for selecting :

 Q(K)=n∑i=1˜wilogf(yi;ˆΨ)+|Ψ|logn, (3)

where is the dimension of which depends on . Since for all under , the above criteria reduces to the original BIC for the mixture model (1).

### 2.3 Asymptotic variance-covariance matrix

The proposed method depends on the tuning parameter which controls the degree of robustness of the complete estimating equation (2), although the specific value of is not interpretable. Here we investigate the role of based on the asymptotic relative efficiency. We first note that when the assumed distribution is correct, i.e. the data does not contain outliers, the choice of a non-zero value of leads to inefficient estimation of the model parameters. On the other hand, the estimating equation (2

) is robust against outliers, there is thus a trade-off between the efficiency under the correct model specification and robustness in the presence of outliers. To quantify the inefficiency, the asymptotic variance-covariance matrix of the estimator is considered.

Let be the complete estimating functions based on the th observations given in (2) and let be the augmented estimating equations and let denote the estimator which is the solution of . Note that the augmented estimating equations are unbiased since the complete estimating equations (2) are unbiased. Under some regularity conditions, the asymptotic distribution of is where the asymptotic variance-covariance matrix is given by

 Vγ=limn→∞(1nn∑i=1∂Gi(Φ)∂Φ)−1{1nn∑i=1Var(Gi(Φ))}(1nn∑i=1∂Gi(Φ)t∂Φ)−1.

This can be consistently estimated by replacing with and with . In practice, it would be difficult to obtain an analytical expression for the derivative of . Therefore, the derivative is numerically computed by using the outputs of the EEE algorithm. Let be the final estimates of the EEE algorithm. Then the derivative evaluated at can be numerically approximated by , namely the -element of corresponds to the ratio between the th component of and th component of , respectively.

Based on , the relative inefficiency of the estimator against the maximum likelihood estimator is defined as , which would be an increasing function of . This may suggest a possible way to specify through relative efficiency, that is, specifying a percentage of relative inefficiency denoted by , is selected by solving the equation with respect to . The value of can be or . A practical approach to solving the equation would be to select a value for from some candidates that minimizes the difference between and .

## 3 Robust Gaussian mixture

Gaussian mixture models are the most famous and widely adopted mixture models for model-based clustering while the performance can be severely affected by outliers due to the light tails of the normal distribution. Here we suggest a new approach to robust fitting of the Gaussian mixture model using the proposed robust EEE algorithm. Let denote the -dimensional normal density . It holds that

 C(θk)=−γ2|Σk|−γ/2Σ−1k(2π)−pγ/2(1+γ)−p/2−1, B(θk)=|Σk|−γ/2(2π)−pγ/2(1+γ)−p/2

so that the weighted complete estimating equations for and are given by

 n∑i=1uikw(yi;θk)(yi−μk)=0 n∑i=1uik{w(yi;θk)Σk−w(yi;θk)(yi−μk)(yi−μk)t−g(Σk)Σk},

where Hence, starting from some initial values , the proposed robust EEE algorithm repeats the following two steps:

• E-step:   The standard E-step is left unchanged as

 u(s)ik=π(s)kϕp(yi;μ(s)k,Σ(s)k)∑Kℓ=1π(s)ℓϕp(yi;μ(s)ℓ,Σ(s)ℓ).
• EE-step:   Update the membership probabilities ’s as in Section 2. The component-specific parameters ’s are updated as

 π(s+1)k=∑ni=1u(s)ikw(yi;θ(s)k)/B(θ(s)k)∑Kℓ=1∑ni=1u(s)iℓw(yi;θ(s)ℓ)/B(θ(s)ℓ) μ(s+1)k=∑ni=1u(s)ikw(yi;θ(s)k)yi∑ni=1u(s)ikw(yi;θ(s)k), Σ(s+1)k=∑ni=1u(s)ikw(yi;θ(s)k)(yi−μ(s+1)k)(yi−μ(s+1)k)t∑ni=1u(s)ikw(yi;θ(s)k)−g(Σ(s)k)∑ni=1u(s)ik.

Note that all the steps in the above algorithms are obtained in the close forms. This is one of the attractive features of the proposed method as it does not require any computationally intensive methods such as Monte Carlo integration. We note that the proposed estimating equations could be an ill-posed problem due to the same reason that the likelihood function for the Gaussian mixture models may be unbounded (e.g. Day, 1969; Maronna and Jacovkis, 1974). Hence, the following eigen-ratio constraint to avoid the problem is employed:

 maxjmaxkλj(Σk)minjminkλj(Σk)≤c,   j=1,…,p,  k=1,…,K,

where denotes the

th eigenvalues of the covariance matrix

in the th component and is a fixed constant. When , a spherical structure is imposed on and a more flexible structure is allowed under a large value of . In order to reflect the eigen-ratio constraint in our EEE algorithm, we can simply replace the eigen-values of with the truncated version where if , if and if . Here is a unknown constant that depends on , and we employed the finding procedure by Fritz et al. (2013).

## 4 Robust mixture of experts

The Mixture of experts model (e.g. Jacobs et al., 1991; McLachlan and Peel, 2004) is known for a useful tool for modeling nonlinear regression relationships. The model that includes a simple mixture of normal regression models and its robust versions have been proposed in the literature (e.g. Bai et al., 2012; Song et al., 2014). Also the robust versions of mixture of experts based on the non-normal distributions were considered in Nguyen and McLachlan (2016) and Chamroukhi (2016). The general model is described as

 f(yi|xi;Ψ)=K∑k=1gk(xi;ηk)fk(yi|xi;θk),

where is the vector of covariates, is the mixing proportion as a function of satisfying . For identifiability of the parameters, we assume that is an empty set since is completely determined by

. A typical form for the continuous response variables adopts

and for . Let be the set of the unknown parameters, and . Compared with the standard mixture model (1), the mixing proportion is a function of parameterized by .

As before, the latent variable such that for is introduced. Then, the complete weighted estimating equations are given by

 n∑i=1uik{w(yi;θk)∂∂θklogfk(yi;θk)−Ci(θk)}=0,n∑i=1uikg(xi;ηk)∂g(xi;ηk)∂ηkw(yi;θk)Bi(θk)−n∑i=1uiKg(xi;ηK)∂g(xi;ηk)∂ηkw(yi;θK)Bi(θK)=0,

where

 Ci(θk)=∫fk(t|xi;θk)w(t;θk)∂∂θklogfk(t|xi;θk)dt, Bi(θk)=∫fk(t|xi;θk)w(t;θk)dt.

Starting from some initial values , the proposed EEE algorithm repeats the following two steps:

• E-step:   The standard E-step is left unchanged as

 u(s)ik=gk(xi;η(s)k)fk(yi|xi;θ(s)k)∑Kℓ=1gℓ(xi;η(s)ℓ)fℓ(yi|xi;θ(s)ℓ).
• EE-step:   Update and by solving the following equations:

 n∑i=1u(s)ik{w(yi;θ(s)k)∂∂θklogf(yi;θk)−Ci(θk)}=0,n∑i=1u(s)ikg(xi;ηk)∂gk∂ηkw(yi;θ(s+1)k)Bi(θ(s+1)k)−n∑i=1u(s)iKgk(xi;ηK)∂gk∂ηkw(yi;θ(s+1)K)Bi(θ(s+1)K)=0. (4)

When the mixture components are the normal linear regression models given by

, the bias correction terms and can be analytically obtained as and . The first equation in (4) can be solved to obtain the closed form updating steps similar to those in Section 3. On the other hand, the second equation is a function of and cannot be solved in a analytical way. Note that the solution corresponds to the maximizer of the weighted log-likelihood function of the multinomial distribution given by

 n∑i=1K∑k=1u(s)ikw(yi;θ(s+1)k)Bi(θ(s+1)k)logg(xi;ηk),

since its first order partial derivatives with respect to reduces to the second equation in (4). Thus, the updating step for can be readily carried out.

## 5 Robust skew normal mixture

We next consider the use of the -dimensional skew normal distribution (e.g. Azzalini and Valle, 1996) for the th component, which is more flexible than the multivariate Gaussian mixtures especially when the cluster-specific distributions are not symmetric but skewed. There exist several works regarding the maximum likelihood estimation of the skew-normal mixtures (e.g. Lin et al., 2007; Lin, 2009). The direct application of the skew normal distribution to the proposed EEE algorithm would be computationally intensive since the bias correction term cannot be obtained analytically and Monte Carlo approximation is required in each iteration. Instead, we employ the stochastic representation of the multivariate skew normal distribution used in Früwirth-Schnatter and Pyne (2010) which admits the following hierarchical representation for given :

 yi|(zi=k)=μk+ψkvik+εik,ϵik∼N(0,Σk),vik∼N+(0,1), (5)

where is the vector of the location parameters, is the vector of the skewness parameters, is the covariance matrix and denotes the truncated normal distribution on the positive real line with the mean and variance parameters and , respectively. By defining

 Ωk=Σk+ψkψtk,αk=ωkΩ−1kψk√1−ψtkΩ−1kψk,

where

. The probability density function of the multivariate skew normal distribution of

Azzalini and Valle (1996) is given by

 fSN(y;μk,Ωk,αk)=2ϕp(y;μk,Ωk)Φ(αtkω−1k(y−μk)). (6)

From the representation (5), the conditional distribution of given both and are normal, namely, , so that the bias correction terms under given and can be easily obtained in the same way as the Gaussian mixture models. Therefore, we consider the following complete estimating equations for the parameters conditional on both and :

 n∑i=1uikwik(yi−μk−ψkvik)=0,n∑i=1uik{wikΣk−wik(yi−μk−ψkvik)(yi−μk−ψkvik)t−g(Σk)Σk}=0,n∑i=1uikvikwik(yi−μk−ψkvik)=0,

where and are in the same forms as the Gaussian mixture case.

The proposed EEE algorithm for the robust fitting of the skew normal mixture is obtained after some modification of that for the normal case. Since the complete estimating equations contain the additional latent variables

, some additional steps are added to compute the moments of

in the equations with respect to the conditional posterior distribution of given , which is where

 δik=ψtkΣ−1k(yi−μk)ψtkΣ−1kψk+1,    τ2ik=1ψtkΣ−1kψk+1.

We define the following quantities:

 t2ik=(γψtkΣ−1kψk+1τ2ik)−1,mik=t2ik{γψtkΣ−1k(yi−μk)+δikτ2ik},Uik=|Σk|−γ/2(2π)−γp/2Φ(mik/tik)Φ(δik/τik)(t2ikτ2ik)1/2     ×exp{−γ2(yi−μk)tΣ−1k(yi−μk)−δ2ik2τ2ik+m2ik2t2ik}. (7)

Note that we have for and

 E[uikwikvjik|yi]=E[E[uikwikvjik|yi,uik]|yi]=E[uik|yi]E[wikvjik|uik,yi].

The conditional distribution of given is with the density , where

 ϕ+(x;a,b2)=1Φ(a/b)(2πb2)−1/2exp{−(x−a)22b2}I(x≥0).

Then, we have

 E[vjikwik|uik,yi] =∫∞0|Σk|−γ/2(2π)−γp/2exp{−γ2(yi−μk−vikψk)tΣ−1k(yi−μk−vikψk)} ×vjikΦ(δik/τik)(2πτ2ik)−1/2exp{−(vik−δik)22τ2ik}dvik =|Σk|−γ/2(2π)−γp/2Φ(mik/tik)Φ(δik/τik)(t2ikτ2ik)1/2∫∞0vjikϕ+(vik;mik,t2ik)dvik ×exp{−γ2(yi−μk)tΣ−1k(yi−μk)−δ2ik2τ2ik+m2ik2t2ik},

and it follows from Lin et al. (2007) that

 ∫∞0v1ikϕ+(vik;mik,t2ik)dvik =mik+tikϕ(mik/tik)Φ(mik/tik), ∫∞0v2ikϕ+(vik;mik,t2ik)dvik =m2ik+t2ik+miktikϕ(mik/tik)Φ(mik/tik),

which leads to the analytical expressions of the updating steps in the E-step. Starting from some initial values of the parameters, , the proposed EEE algorithm iteratively updates the parameters as follows:

• E-step:   Compute the posterior expectations:

 u(s)ik≡E(s)[uik|yi]=π(s)kfSN(yi;μ(s)k,Ω(s)k,α(s)ℓ)∑Kℓ=1π(s)ℓfSN(yi;μ(s)ℓ,Ω(s)ℓ,α(s)ℓ),V0(s)ik≡E(s)[wik|yi,uik]=U(s)ik,V1(s)ik≡E(s)[wikvik|yi,uik]=U(s)ik⎧⎨⎩m(s)ik+t(s)ikϕ(m(s)ik/t(s)ik)Φ(m(s)ik/t(s)ik)⎫⎬⎭,V2(s)ik≡E(s)[wikv2ik|yi,uik]=U(s)ik⎧⎨⎩m2(s)ik+t2(s)ik+m(s)ikt(s)ikϕ(m(s)ik/t(s)ik)Φ(m(s)ik/t(s)ik)⎫⎬⎭,

where and respectively stand for and given in (7) evaluated with the current parameter values.

• EE-step:   Update the membership probabilities ’s as in Section 2 and component-specific parameters ’s as

 π(s+1)k=∑ni=1u(s)ikV0(s)ik/B(θ(s)k)∑Kℓ=1∑ni=1u(