# A New Algorithm using Component-wise Adaptive Trimming For Robust Mixture Regression

Mixture regression provides a statistical model for teasing out latent heterogeneous relationships between response and independent variables. Solving mixture regression relying on EM algorithm is highly sensitive to outliers. To enable simultaneous outlier detection and robust parameter estimation, we proposed a fast and efficient robust mixture regression algorithm, considering Component-wise Adaptive Trimming (CAT). Compared with multiple existing algorithms, it grasps a good balance of computational efficiency and robustness, in different scenarios of simulated data, where unequal component proportions and variances, different levels of outlier contaminations and sample sizes, occur. The adaptive trimming ability of CAT makes it a highly potential tool for mining the latent relationships among variables in the big data era. CAT has been implemented in an R package 'RobMixReg' available in CRAN.

## Authors

• 7 publications
• 15 publications
• 3 publications
• 125 publications
• 9 publications
09/01/2021

### Spatially and Robustly Hybrid Mixture Regression Model for Inference of Spatial Dependence

In this paper, we propose a Spatial Robust Mixture Regression model to i...
12/16/2019

### Detecting and Classifying Outliers in Big Functional Data

This paper proposes two new outlier detection methods, which are useful ...
11/08/2018

### Weighted likelihood mixture modeling and model based clustering

A weighted likelihood approach for robust fitting of a mixture of multiv...
09/17/2019

### Efficient and Robust Estimation of Linear Regression with Normal Errors

Linear regression with normally distributed errors - including particula...
08/23/2019

### A Robust Regression Approach for Robot Model Learning

Machine learning and data analysis have been used in many robotics field...
07/19/2020

### Supervised clustering of high dimensional data using regularized mixture modeling

Identifying relationships between molecular variations and their clinica...
12/15/2021

### Gaining Outlier Resistance with Progressive Quantiles: Fast Algorithms and Theoretical Studies

Outliers widely occur in big-data applications and may severely affect s...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Finite Mixture Gaussian Regression (FMGR) was first introduced by goldfeld1973estimation, and has been widely used to explore the relationship among variables coming from several unknown latent classes in many fields bohning1999computer; hennig2000identifiablity; jiang1999hierarchical; mclachlan2004finite; xu1996convergence; fruhwirth2006finite

. Inference of parameters in FMGR is usually through EM algorithm assuming normally distributed component errors, and might be vulnerable to outliers or heavy-tailed noises. Many algorithms have been developed to estimate the FMGR parameters robustly

yu2020selective. To robustify the estimation procedure, Markatou markatou2000mixture and Shen et al. shen2004outlier proposed using a weight factor for each data point. García-Escudero et al. garcia2017robust have proposed robust model estimation complemented with trimming and constrained estimation. Bai et al. bai2012robust proposed a modified EM algorithm by replacing the least squares criterion in M step with a robust bi-square criterion (MIXBI). Bashir and Carter bashir2012robust

extended the idea of the S-estimator to mixture of linear regression. Yao et al.

yao2014robust extended the idea of mixture of t-distributions proposed by Peel and McLachlan peel2000robust from clustering to the regression setting (MIXT). Similarly, Song et al. song2014robust proposed using Laplace distribution to model the error distribution (MIXL). These methods seek for robust parameters estimations in the presence of outliers, however the identities of the outliers still remain unknown. The identities of the ourliers are often interesting for two reasons: firstly, removal of the outliers could improve the parameter estimates; secondly, for practical reasons, outlying samples could be caused by measurement errors or represent a novel mechanism not representative by the current observations, both of which are interesting to be further investigated. To enable outlier detection, Neykov et al. neykov2007robust proposed robust fitting of mixtures using the trimmed likelihood estimator (TLE), where given a parameter , , the outliers are defined as the observations with the smallest sample likelihood; Yu et al. yu2017new proposed a penalized mean-shift mixture model, , for simultaneous outlier detection and robust parameter estimation. The challenge with TLE and

are the involvement of hyperparameters, namely,

in TLE and penalty parameter in , which could heavily impact the performance of the two algorithms. Yu et al. yu2017new proposed using BIC procedure for hyperparameter, however, BIC criterior becomes highly unstable when the total number of parameters, which equals to the total number of outliers, becomes large.

To address the challenges in simultaneous outlier detection and robust parameter estimation in FMGR, we adopted the idea of Classification-Expectation-Maximization (CEM) algorithm where individual observations are assigned to a definite cluster as part of the maximization process, different from EM algorithm

celeux1992classification. Essentially, CEM maximizes the complete data likelihood, instead of the observed data likelihood as in EM. Under CEM, each component has its exclusive members, which makes it possible to apply a trimmed likelihood approach designed for (single component) linear regression on its member, and hence enables both robust parameter estimation as well as outlier detection for the component. Our major contribution in this paper is the introduction of CEM to FMGR, which provided a platform that migrates the robustness issue from mixture regression to (single component) linear regression, for which robust estimators have been extensively studied, and many algorithms with high break down point have been developed. The task of outlier detection was distributed to each component, making it possible to formally define outliers in FMGR. The algorithm, namely Component-wise Adaptive Trimming (CAT), detects outlier in a data-driven fashion free of hyperparameters, and is hence computationally efficient.

The remainder of the article is organized as follows. In Section 2, we will introduce the complete data maximum likelihood, and the CEM algorithm, based on which, our component adaptive trimming method is developed. In Section 3, we show the performance comparison of our method with other five state of the art methods using synthetic datasets.

## 2 Component wise adaptive trimming

### 2.1 The complete data maximum likelihood estimation

Let , be a finite set of observations, and the design matrix, and

the response vector. Consider a FMGR model parameterized by

, it is assumed that when belongs to the k-th component, , then , where . Then, the conditional density of given is where is the normal density function with mean and variance . Let be the membership indicator for observation , then . The maximum likelihood estimate for is through minimizing the following negative log likelihood:

 LX,Y(θ):=−N∑i=1log(K∑k=1πkN(yi;xTiβk,σ2k)) (1)

EM algorithm is usually applied to obtain the MLE estimates, by treating the cluster membership

as missing random variables.

Assume, we are given a set of observations and assignments . Then, the likelihood that all observations have been drawn according to a FMGR and that each observation have been generated by the -th component, is given by

 N∏i=1p(yi,zi|xi,θ)=N∏i=1πziN(yi;xTiβzi,σ2zi) (2)

This is called the complete-data likelihood. Note that the assignments define a partition of the observations, , such that iff . Hence, we can also rewrite the Equation (2) in its negative logarithm form as

 LfX,Y(θ,C):=−K∑k=1(∑i∈Cklog(N(yi;xTiβk))+logπk|Ck|) (3)

We introduce the the complete data maximum likelihood estimates (CMLE) as follows.

###### Definition 2.1.

(Complete-data Maximum Likelihood Estimates, CMLE) Let be the design matrix of , and be the response vector. Given an integer , find a partition of the N observations and FMGR parameters that minimizes defined in equation (3).

Note that, CMLE is not well defined in this form. For example, for an observation , if is chosen such that and we let , then , which results in infinite likelihood. It is easy to put some mild restrictions on the cluster size, then we can lower bound the variance associated with each regression line, and the CMLE will be well defined blomer2016hard.

### 2.2 Alternating Optimization Scheme with the CEM algorithm

We introduce the alternating optimization algorithm to solve the CMLE problem. Clearly, fixing the partition , the optimal mixture parameter is given by with

 πk=|Ck|∑Kl=1|Cl|
 (βk,σ2k)=OLS(YCk,XCk,;)

Here, denotes the cardinality of the set ; means the OLS solution to regressing on using only observations from .

Fixing the FMGR parameters , the optimal partition is given by assigning each point to its most likely component, i.e.

 i∈Ck⟺k=argmaxl∈{1,...,K}p(zi=l|xi,yi,θ)

where

 p(zi=k|xi,yi,θ)=πkN(yi;xTiβk,σ2k)∑Kl=1πlN(yi;xTiβl,σ2l)

which is the posterior probability that

lies on the k-th regression line of the mixtures. By repeatedly computing updating between and , we will show in Lemma 2.1 that the solution converges to a stationary point of the likelihood function. We call this alternating scheme the CEM algorithm.

Note: Here denotes the observations indexed by

denotes ordinary least square estimates of regressing

on .

###### Lemma 2.1.

The complete data likelihood, , is non-decreasing for any sequence defined as in Algorithm 1, and it converges to a stationary value. Moreover, if the maximum likelihood estimates of the parameters are well-defined, the sequence of converges to a stationary position.

###### Proof.

We first show that the sequence is non-decreasing. Since is maximizing , then

 LfX,Y(θ(m+1),C(m))≥LfX,Y(θ(m),C(m))

And since

 i∈C(m+1)k⇔Lf(xi,yi)(θ(m+1)k)≥Lf(xi,yi)(θ(m+1)k′)

for all , which implies

 π(m+1)kN(yi;xTiβ(m+1)k,σ2(m+1)k)≥π(m+1)k′N(yi;xTiβ(m+1)k′,σ2(m+1)k′)

then we have

 LfX,Y(θ(m+1),C(m+1))≥LfX,Y(θ(m+1),C(m))≥LfX,Y(θ(m),C(m))

Since there is a finite number of partitions of the samples into clusters, the non-decreasing sequence takes a finite number of values, and thus, converges to a stationary values. Hence for large enough; from the first equality and from the assumption that the maximum likelihood estimate are well-defined, we deduce that . ∎

### 2.3 A new definition for outlier under CEM

In linear regression, outliers are understood as observations that deviate from the model assumptions, and obviously, samples with lower likelihood are more likely to be outliers. Powerful tools that simultaneously identifies and down weighs the outlying data points for linear regression has been developed. In robust linear regression, the outliers are either identified as a trimming ratio of the total observations with the lowest likelihood, or they are identified in a completely data-driven manner, without the need of a pre-specified pison2002small; leroy1987robust; rousseeuw1984least; rousseeuw1999fast.

Unfortunately, such a definition of outliers becomes less applicable in the case of mixture regression. Given a robust mixture regression model and a trimming ratio , if we follow the same logic as in linear regression, then the observations with the smallest overall likelihood will be deteced as outliers, as in neykov2007robust. This trimmed likelihood approach implies that an observation with lower overall likelihood is more likely to be an outlier than an observation with higher overall likelihood. However, the overall likelihood depends on not only the likelihood of the observation with respect to each component, but also the proportion of each component, and such a criterior for outlier becomes problematic if the mixing components are unbalanced. In other words, a low will down-weigh the “outlierness” of an observation from the k-th component. In addition, if we argue given a set of observations, we could always find certain mixture model to well explain it, there is no basis for us to call any observation an outlier.

The complete data likelihood approach based CEM algorithm disentangles the mixture distribution into exclusive clusters, within which, the robustness issue could be much easily handled give the tremendous amount of research conducted for robust linear regression. More importantly, we could introduce a more natural definition for outliers.

###### Definition 2.2.

(Outlier of FMGR) Given an FMGR model parameterized by , under CMLE, an observation is considered as an outlier, if and . In other words, an observation is considered as an outlier if it is an outlier to the component it belongs to.

This new definition shifts the robustness issue from a mixture model to its linear regression components, the latter of which has been well defined and studied. Here is a criterior for outlier-ness in linear regression. Naturally, to confer a robust parameter estimation for FMGR under CMLE, we could replace the least square criterion for parameter estimation in the M-step by a robust criterion; and further to enable simultaneous outlier detection, we could choose to use any trimmed likelihood approach with high break-down point.

### 2.4 The robust CEM algorithm

Under Definition 2.2, detecting outliers of the FMGR model could be accomplished through detecting the component-wise outliers. And the fact that outlier detection in linear regression could be completely data-adaptive makes it possible for us to develop a data-driven algorithm for simultaneous outlier detection and robust parameter estimation in FMGR.

Our Component-wise Adaptive Trimming method, namely CAT, starts by initializing the posterior probability matrix, , as in Algorithm 2. For , we randomly sampling samples to build a robust linear regression model, and the posterior probability of sample for component will be initialized as the density of the residual of sample fitting the -th robust regression line. For robust linear regression, we use the “ltsReg” function in the “robustbase” library in R pison2002small; leroy1987robust; rousseeuw1984least; rousseeuw1999fast, where the outliers were detected in a data-driven manner, in addition to robust parameter estimates.

Note: Function RLM outputs robust linear regression model and parameter estimates .

With each initialized , CAT then runs a robust CEM algorithm where the OLS estimates in the M-step was replaced by robust estimates using trimmed likelihood method. At the end of each iteration, an MLE estimate will be obtained on samples excluding the outliers detected from all the components. The MLE estimates were conducted using function “flexmix” from the “flexmix” R package leisch2004flexmix. When the set of outliers do not change between two iterations (or reaching a pre-specificed number of iterations), the outlier and parameter estimates will then be finalized for this random start.

CAT undergoes multiple random starts to stabilize the results, and select the results from the start whose obtained outliers are the closest to the average frequency of outliers across all the random starts. The complete CAT algorithm was illustrated in Algorithm 3.

Note: Function RLM outputs robust linear regression model and parameter estimates , with function ltsReg; function outlier outputs the outliers identified by a robust linear regression using trimmed likelihood method, with function ltsReg; function MLE outputs the regular MLE estimates of FMGR model based on EM algorithm, with function flexmix. is an indicator function, where takes value 1 if and 0 otherwise.

## 3 Simulation studies

We evaluate the performance of CAT on synthetic datasets, and compare it with several existing method, including MLE, TLE, MIXT, MIXL and MIXBI. They stand for the MLE approach leisch2004flexmix, the trimmed likelihood approach neykov2007robust, the mixture yao2014robust, mixture Laplaciansong2014robust, and mixture bisquare bai2012robust approaches.

To compare the methods’ performance on simultaneous outlier detection and robust parameter estimation, we simulate data using the following mean-shift model yu2017new:

 f(yi|xi,θ,γi)=K∑k=1πkN(yi;xTiβk+γikσk,σ2k),i=1,...,N

where . Here, for each observation, a mean-shift parameter, , is added to its mean structure in each mixture component. We consider scenarios in which the observations are drawn from mixture regression models with different , , , , and contaminated with different levels of additive outliers.

Example 1: For each , is independently generated with

where

is a component indicator generated from a Bernoulli distribution with

; and are independently generated from a ; and the error terms and are independently generated from .

Scenario 1:
Scenario 2:
Scenario 3:
Scenario 4:
Scenario 5:
Scenario 6:

Example 2: For each , is independently generated with

 yi=⎧⎪⎨⎪⎩1−xi1+γiσ1+ϵi1ifzi=11+3xi1+γiσ2+ϵi2ifzi=2−1+0.1xi1+γiσ3+ϵi3ifzi=3

where is a component indicator generated from a Bernoulli distribution with . is independently generated from a ; and the error terms , , are independently generated from a , , .

Scenario 1:
Scenario 2:
Scenario 3:
Scenario 4:
Scenario 5:
Scenario 6:

For scenario in examples 1 and 2, we simulate data for sample sizes 200 and 400. The bias and MSE of parameter estimation over 100 repetitions under each scenario was examined for each competing method, which includes the linear regression coefficients and mixing proportion of all the components. The label switching issue celeux2000computational; stephens2000dealing; yao2009bayesian creates some trouble on how to align the parameters of one component from predicted model to that of the true model. Different component orders in the predicted and true model might give totally different results and there are no widely accepted methods to adjust for that. In our simulation study, we simply choose to order the components in the estimated parameter matrix by minimizing the Euclidian distance to the true parameter matrix.

Tables 1 and 2 report the bias and MSE of the parameter estimates of each method for sample sizes =200 and =400, respectively, in example 1. Tables 3 and 4 report the bias and MSE of the parameter estimates of each method for sample sizes =200 and =400, respectively, in example 2. The first rows are for regression coefficients, and the last rows for mixing components. As seen from the four tables, CAT performs comparable to MLE when there is not outlier present. When the observations are contaminated by high leverage outliers, CAT is able to trim off the outliers and reach a robust parameter estimates, regardless of component variances, feature numbers and sample numbers. Its performance is always better or comparable with regard to the five state-of-the-art methods.

## 4 Conclusion

We proposed solving FMGR using the CEM algorithm, based on which, the outliers are more naturally defined, and robustness issue better and more conveniently handled. The CEM algorithm represents a variant of the EM algorithm by maximizing the complete data likelihood. It enables a more natural definition of outliers for FMGR, and further the simultaneous detection of outliers and robust estimation of parameters. Most importantly, the adaptive trimming in FMGR boils down into that of the simple linear regression, for which many powerful tools have been developed. In summary, CAT is an automatic algorithm of high potential in mining the heterogeneous relations among variables in the data booming era.