1 Introduction
We consider high dimensional generalized linear models and we study a robust method for estimating its parameters. Robust estimators for generalized linear models (GLM) have been studied by Künsch et al. [1989], Cantoni and Ronchetti [2001], Bergesio and Yohai [2011], Bianco et al. [2013], Valdora and Yohai [2014] and Alqallaf and Agostinelli [2016]. However, these proposals either lack robustness or require a robust initial estimator. We propose a method for computing an initial estimator which can be used to begin an iterative algorithm, as needed by redescending estimators. We apply this method in the computation of Mestimators based on transformations (MT) proposed by Valdora and Yohai [2014]
. MTestimators are a family of Mestimators based on variance stabilizing transformations which are shown to be highly robust and efficient by means of a Monte Carlo study. These estimators are redescending Mestimators applied after transforming the responses by means of a variance stabilizing function. Stabilizing the variance allows the correct scaling of the loss function used in the definition of the Mestimator.
Consider a GLM in which is the response and is a
dimensional vector of explanatory variables. We assume that
(1) 
where is an unknown vector of parameters and is a known link function. We further assume that
(2) 
where is a discrete or continuous model in the exponential family of distributions in , that is to say, it has a density of the form
(3) 
for given functions , and . We assume is known. MTestimators are defined as follows
(4) 
where is a symmetric, bounded, continuous and non decreasing on function, is a variance stabilizing transformation and is the function defined by
(5) 
where denotes the expectation of when has distribution . It is assumed that is univocally defined, therefore (5) implies the Fisher consistency of . Other assumptions necessary to have consistency and asymptotic normality of this estimators are listed in Valdora and Yohai [2014]. The solution to (4) can be found by iterative methods which typically solve the corresponding system of estimating equations
(6) 
where is the derivative with respect to of . In the Appendix we provide an iteratively reweighted least squares (IRWLS) algorithm to find a solution to equation (6). The difficulty in the case of redescending Mestimators is that the goal function might have several local minima. As a consequence, it might happen that the iterative procedure converges to a solution of equation (6) that is not a solution of the optimization problem (4). To avoid this, one must begin the iterative algorithm at an initial estimator which is a very good approximation of the absolute minimum of , i.e. the solution of (4). If is small, this approximate solution may be obtained by the subsampling method Valdora and Yohai [2014, see]. Based on the algorithm described in Rousseeuw and Leroy [1987] for linear models, this method consists in computing a finite set of candidate solutions to (4) and then replace the minimization over by a minimization over . The set is obtained by randomly drawing subsamples of size and computing the maximum likelihood (ML) estimator based on the subsample. If the original sample contains a proportion of outliers, then the probability that a given subsample is free of outliers is and the probability of having at least one subsample free of outliers is , where is the number of subsamples drawn. If we want this probability to be greater than a given , we must draw a number of subsamples such that
that is to say,
This makes the algorithm infeasible for large .
Peña and Yohai [1999] studied this problem in the case of linear models, introducing an alternative method to compute the set of candidate solutions . Their proposal succeeds in obtaining a set which contains very good approximations of the actual solution and, on the other hand, requires the computation of a small number of subsamples, namely . This makes the algorithm much faster and feasible even for very large values of .
We modify the method introduced by Peña and Yohai [1999] in order to apply it to generalized linear models. We study its application to MTestimators by means of an extensive MonteCarlo study, which shows that the method is very fast and robust for large values of .
As a particular case of the MTestimator we define the Least Squares estimator based on Transformations (LST), which corresponds to , in the following way
(7) 
This estimator can be seen as a natural generalization of the Least Squares estimator (LS) for linear models to the case of GLM. LST estimators are Fisher consistent, however since is not bounded, they are, in general, non robust. In the Appendix we provide an iteratively reweighted least squares algorithm to find the solution to the optimization problem (7).
2 Detecting Outliers Using Principal Sensitivity Components
The classical statistic used to measure the influence of an observation is the Cook statistic introduced by Cook [1977] for linear models, which can be adapted for generalized linear models (see Chapter 12 of McCullagh and Nelder [1989]). This statistic is a measure of the distance between , the maximum likelihood estimator and , the maximum likelihood estimator computed without observation . However, these measure is nonrobust and therefore, when there are several ouliers, it may be completely unreliable. In these cases, some outliers with high influence may have a small Cook statistic if there are other similar outliers that still influence . This is known as masking effect. To make things worse, high leverage outliers may have small residuals making their detection difficult. This situation usually arises when there are several similar, or highly correlated outliers.
The proposal of Peña and Yohai (1999) follows the same idea as the subsampling method but it computes the set of candidate solutions in a different way. The candidates are obtained, as before, by computing the least squares estimates on subsamples. However, the subsamples are not chosen at random. Instead, they are chosen by deleting from the sample, groups of similar or highly correlated outliers, which can potentially cause a masking effect. The set will, in this way, contain candidates which are already quite robust estimates and therefore it will not need to have a large number of candidates as it happens using randomly chosen subsamples. In fact the number of candidates in the set is only .
Let be random vectors which follow a generalized linear model as defined by (2) and (1). Let be the LST estimator and
be the vector of fitted values. Let be the fitted value for observation computed without using observation , that is , where is the LST estimate based on the original sample without observation . We define the th residual as the difference between and its predicted value , that is . Following the ideas introduced by Peña and Yohai (1999) for linear models, we define the sensitivity vectors as the vectors with entries
where is the predicted value of computed without using observation . Then, is the sensitivity of the forecast of the to the deletion of observation and the sensitivity vectors are defined by
The sensitivity matrix is defined as the matrix whose rows are the vectors . Let
(8) 
is the direction in which the projections of the sensitivity vectors is largest. Let
(9) 
then is the vector whose entries are the terms of the sum in (8). Therefore, the largest entries in correspond to the largest terms in the sum in (8), which in turn correspond to the observations that have the largest projected sensitivity in the direction .
In the same way, we can define recursively as the solution of
(10)  
(11) 
The vectors are the directions in which the projected sensitivity of the observations are the largest. The corresponding projections
(12) 
are called the principal sensitivity components. The entries of are the projections of the sensitivity vectors on the direction . Large entries correspond to observations whose projected sensitivity in the direction is large. Therefore, large entries are considered potential outliers.
High leverage observations typically have large sensitivity because a small change in the estimated slopes will cause a large change in the fitted values. Peña and Yohai [1999] prove that, in the case of linear models, if the sample is contaminated with less than
high leverage outliers, then, at least for one eigenvector, the coordinates corresponding to the outliers have absolute value larger than the median.
3 Procedure for obtaining a robust initial estimate in generalized linear models
Consider a random sample following a generalized linear model as defined by (1), (2) and (3). The following procedure computes an approximation of which will be used as an initial estimator in the IRWLS algorithm for the estimating equation (6). The procedure has two stages. Stage 1 aims at finding a highly robust but possibly inefficient estimate and stage 2 aims at increasing its efficiency.
Stage 1. In this stage, the idea is to find a robust, but possibly inefficient, estimate of by an iterative procedure. In each iteration we get
(13) 
In the first iteration () the set is constructed as follows. We begin by computing the LST estimate with the complete sample and the principal sensitivity components. For each principal sensitivity component we compute three estimates by the LST method. The first eliminating the half of the observations corresponding to the smallest entries in , the second eliminating the half corresponding to the largest entries in and the third eliminating the half corresponding to the largest absolute values. To these initial candidates we add the LST estimate computed using the complete sample, obtaining a set of elements. Once we have we obtain by minimizing over the elements of .
Suppose now that we are on stage Let be a trimming constant; in all our applications we set . Then, for , we first delete the observations such that or where and, with the remaining observations, we recompute the LST estimator and the principal sensitivity components. Let us remark that, for the computation of we have deleted the observations that have large residuals, since is the fitted value obtained using . In this way, while candidates on the first step of the iteration are protected from high leverage outliers, candidate is protected from low leverage outliers, which may not be extreme entries of the .
Now the set will contain , and the LST estimates computed deleting extreme values according to the principal sensitivity components as in the first iteration. is the element of minimizing .
The iterations will continue until . Let be the final estimate obtained at this stage.
Stage 2. We first delete the observations () such that or , where and compute the LST estimate with the reduced sample. Then for each of the deleted observations we check whether or , where . Observations which are not within these bounds are finally eliminated and those which are, are restored to the sample. With the resulting set of observations we compute the LST estimate which is our proposal as a starting value for solving the estimating equations of the MTestimates.
4 Monte Carlo Study
In this section we report the results of a Monte Carlo study in which we compare the MTestimator computed with the proposed initial estimate (FMT), to the robust quasi likelihood estimator (RQL) proposed by Cantoni and Ronchetti [2001], the Conditionally Unbiased Bounded Influence (CUBIF) estimator proposed by Künsch et al. [1989], and the MTestimate beginning at an intitial estimator computing by subsampling (SMT). For computing the RQL estimator, we used function glmrob from the R package robustbase with method ”Mqle” and argument weights.on.x set to ”robCov” so that weights based on robust Mahalanobis distance of the design matrix (intercept excluded) are used to downweight potential outliers in xspace. The Conditionally Unbiased Bounded Influence (CUBIF) estimators proposed by Künsch et al. [1989] was computed using an implementation kindly provided by Prof. Alfio Marazzi (personal communication); the implementation available in function glmrob from the R package robustbase with method ”cubif” perform substantially bad and it is not reported here. For the computation of the SMT estimator, the number of subsamples was set to . Both FMT and SMT are computed using an iteratively reweighted least squares method described in the appendix. They only differ in the starting point. We study the case of Poisson regression and link.
Let be a random vector in such that has distribution and let
be a random variable such that
. We consider and three different models. In model 1 data are generated with , in model 2 and in model 3 , where is the vector of with all entries equal to zero except for the th entry which is equal to one. For each of these models we simulate the case in which the samples do not contain outliers and the case in which the samples have per cent of identical outliers at point . The outliers are located at . The values of are taken in a grid ranging from to where . The values and and the grid step are chosen so that the maximum mean squared error of our proposed estimator can be identified. For model 1 we also consider high leverage outliers with . We have considered samples of size an , but since the behaviour is similar we report the results only for the larger sample size. Given an estimator , we denote by MSE, the mean squared error defined by , where denotes the norm. We estimate the MSE bywhere is the value of the estimator at the th replication and is the number of replications which was chosen equal to .
Our simulations show that the proposed estimator has smaller MSE than all other proposals for almost all the contaminations considered. CUBIF estimator has a smaller MSE for some values of but, since it is based on a monotone score function, its MSE increases as increases. On the other hand, the MSE of FMT estimator is bounded; we observe that it decreases as increases beyond a certain value. To see this we study the MSE as a function of and consider, as a measure of robustness, the maximum MSE for . The proposed estimator has the smallest maximum MSE for all the models considered. In Figures 1 to 3 we plot the MSE as a function of for samples of size with contamination level.
In Figure 5 we report the execution time for the different methods. This figure shows that our proposed method is a great improvement over the subsampling method, as far as computational time is concerned.
5 Conclusion
We introduce a deterministic robust initial estimate for generalized linear models. This initial estimate is used in an iteratively reweighted least squares algorithm to obtain a solution of a transformed Mestimator. We illustrate the procedures for the Poisson model. Monte Carlo experiments show that the performance of MTestimators computed with the proposed initial estimator have a small bounded mean squared error exhibiting a redescending behavior. This is not the case for other proposals such RQL, CUBIF and MT with initial estimator based on subsampling.
Computational details and algorithms
In this Appendix we describe the iteratively reweighted least squares algorithms that were used to compute the LST and the MT estimators.
Suppose that we have an initial estimator and call then using a Taylor expansion of order one we can approximate by
(14) 
Then an approximate value to the LST estimator can be found by the value that minimizes
Therefore is the LS estimator for a linear model with responses and regressor vectors and consequently
(15) 
where is the matrix whose th row is , is the diagonal matrix with diagonal elements and
An iterative procedure to compute the LST estimator can be obtained putting
(16) 
and stopping when where is the error tolerance.
Suppose that converges to then this value should satisfy the LST estimating equation . In fact, taking limit in both sides of (16) we get
which is equivalent to
and then satisfies the estimating equation of the LST estimator.
To start the algorithm, it will be convenient to write equation (15) in the following slightly different way
(17) 
Observe that according to (17) to compute we only need to give . Then, since for Poisson regression and log link it holds it seems reasonable to take The value 0.1 is added to avoid numerical problem when To compute the estimators only one iteration is performed. The reason is that for these auxiliary estimators the accuracy is not as important as the speed at which they can be computed. Our experiments show that there is no noticeable loss in the precision of the final estimate by doing this but, on the other hand, the computation times decrease significantly.
We describe now an analogous iterative algorithm for computing the MT estimator. Suppose that we have an initial robust estimator We compute a new value using two approximations. As in the case of the LST estimator, replacing, in (4), by (14) we consider the approximate loss function
Differentiating with respect to we obtain the estimating equation
(18) 
where Note that this equation can be written as
(19) 
where
Since should be close to the second approximation is to replace, in (19), by Then is defined as the solution of the approximate estimating equation
and is given by
where is the diagonal matrix with diagonal elements
Then, the iterative procedure to compute the MT estimator is given by
(20) 
Suppose that then taking limits in both sides of (20), we get
and this is equivalent to
(21) 
where . Then satisfies the estimating equation of the MT estimator.
References
 Alqallaf and Agostinelli [2016] F. Alqallaf and C. Agostinelli. Robust inference in generalized linear models. Communications in Statistics  Simulation and Computation, 45(9):3053–3073, 2016. doi: 10.1080/03610918.2014.911896.
 Bergesio and Yohai [2011] A. Bergesio and V.J. Yohai. Projection estimators for generalized linear models. Journal of the American Statistical Association, 106:661–671, 2011.

Bianco et al. [2013]
A.M. Bianco, G. Boente, and I.M. Rodrigues.
Resistant estimators in poisson and gamma models with missing responses and an application to outlier detection.
Journal of Multivariate Analysis
, 114:209–226, 2013.  Cantoni and Ronchetti [2001] E. Cantoni and E. Ronchetti. Robust inference for generalized linear models. Journal of the American Statistical Association, 96:1022–1030, 2001.

Cook [1977]
R.D. Cook.
Detection of influential observation in linear regression.
Technometrics, 19(1):15–18, 1977. doi: 10.2307/1268249.  Künsch et al. [1989] H. Künsch, L. Stefanski, and R. Carroll. Conditionally unbiased boundedinfluence estimation in general regression models, with applications to generalized linear models. Journal of the American Statistical Association, 84:460–466, 1989.
 McCullagh and Nelder [1989] P. McCullagh and J.A. Nelder. Generalized Linear Models. Chapman and Hall/CRC, second edition, 1989.
 Peña and Yohai [1999] D. Peña and V.J. Yohai. A fast procedure for outlier diagnostics in large regression problems. Journal of the American Statistical Association, 94:434–445, 1999.
 Rousseeuw and Leroy [1987] P.J. Rousseeuw and A.M. Leroy. Robust regression and outlier detection. Wiley and Sons, 1987.
 Valdora and Yohai [2014] M. Valdora and V.J. Yohai. Robust estimators for generalized linear models. Journal of Statistical Planning and Inference, 146:31–48, 2014.