Robust location estimators in regression models with covariates and responses missing at random

05/07/2020 ∙ by Ana M. Bianco, et al. ∙ 0

This paper deals with robust marginal estimation under a general regression model when missing data occur in the response and also in some of covariates. The target is a marginal location parameter which is given through an M-functional. To obtain robust Fisher–consistent estimators, properly defined marginal distribution function estimators are considered. These estimators avoid the bias due to missing values by assuming a missing at random condition. Three methods are considered to estimate the marginal distribution function which allows to obtain the M-location of interest: the well-known inverse probability weighting, a convolution–based method that makes use of the regression model and an augmented inverse probability weighting procedure that prevents against misspecification. The robust proposed estimators and the classical ones are compared through a numerical study under different missing models including clean and contaminated samples. We illustrate the estimators behaviour under a nonlinear model. A real data set is also analysed.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As is well known, the basis for any regression analysis is to record a response variable

and a covariate vector

which are linked through the expression


where generally it is assumed that the error is independent of . It is worth noticing that in the classical approach, it is usually assumed that the errors are centred, i.e.,

with finite variance

. In contrast, in a robust framework no moment conditions are required and two branches have been developed. The most studied setting considers that

has a symmetric distribution with

the unknown scale parameter. In contrast, when the regression errors are skewed, a given family of densities has to be assumed, see, for instance, Cantoni and Ronchetti (2006) for an approach under a linear regression model with log–Gamma errors. In this paper,

is a general regression function which may link the responses with the independent variables linearly, nonlinearly, nonparametrically or either through a semiparametric model. To perform the analysis the practitioner records independent copies , , of , that is, satisfy (1) with the errors i.i.d. and independent of .

Suppose we are interested in estimating a location parameter of the response

. In the classical setting, the target parameter is the mean and it is well known that, even when all observations are available, the mean is very sensitive to the presence of outliers in the sample. Just one outlying observation can take this estimator beyond any limit. Forty years after Huber’s (1964) seminal paper, robust estimators are a popular choice that protects against outliers. Among others, the median or

estimators, which are given through a continuous location functional , have been developed to overcome the mean sensitivity towards atypical data. In this robust context, the target is now the robust location functional related to the estimation procedure defined through a score function.

Beyond the robust point of view, the effect of ignoring missing observations from the analysis is well known. In particular, when only responses are missing, the estimation of the response mean based on the observed data has deserved a lot of attention. Several strategies have been developed to alleviate the effect on the bias of missing values. Some of them use that additional variables with predictive ability are recorded and include inverse probability weighted (ipw) or regression based procedures. In addition to missing responses, some covariates may be dropped out making the problem more complex. In order to adjust for missing values both in the response or design variables, it would be necessary to extend existing practice so as to take benefit of the predictive capability of the always observed covariates. The marginal estimation task is even more challenging when atypical responses arise in the sample, since the standard procedures based on maximum likelihood are very sensitive to the occurrence of a few atypical observations. Furthermore, the challenge is even greater if the aim is to obtain a robust estimator which at the same time protects against misspecification in the missing pattern or in the regression model.

As a motivation, we consider the environmental data analysed in Cleveland (1985) related to air quality measurements. This data set consists in 153 observations that include daily record readings of ozone, wind speed and solar radiation. Several authors have described the nonlinear relation between these variables. For that reason, they have fitted a nonlinear model to the ozone measurements, over a subset of the data, corresponding to the always observed cases since missing variables occur. The considered covariates include the wind speed and the solar radiation. A robust fit for nonlinear models with missing responses was given in Bianco and Spano (2017) who consider an exponential growth model to explain the ozone daily behaviour in terms of wind speed and identify several atypical data. However, the inclusion of a linear component based on the solar radiation may add valuable information to the analysis of the ozone variation. An appealing characteristic of these data is that not only some of the ozone records are missing, but also the variable solar radiation has dropouts, while the variable wind speed is completely observed. Hence, the detection of possible atypical observations when solar radiation is included in the analysis remains open, as well as the estimation of a reliable location marginal parameter.

For that reason, in this paper, we address robust location estimation in the framework of the regression model (1) when the response and a fixed subset of independent variables are subject to missingness, while the remaining covariates are always observed. This situation may arise, for instance, in environmental observational studies as in our example, in biological essays when some independent variables can be controlled, while others not or in epidemiological studies where multivariate survival analyses are performed. The key point is that when treated inappropriately, missing values among covariates may affect the postulation of an appropriate model. As mentioned above, the simple method of deleting from the analysis those cases with missing values, either in the response or the independent variables, may produce biased estimators that may lead to wrong conclusions. For a recent discussion, see for instance, Chen et al. (2008) and Hristache and Patilea (2017). In order to avoid these bias problems and to provide a unified approach to handle missing data, Chen et al.

(2015) consider a missing at random model where responses and covariates are jointly missing. It is worth to emphasize that instead of the quantile regression model studied in Chen

et al. (2015) that concerns the conditional distribution, throughout this paper we use the regression model (1) as a tool to estimate the marginal location measure of interest. With these ideas in mind, to obtain robust marginal estimators under this missing scheme, we will consider the propensity model defined in Chen et al. (2015).

More precisely, throughout this paper, we assume that the observations consist on the triplets , , such that with the vector containing the variables subject to missingness, the dimensional vector containing the always observed variables, . Since our goal is to estimate a marginal parameter, we focus only on the situation in which is one of the components of . The presence indicator variable is such that if all the values in are observed and otherwise. The missing at random model (mar) assumes that


allowing to identify the parameter of interest just in terms of the distribution of the data available at hand. Note that (2) enables to deal with situations in which missing values occur only among the responses and also with those cases where data are missing from the response and a subset of covariates. We refer to Chen et al. (2015) for a thorough analysis regarding the missing scenarios modelled with this framework.

The goal of this paper is to introduce, in the context of the regression model (1), resistant estimators for the marginal location of , say , where is the distribution of , when there are missing values both in the responses and in some (but not all) covariates. When missing data arise only in the responses and all the covariates are fully observed, median estimators have been studied in Zhang et al. (2012) and Diaz (2017), while robust type location procedures have been considered in Bianco et al. (2010) and Sued and Yohai (2013). To deal with the situation in which missing covariates may also arise, we introduce different methods under (2). The first method to be considered is based on the ipw approach introduced in Horvitz and Thompson (1952), where each observation is weighted according to the inverse of the estimated probability of dropouts. The second approach extends the ideas given in Müller (2009) to a robust setting when also covariates may be missing. To this end, it is necessary to have a robust and strongly consistent estimator of the regression function given in (1) that will allow to estimate the errors distribution and the distribution of as well. An estimator of the distribution function of the responses, , is then constructed by considering their convolution. Finally, the obtained marginal distribution estimator ensures that it is possible to obtain robust estimators of the marginal quantity , given through a continuous functional . In particular, in this paper we focus on location estimates.

As when estimating the mean, the estimators of the marginal distribution based on inverse propensity score weighting require a correct postulated propensity model, while those based on the convolution method require also a correct postulated regression model. For that reason, another important novelty of the paper is that we introduce an augmented inverse probability weighting (aipw) that prevents against misspecification of the regression model or the dropouts probability while protecting against atypical observations. As far as we know, when missing data occur on the responses and some (but not all) of the covariates, our estimators proposal gives the first attempt to obtain valid estimates when atypical observations arise, either if the model on the regression function or on the missing probability are correct. In this sense, the new estimator copes with two major purposes: to be robust and double protected.

The paper is organized as follows. Section 2 describes some marginal measures of interest to be used in the sequel. The estimators when missing data occur in the responses and some of the covariates are described in Section 3, where also their asymptotic behaviour is studied. The double protected robust estimator is introduced in Section 4. A numerical study is carried out in Section 5 to examine the small sample properties of the proposed procedures under a nonlinear regression model. The ozone data set is analysed in Section 6, where the advantage of the proposed aipw procedure over the convolution–based approach is illustrated, while some concluding remarks and recommendations are given in Section 7. All proofs are relegated to the Appendix.

2 Notation and preliminaries

Throughout this paper, we denote as the response marginal probability measure and as its related distribution function. Let be any marginal location functional, where we will use indistinctly the notation or . Some examples of usual interest are the marginal mean or median of which are special cases of functionals.

Mmany estimators are defined using a previously computed nuisance parameter estimator. A typical example consists on the traditional marginal location–scale model, where has scale . The scale plays the role of the nuisance parameter and to obtain a scale equivariant procedure, an initial estimator is needed.

From now on, stands for a rho–function as defined in Maronna et al. (2006, Chapter 2), i.e., a function such that

  • is a nondecreasing function of ,

  • ,

  • is increasing for when ,

  • if is bounded, it is also assumed that .

The related location functional of may be defined as

where is the scale functional to be defined below. Usually, with where is a function and is a tuning constant chosen to attain a given efficiency. A common choice for a bounded function is the bisquare Tukey’s function . For example, the choice gives a 95% of efficiency with respect to the mean under normality.

If is continuously differentiable with derivative , when considering the differentiating equations, one has that where . In particular, when


the location estimator distribution is independent of the preliminary scale estimator distribution.

Two well known examples of functionals are the mean and the median which correspond to and , respectively. A feature of the estimators related to these functionals is that, in both cases, it is not necessary to have a preliminary estimation of the nuisance parameter. Moreover, these functionals allow to have a deeper insight on the interpretation of an location parameter either for symmetric or skewed distributions. When is symmetric around , both functionals coincide with . Furthermore, we also have that for any function and (3

) holds for any odd function

. On the contrary, when is skewed, the situation is different. In fact, both functionals are well identified, but they do not coincide; an illustrative example may be the distribution. The same assertion holds for any general location functional.

Since the scale of a distribution measures its dispersion, it is sensible to choose scale estimators that are invariant under translations and equivariant under scale transformations (see Maronna et al. 2006). Among other robust scale functionals, common choices are the mad (median of the absolute values around the median) and dispersion functionals, which are related to scale estimators (see Huber and Ronchetti, 2009). To define the latter, let be a function. One possible choice is , as above, where the user–chosen tuning constant guarantees Fisher–consistency under the underlying distribution. The dispersion functional is then defined as


where and is usually called the location functional. For instance, when and , , leading to the least median location estimator. As mentioned in Maronna et al. (2006), when is bounded, the breakdown point of the scale estimator is , since . When, as in our simulation study, is the Tukey’s bisquare function and we take and

, the estimator is Fisher–consistent at the normal distribution and has breakdown point 50%.

An estimator of may be obtained from a random sample plugging–in an estimator of the marginal distribution function . When all the observations are available, the empirical distribution, , can be computed and thus, the estimator may be defined as . Hence, the location estimator is the value such that

where stands for a robust consistent estimator of the marginal scale of the response variable, such as with defined in (4) and for any . In particular, if , we have that

When missing data arise, the estimators described above cannot be computed in practice or will be biased if only the available observations are used. Section 3 describes some alternatives to solve this problem using the information provided by those covariates that are always observed.

3 Marginal location estimators when missing covariates and responses arise

In this section, we face the problem of estimating an location parameter under the regression model (1), when missing data arise both on the responses and on some covariates and when, at the same time, anomalous responses (vertical outliers) occur.

We will consider an incomplete data set , , where are defined as in Section 1

. The binary variable

is modelled through (2). As mentioned above, among the missing variables we will always include the responses, i.e., is one of the components of .

To adapt to the missing values, two estimators of the marginal probability measure can be defined (see Bianco et al., 2018). The first one is an inverse probability weighting estimator that corrects the bias caused in the estimation by the missing mechanism using an estimator of the missingness probability

. The second one uses the information given by the assumed regression model. For that purpose, a convolution type estimator, as the one described in Müller (2009) for a fully parametric model with missing responses, is defined.

Denote as any of these marginal distribution estimators. Then, an estimator of the location parameter may be defined as . In Section 3.1 and 3.2, we give a precise definition of the location estimators and we study their asymptotic behaviour. In particular, to derive consistency results for the inverse probability weighting and the convolution–based location estimators, some of the following assumptions will be needed

  • , where is the support of the distribution of .

  • .

  • , for any compact set .

3.1 The inverse probability weighted estimator

The ipw estimator exploits the regressors potential to predict the propensity function . More precisely, as it usual when dealing with missing data, using the fully observed data and inverse probability weighting, can be estimated by


where is the point mass at point and is an estimator of the missing probability .

Theorem 3.1 in Bianco et al. (2018) ensures that, under A1 and A2, , where stands for the Prohorov distance. Therefore, for any functional continuous with respect to the Prohorov distance, we have that . Recall that continuity with respect to is a usual requirement when considering robust estimators. In particular, let be a robust consistent estimator of the marginal scale . A possible choice is with the dispersion functional related to a function as defined in (4). In this case, we have that .

Regarding the location estimators, let be a function such that is bounded. Two possible families for the function

may be chosen corresponding to increasing scores or redescending ones. For the first family, the related loss function

is a convex one, such as the well known Huber’s function, while for the second one is a bounded function such as the Tukey’s bisquare function. In the latter, it is usually required , for any , as mentioned in Section 2. Denote as , then is the solution of with


For redesceding functions, in order to identify a proper solution, it is better to define as the value , where

When is a differentiable function with bounded derivative , such that , standard arguments allow to conclude that , since the scale estimators are consistent and .

Different location estimators are obtained according to the procedure chosen to estimate the missing probability. Under certain experimental designs, the propensity may be assumed to be known. This can also be though as the case of an oracle situation. In contrast, if is unknown, it may be estimated using a either nonparametric approach or a parametric one based on previous information.

More precisely, when the propensity is fully known, the marginal estimator denoted solves , i.e., we have that


When a parametric model is assumed for the missing probability, usually the propensity is estimated by plugging–in a consistent estimator of the unknown parameter , where the dimension may be different from , the dimension of . More precisely, let be any consistent estimator of , i.e., such that . Hence, the estimator of the missingness probability defined as satisfies A2 if is equicontinuous in at . This condition holds, for instance, when is a continuous function of all its arguments and the support is a bounded set. Under a parametric model, we denote as the solution of , i.e.,


Finally, when considering a nonparametric smoother, a kernel estimator of the propensity may be defined as where


with a kernel function and the smoothing parameter. In this case, if is a uniformly continuous function, , and , where is a bounded variation function, analogous arguments to those considered in Chapter 2 of Pollard (1984) allow to show that A2 holds. We will denote as , the marginal estimators obtained when using the nonparametric estimator of the missingness probability, that is, solves


and provides strongly consistent estimators, under A1 and the conditions on the kernel and bandwidth mentioned above.

It is worth mentioning that, under A1, A2 holds and regularity conditions on the function , , and are strongly consistent to .

From now on, let be a random vector with the same distribution as , where, as above, . Henceforth, we will denote by .

Theorem 3.1.1 summarizes the asymptotic behaviour of , , when the missingness probability is either known or estimated under a parametric model or using a kernel approach. To establish their asymptotic distribution assumptions N1 to N7 given in the Appendix are needed.

As mentioned in Section 2, for data with no missing observations, if


the asymptotic distribution of the location estimator does not depend on that of the scale estimator and only its consistency is required. The inverse probability weighting estimators have the same behaviour as shown in Theorem 3.1.1. Among other scale consistent estimators, the practitioner may choose the mad of or more generally, an dispersion estimator . Remark 3.1.1 discusses the situation in which does not satisfy (11), which includes skewed distributions.

For simplicity of notation, we denote as and as


where stands for the gradient of with respect to .

Theorem 3.1.1. Assume that A1, N1 and N2 hold. Let be such that (11) holds. Furthermore, let be a scale estimator such that .

  • Let be defined in (7). If , we have that , where

  • Assume that , where . Let , with an estimator of such that . Denote as the ipw estimator given by (8). If, in addition, and N3 and N4 hold, we have that , where with

    and and given in N4.

  • Let be the kernel estimator defined in (9) and be the ipw estimator defined in (10). If, in addition, and N5 to N7 hold, we have that , where and

Remark 3.1.1. Even when the propensity is assumed to be known, the efficiency with respect to the ipw mean estimator depends on the proportion of missing data appearing in the sample, an effect that has been already addressed in the literature when missing values arise only in the responses. Besides, since , we have that and so, the marginal location estimator computed estimating the missing probability through a kernel estimator is more efficient than that computed with the true propensity, . As discussed among others in Wang et al. (1997), the better efficiency of the marginal location estimator

is related to the sample adjustment obtained through the propensity kernel estimator. The reader can find an heuristic justification of this behaviour for regression estimators in Robins

et al. (1994) when only covariates are missing.

Note that since we focus on marginal measures, we have considered location estimators to protect against outliers in the responses. The situation where atypical data in the covariates used to model the propensity is not considered here and we refer to Molina et al. (2017) for further discussion. In particular, if the propensity is modelled nonparametrically the lack of atypical observations in the covariate space is a usual assumption to avoid isolated points.

Remark 3.1.2. When has a skewed distribution, one cannot ensure that (11) holds, which is a condition used to guarantee that only consistency is required to the scale estimator . To solve this problem, a Bahadur expansion for the scale estimators is useful to derive the results. For that purpose, assume that the scale is related to an estimator with function , i.e., that the scale equals with defined in (4) and recall that . Using analogous arguments to those considered in Sued and Yohai (2013), it is easy to see that the asymptotic variance of the estimators can be obtained as in Theorem 3.1.1, if we replace in the expressions for , given above, the function by

with defined in (4) and

3.2 The convolution based marginal estimator

For the situation in which the missing values are restricted to occur only on the responses, Müller (2009) and Sued and Yohai (2013) noted that a different estimator of may be obtained using the regression model and the fact that is the convolution of the errors and the regression function distributions. From now on, we denote as and the distribution function of the errors and of the true regression function , respectively. The probability measures and are defined similarly. Using the convolution property, i.e., , a consistent estimator for can be obtained plugging–in consistent estimators and of and , respectively. More precisely, the

fully imputed

estimator introduced in Müller (2009) was adapted to the situation of missing values in the responses and covariates in Bianco et al. (2018) with the purpose of estimating the marginal quantiles. We recall its definition.

Let be a consistent estimator of . This consistent estimation can be accomplished in different ways according to the model structure assumed on the regression function which may be parametric, nonparametric or semiparametric. Bianco et al. (2018) illustrates through a detailed discussion how classical consistent estimators of may be obtained in different regression scenarios when missing observations occur in the responses and some of the covariates. However, the estimators defined therein are sensitive to atypical observations since they are mainly based on a least squares approach. It is worth to highlight that in our framework, besides consistency, robustness is also a desirable property for the estimators . Remark 3.2.1 discusses some robust consistent alternatives when a parametric model is considered.

Using the robust regression estimator , define


where the weights are normalized to guarantee that is a probability measure. Note that (13) involves not only the regression estimator but also, due to the missingness of some covariates, a propensity estimator . Hence, to avoid biases in the estimation of , both the regression and the propensity models must be correctly specified.

When , the residuals can be effectively predicted as , so that an estimator of can be computed as , with . The convolution–based estimator of is then defined as . As when missing values arise only on the responses, is a weighted empirical distribution since it can be written as , where , for .

Under mild conditions, is a consistent estimator of , since condition (2) holds. More precisely, if A1 to A3 hold, Theorem 3.2 in Bianco et al. (2018) entails that , which leads to the strong consistency of .

As above, let be a robust consistent estimator of the marginal scale , for instance, the mad of or with defined in (4). Note that is the solution of where

As in Section 3.1, we denote respectively as and the convolution–based estimators obtained assuming that the propensity is known () and that the propensity is estimated using a parametric model, that is, and with an estimator of .

Theorem 3.2.1 below provides the asymptotic distribution of , , when (11) holds and has a parametric form, i.e., when as stated in assumption N8. Otherwise, when (11) does not hold, as in Section 3.1, a Bahadur expansion for is needed to obtain an expression for the asymptotic variance of .

Let where is defined in N8 and


with the gradient vector of the function with respect to . Furthermore, when the propensity is estimated using the parametric model , define where is given in N4 and

Theorem 3.2.1. Let be such that (11) holds. Assume that A1, N1, N2 and N8 hold, is bounded and .

  • Denote as the solution of . If , we have that , where and .

  • Assume that , where . Let , with an estimator of such that and denote as the solution of . If and N3 and N4 hold, we have that , where with