Robust semiparametric estimators: missing data and causal inference

by   Eva Cantoni, et al.

Semiparametric inference with missing outcome data (including causal inference) is based on partially specified models which are not of direct interest (e.g., model for missingness/treatment assignment mechanism). Different class of estimators exist, which are more or less robust to misspecification of these models. Another type of threat to the validity of the inference occur in situations where some observations are contaminated (generated by some nuisance distribution). Classical semiparametric inference is not robust to such contamination, and a single observation may have an arbitrary large effect on bias as measured by the influence function. We introduce inverse probability weighted, double robust and outcome regression estimators of location and scale parameters, which are robust to contamination in the sense that their influence function is bounded. We give asymptotic properties and study finite sample behaviour. Our simulated experiments show that contamination can be more serious a threat to the quality of inference than model misspecification. An interesting aspect of our results is that the auxiliary outcome model used to adjust for ignorable missingness (confounding) is also useful to protect against contamination. We also illustrate through a case study how both adjustment to ignorable missingness and protection against contamination are achieved through weighting schemes, which can be contrasted to gain further insights.



There are no comments yet.


page 1

page 2

page 3

page 4


Causal Inference: A Missing Data Perspective

Inferring causal effects of treatments is a central goal in many discipl...

Navigated Weighting to Improve Inverse Probability Weighting for Missing Data Problems and Causal Inference

The inverse probability weighting (IPW) is broadly utilized to address m...

Inverse Conditional Probability Weighting with Clustered Data in Causal Inference

Estimating the average treatment causal effect in clustered data often i...

A stableness of resistance model for nonresponse adjustment with callback data

The survey world is rife with nonresponse and in many situations the mis...

Covariate balancing for causal inference on categorical and continuous treatments

We propose novel estimators for categorical and continuous treatments by...

Double Robust Semi-Supervised Inference for the Mean: Selection Bias under MAR Labeling with Decaying Overlap

Semi-supervised (SS) inference has received much attention in recent yea...

Regression-based causal inference with factorial experiments: estimands, model specifications, and design-based properties

Factorial designs are widely used due to their ability to accommodate mu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many data analyses are concerned with drawing inference on a parameter partially characterising a distribution law of interest from which data is assumed to be a random sample. However, most often the observed data deviates from this ideal random sample scenario, for instance such that some of the observations are contaminated, i.e. drawn from a nuisance distribution law. Another common deviation is that the random sample is incomplete: some data are missing, due to dropout in follow up studies, non-response in surveys, etc. Such corrupted random samples are indeed the rule rather than the exception in applications, and we give a telling example in Section 4

, where we study BMI change in a ten year follow up study. While methods are available to deal with these two different problems separately as described below, it is essential to have inferential methods able to deal with situations where both types of corruption (missingness and contamination) arise simultaneously. Indeed, while it is well known that many estimating procedures, including OLS, ML, and method of moments, lack robustness to contamination

(a single observation can have arbitrarily large influence, e.g., Hampel et al., 1986; Heritier et al., 2009), it is seldom acknowledged that this sensitivity to contamination can be exacerbated with estimators able to deal with missing data; see, however, Hulliger (1995) and Beaumont et al. (2013), where this increased sensitivity has been pointed out in the context of surveys of finite populations. The potential increase sensitivity to contamination arises in particular for estimators overweighting some of the observations (those representing part of the data which are missing), and if the overweighted observations are by chance contaminated, this will have large negative impact on the inference. Thus, while robust methods are important in general, they are even more so when missing data needs to be accounted for.

In this paper we focus on situations where, while observations may be missing for the response, there is a set of background variables (covariates) which are observed for all units, and we can assume that outcomes are independent of missingness given the observed covariates (ignorable missingness assumption). Under the latter assumption, auxiliary (nuisance) models explaining the missingness mechanism and the outcome given the covariates can be combined in different ways to obtain semiparametric estimators of . Classical examples include inverse probability weighted estimators (IPW), using the missingness mechanism model as weights (Horovitz and Thompson, 1952), and augmented inverse probability weighted estimators (AIPW, Robins et al., 1994) using both auxiliary models. AIPW estimators are then robust to the misspecification of one of these two auxiliary models at a time (thus the name doubly robust estimator often used); see, e.g., Tsiatis (2006), Rotnitzky and Vansteelandt (2015)

. Finally, outcome regression imputation (OR) estimators using only the model for the outcome may also be used, thereby avoiding weighting

(Kang and Schafer, 2007; Tan, 2007).

Within this context of ignorable missing data in the outcome, we introduce and study estimators that are able to deal with situations where most of the units in the sample are randomly drawn from the distribution of interest while a smaller number of units is possibly drawn from another nuisance distribution. An estimator is considered robust to such contamination if it has bounded influence function, see Hampel et al. (1986). This is because the influence function measures the asymptotic bias due to an infinitesimal contamination. A single observation can thus yield arbitrarily large bias if the influence function of the estimator is not bounded. Classical IPW, AIPW and OR estimators have unbounded influence function. They are not robust in this sense, even though AIPW has a robustness property, but merely to misspecification of one of the auxiliary models used. In a full data and finite parametric context, bounded influence function estimators are most naturally introduced as M-estimators (Huber, 1964; Hampel, 1974). Here we take advantage of the fact that IPW, AIPW and OR estimators are partial M-estimators (Newey and McFadden, 1994; Stefanski and Boos, 2002; Zhelonkin et al., 2012) to propose bounded influence function estimators. An interesting result of the introduced estimators is that the auxiliary outcome regression model used by AIPW to improve on efficiency compared to IPW, happens to also be useful in improving on the robustness properties of AIPW and OR. Robustness to contamination is typically obtained at the price of a loss in efficiency, although the latter can be controlled and set to say approximately 5% under some conditions. On the other hand, our simulated experiments show that moderate contamination seriously affects the quality of classical semiparametric inference, more so than model misspecification. Our approach is general and we fully spell out the case where is the two dimensional location-scale parameter.

The paper is organized as follows. Section 2 presents formally the context, and introduces robust estimators for missing data situations, together with their asymptotic properties. Section 3 studies finite sample properties through simulation designs previously used by Lunceford and Davidian (2004)

, to which we have added several contamination schemes. This allows us to study robustness due both to model misspecification and to contamination. In Section 4 a longitudinal study of BMI based on electronic record linkage data is used to illustrate, e.g., how the robustification introduced can be seen as a weighting scheme which can be compared to the weighting used to correct for ignorable missingness. The paper is concluded with a discussion in Section 5. Regularity conditions, proofs, implementation details and exhaustive results from the simulations are relegated to the Appendix.

2 Theory and method

2.1 Notation and context

Let a vector variable

be partitioned as , and consider the ideal situation when , are independently drawn from a probability law with density for unknown values and , where , of finite dimension, is the parameter of interest describing some aspects of the distribution, is a nuisance parameter possibly of infinite dimension, and and are variationally independent (semiparametric model; see Tsiatis, 2006, Chap. 4). We consider simultaneously two types of deviation from the above ideal random sample setting.

First, situations where atypical observations can occur in (and possibly ), i.e. where the majority of the data is generated as described above, but some of the observations may be issued from a different, but unknown, distribution. The final goal is to draw inference about , even in the presence of a small fraction of spurious data points.

Further, we also want to allow for incomplete data situations, where we observe only , , with

a binary variable indicating the observation status of

: if observed and if missing. We make throughout the missing at random assumption (also called ignorable missingness), i.e. , with on the support of . The missing assignment mechanism is modelled up to a parameter , , and we distinguish cases where this model is correctly specified, i.e. for a given but unknown , and cases where it is misspecified, i.e. an incorrect model for is used.

2.2 Full data case: robust M-estimators

Let us first consider an estimating function , which would be used if we had no missing data ( for all ):


The choice of may be done based on desired properties for the resulting M-estimator for (in the complete data case); e.g., such that for consistency. The study of robustness properties to contamination was formalised in Hampel (1974). The influence function plays a central role because it can be interpreted as measuring the asymptotic bias due to an infinitesimal contamination. Here, the influence function for the resulting estimator solution of (1) is


under suitable regularity conditions (Stefanski and Boos, 2002).

In the sequel we focus on the location-scale parameter . A commonly used choice is , because the resulting estimator is efficient in the Gaussian case. For this choice of estimating function, the influence function will not be bounded in and therefore not robust to contamination; see, e.g., Maronna et al. (2006, Chap. 2). A general class of M-estimators for and are solution of (1) for



is an odd function, and where

and in order to ensure that at , the true unknown value for . Bounded influence function estimators are obtained by using bounded functions, e.g., the Huber function , and the Tukey biweight function if and otherwise, see Heritier et al. (2009) for further details. The value for can be chosen appropriately to control efficiency under the non-contaminated Gaussian case. Equations (1) using (3) need to be solved simultaneously for and .

2.3 Robust estimation with missing data

Semiparametric estimation with missing data has been reviewed for instance in Tsiatis (2006). We introduce below novel bounded influence function estimators.


be a well specified parametric model, i.e. such that for

for an unknown value . Assume that we have an estimator of solution of estimating equations


such that .

Definition 1.

A robust inverse probability weighted (RIPW) estimator
of is solution of the estimating equation:



with and

A similar estimator was proposed and studied in Hulliger (1995) in the context, however, of finite populations and surveys. Note that letting , the identity function, yields a classical inverse probability weighted estimator (Horovitz and Thompson, 1952).

Remark 1.

RIPW estimation can be interpreted as a double weighting scheme estimator, where observations are weighted with inverse propensity scores (i.e., observations lying on the covariate support where the probability of dropout is higher are overweighted) and with weights (i.e., outlying observations are downweighted). These weights as well as the compound weights may be looked at in applications to gain insight in how the two weighting schemes interact; see Section 4 for an illustration.

Proposition 1.

Let be correctly specified with (4) such that . Then, under regularity conditions given in Appendix B, is consistent for

and has the following asymptotic multivariate normal distribution as

where is the influence function:


Thus, from (6) we see that the influence function of RIPW is bounded in if the function is bounded. This is not the case for the classical IPW, cor responding to .

The implementation of RIPW requires the computation of and . If the standardized quantity is satisfactorily approximated by a variate, then (since is odd) and can be approximated by Monte Carlo simulations.

In an attempt to improve efficiency one may consider a working model (parametrised with ) for . This model is correctly specified for if for a value . However, we call it working model because we will also consider situations where it is not necessarily correctly specified. Assume we have estimators of and of , respectively solutions of (4) and


such that and , for some fix values and . In the correctly specified cases and .

Definition 2.

A robust augmented IPW (RAIPW) estimator of is solution of the estimating equation:



is a working model for and for , and and

Using the identity function for yields a classical augmented inverse probability weighting (AIPW) estimator (Robins et al., 1994).

Proposition 2.

Let be correctly specified with (4) such that
and/or let be correctly specified with (7) such that . Then, under regularity conditions given in Appendix A, is consistent for and has the following asymptotic multivariate normal distribution as



Thus, RAIPW is as AIPW doubly robust in the sense that only one of the two auxiliary models used must be correctly specified in order to obtain consistent and asymptotic normal estimators. Moreover, the influence function of RAIPW is bounded in if the function is bounded (assuming the estimating equation (7) of the auxiliary model has also bounded influence function in ; see Exemple 1), while this is not the case for the classical AIPW.

Example 1 (RAIPW estimator of location and scale).

Let us specify a working model parametrized by as


with ). Note that this does not constrain to have a symmetric distribution as was the case for RIPW. The corresponding working model for is such that and . Estimators of with bounded influence function in this context are described in Appendix C.2. Then,


may be computed using numerical integration for the conditional expectations, where is the expectation under model (10). Under the latter model, and Monte Carlo simulations can be used to obtain B. Both (11) and (12) can be used to obtain RAIPW estimators through (8). See Appendix C for more implementation details.

Finally, when the outcome model is correctly specified yet another robust estimator can be introduced.

Definition 3.

A robust outcome regression estimator (ROR) of () is solution of


by using a correctly specified working model together with , an M-estimator (7) of with bounded influence function.

Proposition 3.

Let be correctly specified with (7) such that . Then, under regularity conditions given in Appendix B, is consistent for and has the following asymptotic multivariate normal distribution as


Example 2 (ROR estimator of location and scale).

Within the context of Example 1, assume that model (10) holds. Then, , and is estimated with a bounded influence function estimator; see Appendix C.2 for details.

Unlike for RAIPW, the regularity conditions apply to the working model only. For instance, to characterise the influence function one need to be more specific about the working model (which needs to be correctly specified). On the other hand, the results of Propositions 1 and 2 for RIPW and RAIPW respectively give specifically the regularity conditions that must apply to the functions used, and the influence functions resulting.

We have focused on robustness properties to contamination in the outcome . Contamination in the covariates may also happen. This is typically tackled by using the Tukey’s redescending function which protect against high leverage points, i.e. outlying values in the design space; see, e.g., Maronna et al. (2006, Chap. 4 and 5) and Cantoni and Ronchetti (2001).

3 Simulation experiments

We present a large simulation exercise to assess several aspects of our procedure for the joint estimation of location and scale: behaviour for clean data, robustness to the presence of contamination, and sensitivity to model misspecification.

3.1 Simulation setting

We implement the same simulation design as Lunceford and Davidian (2004). We consider the covariates associated with both the missingness mechanism and the outcome, and the covariates which are associated only with the outcome. The variables

are realizations of the joint distribution of

built by first taking . Then, conditionally on , is generated as Bernoulli with and finally is taken from a multivariate normal distribution , where , and

For each individual , the missingess mechanism indicator is generated as a Bernoulli variable with probability of missingness () defined by

which corresponds to the control group in Lunceford and Davidian (2004).

The response is generated according to the model


where and in our notation .

The parameter values are kept fixed throughout, whereas different scenarios are considered for and , namely




Notice that when , is associated with neither the outcome nor the missingness mechanism. The values of and are such that lower response values and lower probabilities of missingness are obtained when , and conversely when .

We generate realisations of size and , called clean datasets, i.e. free of contamination. We present results for , while the larger sample size confirmed the results and are omitted. Departing from the clean datasets, we obtain corresponding contaminated datasets according to different schemes as we describe in Section 3.3.

The combination of parameters in (16) and (17) gives six designs. For each design, we fit a total of 20 estimators of . They differ in the choice of estimation strategy (IPW, AIPW, OR), whether they are in their classical or robust versions, and whether the auxiliary models are misspecified or not. Thus, we consider
IPW(), AIPW(), AIPW(), OR() and OR(),
and their robust versions
RIPW(), RAIPW(), RAIPW(), ROR() and ROR(),
where the covariate sets used in the auxiliary models are given within parentheses, and, e.g., AIPW(), means that the first set is used to explain and the second set is used to explain . All these estimators use well specified auxiliary models. We, moreover, consider estimators using misspecified auxiliary models as follows:
IPW(), AIPW(), AIPW(), AIPW(), and OR(),
and their robust versions
and ROR(),
where and . Auxiliary models explaining and

are fitted using, respectively, logistic regression and ordinary least squares for the classical versions, and robust logistic regression and robust linear regression for the robust versions. For RIPW and RAIPW estimators Tukey’s

function is used in (8). Tukey’s function is usually preferred over Huber’s with asymmetric contamination. The robust estimators are tuned to have approximately efficiency at the correctly specified models for clean data. The values of the corresponding tuning constants are given in Appendix D.1. For details on the computation see Appendix C.

3.2 Results for clean data

The top half of Figure 1 summarises with boxplots the estimates of (left) and (right) for clean data, i.e. when the 1000 replicates are generated from the design introduced in Section 3.1, with moderate and moderate. The first row of panels show that for both and all the estimators (classical and robust) except RIPW are, as expected, unbiased. The bias of RIPW is due to the correction terms ( and in (5)) which are in this setting badly approximated based on the assumption that is normally distributed. This is improved for RAIPW, because the use of the outcome model allows for a better approximation of the correction terms. Also as expected, (R)IPW is more variable than (R)AIPW.

Figure 1: Estimates of (left) and (right) for the moderate- moderate scenario for clean data and under the C-asym contamination. The vertical lines represent the true underlying values.

The second row of panels confirms some other well known properties of the (A)IPW estimators: the bias due to misspecification of the missingness mechanism for IPW, the double robustness property of AIPW (i.e., unbiasedness if only one of the auxiliary models is misspecified) and the sensitivity of the OR estimator to the misspecification of the outcome regression model. Essentially, these properties are preserved for the robust versions introduced herein. These results are also summarised numerically in Table 4 (Appendix D.3

), where bias, standard deviations and root mean squared error of the estimators are reported. The results for the other five combinations of parameters ((

16) - (17)) deliver a similar general message, with different magnitudes. The corresponding figures and tables supporting this claim are provided in Appendix D.4.

3.3 Results under contamination

With the result above that expected behaviours are obtained with clean data, we study now the effect on estimation of deviations from the data generating mechanism of interest. To generate a contaminated sample, of the observed responses (i.e. data points for which ) issued from model (15) were randomly chosen and changed to the realization of:




, where is Bernoulli(probability=0.5),



For the C-asym case, the range of the uniform distribution has been set such that it falls approximately outside the observed range of clean

. C-sym is the symmetric version of C-asym, and the C-hidden case is such that the contamination is not clearly visible when looking at the observed marginally. Figure 2 displays a realization of each scheme for the scenario with moderate and moderate: the values of are plotted against , the linear predictor of the outcome model (15), with a histogram of the marginal distribution of .

C-asym C-sym C-hidden
Figure 2: One realization of size 1000 for the three contamination schemes considered: black circles observed outcomes () and circles unobserved outcomes (). The histograms are over the observed outcomes .

We present the results for the moderate and moderate design and contamination C-asym in bottom half of Figure 1. The results for the other - combinations are given in Appendix D.4. The third row of panels of Figure 1 displays the results for the correctly specified models. We can see that for all the classical methods suffer a negative bias (underestimation) due to the presence of contamination, and these bias are of similar magnitude. For , the biases of the classical methods are positive (overestimation), with OR estimators being even more affected than (A)IPW estimators. On the other hand, RAIPW and ROR perform well, producing estimates in target with the true underlying values. A slight negative bias remains for RAIPW(X,XV) for . However, the size of this bias is negligible compared to the bias induced by contamination on the classical estimators.

The fourth row of panels of Figure 1 shows the effects of both misspecification and contamination. We observe that the bias due to contamination is more severe than bias due to misspecification in the setting simulated (compare with the second row of the same figure, i.e. clean data).

Root mean squared errors (RMSE) are given in Table 1, yielding further insights. In particular, while AIPW and OR with correct model specification were comparable in terms of empirical RMSE in the clean data designs (see RMSE tables in Appendix D.3), we observe that ROR outperforms RAIPW in this contamination case, particularly so when estimating (upper half of Table 1

). In fact we see that ROR has both lower bias and variance in this setting. When model misspecification occurs (lower half of Table

1), ROR also outperforms RAIPW if the latter misspecifies both auxiliary models, otherwise RAIPW has lowest RMSE for estimation of when one of the model used is correct.

Results for the other parameter combinations under C-asym contamination carry similar messages, with different magnitudes; see figures and tables in Appendix D.4.

() bias() sd( bias() sd(
IPW(X) -8.835 1.506 8.962 15.903 1.511 15.974
AIPW(X,X) -8.864 1.266 8.954 15.941 1.308 15.994
AIPW(X,XV) -8.858 1.254 8.946 15.947 1.297 15.999
OR(X) -8.873 1.292 8.967 17.684 1.023 17.714
OR(XV) -8.866 1.278 8.958 17.774 1.008 17.803
RIPW(X) 3.583 1.702 3.966 -2.370 2.051 3.134
RAIPW(X,X) 0.182 1.477 1.487 -1.233 1.392 1.859
RAIPW(X,XV) 0.165 1.452 1.460 -1.207 1.357 1.815
ROR(X) 0.017 1.298 1.297 0.295 1.021 1.062
ROR(XV) 0.026 1.274 1.274 0.194 0.980 0.999
IPW() -7.618 1.443 7.754 15.815 1.451 15.881
AIPW() -8.865 1.261 8.954 16.153 1.292 16.204
AIPW() -8.845 1.261 8.934 15.939 1.297 15.991
AIPW() -8.088 1.247 8.184 15.947 1.283 15.999
OR() -8.091 1.264 8.189 17.559 1.003 17.588
RIPW() 4.834 1.656 5.109 -2.911 1.898 3.474
RAIPW() 0.113 1.447 1.451 -1.193 1.330 1.786
RAIPW() 0.244 1.480 1.499 -1.256 1.459 1.924
RAIPW() 1.034 1.452 1.782 -1.616 1.386 2.128
ROR() 0.853 1.291 1.547 0.041 1.040 1.041
Table 1: Bias, standard deviation and root mean squared error (times 10) of the estimates across simulations for the moderate- moderate scenario under C-asym contamination.
Figure 3: Estimates of (left) and (right) for the moderate- moderate scenario under the C-asym and C-hidden contamination. The vertical lines represent the true underlying values.

Results for the C-sym contamination are summarized in the top half of Figure 3, which shows the same patterns as C-asym in Figure 1, with a major difference that the biases of RAIPW(X) and RAIPW(XV) in the estimation of have now disappeared. Finally, the C-hidden configuration, whose results are summarized in the bottom half of Figure 3, confirms the expectation that this contamination scheme is most challenging. Here RAIPW is clearly biased for both and , and ROR behaves best both in terms of bias and efficiency. This is due to the fact that ROR only considers the conditional distribution of given , through the outcome regression model, where the contamination is most visible, while RAIPW considers this conditional distribution but also the marginal of where the contamination is hidden. RMSE tables provided in Appendix D.3 confirm the visual impression of the figures.

4 Application: BMI change

Figure 4: BMI change after ten years versus BMI measured at 40 years of age for 5553 men living in the county of Västerbotten in Sweden and born 1950-58. In red baseline BMI observed for the 2002 men not returning at follow up.

We illustrate the methods presented in this paper with a population based 10 year follow up study of body mass index (kg/m, BMI). The analysis is performed on data from the Umeå SIMSAM Lab database (Lindgren et al., 2016), which makes available record linkage information from several population based registers. In particular the database includes BMI data from an intervention program where all individuals living in the county of Västerbotten in Sweden and turning 40 and 50 years of age are called for a health examination. Thanks to the Swedish individual identification number, this collected health data can be record linked to population wide health and administrative registers, which allows us to retrieve useful auxiliary information on individuals hospitalisation and socio-economy adding to self reported variables available from the intervention program.

We consider men born between 1950 and 1958 and observe their BMI when they turn 40 years of age as well as at a 10 year follow up. Figure 4 displays a scatter plot of BMI change versus BMI at baseline for the 5553 men who came back at the follow up examination when they turned 50 out of the 7555 that were measured at baseline (40 years of age). Extreme BMI values are observed (both at baseline and changes at follow up), giving an interesting case to illustrate the robust estimators introduced herein.

The set of baseline covariates used in the auxiliary models fitted are: (from the health examinations) measured BMI, self reported health, and tobacco use; (from Statistics Sweden registers) education level, number of children under 3 years of age, log earnings, parental benefits, sick leave benefits, unemployment benefits, urban living; (from the hospitalisation register) annual hospitalisation days. These variables are used to explain dropout () using a logistic regression and BMI change () using linear regression. Estimation of these auxiliary models is performed using maximum likelihood or robust regression methods depending on the estimators used. More details on baseline covariates and results on the estimation of the auxiliary models are provided and discussed in Appendix E.

1.43 1.36 1.38 1.41 1.34 1.34 1.34
s.e. 0.039 0.073 0.062 0.044 0.024 0.024 0.026
2.87 3.67 3.64 2.88 1.56 1.58 1.66
s.e. 1.524 0.700 0.665 0.264 0.025 0.025 0.024
Table 2:

Estimates and their standard errors (s.e.) for

and using the different estimators defined in the article, and where ”CC” stands for complete case sample moments.

Table 2 displays estimated mean () and standard deviation (

) in BMI change. The estimators used are the naive complete case (CC) sample mean and standard deviation (i.e., not taking into account selective dropout and outlier contamination); IPW, OR, and AIPW taking into account selective dropout but not outlier contamination; and RIPW, ROR and RAIPW, taking into account both selective dropout and contamination.

Figure 5: Tukey’s weights versus inverse of fitted propensity scores and compound weights (product of Tukey’s weights and inverse propensity scores) used in RIPW.

All three introduced robust versions (RIPW, RAIPW, ROR) give similar estimates of mean and variance BMI change. Robust estimation seems to be here most relevant for the scale parameter , where the robust versions are two BMI units smaller than the non-robust versions. This has also consequence for the estimation of standard errors for , IPW, OR and AIPW having standard errors 50% larger than their robust counterparts. In summary, while robust estimation does not yield notably different results in mean BMI change, the results indicate a clear overestimation of the variability in BMI change if the non-robust estimators are used. Furthermore, correcting for selective dropout without taking into account contamination yields even larger overestimation of the variability of BMI change than using the naive estimator.

Taking into account selective dropout and contamination can both be seen as weighting schemes as described in Remark 1. Thus, the propensity score weighting and the Tukey weighting, as well as the compound weights, are plotted in Figure 5 against each other. This plot highlights which individuals are downweighted due to outlying BMI change and overweighted due to selective dropout. We observe that one observation has very high inverse propensity score and zero Tukey weight. This observation corresponds to the outlying individual with BMI at baseline close to 100 and large BMI decrease (see Figure 4). Thus, while IPW and AIPW give it a large weight because it lies in a region with high probability of missingness, its outlying nature is noticed by the robust estimators which discard it from estimation. Its overweight in IPW and AIPW estimation is a contributing factor to their seemingly overestimation of noticed above.

5 Discussion

In this paper we have studied semiparametric inference when outcome data is missing at random (ignorable given observed covariates) and the observed data is possibly contaminated by a nuisance process. We have proposed estimators which have bounded bias for an arbitrary large contamination. Many alternative estimators have been proposed in the literature (Rotnitzky and Vansteelandt, 2015, for a review). In order to obtain bounded influence function versions of those, the approach presented here can be followed. In particular, an interesting family of AIPW estimators are those which are bounded in the sense that they cannot produce an estimate outside the range of the observed outcomes (Tan, 2010; Gruber and van der Laan, 2010), in order to avoid the inverse probability weighting to have too drastic consequences. Such estimators are still not robust to contamination (unbounded influence function) and may therefore be robustified as proposed herein. Moreover, while we have focused on the location-scale parameter of the marginal distribution of the outcome of interest, the framework can readily be extended to other parameters, e.g., parameters of the conditional distribution of the outcome given some covariates, and to causal parameters defined using the potential outcome framework (Rubin, 1974).

Finally, an interesting aspect of our results is that the auxiliary outcome regression model is not only useful to improve efficiency over IPW estimation but it is also useful for the robustness properties of the RAIPW and ROR estimators. This is the case in two ways, first because it allows us to relax assumptions (otherwise commonly made in robust estimation of location and scale) on the marginal distribution of the outcome, and second because it allows us to deal with contaminations which may be hidden when looking at the marginal distribution of the outcome, but become apparent in the conditional distribution of the outcome given the covariates.


We are grateful to Ingeborg Waernbaum and Chris Field for comments that have improved the paper. We acknowledge the support of the Marianne and Marcus Wallenberg Foundation, the CRoNoS COST Action IC1408 and the Einar and Per Wikström Foundation. The Umeå SIMSAM Lab data infrastructure used in the application section was developed with support from the Swedish Research Council and by strategic funds from Umeå University. The simulations were performed at University of Geneva on the Baobab cluster.


  • Beaumont et al. (2013) Beaumont, J.-F., Haziza, D., and Ruiz-Gazen, A. (2013). A unified approach to robust estimation in finite population sampling. Biometrika100,(3), 555–569.
  • Cantoni and Ronchetti (2001) Cantoni, E. and Ronchetti, E. (2001). Robust inference for generalized linear models. Journal of the American Statistical Association96,(455), 1022–1030.
  • Gruber and van der Laan (2010) Gruber, S. and van der Laan, M. J. (2010). An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. The International Journal of Biostatistics6,(1).
  • Hampel (1974) Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association69,(346), 383–393.
  • Hampel et al. (1986) Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions. New York: Wiley.
  • Heritier et al. (2009) Heritier, S., Cantoni, E., Copt, S., and Victoria-Feser, M.-P. (2009). Robust Methods in Biostatistics. Wiley-Interscience.
  • Horovitz and Thompson (1952) Horovitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association47,(260), 663–685.
  • Huber (1964) Huber, P. (1964). Robust estimation of a location parameter. Annals of Mathematical Statististics35,(1), 73–101.
  • Hulliger (1995) Hulliger, B. (1995). Outlier robust Horvitz-Thompson estimators. Survey Methodology21,(1), 79–87.
  • Kang and Schafer (2007) Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science22,(4), 523–539.
  • Lindgren et al. (2016) Lindgren, U., Nilsson, K., de Luna, X., and Ivarsson, A. (2016). Data resource profile: Swedish microdata research from childhood into lifelong health and welfare (Umeå SIMSAM Lab). International Journal of Epidemiology45,(4), 1075–1075.
  • Lunceford and Davidian (2004) Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine23,(19), 2937–2960.
  • Maronna et al. (2006) Maronna, R., Martin, R., and Yohai, V. (2006). Robust statistics. Wiley New York.
  • Newey and McFadden (1994) Newey, W. K. and McFadden, D. L. (1994). Large sample estimation and hypothesis testing. In R. F. Engle and D. L. McFadden (Eds.), Handbook of Econometrics, Volume IV, Chapter 36, pp. 2111–2245. Amsterdam: Elsevier Science.
  • Robins et al. (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association89,(427), 846–866.
  • Rotnitzky and Vansteelandt (2015) Rotnitzky, A. and Vansteelandt, S. (2015). Double-robust methods. In G. Molenberghs, G. Fitzmaurice, M. G. Kenward, A. Tsiatis, and G. Verbeke (Eds.), Handbook of Missing Data Methodology, Chapter 9, pp. 185–212. London: Chapman and Hall/CRC.
  • Rubin (1974) Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology66,(5), 688–701.
  • Stefanski and Boos (2002) Stefanski, L. A. and Boos, D. D. (2002). The calculus of M-estimation. The American Statistician56,(1), 29–38.
  • Tan (2007) Tan, Z. (2007). Comment: Understanding OR, PS and DR. Statistical Science22,(4), 560–568.
  • Tan (2010) Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika97,(3), 661–6823.
  • Tsiatis (2006) Tsiatis, A. (2006). Semiparametric theory and missing data. Springer Science & Business Media.
  • Zhelonkin et al. (2012) Zhelonkin, M., Genton, M. G., and Ronchetti, E. (2012). On the robustness of two-stage estimators. Statistics & Probability Letters82,(4), 726–732.

Appendix A Proposition 2

We give below proofs of consistency and asymptotic normality for the RAIPW estimator. Assumptions made in Section 2 on the data generating mechanism hold throughout. Let us use the simpler notation for
given in (8), where the dependence on the data is shown merely by the index . For convenience, the estimator defined as solving (8) is instead (and equivalently when such a solution exists) defined as a minimum distance estimator (between the empirical moment and zero):


where . This allows us to utilize some general asymptotic results given in Newey and McFadden (1994) as specified below. Proposition 2 given earlier is a summary of the two following propositions (Proposition 2 (consistency) and Proposition 2 (asymptotic normality)).

a.1 Consistency

Regularity conditions

  • i) , and ii) ;

    • is differentiable with respect to on the open interval with its derivative continuous on the closed interval between and , where ;

    • and are differentiable with respect to on the open interval with their derivatives continuous on the closed interval between and , where ;

  • where is compact, and the equality

    holds only for ;

  • for with probability (wp) 1;

  • is continuous at each wp 1;

  • and are continuous at each wp 1;

Condition A.3) is an identification condition. Compactness of may be considered restrictive and can be relaxed at the cost of other assumptions (Newey and McFadden, 1994). For Huber’s function, compactness is, for instance, not necessary (Huber, 1964), while for Tukey’s function identification holds only locally.

Proposition 2 (consistency).

Let either be correctly specified with and/or