Averaging causal estimators in high dimensions

06/21/2019 ∙ by Joseph Antonelli, et al. ∙ 0

There has been increasing interest in recent years in the development of approaches to estimate causal effects when the number of potential confounders is prohibitively large. This growth in interest has led to a number of potential estimators one could use in this setting. Each of these estimators has different operating characteristics, and it is unlikely that one estimator will outperform all others across all possible scenarios. Coupling this with the fact that an analyst can never know which approach is best for their particular data, we propose a synthetic estimator that averages over a set of candidate estimators. Averaging is widely used in statistics for problems such as prediction, where there are many possible models, and averaging can improve performance and increase robustness to using incorrect models. We show that these ideas carry over into the estimation of causal effects in high-dimensional scenarios. We show theoretically that averaging provides robustness against choosing a bad model, and show empirically via simulation that the averaging estimator performs quite well, and in most cases nearly as well as the best among all possible candidate estimators. Finally, we illustrate these ideas in an environmental wide association study and see that averaging provides the largest benefit in the more difficult scenarios that have large numbers of confounders.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating the causal effect of a treatment on an outcome with observational data relies on the assumption that the treatment assignment is unconfounded conditional on a set of pre-treatment covariates. There is a large literature on methods to adjust for confounding in order to satisfy this assumption, such as matching (Rubin & Thomas, 1996), propensity score weighting (Rosenbaum & Rubin, 1983), stratification, outcome regression, and doubly robust estimators (Bang & Robins, 2005)

. These approaches have all been shown to provide valid estimates of causal effects as long as the covariates necessary for the unconfoundedness assumption to hold are measured in the data. In recent years, the number of high-dimensional data sets with large numbers of covariates has been growing, and there has been subsequent interest in the estimation of causal effects in such high-dimensional settings. This is both a blessing and a curse: the increasing number of measured covariates makes it more plausible that the covariates necessary for confounding adjustment have been included, however, estimation becomes considerably more difficult with large numbers of parameters. Due to this, there has been a spike in interest in methods to adjust for large numbers of covariates and produce consistent estimates of causal effects from observational data.

Much of the interest began as papers started to consider the problem of variable selection or model selection for causal estimation in low-dimensional settings (Crainiceanu et al. , 2008; Vansteelandt et al. , 2012; Wang et al. , 2012). Since then, a large number of manuscripts have been published that aim to estimate causal effects when the number of covariates is large relative to the sample size. Wilson & Reich (2014) tailored the adaptive lasso (Zou, 2006) to effect estimation instead of prediction, by creating weights that involve both outcome model and treatment model coefficients. Other approaches have utilized penalization approaches to estimate causal effects in high-dimensions. Shortreed & Ertefaie (2017) used the adaptive lasso within the propensity score framework, but allowed the weights to depend on the outcome model coefficients. Ertefaie et al.  (2018) create a new penalization estimator that utilizes information from both the treatment and outcome to identify confounders, and then estimate causal effects conditional on this set of variables. The lasso Tibshirani (1996) was used to estimate both propensity score and prognostic score models in Antonelli et al.  (2016) to obtain doubly robust estimates via matching. Antonelli et al.  (2017a) utilize a fully Bayesian approach to estimating treatment effects in high-dimensions that attempts to reduce finite sample bias while eliminating the impact of instrumental variables. Hahn et al.  (2016) utilize horseshoe priors on a re-parameterized likelihood to reduce shrinkage of important coefficients. A large number of approaches have focused on obtaining uniformly valid inference of causal effects in high-dimensions (Belloni et al. , 2014; Farrell, 2015; Chernozhukov et al. , 2018). Belloni et al.  (2014) use lasso models to estimate both the treament and outcome model parameters, taking the union of the identified covariates from the two models, and then estimating effects conditional on the chosen model. Farrell (2015) also utilize the lasso for a treatment and outcome model, and then fit post selection estimators to estimate the traditional doubly robust estimator. Chernozhukov et al.  (2018) estimate nuisance parameters such as treatment or outcome model parameters via penalization, and then solve the efficient score to estimate treatment effects. Similar ideas are used from the Bayesian perspective in Antonelli & Dominici (2018), and it is shown that it leads to improved finite sample performance of frequentist metrics such as interval coverage. Other approaches have shown that convergence can be achieved in high-dimensional settings. Athey et al.  (2018) combine high-dimensional outcome regression with balancing weights that eliminate any residual confounding after the outcome regression. Propensity score estimation is improved for both high-dimensions and model misspecification using calibration weights in Tan (2017, 2018). Propensity scores are again used in high-dimensional scenarios, but are instead estimated by their ability to balance important covariates in Ning et al.  (2018). Lastly, targeted maximum likelihood (TMLE, Van Der Laan & Rubin (2006)) can be used in conjuction with high-dimensional models to produce valid estimates of treatment effects.

This recent interest has led to a diverse and large set of potential estimators to use if one is faced with observational data that is high-dimensional. Choosing among this large set of estimators is a subjective choice, and it is not clear if, or when, any estimators are preferable to others. Given the complexity faced with high-dimensional data, it is highly unlikely that any one estimator will perform well universally. In this manuscript, we propose a synthetic estimator that is a convex combination of candidate estimators. The idea of combining models, estimators, or tests has a rich history in the statistical literature. For example, model uncertainty can be accounted for through model averaging, which exists in both the Bayesian and frequentist framework (Draper (1995); Hoeting et al.  (1999); Hjort & Claeskens (2003); Hansen (2007)). Others have proposed combinations of estimators (Mittelhammer & Judge (2005); Antonelli et al.  (2017b)), including in the field of small area estimation (Ghosh et al.  (1994)

). Omnibus tests for the global null hypothesis simultaneously test multiple hypotheses (e.g. ANOVA), and several methods for combining individual p-values from genome-wide association studies have been proposed (

Chen & Yang (2017)). These are just a few examples of combining models, estimators, or tests, but they all share a common goal: combining statistics for improved performance.

In this manuscript, we leverage a large set of potential estimators to our advantage by introducing a simple, intuitive estimator that averages over all of the candidate estimators. We will show both theoretically and empirically that this leads to robust performance for estimating treatment effects in high-dimensions. In particular, averaging over all existing estimators can lead to an estimate with reduced bias and mean squared error (MSE), as well as increased coverage probabilities.

2 Methodology

Throughout, we will assume that we have i.i.d samples of . and are the outcome and treatment of interest, respectively, while is a

dimensional vector of pre-treatment covariates. We will assume that

is a binary treatment as many of the existing estimators are only applicable to binary treatments. We will not discuss the rates at which grows, but we will be working in the high-dimensional regime that and grows with . For simplicity, we will restrict attention to the average treatment effect defined as , where is the potential outcome we would have observed for subject if they had been exposed to treatment . We will be working under the assumptions of no unmeasured confounding and positivity, defined below:

Unconfoundedness: for t=0,1
Positivity: There exist such that with probability 1,

where denotes the propensity score (Rosenbaum & Rubin, 1983). We will also assume the stable unit treatment value assumption (SUTVA), which states that the treatment is well-defined and that the outcomes for one subject are not affected by the treatment status of other subjects. Further, in high-dimensional settings, many of the estimators we consider will rely on some form of sparsity, essentially stating that unconfoundedness holds conditional of a subset of the covariates. The exact form of the sparsity condition varies by estimator and will not be discussed in detail in this manuscript.

2.1 Averaged estimator

We now detail how to construct the averaged estimator, conditional on the fact that there exist estimators from which to choose. We will denote each estimator by for

. Each estimator will also have a corresponding variance estimate, which we will denote by

for . The averaged estimator takes the form

where

is a weight vector that must be chosen. In an ideal world, we would estimate the weight vector to try and obtain the optimal choice that minimizes a pre-specified loss function. One example would be to find the weight vector that minimizes the mean squared error of

, which would amount to solving

where and

are the bias vector and covariance matrix for the

estimators. Unfortunately, both of these quantities are quite difficult to estimate. In the context of noncompliance in clinical trials, Antonelli et al.  (2017b) are able to estimate these two quantities and estimate the weight vector

. The bias is estimated by identifying an unbiased estimator among the candidate estimators, and the covariance is estimated using the bootstrap. These ideas do not carry over into the high-dimensional estimators considered here. The bootstrap does not apply to most of the high-dimensional estimators based on variable selection or regularization, and estimating the bias of each estimator is infeasible. Further, most of the papers mentioned in the introduction went to great detail simply to estimate the variance of their estimators in high-dimensions, so finding the covariance between each respective estimator is not possible at this point. With this in mind, we will proceed forward with a fixed weight vector

, such that each estimator gets equal weight. This provides a number of advantages over simply choosing a single estimator, which are detailed in the subsequent sections. The averaging will provide a form of robustness in terms of bias and MSE by ensuring that the averaged estimator does better in terms of these metrics than the worst of the estimators. Our empirical studies in Section 3 show that we not only provide protection against choosing the worst estimator, but the averaged estimator typically does close to, if not better, than the best of the estimators. This is due to large decreases in the variance of the averaged estimator relative to any one estimator, particularly if the individual estimators are not strongly correlated. We would like to note that unequal weights can easily be applied here as long as they are chosen a priori and not chosen simply to provide a desired result.

2.2 Variance estimation

One reason that utlizing the averaged estimator can be so powerful is that it can lead to large gains in efficiency. If the covariance matrix of the estimators is known, the variance of the averaged estimator can be written as

We can explore this quantity in various situations to see how the variance can be greatly improved with the averaged estimator. In one extreme case where all the estimators are perfectly correlated, then the variance of the averaged estimator is equal to the average of the pairwise products of the standard deviations for each estimator. If each estimator has the same variance, this would lead to the averaged estimator having the exact same variance as the other estimators. In a different extreme case where all of the estimators are uncorrelated, the variance of the averaged estimator would be

. This means that the variance of the averaged estimator would be an order of magnitude () smaller than the average of the individual estimator variances. Even bigger gains can be seen if the estimators are negatively correlated, but this is very unlikely to happen in practice so it will not be discussed any further. In our experience, these estimators will be positively, but not perfectly, correlated with each other, meaning that the truth lies between the two extremes discussed. Nonetheless, the variance of the averaged estimator can either be a big improvement over the individual estimators, or in the worst case scenario, it is no worse.

The difficult aspect of using the high-dimensional averaged estimator proposed is to estimate its corresponding variance. We could plug in estimates into the expression above, however, as discussed earlier it is very difficult to produce an estimate of in high-dimensions. We can, however, estimate the individual estimator variances . Since the correlation is bounded between -1 and 1, we can bound the variance of the averaged estimator via

Therefore, we can propose a conservative estimate of the variance as

The amount that this estimator is conservative or not depends on the correlation between the individual estimators. If they are highly, positively correlated, then this will not be very conservative. If the estimators are close to independent, then this estimator will be a very conservative estimate of the variance. We argue that being conservative in this setting is good for two reasons. First, the estimates of standard errors

frequently rely on asymptotic approximations, which might not hold, and therefore these could be underestimates of the true individual variances. Second, there is frequently bias in estimating treatment effects in high-dimensions due to penalization or variable selection, and therefore conservative estimates of the variance might lead to better interval coverages that help to account for some additional bias. We will see empirically that this conservative variance does quite well in Section 3.

2.3 Robustness to choosing the worst model

The extent to which the averaged estimator performs relative to the candidate estimators is highly data dependent. However, we show that the averaged estimator protects against choosing a potentially very bad estimator among the class of estimators. Intuitively, this is because the averaged estimator only assigns a weight of to any one estimator, including the worst estimator with respect to a given metric, and therefore it will perform as good or better than this worst case estimator. The following result shows this for both bias and mean squared error:

Result 1: If each estimator is assigned equal weight, and we let and be the worst estimators in terms of bias and MSE, respectively, then

A proof of this result can be found in the Appendix. While this result only guarantees that the averaged estimator does better than the worst model, in practice these are typically very conservative bounds, as the averaged estimator not only performs better than the worst estimators, but frequently performs similarly well to the best individual estimators.

3 Simulation study

We will empirically investigate the ability of the averaged estimator to estimate treatment effects in high-dimensional settings. We focus on 10 potential estimators that are either available in an R package, or straightforward to implement ourselves. The estimators to enter into the averaged estimator are listed below:

  1. The double post selection approach (Double PS, Belloni et al.  (2014))

  2. The approximate residual debiasing approach (Debiasing, Athey et al.  (2018))

  3. The doubly robust lasso approach (DR-Lasso, Farrell (2015))

  4. Doubly robust matching (DRME, Antonelli et al.  (2016))

  5. High-dimensional confounding with spike-and-slab priors (HDC, Antonelli et al.  (2017a))

  6. Bayesian doubly robust estimator (DR-Bayes, Antonelli & Dominici (2018))

  7. Targeted maximum likelihood with lasso models (TMLE, Van Der Laan & Rubin (2006))

  8. Targeted maximum likelihood with an initial screening step followed by unpenalized linear models (TMLE-screen)

  9. Double machine learning with lasso models (DML,

    Chernozhukov et al.  (2018))

  10. Double machine learning with post selection estimators (DML-PS)

We will not go into the details of the respective approaches or how they are fit, as our aim here is to show the performance of the averaged estimator when there are a large number of estimators to average over, each with varying operating characteristics that are unknown to the user beforehand. For this reason, we will focus on results for the averaged estimator, as well as the best and worst estimator with respect to any given metric. For mean squared error, the best estimator is the one with the smallest MSE, and the worst is the one with the largest MSE. For ease of discussion, the MSE for each simulation scenario is normalized so that the best estimator has MSE equal to 1. For interval coverage, the worst estimator is the one with the lowest coverage, and then the best estimator is the one with the coverage closest to 0.95 without being 1. We don’t want an estimator with a coverage probability of 1 as it could be overly conservative. This was not an issue in any of the five scenarios we considered. For the first three simulation scenarios, we generated data under the following models for the treatment and the outcome processes:

where is an exchangeable correlation matrix with correlation 0.3. In all three of these scenarios, and . In scenario 1, , and . In scenario 2, both and are and there is no confounding. In scenario 3, , and . The final two simulation scenarios are based on a simulation from Athey et al.  (2018) that has a dense treatment model. For the fourth scenario, define 20 clusters, where . We draw

uniformly at random from one of the 20. Then, we draw the covariates from a multivariate normal distribution centered at

with the identity matrix as the covariance. We set

with probability 0.1 for the first 10 clusters, and with probability 0.9 for the remaining clusters. Finally, we generate data from the outcome model defined as , where and is normalized such that . The fifth scenario is the same as the fourth, except that with probability 0.25 and 0.75 for the first 10 and remaining clusters, respectively. Further, we set .

3.1 Results

Figures 1 and 2 show the results averaged over 1000 simulations for both MSE and interval coverage. We see that the averaged estimator does quite favorably in terms of both MSE and interval coverage. In terms of MSE, the worst estimator does significantly worse than the averaged estimator in all scenarios, while the averaged estimator is able to achieve an MSE that is close to the best of the individual estimators. While it is hard to see in the figure because the differences are small, the averaged estimator does slightly better than the best estimator in Scenario 3. As far as interval coverages, the averaged estimator achieves coverages near the nominal 0.95 level in nearly all of the simulation scenarios looked at, and significantly outperforms the worst estimator. There is one scenario in which the coverage of the averaged estimator is overly conservative, which was Scenario 4 and the coverage probability was 1. The reason for this was not due to us using an upper bound for the variance, it was because one of the estimators performed poorly with extremely high standard errors that propagated into the averaged estimator’s standard error. Nonetheless, we would argue that airing on the side of being conservative is preferred to anti-conservative intervals. In practice, we would exclude any estimator with an extreme estimated standard error from the average, which hopefully would avoid this issue.

Overall, the results confirm the theoretical results that the averaged estimator does better than the worst individual estimator, though more importantly, they show that in fact, the averaged estimator tends to perform more closely to the best individual estimator. Figure 3 shows the absolute bias and variance for each individual estimator across all the simulation scenarios considered. The circles represent the individual estimators while the triangles represent the averaged estimator for each scenario. The averaged estimator tends to lie in the lower left of the plot compared with the individual estimators for a given simulation, indicating that the averaged estimator has both a small bias and small variance in each scenario.

Figure 1: Mean squared error for the estimator with the lowest MSE, highest MSE, and the averaged estimator.
Figure 2: Interval coverage for the estimator with the best interval coverage, worst interval coverage, and for the averaged estimator.
Figure 3: Bias and variance for each of the estimators, over all scenarios looked at. The averaged estimators are denoted by large triangles, while the other estimators are denoted by small circles.

4 Application to NHANES data

Environmental wide association studies (EWAS) measure a variety of chemicals and toxins that humans are exposed to in an effort to understand how they affect the biological processes in the human body (Wild, 2005; Patel & Ioannidis, 2014). The National Health and Nutrition Examination Survey (NHANES), is a cross-sectional data source made publicly available by the Centers for Disease Control and Prevention (CDC), and we will aim to estimate the effects of a number of environmental exposures on three different outcomes: HDL cholesterol levels, LDL cholesterol levels, and triglyceride levels in the NHANES study. We will use the data found in Wilson et al.  (2018), which contains a large number of potential confounders. Participants both fill out questionnaires regarding their health status, as well as receive clinical and laboratory tests that contain information on environmental factors such as pollutants, allergens, bacterial/viral organisms, chemical toxicants, and nutrients. In previous work (Patel et al. , 2012)

, the environmental agents were separated into different groups containing similar agents that might affect similar biological pathways. We will look at the effects of 22 different environmental agent groups on the three outcomes, leading to 66 different analyses. Each exposure we look at is defined as the average exposure level across all agents within the same grouping. As most of the approaches considered only look at binary treatments, we dichotomize the continuous exposure by classifying each exposure as being above or below the median exposure level. Each of the environmental agents looked at is measured in a different subset of the NHANES data, leading to different populations, sample sizes, and covariate dimensions for each of the 22 different exposures. Some of the data sets have very small

ratios that do not require high-dimensional techniques, while others have ratios between 0.3-0.4 and require some form of dimension reduction. In the following sections we will see how the averaged estimator performs in comparison with the individual estimators, and how this varies as a function of the ratio.

4.1 Agreement of statistical decisions

Here we evaluate whether the averaged estimator would lead to any different decisions in terms of hypothesis testing. In particular, if we were interested in testing whether the ATE was 0 or not, do the different estimators lead to different conclusions. Figure 4 shows the results of this statistical test for any data set in which there was disagreement among the estimators. Each row corresponds to a particular estimator, while each column corresponds to a particular data set. We see that the averaged estimator generally sides with the majority of the individual estimators in terms of testing the null hypothesis, though this is not always the case. In the sixth column, only 3 of the 10 estimators show a significant treatment effect, and the averaged estimator also leads to a rejection of the null hypothesis of no treatment effect. It is important to emphasize that Figure 4 is limited to cases where different estimators lead to different conclusions, which represents only 30% of analyses performed.

Figure 4: Plot of whether or not each estimator rejected a particular null hypothesis. The results only show data sets in which there was some disagreement among the 11 estimators in terms of the final hypothesis test.

4.2 When does averaging matter?

Here we evaluate the potential impact that averaging can have on estimation within the NHANES data set, and will try to identify where it is the most useful. Assuming that each estimator has only a small amount of bias, the main advantage of averaging is the efficiency gains one can achieve. When the individual estimators are highly correlated, there is very little to be gained (though also very little to be lost) by using averaging. On the other hand, as estimators have smaller and smaller correlations with each other, the averaged estimator gains more and more efficiency. The left panel of Figure 5 shows the correlation matrix of the 10 estimators across the data sets examined. We see that the correlation is generally very high, with the majority of the values on the off-diagonal above 0.9. The right panel, however, depicts a somewhat different story when we restrict to the data sets with ratios above 0.25. In these settings we see much smaller correlation between the estimators, with some of the correlations down in the 0.4 range, and many more in the 0.6 to 0.8 range. These two panels show that when is small, there is very little to be gained by averaging. In the more difficult settings with larger numbers of covariates, averaging has the potential to reduce the variance of the treatment effect estimate substantially. These more difficult scenarios are exactly the types of scenarios that all of the aforementioned high-dimensional approaches were developed for, and we can see that in these scenarios, averaging over all the potential estimators can lead to improved estimation.

Figure 5: Correlation matrix across the 10 estimators over the analyses in the NHANES data. The right panel only shows this correlation for those data sets with a ratio that is above 0.25

5 Discussion

We have presented a simple approach to estimating average treatment effects in high-dimensions that leverages all of the estimators that have been developed in recent years. As more and more approaches are developed, it becomes more difficult for an analyst to choose an approach for their particular study. Further, there is likely no best approach in all scenarios, and the extent to which each estimator performs is highly dependent on the unknown data generating process. We have argued in this manuscript that averaging over all of the estimators provides an ideal choice when presented with so many estimators. Not only does the averaged estimator guarantee improved performance over the worst estimator in terms of bias and MSE, but our empirical results indicate that it can perform much more closely to the best of the individual estimators without knowing beforehand which estimators are best. At times, it can even outperform the best of the individual estimators. Further, we saw that our conservative estimate of the variance of the averaged estimator leads to good performance in terms of confidence interval coverages. In high-dimensional settings there is typically both bias of statistical estimators, and underestimation of the uncertainty associated with estimators, and therefore our conservative variance estimate seems to help mitigate these issues to obtain better finite sample performance.

A potential criticism of the averaged estimator is that each individual estimator relies on its own set of assumptions, and therefore the averaged estimator assumes all of these assumptions in order to achieve unbiased inference. While this is generally true, we argue that in some ways this shows the strength of the averaged estimator. All of these assumptions are untestable, and we don’t know which estimator’s assumptions are being violated. By averaging over all estimators, we mitigate the impact that gross misspecification of one of these assumptions could have if we simply chose the one, bad estimator. Ideally, we would estimate the weights to optimize the averaged estimator, and reduce the impact that one bad estimator could have on the averaged estimator. Unfortunately, this is a very difficult problem because it requires at least the covariance matrix of the individual estimators, if not also the bias for each estimator. In high-dimensions it is not possible to obtain these quantities for a general set of estimators, and even if we could, the estimation of the weights would likely increase the variance of our averaged estimator enough to remove any benefits from adaptive weight estimation.

References

  • Antonelli & Dominici (2018) Antonelli, Joseph, & Dominici, Francesca. 2018. A Bayesian semiparametric framework for causal inference in high-dimensional data. arXiv preprint arXiv:1805.04899.
  • Antonelli et al.  (2016) Antonelli, Joseph, Cefalu, Matthew, Palmer, Nathan, & Agniel, Denis. 2016. Doubly robust matching estimators for high dimensional confounding adjustment. Biometrics.
  • Antonelli et al.  (2017a) Antonelli, Joseph, Parmigiani, Giovanni, & Dominici, Francesca. 2017a. High dimensional confounding adjustment using continuous spike and slab priors. arXiv preprint arXiv:1704.07532.
  • Antonelli et al.  (2017b) Antonelli, Joseph, Han, Bing, & Cefalu, Matthew. 2017b. A synthetic estimator for the efficacy of clinical trials with all-or-nothing compliance. Statistics in medicine, 36(29), 4604–4615.
  • Athey et al.  (2018) Athey, Susan, Imbens, Guido W, & Wager, Stefan. 2018. Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4), 597–623.
  • Bang & Robins (2005) Bang, Heejung, & Robins, James M. 2005. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4), 962–973.
  • Belloni et al.  (2014) Belloni, Alexandre, Chernozhukov, Victor, & Hansen, Christian. 2014. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2), 608–650.
  • Chen & Yang (2017) Chen, Chia-Wei, & Yang, Hsin-Chou. 2017. OPATs: Omnibus P-value association tests. Briefings in Bioinformatics, 20(1), 1–14.
  • Chernozhukov et al.  (2018) Chernozhukov, Victor, Chetverikov, Denis, Demirer, Mert, Duflo, Esther, Hansen, Christian, Newey, Whitney, & Robins, James. 2018. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68.
  • Crainiceanu et al.  (2008) Crainiceanu, Ciprian M, Dominici, Francesca, & Parmigiani, Giovanni. 2008. Adjustment uncertainty in effect estimation. Biometrika, 95(3), 635–651.
  • Draper (1995) Draper, David. 1995. Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 45–70.
  • Ertefaie et al.  (2018) Ertefaie, Ashkan, Asgharian, Masoud, & Stephens, David A. 2018. Variable selection in causal inference using a simultaneous penalization method. Journal of Causal Inference, 6(1).
  • Farrell (2015) Farrell, Max H. 2015. Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1), 1–23.
  • Ghosh et al.  (1994) Ghosh, Malay, Rao, JNK, et al. . 1994. Small area estimation: an appraisal. Statistical science, 9(1), 55–76.
  • Hahn et al.  (2016) Hahn, P Richard, Carvalho, Carlos, & Puelz, David. 2016. Bayesian Regularized Regression for Treatment Effect Estimation from Observational Data. Available at SSRN.
  • Hansen (2007) Hansen, Bruce E. 2007. Least squares model averaging. Econometrica, 75(4), 1175–1189.
  • Hjort & Claeskens (2003) Hjort, Nils Lid, & Claeskens, Gerda. 2003. Frequentist model average estimators. Journal of the American Statistical Association, 98(464), 879–899.
  • Hoeting et al.  (1999) Hoeting, Jennifer A, Madigan, David, Raftery, Adrian E, & Volinsky, Chris T. 1999. Bayesian model averaging: a tutorial. Statistical science, 382–401.
  • Mittelhammer & Judge (2005) Mittelhammer, Ron C, & Judge, George G. 2005. Combining estimators to improve structural model estimation and inference under quadratic loss. Journal of econometrics, 128(1), 1–29.
  • Ning et al.  (2018) Ning, Yang, Peng, Sida, & Imai, Kosuke. 2018. Robust Estimation of Causal Effects via High-Dimensional Covariate Balancing Propensity Score. arXiv preprint arXiv:1812.08683.
  • Patel & Ioannidis (2014) Patel, Chirag J, & Ioannidis, John PA. 2014. Studying the elusive environment in large scale. Jama, 311(21), 2173–2174.
  • Patel et al.  (2012) Patel, Chirag J, Cullen, Mark R, Ioannidis, John PA, & Butte, Atul J. 2012. Systematic evaluation of environmental factors: persistent pollutants and nutrients correlated with serum lipid levels. International journal of epidemiology, 41(3), 828–843.
  • Rosenbaum & Rubin (1983) Rosenbaum, Paul R, & Rubin, Donald B. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55.
  • Rubin & Thomas (1996) Rubin, Donald B, & Thomas, Neal. 1996. Matching using estimated propensity scores: relating theory to practice. Biometrics, 249–264.
  • Shortreed & Ertefaie (2017) Shortreed, Susan M, & Ertefaie, Ashkan. 2017. Outcome-adaptive lasso: Variable selection for causal inference. Biometrics.
  • Tan (2017) Tan, Zhiqiang. 2017. Regularized calibrated estimation of propensity scores with model misspecification and high-dimensional data. arXiv preprint arXiv:1710.08074.
  • Tan (2018) Tan, Zhiqiang. 2018. Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data. arXiv preprint arXiv:1801.09817.
  • Tibshirani (1996) Tibshirani, Robert. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288.
  • Van Der Laan & Rubin (2006) Van Der Laan, Mark J, & Rubin, Daniel. 2006. Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1).
  • Vansteelandt et al.  (2012) Vansteelandt, Stijn, Bekaert, Maarten, & Claeskens, Gerda. 2012. On model selection and model misspecification in causal inference. Statistical methods in medical research, 21(1), 7–30.
  • Wang et al.  (2012) Wang, Chi, Parmigiani, Giovanni, & Dominici, Francesca. 2012. Bayesian effect estimation accounting for adjustment uncertainty. Biometrics, 68(3), 661–671.
  • Wild (2005) Wild, Christopher Paul. 2005. Complementing the genome with an “exposome”: the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiology Biomarkers & Prevention, 14(8), 1847–1850.
  • Wilson & Reich (2014) Wilson, Ander, & Reich, Brian J. 2014. Confounder selection via penalized credible regions. Biometrics, 70(4), 852–861.
  • Wilson et al.  (2018) Wilson, Ander, Zigler, Corwin, Patel, Chirag, & Dominici, Francesca. 2018.

    Model-averaged confounder adjustment for estimating multivariate exposure effects with linear regression.

    Biometrics.
  • Zou (2006) Zou, Hui. 2006. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418–1429.

Appendix A Proof of result 1

First we will show the result for bias:

Now for the MSE: