1 Introduction
A large literature focuses on estimating average treatment effects under unconfoundedness (see, e.g., Blundell and Costa Dias 2009, Imbens and Wooldridge 2009).^{1}^{1}1The unconfoundedness assumption may also be referred to as exogeneity, ignorability, or selection on observables. Many estimators are available to researchers in this context, and many of these estimators have similar asymptotic properties. This can make it difficult to select which estimator to use. Monte Carlo studies are a useful tool for examining the smallsample properties of these estimation methods, which can guide estimator choice.^{2}^{2}2See, for example, Frölich (2004); Lunceford and Davidian (2004); Zhao (2004, 2008); Busso et al. (2009); Millimet and Tchernis (2009); Austin (2010); Abadie and Imbens (2011); Khwaja et al. (2011); Diamond and Sekhon (2013); Huber et al. (2013); Busso et al. (2014); Frölich et al. (2017), and Bodory et al. (2018), all studying the finitesample performance of estimators of average treatment effects under unconfoundedness. Early contributions, such as Frölich (2004), demonstrate estimator performance in stylised data generating processes (DGPs) which do not resemble any empirical settings. This reliance on unrealistic DGPs is criticised by Huber et al. (2013) and Busso et al. (2014). Both recommend that Monte Carlo studies should intend to replicate actual datasets of interest, although they suggest different procedures for doing this. Huber et al. (2013) describe this approach to examining the smallsample properties of estimators as an ‘empirical Monte Carlo study’ (EMCS). An important question is whether either type of EMCS can help applied researchers in choosing what estimator(s) to prefer in a given context. Busso et al. (2014) indicate this might be possible, noting that their results ‘suggest the wisdom of conducting a smallscale simulation study tailored to the features of the data at hand’.^{3}^{3}3Similarly, Huber et al. (2013) suggest that ‘the advantage [of an EMCS] is that it is valid in at least one relevant environment’, i.e. that it is informative at least about the performance of estimators in the dataset on which it was conducted.
In this paper we evaluate the premise that EMCS is ‘internally valid’: that it can be informative about the performance of estimators in the particular data which are the basis for the EMCS.^{4}^{4}4This usage of ‘internal validity’ is somewhat nonstandard. However, it is consistent with the definition given by Angrist and Krueger (1999). They define internal validity as the question of ‘whether an empirical relationship has a causal interpretation in the setting where it is observed’. In our case the relationship is between the performance of estimators in the original data and their performance in the EMCS implemented on these data. We first show theoretically that these approaches are expected to be informative only under very restrictive conditions. These conditions are unlikely to hold in many practical examples faced by a researcher. We then test EMCS performance in a realworld case where we know the actual behaviour of estimators. We find that in terms of selecting estimators on absolute bias they are often worse than choosing randomly. On mean squared error (MSE) they perform better than random, but no better than selecting an estimator based on simple bootstrap estimates of MSEs. Their performance in absolute terms may also still be poor.
The first type of EMCS we consider is the placebo EMCS (Huber et al., 2013).^{5}^{5}5It is also applied by Lechner and Wunsch (2013); Huber et al. (2016); Lechner and Strittmatter (2016); Frölich et al. (2017), and Bodory et al. (2018). A related approach is proposed by Schuler et al. (2017). This proposes a way to assign ‘realistic placebo treatments among the nontreated’, using information about the predictors of treatment status in the original data. It then tests how well estimators can recover the zero effect of the placebo treatment. The performance of estimators in this exercise is hypothesised to be informative about their performance in the original data.
The second type we describe as the structured EMCS. An exercise of this type is undertaken by Busso et al. (2014).^{6}^{6}6A similar approach is also used by Abadie and Imbens (2011), Lee (2013), and Díaz et al. (2015). Here a parameterised approximation of the original data generating process is created, using functional form assumptions about the distributions of observed covariates. Parameters of their marginal (or conditional) distributions are estimated from the original data.^{7}^{7}7The distribution of particular covariates may be allowed to depend on the realisations of others, in which case parameters of the conditional distributions are needed. Samples can be drawn from this approximate DGP, to which the estimators can be applied. Since the treatment effect in this DGP can be calculated directly from knowledge of the parameters, performance of the estimators in these samples can be measured. The performance of estimators in this exercise is also hypothesised to be informative about their performance in the original data.
To examine whether or not EMCS can correctly choose a best performing estimator, for various definitions of performance, we first focus on a simple example with two estimators that have Gaussian sampling distributions. We show analytically that both these approaches will only be guaranteed to correctly select the preferred estimator if they can correctly reproduce both the biases and the ordering of the variances of estimators. These are restrictive conditions that we show can easily fail in practical applications, such as when the EMCS procedures fail to recover heteroskedastic errors or misspecify the regression equations or propensity scores. In two sets of simulations based on a stylised DGP, both approaches select the better estimator less than 3% of the time, much worse than 50% achievable by selecting randomly.
To study the extent of the problem in a realworld circumstance, we apply both methods to the National Supported Work (NSW) Demonstration data on men, previously analysed by LaLonde (1986), Heckman and Hotz (1989), Dehejia and Wahba (1999, 2002), Smith and Todd (2001, 2005), and many others. In these data participation in a job training programme was randomly assigned, so the treatment effect of the programme can be estimated by comparing sample means. LaLonde (1986) used these data to test the performance of estimators at reproducing this treatment effect when an artificial comparison group (rather than the experimental controls) was used. We instead use the data to test how well the two EMCS procedures can inform us about the performance of the estimators: Can EMCS tell us which estimator to use? On average how much worse than the optimal estimator is the one chosen by EMCS? How well can EMCS reproduce the ranking of performance across all estimators?
Applying the two EMCS procedures we find three main results. First, in terms of absolute bias, the EMCS procedures are no better, and often noticeably worse, than selecting an estimator at random. In two out of three cases we study, the rankings produced are negatively correlated with the true ranking. In one case the preferred estimator selected by EMCS is on average 30–37 times worse than the actual best estimator.
Second, EMCS does better at reproducing the performance of estimators in terms of MSE. This is because the MSEs of the estimators are mostly driven by their variances, and EMCS appears more effective at capturing variances. The rankings of estimators are consistently positively correlated with the true rankings, although the estimator preferred by EMCS has an MSE up to twice as high as the best estimator.
Third, given the variance result, we also compare EMCS procedures to choosing estimators based on which has the lowest variance from a simple bootstrap. We find that the bootstrap is as good, and often much better, than either of the EMCS procedures. Hence even when the procedures are somewhat informative, they are not superior to a procedure that relies on fewer design choices.
These results are unfortunate, but nevertheless important. They caution against treating either of these approaches as general solutions to the problem of estimator choice. There remains no silver bullet that can assist empirical researchers with the ‘right’ or ‘best’ estimator for a particular context. In the absence of a clear choice driven by research design, the best advice at this stage is likely to be implementing a number of estimators, and then considering the range of estimates provided, as Busso et al. (2014) also suggest.
Our results also have implications for researchers studying the smallsample properties of treatment effect estimators (see footnote 2). It has been argued that ‘it is preferable to study DGPs that are empirically relevant’ (Busso et al., 2014).^{8}^{8}8A similar argument is made in Huber et al. (2013). Our theoretical and empirical results suggest there is little support for this claim. We show theoretically that misspecification in the construction of the DGP can lead the ranking of estimators to be incorrect for the original dataset. In our empirical example, we see that EMCS is not better than using a bootstrap (and sometimes not better than random) to predict performance in the data on which the EMCS was performed. There seems little reason to then think it is particularly informative about performance in other unrelated real datasets, i.e. that testing smallsample properties of estimators in ‘real data’ is necessarily better than in completely artificial data. A more fruitful path might be to test sensitivity of estimator performance to parameters of the simulation, such as sample size and the degree of heteroskedasticity. This approach is also taken by Huber et al. (2013), and might be more helpful in understanding what characteristics of samples most affect the performance of particular estimators.
2 EMCS Designs
We first describe the two main approaches to conducting an EMCS, namely the placebo design of Huber et al. (2013) and the structured design of Busso et al. (2014). In either EMCS design, one simulates many ‘empirical Monte Carlo’ replication samples from a known data generating process. By implementing the estimators on the simulated replications, one obtains estimates of the sampling distributions and performance criteria (e.g., MSEs) of the estimators, according to which one ranks the candidate estimators. Note that the researcher needs to make a choice of what criteria to use to rank estimators.
2.1 The Placebo Design
The idea of the placebo design is to assign placebo treatments to some control observations, and attempt to recover the true effect which by construction is zero.^{9}^{9}9A similar approach is developed by Bertrand et al. (2004) who study inference in differenceindifferences methods using simulations with randomly generated ‘placebo laws’ in statelevel data, i.e. policy changes which never actually happened. For followup studies, see Hansen (2007), Cameron et al. (2008), and Brewer et al. (2018). In particular, covariates and outcomes are first drawn jointly by sampling (with replacement) from the empirical distribution of control observations.^{10}^{10}10In this paper the sample size is always equal to the size of the original control subsample. Using the original dataset, the propensity score is estimated (e.g.
, using a logit model). The estimated parameters of this model
are then used to assign placebo treatments to the generated sample in the following way:(1)  
(2) 
where is an iid error, and both and are additional parameters to be selected. While shifts the proportion of observations that are treated, controls the extent of selection: with selection on observables takes the same form in the Monte Carlo sample as in the original dataset.
2.2 The Structured Design
The idea of the structured design is instead to create a parameterised approximation to the original (unknown) data generating process, and then draw samples from the approximated process. To begin, a fixed number of treated and control observations are created, to match the number of each in the original dataset. Covariates and outcome variables are then drawn from parameterised distributions where the parameters are estimated from the original dataset. For example, the variable black might come from a Bernoulli with mean estimated from the data, and the variable earnings
from a lognormal distribution with mean and variance estimated from the data. The parameters of these distributions are typically estimated conditional on treatment status. Parameters of some distributions might also be conditional on the value of other variables;
e.g., earnings might be conditional on race as well as treatment status. More conditioning will improve the match of the joint distribution of simulated data to the joint distribution of the original data, but will increase the number of parameters that need to be estimated.
3 Theory
To understand the conditions under which an EMCS might be informative about the preferred estimator in some particular dataset, we first construct a simple example. Here we have only two estimators, with a straightforward and restricted joint sampling distribution (bivariate Gaussian). This bivariate Gaussian setting mimics an ideal situation in which the finite sample distribution of the estimators is well approximated by their asymptotic distribution. We show that even in such an ideal, large sample situation, EMCS can fail to select the best estimator if the bias in any one of the estimators or the ranking of variances is not correctly replicated in the simulated samples.^{11}^{11}11In Section 5 we will show that in practice this means that, when estimators are unbiased, rankings based on a simple bootstrap perform at least as well and sometimes better than the more involved EMCS procedures. We provide simple common cases for treatment effect estimation in which failure to capture the biases and heteroskedasticity contaminates EMCS, and provide results from a simple simulation illustrating this.^{12}^{12}12The role of the simulations is to show a quantitative example of how performance might look. Since we are aware this simulation is stylised, in Section 4 we also provide simulation results based on realworld data. We then extend the example to the case of more than two estimators.
3.1 Simple Example: Two Estimator Case
Suppose the researcher wants to rank two estimators and according to their statistical performances under repeated sampling. These estimators are estimating the same object of interest , but their constructions are different. For simplicity of the illustration, assume that the joint sampling distribution of the two estimators is bivariate Gaussian:
(3) 
where , , and is the sample size. Here, our implicit assumption is that the estimators converge to at rate. Let be the true value of the parameter of interest. We allow and/or to be biased so that and/or can differ from .
We rank these estimators according to their statistical performances. Given that we often assess the performance of an estimator by its mean squared error (MSE) or mean absolute error (MAE), we may, for instance, rank the estimators according to their MSEs or MAEs.^{13}^{13}13MSE and MAE criteria do not take into account the dependence of the estimators. One way to rank the estimators that takes into account their dependence is based on
the probability of being closer to the truth
, , where and are the estimation errors of the two estimators. That is, is preferred to if and is preferred to if . Considering this criterion instead of MSE does not affect the main results in our simple example. Given the Gaussian assumption, the MSE of each estimator, , isWe denote by the index of the strictly preferred estimator, assuming it exists. Ranking the estimators is difficult in practice since we do not know the mean and variances of the estimators as well as the true value of . Proposals of the EMCS literature aim to infer a best performing estimator by estimating the sampling distribution of and via some Monte Carlo studies. For simplicity, we assume that the estimators simulated in EMCS also follow bivariate Gaussian,
(4) 
where , , and is the size of a simulated sample that may differ from the size of the original sample . The underlying parameters in EMCS, , generally depend on the original sample, but we assume for simplicity that the dependence is negligible and they can be treated as constants. EMCS computes and repeatedly using simulated samples of size drawn from a data generating process with the parameter value set at known value . For instance, the placebo EMCS approach of Huber et al. (2013) sets and , the size of the control group in the original data. The approach of structured EMCS sets at an estimate of constructed from the original sample. In implementing EMCS, we do not have to know the mean and variance parameters of , and they can be estimated with arbitrary accuracy based on the simulated estimators. EMCS accordingly obtains the MSE of each estimator, , by
We denote by the index for a best performing estimator estimated from EMCS, . To assess the validity of EMCS, we define a criterion of EMCS internal validity by the probability that coincides with ,
, where the probability is evaluated under repeated sampling of the original samples. In the examples to follow, we investigate how this criterion of EMCS validity becomes one or zero depending on the parameter values in the bivariate Gaussian distributions of (
3) and (4).^{14}^{14}14We assume away the dependence of the parameters in (4) on the original sample for simplicity of illustration. In such a case the MSE estimates in EMCS and resulting selection of a best estimator are nonrandom. The criterion of EMCS internal validity in this case is either 1 or 0.We can also consider the average regret type criterion such as to quantify EMCS internal validity. Here, the expectation concerns the sampling distribution of EMCS’s selection of an optimal estimator . This average regret criterion can quantify severity of a wrong choice of the estimators in terms of how much MSE is on average sacrificed relative to the true bestperforming estimator.
3.1.1 Scenario 1
Denote the biases in by and the biases in by . We start with a scenario in which are unbiased and the distribution of well replicates the distribution of in the following sense:
(5) 
Here, the biases and the samplesizeadjusted variances of the estimators simulated in EMCS coincide with those of the estimators in the original data generating process. Note that the true value of parameter assumed in EMCS, , does not have to agree with the true parameter value in the original sampling process, .
In the current scenario, the ranking of the true MSEs clearly coincides with the ranking of the MSE estimates in EMCS, implying . This is a benchmark case in which EMCS works. The next two scenarios show that once we depart from the assumptions in (5), EMCS can be no longer valid.
3.1.2 Scenario 2
Assume that the estimators are free from biases both in the original data generating process and EMCS, , but EMCS fails to replicate the normalised covariance matrix of the estimators, . In this case, the MSE estimates in EMCS correctly rank the true MSEs of the estimators (assuming ) if and only if the ordering of the variances of the two estimators agrees between the original sampling process and the simulated sampling process, i.e. . Otherwise, EMCS reverses the ranking of the estimators and incorrectly selects a suboptimal estimator as optimal, .
Hence, even when EMCS well replicates the biases of the estimators, it can fail to select a best performing estimator due to an incorrect variance ordering.
3.1.3 Scenario 3
In the third scenario, we assume that EMCS correctly replicates the variance ordering of the estimators, i.e. , but fails to replicate the biases, . To be specific, we set , but . This can correspond to a situation that the estimator 1 is correctly specified and has no bias, whereas estimator 2 is misspecified and is subject to bias in the original data generating process. EMCS, however, fails to capture the misspecification bias in estimator 2.
Suppose holds. The true MSEs are and , while the MSE estimates in EMCS are and . Since we assumed that EMCS correctly replicates the variance of the estimators, EMCS selects as a best estimator. This selection of the estimator is indeed misleading if is far from zero, since if , outperforms in terms of MSE.
This scenario highlights that EMCSbased selection of the estimator can fail if any one of the estimators is misspecified and the simulation design in EMCS does not replicate the misspecification bias.
3.2 Are Scenarios 2 and 3 Relevant in Treatment Effect Estimation?
We next provide simple but empirically relevant examples where we focus on the estimation of treatment effects, and show that both types of EMCS may yield misleading choices of the estimators for the reasons illustrated in Scenarios 2 and 3 above.
Data are given by a random sample of , where is unit ’s observed posttreatment outcome, is her treatment status, and
is a vector of her pretreatment characteristics whose support is assumed to be bounded. We denote unit
’s potential outcomes by . We assume the unconfoundedness assumption, , throughout. The propensity score is denoted by .3.2.1 An Example for Scenario 2
To keep our example as simple as possible, consider the following data generating processes:
(6)  
The specified mean equations for both potential outcomes imply that the conditional average treatment effects are homogeneous over observable characteristics and equal to . The potential outcomes are heteroskedastic if . We assume a linear probability for the propensity score in order to simplify analytical comparisons of the variances of the estimators we introduce below.
Suppose that the parameter of interest is the population average treatment effect for the treated (ATT), . Since specification (6) implies homogeneous treatment effects, , the true value of ATT is .
We consider two different estimators to estimate the population ATT. The first estimator is a semiparametric estimator for ATT, which is consistent without assuming functional forms for the outcome and propensity score equations, and asymptotically attains the semiparametric efficiency bound (SEB) of ATT derived by Hahn (1998). Estimators that attain this property include the inverse probability weighting (IPW) estimator with nonparametrically estimated propensity scores (Hirano et al., 2003), doubly robust estimators of Hahn (1998), covariate or propensity score matching estimators with a single covariate (Abadie and Imbens, 2006, 2016), and covariate balancing estimators of Graham et al. (2012, 2016). We can set any one of these estimators as our first estimator without affecting the analysis below.
We specify the second estimator
as the ordinary least squares estimator of
in the following regression equation:(7) 
In other words, . The specification of (6) implies that is unbiased and consistent for the population ATT, . We consider a situation that the finite sample distribution of is well approximated by its large sample normal approximation, i.e.
where is the asymptotic variance of given by SEB for ATT without the knowledge of propensity scores, and is the asymptotic variance of . Under the current specification, they are obtained as
(8)  
(9) 
See Appendix A for their derivations.
When and share the variance (), it can be shown that the OLS estimator is more efficient than the semiparametric estimator, , due to exploitation of the correct functional form of the regression equation. In contrast, if the variance of the treated outcome is higher than the variance of the control outcome (), the simple OLS estimator that does not take into account the heteroskedastic errors can become less efficient than the semiparametric estimator. Specifically, we show in Appendix A that
(10)  
Hence, if the degree of heteroskedasticity satisfies the condition in (10), the semiparametric estimator is strictly preferred to the OLS estimator .
Given that meets (10), consider applying the placebo EMCS proposed in Huber et al. (2013). We assume that the two estimators are centred at zero and their simulated distributions can be well approximated by bivariate Gaussian,
where is the sample size of control group in the original sample. Suppose also that the propensity scores used to generate the placebo treatment coincide with the true propensity scores in the original data. Since the placebo treated group is generated from the original control group, it fails to replicate the variance of the treatment outcomes in the original data. As a result, the variances of and are given by the homoskedastic version () of (8) and (9),
(11) 
where and are the probability and expectation with respect to the sampling distribution specified in the placebo EMCS. This inequality is strict if is nondegenerate. EMCS therefore incorrectly selects the OLS estimator as a preferred estimator.
The underlying mechanism for why EMCS goes wrong is in line with Scenario 2 in the previous subsection. Even in a rather ideal situation where EMCS well replicates the unbiasedness of the estimators, artificially creating a placebo treated group from the control group in the original sample distorts the variance ordering among the estimators.
Exactly the same reasoning can also invalidate structured EMCS designs if the estimated data generating process from which the data are to be simulated ignores or fails to replicate the underlying heteroskedasticity of the potential outcome distributions.
This problem can be seen in a simple simulation study. We draw 1,000 samples from a data generating process of the form given by equation (6) with 1,000 observations per sample.^{15}^{15}15For full details of our procedure and parameter values see Appendix B. For detailed simulation results see Appendix C. For each sample we run 1,000 replications of the placebo and structured EMCS procedures, considering IPW and OLS as our two estimators. This gives us ‘the true MSE’ for each estimator (based on the original samples) as well as 1,000 estimates of the MSE for each combination of an estimator and an EMCS design. Looking at a simple count of how many times each procedure selects the right estimator, we see that the placebo approach selects the superior estimator only 19 times (1.9% of the time) and the structured approach is little better at 30 times (3.0%). This compares with 97.6% and 100% for the placebo and structured procedures, respectively, when there is no heteroskedasticity. Of course this is a single example, and in a very stylised context; in Section 4 we will see that the performance of these methods is also poor in a ‘realworld’ example.
3.2.2 An Example for Scenario 3
We shift our focus to Scenario 3. We now introduce a bias in one of the estimators in the original data generating process. For this purpose, we maintain the two estimators as in the previous example, but alter the potential outcome equations from (6) with
(12)  
with distinct slopes, . This causes the regression specification of (7) to be misspecified so that is no longer consistent for the population ATT, . See, e.g., Słoczyński (2017) for analytical characterizations of the bias. On the other hand, the semiparametric estimator remains consistent and semiparametrically efficient (asymptotically attains SEB).^{16}^{16}16Due to the misspecification of the regression equation, the asymptotic variance of differs from and is generally greater than the variance of (9). Hence, assuming that the finite sample distribution of is well approximated by its asymptotic normal approximation, we have
As we argued in Scenario 3 above, the bias in makes
inferior to unbiased estimator
even when if or the sample size is sufficiently large.In the placebo EMCS procedure of Huber et al. (2013), the fact that the placebo treated group is generated from the original control group removes the misspecification issue of the OLS estimator caused by the nonparallel treatment outcome equation. Hence, behaves as a correctly specified OLS estimator with homoskedastic errors, and the simulated distribution of fails to replicate the bias in . Since the variance ordering in EMCS obtained in (11) is preserved in the current example, EMCS erroneously concludes that the OLS estimator dominates the semiparametric estimator .
In case of structured EMCS procedures, if the data generating process from which Monte Carlo samples are drawn is estimated under misspecification, the structured EMCS misleads the estimator selection for exactly the same reason. For example, if one were to construct the Monte Carlo data generating process using linear regressions additive in
, structured EMCS will then wrongly conclude that the OLS estimator outperforms the semiparametric estimator .Again we perform a simple simulation, analogous to the previous subsection but modifying the potential outcome equation as given by equation (12).^{17}^{17}17For full details of our procedure and parameter values again see Appendix B. Similarly, for detailed simulation results see Appendix C. We perform 1,000 replications of each EMCS procedure using the same estimators, and then compare in how many cases the EMCS correctly selects the estimator with the lower MSE. Again the performance of EMCS is rather poor: placebo EMCS correctly selects IPW 2.3% of the time, and structured EMCS is correct only .2% of the time.
3.3 More Than Two Estimators
Applications of EMCS often consider comparing more than two estimators. Fragility of EMCSbased estimator selection highlighted in the two estimator examples above naturally carries over to settings with more than two estimators, since ranking over multiple estimators consists of transitive pairwise rankings of any two candidate estimators.
The Monte Carlo exercises and the empirical application below consider a setting with seven estimators in the context of program evaluation with observational data. Let be the pool of candidate estimators, and let the purpose of EMCS be to obtain a complete ordering among these estimators according to the MSE criteria.^{18}^{18}18The pairwise ordering criterion defined in footnote 13
is not suitable to generate a complete ordering among several estimators since the ordering criterion described there is not transitive. For instance, consider three random variables
such that for ,The internal validity criteria for EMCS introduced above, and can be straightforwardly extended to the case with several estimators. In addition, to measure similarity or dissimilarity between the true ranking and estimated rankings in EMCS, it can be of interest to look at the distribution of the Kendall’s tau,
where and , , are the ranks of estimator with respect to the true MSE and estimated MSE in EMCS, respectively. Noting has a distribution under repeated sampling, its mean or other location parameters can summarise how well EMCS can assess the relative performances among the candidate estimators.
4 Application
To demonstrate the empirical relevance of the theoretical results discussed above, and consider the extent to which they might be a problem in practice, we provide an application of EMCS procedures to a real dataset. In these data we have an experimental estimate of the treatment effect. By (initially) treating the experimental estimate as the true treatment effect, the aim is to show whether (or not) EMCS procedures can accurately recover the ranking of estimators that we see from the experiment. We first discuss the data used, then our approach, next the estimators, and finally the details of how the EMCS procedures were conducted.
4.1 Data and Context
We focus on the data on men from LaLonde (1986), used also by Heckman and Hotz (1989), Dehejia and Wahba (1999, 2002), and Smith and Todd (2001, 2005).^{19}^{19}19Recent work by Calónico and Smith (2017) highlights the effects of the NSW programme for women. Prior to this women were largely ignored in the NSW literature subsequent to LaLonde (1986) because the analysis datafile for women was not preserved. A subset of these data comes from the National Supported Work (NSW) Demonstration, which was a work experience programme that operated in the mid1970s at 15 locations in the United States (for a detailed description of the programme see Smith and Todd, 2005). This programme served several groups of disadvantaged workers, such as women with dependent children receiving welfare, former drug addicts, exconvicts, and school dropouts. Unlike many similar programmes, the NSW implemented random assignment among eligible participants. This random selection allowed for straightforward evaluation of the programme via a comparison of mean outcomes in the treatment and control groups.
In an influential paper, LaLonde (1986) uses the design of this programme to assess the performance of a large number of nonexperimental estimators of average treatment effects, many of which are based on the assumption of unconfoundedness. He discards the original control group from the NSW data and creates several alternative comparison groups using data from the Current Population Survey (CPS) and the Panel Study of Income Dynamics (PSID), two standard datasets on the U.S. population. His key insight is that a ‘good’ estimator should be able to closely replicate the experimental estimate of the effect of NSW using nonexperimental data. He finds that very few of the estimates are close to this benchmark. This result motivated a large number of replications and followups, and established a testbed for estimators of average treatment effects under unconfoundedness (see, e.g., Heckman and Hotz 1989; Dehejia and Wahba 1999, 2002; Smith and Todd 2001, 2005; Abadie and Imbens 2011; Diamond and Sekhon 2013). Like many other papers, we use the largest of the six nonexperimental comparison groups constructed by LaLonde (1986), which he refers to as CPS1.
4.2 Approach
In this paper we take the key insight of LaLonde (1986) one step further. We treat the NSW–CPS data from LaLonde (1986) as a finite population, with 185 treated observations and 7,660 comparison observations in our main example.^{20}^{20}20This comes from taking the treated sample used by Dehejia and Wahba (1999) and a trimmed version of the CPS1 dataset. We use a logit model to predict propensity to be in the experimental data (either as treatment or control) versus being in the CPS1 data. We then drop all CPS1 observations with propensity scores below the minimum or above the maximum in the experimental data. This is the trimmed CPS1 dataset, which we then combine with the NSW treated observations from Dehejia and Wahba (1999). From this we draw 1,000 samples, each composed of 100 treated observations and 1,900 comparison observations. We then implement the estimators described below. For each sample and each estimator we compute the difference between the estimate and the ‘true effect’ ($1,794), which comes from the experimental estimate of the impact of NSW on earnings. With 1,000 such differences for each estimator, we can compute the MSE and other performance measures for that estimator in these data. Then, on each of the 1,000 samples, we implement the two EMCS procedures described in Section 2, and compare their performances in terms of the criteria introduced in Section 3.
One limitation of this approach is that the ‘true effect’ we calculate is subject to sampling error. We therefore consider a second case, where we apply the insight of Smith and Todd (2005) that the control sample from the NSW can be compared to the same nonexperimental comparison group. The NSW control sample includes people who were selected in the same way as those actually treated, but who were randomised out of treatment. Now we know that the ‘true effect’ is a precise zero, since the control sample did not actually receive treatment. Thus, we have an original dataset of 142 ‘treated’ observations (who in reality received no treatment) and 7,467 comparison units.^{21}^{21}21This comes from taking the control sample used by Smith and Todd (2005) and a trimmed version of the CPS1 dataset. The size of the comparison subsample (7,467 observations) is different than in the first case (7,660 observations) because our logit model – which we use to predict propensity to be in the experimental data and to trim – is now fitted on a different dataset; while the comparison units remain the same, the treated and control subsamples are different. Again we draw samples by selecting 100 treated observations and 1,900 comparison observations from this population, with the true effect being precisely zero in each sample, and then perform EMCS on these samples.
Another possible worry might be that our example applies estimators that are suitable under unconfoundedness, i.e. when potential outcomes are independent of treatment assignment, conditional on observables. One of the main conclusions in Smith and Todd (2005) is that such conditional independence is not plausible in the context of the NSW–CPS data. To address this concern, we take a third approach. The basic idea is to construct a population similar to the NSW–CPS data where unconfoundedness holds by construction, and then draw samples from this. We begin with a trimmed version of the Dehejia and Wahba (1999)
dataset used in the first case. Next, we perform 4nearest neighbour matching (with replacement) to impute the ‘missing’ potential outcome for each observation. This is our new population. We then draw random subsamples of 2,000 observations (covariates, potential outcomes, and propensity scores) from the data, add a logistic error to the estimated propensity score, and assign to treatment the individuals in the top quarter of the adjusted propensity score distribution (giving 500 treated and 1,500 nontreated in each sample). By construction treatment is now independent of potential outcomes. We also use our knowledge of potential outcomes of all individuals to calculate the true value (or ‘pseudotrue’ value) of ATT.
^{22}^{22}22More precisely, we use the representation of ATT as to calculate its value in our application. By design, we know both potential outcomes for all units and we set at 25%. We approximate for all units with the empirical probabilities of treatment in 10,000 samples from the original data generating process. This ‘pseudotrue’ value of ATT is equal to –$405. We implement EMCS on the samples drawn in this way.4.3 Estimators
In all our simulations we study the impact of the NSW programme on earnings in 1978. We consider seven nonexperimental estimators: linear regression, Oaxaca–Blinder, inverse probability weighting (IPW), doublyrobust regression, uniform kernel matching, nearest neighbour matching, and biasadjusted nearest neighbour matching. For details see Appendix D. In each case we focus on the average treatment effect on the treated (ATT), unless a given method does not allow for heterogeneity in effects (in which case we estimate the overall effect of treatment). As noted above, all of these estimators are based on the assumption of unconfoundedness.
We use a single set of control variables in all our simulations. Following Dehejia and Wahba (1999) and Smith and Todd (2005), we control for age, age squared, age cubed, education, education squared, whether a high school dropout, whether married, whether black, whether Hispanic, earnings in months 13–24 prior to randomization, earnings in 1975, nonemployment in months 13–24 prior to randomization, nonemployment in 1975, and the interaction of education and earnings in months 13–24 prior to randomization.
4.4 Procedures
In Section 2 we noted that for the placebo design we require some choice of and , where determines the degree of covariate overlap between the ‘placebo treated’ and ‘placebo control’ observations and determines the proportion of the ‘placebo treated’. We choose to ensure that the proportion of the ‘placebo treated’ observations in each placebo EMCS replication is equal to the proportion of treated units in the sample.^{23}^{23}23It should be noted, however, that the way these datasets were constructed by LaLonde (1986) results in samples that are best described as choicebased. More precisely, the treatment and control groups are heavily overrepresented relative to their population proportions. See Smith and Todd (2005) for a further discussion of this issue. We also follow Huber et al. (2013) in choosing as well as in using a logit model to estimate the propensity score.
The structured design requires more choices, in particular how we specify the joint probability distribution as the product of the marginal distribution for treatment status and some conditional distributions. As discussed in Section
2, we begin each structured EMCS replication by generating a fixed number of treated and nontreated observations to match the numbers in the sample. We then order the covariates, regress each covariate on the preceding covariates (using logistic regression for binary covariates), and use this to define the conditional distribution for that covariate. In EMCS replications the covariates are then drawn in the same order, from the appropriate conditional distribution. Full details of the procedure are provided in Appendix
E.5 Results
We now describe the results of our tests of the two EMCS procedures – placebo and structured – in the context of our realworld data. As described in Subsection 4.2, we perform three sets of tests. First, we apply the two procedures to the NSW treatment sample, combined with the CPS1 comparison dataset. We find performance of the procedures to be poor when it comes to finding the estimator with the lowest bias. When we study MSE (i.e., account also for variance), performance is better. This is because the rankings of estimators are mainly being driven by the variance, and both EMCS methods do well at replicating the variance components. However, given this, we also test a simple bootstrap procedure and find that it is more effective at picking the best estimator. Then, we follow Smith and Todd (2005) in using the NSW controls as our ‘treated’ sample instead: now the effect we intend to estimate must be zero for sure, removing worries that poor performance might be an artefact caused by sampling uncertainty around the true effect. We find that the previous results are maintained. Finally, we use an adjusted version of the original data, constructed so that conditional independence necessarily holds, to allay concerns that poor performance is driven by a context in which unconfoundedness may not hold. Again we find that the EMCS procedures do not perform well on bias, and are better on MSE, although here the bootstrap does not clearly dominate.
5.1 Testing EMCS in the NSW Data
Our first results using ‘realworld’ data focus on the variant of the original NSW treatment sample constructed by Dehejia and Wahba (1999), combined with a trimmed version of the CPS1 comparison dataset. We create 1,000 samples from the original dataset by sampling 100 treated and 1,900 nontreated observations from the 185 possible treated and 7,660 comparison units in the original data. We implement the two EMCS procedures 1,000 times on each of the 1,000 samples, giving a total of 1,000,000 replications for each EMCS procedure. In each replication we implement the seven estimators described earlier, and measure how well the two EMCS procedures help us assess the relative performance of the estimators. We might measure performance of an estimator in terms of absolute bias or MSE (which also takes into account its variance). Performance (‘internal validity’) of EMCS is then measured by how well the EMCS procedure replicates these features of an estimator in the original samples. In Section 3, we described two measures of EMCS performance suitable for when we have many estimators:

the average regret, i.e. average difference in absolute bias/MSE between the estimator selected by EMCS and the estimator with the actual minimum absolute bias/MSE; and

the average Kendall’s tau (Kendall’s rank correlation coefficient), which measures the similarity between the ranking of estimators suggested by EMCS and the ‘true’ ranking from the original samples.^{24}^{24}24See Subsection 3.3 for the precise calculation.
For ease of interpretation, it is also useful to normalise the values of average regret. Our discussion below focuses on the average regret as a percentage of the minimum value of absolute bias/MSE. However, we also consider an alternative normalisation, where we divide the average regret for a given EMCS procedure by the average regret for random selection of estimators (which we discuss further below). Finally, we also consider an additional measure, which is straightforward to interpret, namely

the average correlation in absolute bias/MSE (rather than in the rankings, as given by Kendall’s tau).
In each case the comparison is between what the EMCS procedure suggests and the results from taking the ‘true effect’ in the original data, and then calculating the absolute bias/MSE of each estimator across the 1,000 samples.
To provide a benchmark for the performance of EMCS, we also include results from two other procedures. In the first we simply draw nonparametric bootstrap samples.^{25}^{25}25Precisely, we sample with replacement, and draw replication samples of the same size as the original sample. We can then compare estimators on variance, and also see how the ranking compares to the ranking on MSE from the original samples. In the second we do not create any samples, but simply rank estimators randomly. This provides a ‘worstcase’ benchmark: suppose a researcher knows nothing at all about performance and just picks an estimator blindly, how would they do? Here we cannot compute a result for the correlation, but can for average regret and Kendall’s tau. Table 1 shows the results from these simulations. Appendix F provides further details.
The first result is that performance of both EMCS procedures in terms of bias is very poor. The average regret in terms of absolute bias, as a percentage of the absolute bias for the best estimator, is 3,067% (3,766%) for placebo (structured), i.e. an order of magnitude larger than the minimum value. It is worse than choosing completely randomly, which would be 1,184% worse than the best estimator. Looking at the ranking across estimators, the average Kendall’s tau is –.21 (–.37) for placebo (structured). So the rankings produced by EMCS are, on average, negatively correlated with the ranking in the original samples. This is worse than random, which gives .00. The same pattern is seen in the average correlation coefficients for absolute bias, which are –.44 (–.51).
A researcher might be interested in knowing about performance of estimators in terms of MSE rather than only considering bias. Here EMCS performs significantly better. The average regret for placebo (structured) is now only 18% (16%), much better than random (142%). Similarly, average Kendall’s tau is now .60 and .64 for placebo and structured, respectively, much better than .00 for random. The lowest panel of Table 1 shows that this is driven by the much better performance in replicating the variances. Since the rankings here are mostly determined by the variance, being able to reproduce variances significantly improves the measures of performance relative to the metrics based on absolute bias.
However, looking at our other benchmark case – the bootstrap – we see that it outperforms both EMCS methods in terms of MSE. Average regret is lower at 7.9%, and the average Kendall’s tau is much higher at .83. Given that MSE performance for EMCS is driven by the variance components, this does not seem surprising. The bootstrap is a simpler procedure than the two EMCS methods, and its ability to help us understand the variability of estimators is well known. It therefore seems like a potentially valuable path which has fewer design choices than EMCS.
5.2 Removing Sampling Error from the ‘True Effect’
The previous subsection calculated the MSE for each estimator by comparing the value of the estimate in each sample to a ‘true effect’ measured using the experiment. One concern might be that the estimate from the experiment is subject to sampling error, and this might somehow negatively affect our performance measures for EMCS. To test this, we now use as our ‘treated’ observations the NSW control sample from Smith and Todd (2005). Since these individuals were selected for the programme in the same way as those actually treated, but were then randomised out, the actual treatment effect for them is precisely zero. We therefore repeat the exercise on these data, again implementing the two EMCS procedures 1,000 times on each of the 1,000 original samples. Table 2 documents the results. Appendix F provides further details.
Our conclusions are similar to those in the previous subsection. In terms of absolute bias, the average regret is much lower than previously, at 30% (42%) for placebo (structured), although this is mostly driven by a large increase in the minimum value of absolute bias.^{26}^{26}26In Subsection 5.1, the minimum absolute bias in the original data was equal to 16 (nearest neighbour matching); now, it is equal to 954 (Oaxaca–Blinder). See also Appendix F. As noted by Smith and Todd (2005), it is much more difficult to recover the true effect of NSW in these data. The fact that the difference in the values of average regret between Tables 1 and 2 is driven by the minimum absolute bias can also be seen by normalising these values by the average regret for random selection of estimators. In the first simulation study, the average regret for placebo (structured) is approximately 2.6 (3.2) times larger than for random; in the second simulation study, this metric is approximately 1.6 (2.2) times larger for placebo (structured) than for random. While these values continue to be smaller in the second simulation study, their overall magnitudes are similar in both cases. Also, this is still worse than choosing at random (19%). As before, the average Kendall’s tau is negative for placebo (structured) at –.27 (–.47), which is worse than random (.00) as well. On MSE performance is better, with average regret of 23% (32%) and average Kendall’s tau of .65 (.55). These are much better than random (94% and .00), but worse than bootstrap (17% and .81).
5.3 Ensuring Unconfoundedness Holds
Another potential concern is whether the conditional independence assumption holds. Here we take the approach described in Subsection 4.2 to generate 1,000 samples in which conditional independence holds by construction. Then, we implement the two EMCS procedures 500 times on each of these samples.^{27}^{27}27The smaller number of replications per sample follows from the significant computational burden of this simulation study, which originates from the larger size of the treated subsamples (500 instead of 100 observations per sample). Table 3 displays the results. Appendix F provides further details.
The previous results are broadly maintained even after ensuring conditional independence. In terms of absolute bias the performance of both EMCS approaches is similar to random. In terms of MSE both procedures perform better than random selection of estimators and also marginally better than bootstrap. Average regret in terms of MSE is worse than in the first case, though average Kendall’s tau is a little higher, so it is also not obvious that contexts where conditional independence holds should necessarily see better performance of EMCS procedures.
6 Discussion
Advances in econometrics have left the empirical researcher blessed with a wealth of possible treatment effect estimators from which to choose. They have not yet provided clear guidance on which of these estimators should be preferred in which context. In this paper we studied two proposals which suggest an approach to choosing an appropriate estimator for a given context. The first approach (placebo) suggests a way to introduce placebo treatments to some control observations in a dataset, and studies how well estimators can pick up the true zero effect. The second approach (structured) creates data from a known DGP whose parameters are estimated from features of the original data, and studies how well estimators can pick up the implied true effect in the DGP.
We showed theoretically that both approaches can only be guaranteed to work under rather restrictive conditions: when they can correctly reproduce the biases and the ordering of the variances of estimators. We show simple practical cases where one or other of these might fail, and give an example of the consequences based on simulations from an artificial DGP. To provide a realworld example, we also implement the EMCS procedures in the NSW–CPS data, where we know the ‘true effect’ of the programme. This allows us to compute actual performance of the estimators in samples from the original data, and compare this to what EMCS would suggest if applied to these samples. We show that in this example EMCS performs badly on ordering estimators in terms of absolute bias, and the estimator it suggests is often many times worse than the best (or even than selecting randomly). In this example both EMCS procedures perform much better in terms of MSE because reproducing the variance term turns out to drive the MSE in these data. But, this leads the methods to be no better (and sometimes significantly worse) than a simple bootstrap procedure.
These results are unfortunate, but nevertheless important. There remains no silver bullet that can assist empirical researchers with the ‘right’ or ‘best’ estimator for a particular context. In the absence of a clear choice driven by research design, the best advice at this stage is likely to be implementing a number of estimators, and then considering the range of estimates provided, as Busso et al. (2014) also suggest.
One possible future alternative, recently proposed, is synthvalidation (Schuler et al., 2017). This approach is related to crossvalidation and is based on estimating ‘the estimation error of causal inference methods applied to a given dataset’. The authors provide simulations which suggest that this ‘lowers the expected estimation error relative to consistently using any single method’. Further work is needed to test how general this approach is, and whether it can reliably guide researchers in selecting estimators.
References
 Abadie et al. (2004) Abadie, A., D. Drukker, J. L. Herr, and G. W. Imbens (2004): “Implementing Matching Estimators for Average Treatment Effects in Stata,” Stata Journal, 4, 290–311.
 Abadie and Imbens (2006) Abadie, A. and G. W. Imbens (2006): “Large Sample Properties of Matching Estimators for Average Treatment Effects,” Econometrica, 74, 235–267.
 Abadie and Imbens (2011) ——— (2011): “BiasCorrected Matching Estimators for Average Treatment Effects,” Journal of Business & Economic Statistics, 29, 1–11.
 Abadie and Imbens (2016) ——— (2016): “Matching on the Estimated Propensity Score,” Econometrica, 84, 781–807.
 Angrist and Krueger (1999) Angrist, J. D. and A. B. Krueger (1999): “Empirical Strategies in Labor Economics,” in Handbook of Labor Economics, ed. by O. C. Ashenfelter and D. Card, Elsevier, vol. 3, 1277–1366.
 Austin (2010) Austin, P. C. (2010): “The Performance of Different PropensityScore Methods for Estimating Differences in Proportions (Risk Differences or Absolute Risk Reductions) in Observational Studies,” Statistics in Medicine, 29, 2137–2148.
 Bertrand et al. (2004) Bertrand, M., E. Duflo, and S. Mullainathan (2004): “How Much Should We Trust DifferencesinDifferences Estimates?” Quarterly Journal of Economics, 119, 249–275.
 Blundell and Costa Dias (2009) Blundell, R. and M. Costa Dias (2009): “Alternative Approaches to Evaluation in Empirical Microeconomics,” Journal of Human Resources, 44, 565–640.
 Bodory et al. (2018) Bodory, H., L. Camponovo, M. Huber, and M. Lechner (2018): “The Finite Sample Performance of Inference Methods for Propensity Score Matching and Weighting Estimators,” Journal of Business & Economic Statistics, forthcoming.
 Brewer et al. (2018) Brewer, M., T. F. Crossley, and R. Joyce (2018): “Inference with DifferenceinDifferences Revisited,” Journal of Econometric Methods, 7.
 Busso et al. (2009) Busso, M., J. DiNardo, and J. McCrary (2009): “Finite Sample Properties of Semiparametric Estimators of Average Treatment Effects,” Unpublished.
 Busso et al. (2014) ——— (2014): “New Evidence on the Finite Sample Properties of Propensity Score Reweighting and Matching Estimators,” Review of Economics and Statistics, 96, 885–897.
 Calónico and Smith (2017) Calónico, S. and J. Smith (2017): “The Women of the National Supported Work Demonstration,” Journal of Labor Economics, 35, S65–S97.
 Cameron et al. (2008) Cameron, A. C., J. B. Gelbach, and D. L. Miller (2008): “BootstrapBased Improvements for Inference with Clustered Errors,” Review of Economics and Statistics, 90, 414–427.
 Dehejia and Wahba (1999) Dehejia, R. H. and S. Wahba (1999): “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs,” Journal of the American Statistical Association, 94, 1053–1062.
 Dehejia and Wahba (2002) ——— (2002): “Propensity ScoreMatching Methods for Nonexperimental Causal Studies,” Review of Economics and Statistics, 84, 151–161.
 Diamond and Sekhon (2013) Diamond, A. and J. S. Sekhon (2013): “Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies,” Review of Economics and Statistics, 95, 932–945.
 Díaz et al. (2015) Díaz, J., T. Rau, and J. Rivera (2015): “A Matching Estimator Based on a Bilevel Optimization Problem,” Review of Economics and Statistics, 97, 803–812.
 Frölich (2004) Frölich, M. (2004): “FiniteSample Properties of PropensityScore Matching and Weighting Estimators,” Review of Economics and Statistics, 86, 77–90.
 Frölich et al. (2017) Frölich, M., M. Huber, and M. Wiesenfarth (2017): “The Finite Sample Performance of Semi and Nonparametric Estimators for Treatment Effects and Policy Evaluation,” Computational Statistics & Data Analysis, 115, 91–102.

Graham et al. (2012)
Graham, B. S., C. Campos de Xavier Pinto, and D. Egel
(2012): “Inverse Probability Tilting for Moment Condition Models with Missing Data,”
Review of Economic Studies, 79, 1053–1079.  Graham et al. (2016) ——— (2016): “Efficient Estimation of Data Combination Models by the Method of AuxiliarytoStudy Tilting (AST),” Journal of Business & Economic Statistics, 31, 288–301.
 Hahn (1998) Hahn, J. (1998): “On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects,” Econometrica, 66, 315–331.
 Hansen (2007) Hansen, C. B. (2007): “Generalized Least Squares Inference in Panel and Multilevel Models with Serial Correlation and Fixed Effects,” Journal of Econometrics, 140, 670–694.
 Heckman and Hotz (1989) Heckman, J. J. and V. J. Hotz (1989): “Choosing Among Alternative Nonexperimental Methods for Estimating the Impact of Social Programs: The Case of Manpower Training,” Journal of the American Statistical Association, 84, 862–874.
 Hirano et al. (2003) Hirano, K., G. W. Imbens, and G. Ridder (2003): “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score,” Econometrica, 71, 1161–1189.
 Huber et al. (2016) Huber, M., M. Lechner, and G. Mellace (2016): “The Finite Sample Performance of Estimators for Mediation Analysis Under Sequential Conditional Independence,” Journal of Business & Economic Statistics, 34, 139–160.
 Huber et al. (2013) Huber, M., M. Lechner, and C. Wunsch (2013): “The Performance of Estimators Based on the Propensity Score,” Journal of Econometrics, 175, 1–21.
 Imbens and Wooldridge (2009) Imbens, G. W. and J. M. Wooldridge (2009): “Recent Developments in the Econometrics of Program Evaluation,” Journal of Economic Literature, 47, 5–86.
 Jann (2008) Jann, B. (2008): “The Blinder–Oaxaca Decomposition for Linear Regression Models,” Stata Journal, 8, 453–479.
 Khwaja et al. (2011) Khwaja, A., G. Picone, M. Salm, and J. G. Trogdon (2011): “A Comparison of Treatment Effects Estimators Using a Structural Model of AMI Treatment Choices and Severity of Illness Information from Hospital Charts,” Journal of Applied Econometrics, 26, 825–853.
 Kline (2011) Kline, P. (2011): “OaxacaBlinder as a Reweighting Estimator,” American Economic Review: Papers & Proceedings, 101, 532–537.
 LaLonde (1986) LaLonde, R. J. (1986): “Evaluating the Econometric Evaluations of Training Programs with Experimental Data,” American Economic Review, 76, 604–620.
 Lechner and Strittmatter (2016) Lechner, M. and A. Strittmatter (2016): “Practical Procedures to Deal with Common Support Problems in Matching Estimation,” Econometric Reviews, forthcoming.
 Lechner and Wunsch (2013) Lechner, M. and C. Wunsch (2013): “Sensitivity of MatchingBased Program Evaluations to the Availability of Control Variables,” Labour Economics, 21, 111–121.
 Lee (2013) Lee, W.S. (2013): “Propensity Score Matching and Variations on the Balancing Test,” Empirical Economics, 44, 47–80.
 Leuven and Sianesi (2003) Leuven, E. and B. Sianesi (2003): “PSMATCH2: Stata Module to Perform Full Mahalanobis and Propensity Score Matching, Common Support Graphing, and Covariate Imbalance Testing,” This version 4.0.6.
 Lunceford and Davidian (2004) Lunceford, J. K. and M. Davidian (2004): “Stratification and Weighting via the Propensity Score in Estimation of Causal Treatment Effects: A Comparative Study,” Statistics in Medicine, 23, 2937–2960.
 Millimet and Tchernis (2009) Millimet, D. L. and R. Tchernis (2009): “On the Specification of Propensity Scores, with Applications to the Analysis of Trade Policies,” Journal of Business & Economic Statistics, 27, 397–415.
 Schuler et al. (2017) Schuler, A., K. Jung, R. Tibshirani, T. Hastie, and N. Shah (2017): “SynthValidation: Selecting the Best Causal Inference Method for a Given Dataset,” Unpublished.
 Słoczyński (2017) Słoczyński, T. (2017): “A General Weighted Average Representation of the Ordinary and TwoStage Least Squares Estimands,” Unpublished.
 Słoczyński and Wooldridge (2018) Słoczyński, T. and J. M. Wooldridge (2018): “A General Double Robustness Result for Estimating Average Treatment Effects,” Econometric Theory, 34, 112–133.
 Smith and Todd (2001) Smith, J. A. and P. E. Todd (2001): “Reconciling Conflicting Evidence on the Performance of PropensityScore Matching Methods,” American Economic Review: Papers & Proceedings, 91, 112–118.
 Smith and Todd (2005) ——— (2005): “Does Matching Overcome LaLonde’s Critique of Nonexperimental Estimators?” Journal of Econometrics, 125, 305–353.
 Wooldridge (2007) Wooldridge, J. M. (2007): “Inverse Probability Weighted Estimation for General Missing Data Problems,” Journal of Econometrics, 141, 1281–1301.
 Zhao (2004) Zhao, Z. (2004): “Using Matching to Estimate Treatment Effects: Data Requirements, Matching Metrics, and Monte Carlo Evidence,” Review of Economics and Statistics, 86, 91–107.
 Zhao (2008) ——— (2008): “Sensitivity of Propensity Score Methods to the Specifications,” Economics Letters, 98, 309–319.
Appendix
Appendix A Theory
Derivations of (8) and (9): A general expression of SEB for ATT in the absence of knowledge of the propensity score is given by
Plugging the current specifications for and and noting for all , the expression of (8) follows.
By the partialling out argument of the least squares regression and the linear probability specification of the propensity score, the asymptotic variance of can be written as
where the third line follows from Bayes rule applied to each denominator and numerator.
Appendix B Stylised Simulations: Design
Here we provide further details on the parameters and procedures for the stylised simulations described in Subsection 3.2.
b.1 Details of Simulation for Scenario 2
For each sample we generate 1,000 observations, and for each observation draw a covariate from a truncated standard normal distribution with the left truncation point at –4 and the right truncation point at 6.
Propensity score is then constructed as
(16) 
For each observation we draw a random number from a standard uniform distribution, and assign treated status,
, if exceeds that random number.We next generate an unobservable
drawn from a normal distribution with mean zero. Since Scenario 2 is the heteroskedastic case, the standard deviation for those not treated is
, while for those who are treated it is .^{28}^{28}28In the benchmark case (Scenario 1), mentioned at the end of Subsubsection 3.2.1, .Finally the outcome is generated as
(17) 
and hence ATT is equal to .5.
This completes the generation of a Scenario 2 sample, which can then be used to implement the two EMCS procedures described in Section 2. For each EMCS design, we consider 1,000 samples and 1,000 replications per sample.
In the placebo design, we additionally require some choice of and , where determines the degree of covariate overlap between the ‘placebo treated’ and ‘placebo control’ observations and determines the proportion of the ‘placebo treated’. We choose to ensure that the proportion of the ‘placebo treated’ observations in each placebo EMCS replication is equal to the proportion of treated units in the sample. We follow Huber et al. (2013) in choosing . We also use a linear model to estimate the propensity score, as this corresponds to the true model in equation (16).
In the structured design, we first estimate the mean and variance of in a given sample, conditional on treatment status. We also regress on and , excluding the interaction of and . Next, in the simulated dataset, is drawn from a normal distribution with mean and variance conditional on treatment status and equal to the estimates above. Whenever the draw of lies outside the support observed in the data, conditional on treatment status, the observation is replaced with the limit point of the support. Finally, the simulated outcome, , is generated in two steps. In the first step, we calculate its conditional mean based on the estimated coefficients from the regression above. In the second step, the simulated outcome is determined as a draw from a normal distribution with the conditional mean determined above and the variance that is equal to the variance of the residuals in the regression model estimated on the original data.^{29}^{29}29Thus, by using a single value of variance for both treated and control units, we fail to account for heteroskedasticity of the potential outcome equations. This is the source of misspecification of the structured design in Scenario 2. Again, we replace extreme values with the limit of the support, conditional on treatment status.
We use two estimators in our stylised simulations: linear regression (OLS) and inverse probability weighting (IPW). In the latter case, we first estimate the propensity score using a linear model, as this corresponds to the true model in equation (16), and then use inverse weighting with normalised weights to estimate the ATT.
b.2 Details of Simulation for Scenario 3
A similar procedure to that detailed in the previous subsection is followed. Two changes are made. First, we now have homoskedasticity so . Second, in each sample, we now generate the outcome as
(18) 
and hence ATT is equal to .^{30}^{30}30In practice, we estimate using the mean of for the treated observations in 1,000 samples from the true data generating process. As a result, ATT is equal to (approximately) .625.
The source of misspecification of the structured design in Scenario 3 is in its failure to account for the interaction of and when generating the simulated outcomes.
Appendix C Stylised Simulations: Detailed Results
Absolute bias  RMSE  SD  

Original samples  
IPW  .000  .034  .034 
OLS  .000  .032  .032 
Placebo  
IPW  .002  .044  .044 
(.001)  (.002)  (.002)  
OLS  .001  .042  .042 
(.001)  (.002)  (.002)  
Structured  
IPW  .007  .035  .034 
(.005)  (.002)  (.001)  
OLS  .001  .033  .033 
(.001)  (.001)  (.001) 

Notes: Results for ‘Original samples’ correspond to the true values of all features of interest (absolute bias, RMSE, and SD) in the original data generating process. Measures of absolute bias and RMSE are centred around the true value of ATT, reported in Appendix B. All calculations are based on 1,000 samples. For each of these 1,000 samples, ‘Placebo’ and ‘Structured’ generate 1,000 new replications using the placebo and structured approaches described in Section 2. In each case, we report both the mean and the standard deviation (in brackets) of EMCS estimates of all features of interest across all replications. Estimates of absolute bias and RMSE are centred around 0 for placebo and around the modelimplied value for structured. For ease of interpretation, RMSE and SD are reported instead of MSE and variance (as elsewhere in the paper).
Absolute bias  RMSE  SD  

Original samples  
IPW  .003  .079  .079 
OLS  .002  .080  .080 
Placebo  
IPW  .002  .044  .044 
(.001)  (.002)  (.002)  
OLS  .001  .042  .042 
(.001)  (.002)  (.002)  
Structured  
IPW  .016  .070  .067 
(.012)  (.005)  (.003)  
OLS  .010  .067  .066 
(.008)  (.003)  (.003) 

Notes: Results for ‘Original samples’ correspond to the true values of all features of interest (absolute bias, RMSE, and SD) in the original data generating process. Measures of absolute bias and RMSE are centred around the true value of ATT, reported in Appendix B. All calculations are based on 1,000 samples. For each of these 1,000 samples, ‘Placebo’ and ‘Structured’ generate 1,000 new replications using the placebo and structured approaches described in Section 2. In each case, we report both the mean and the standard deviation (in brackets) of EMCS estimates of all features of interest across all replications. Estimates of absolute bias and RMSE are centred around 0 for placebo and around the modelimplied value for structured. For ease of interpretation, RMSE and SD are reported instead of MSE and variance (as elsewhere in the paper).
Absolute bias  RMSE  SD  

Original samples  
IPW  .001  .044  .044 
OLS  .081  .089  .037 
Placebo  
IPW  .002  .043  .043 
(.001)  (.002)  (.002)  
OLS  .001  .042  .042 
(.001)  (.002)  (.002)  
Structured  
IPW  .011  .040  .038 
(.009)  (.004)  (.001)  
OLS  .003  .037  .036 
(.003)  (.001)  (.001) 

Notes: Results for ‘Original samples’ correspond to the true values of all features of interest (absolute bias, RMSE, and SD) in the original data generating process. Measures of absolute bias and RMSE are centred around the true value of ATT, reported in Appendix B. All calculations are based on 1,000 samples. For each of these 1,000 samples, ‘Placebo’ and ‘Structured’ generate 1,000 new replications using the placebo and structured approaches described in Section 2. In each case, we report both the mean and the standard deviation (in brackets) of EMCS estimates of all features of interest across all replications. Estimates of absolute bias and RMSE are centred around 0 for placebo and around the modelimplied value for structured. For ease of interpretation, RMSE and SD are reported instead of MSE and variance (as elsewhere in the paper).
Appendix D Empirical Application: Estimators
We use seven estimators in our empirical application.

Linear regression (OLS).

Oaxaca–Blinder – we follow Kline (2011) in using the Oaxaca–Blinder decomposition to estimate the ATT.

Inverse probability weighting (IPW) – we first estimate the propensity score using a logit model, and then use inverse weighting with normalised weights to estimate the ATT.

Doublyrobust regression – as in Wooldridge (2007) and Słoczyński and Wooldridge (2018), we use the inverseprobabilityweighted regressionadjustment (IPWRA) estimator. This is effectively a combination of the two estimators above, IPW and Oaxaca–Blinder. It satisfies the double robustness property.

Uniform kernel matching – we first estimate the propensity score using a logit model, and then match on propensity scores using a uniform kernel. We select the bandwidth on the basis of leaveoneout crossvalidation (as in Busso et al. 2009 and Huber et al. 2013), using a search grid for . The computational time of doing this for each replication is prohibitive. Consequently we calculate this once for each original sample, and use the recovered optimal bandwidth in all EMCS replications for that sample.

Nearest neighbour matching – nearest neighbour matching on propensity scores, which are first estimated from a logit regression, with matching on the single nearest neighbour. We match with replacement; if there are ties, all of the tied observations are used.

Biasadjusted nearest neighbour matching – as above, but correcting bias as in Abadie and Imbens (2011), since nearest neighbour matching is not consistent.
Appendix E Empirical Application: Structured EMCS Procedure
Here we detail precisely the procedure followed to implement the structured EMCS in our empirical application. As noted previously, we begin each structured EMCS replication by generating a fixed number of treated and nontreated observations to match the number in the sample. We then draw an employment status pair of u74 and u75 (nonemployment in months 13–24 prior to randomization and nonemployment in 1975), conditional on treatment status, to match the observed conditional joint probability. For individuals who are employed in only one period, an income is drawn from a log normal distribution conditional on treatment and employment statuses, with mean and variance calibrated to the respective conditional moments in the data. Where individuals are employed in both periods a joint log normal distribution is used, again conditioning on treatment status. In all cases, whenever the income draw in a particular year lies outside the relevant support observed in the data, conditional on treatment status, the observation is replaced with the limit point of the empirical support, as also suggested by Busso et al. (2014).
We model the joint distribution of the remaining control variables as a particular treestructured conditional probability distribution, so that we can better fit the correlation structure in the data. The process for generating these covariates is as follows:

The covariates are ordered: treatment status, employment statuses, income in each period, whether a high school dropout (nodegree), education (educ), age, whether married, whether black, and whether Hispanic. This ordering is arbitrary, and a similar correlation structure would be generated if the ordering were changed.

Using the sample on which the EMCS is being performed, each covariate from nodegree
onward is regressed on all the covariates listed before it (we use the logit model for binary variables).
^{31}^{31}31One exception is educ which is regressed on the prior listed covariates conditional on nodegree. Clearly, it is not possible for a high school dropout to have twelve years of schooling or more; it is also not possible for a nondropout to have less than twelve years of schooling. These regressions are not to be interpreted causally; they simply give the conditional mean of each variable given all preceding covariates. 
In the simulated dataset, covariates are drawn sequentially in the same order. For binary covariates a temporary value is drawn from a distribution. Then the covariate is equal to one if the temporary value is less than the conditional probability for that observation. The conditional probability is found using the values of the existing generated covariates and the estimated coefficients from step 2. Age and education are drawn from a normal distribution whose mean depends on the other covariates and whose variance is equal to that of the residuals from the relevant model. Again, we replace extreme values with the limit of the support, conditional on treatment status (for education, also conditional on dropout status).
The outcome studied is earnings in 1978, re78. The simulated outcome, for individual , is then generated in two steps. In the first step, we generate a conditional mean using the parameters of a flexible linear model fitted to the sampled data. Precisely, we estimate from the following linear model:
(19) 
The predicted conditional mean in the replication is then calculated using the estimated coefficients from above, and the simulated treatment status and covariates, and . In the second step, the simulated outcome, , is determined as a draw from a normal distribution with the estimated conditional mean and the variance that is fitted to that of the residuals from the model in equation (19), conditional on treatment status. Once again, we replace extreme values of re78 with the limit point of the support, also conditional on treatment status. ‘True effects’ in each replication, , are calculated using the conditional means for both treatment statuses, and the difference in conditional means, i.e. the individuallevel treatment effect, is averaged over the subsample of treated units.^{32}^{32}32Thus, we implicitly focus on the sample average treatment effect on the treated (SATT), not on the population average treatment effect on the treated (ATT). Both of these measures can be used as the benchmark effect in simulations and we have no particular preference for either.
Comments
There are no comments yet.