1 Introduction
When researchers in econometrics develop new econometric methods it is common practice to assess the performance of the new methods in Monte Carlo studies. In such Monte Carlo studies, artificial data are often generated using very simple distributions with a high degree of smoothness and limited dependence between different variables. The performance of the new methods in those settings may not be indicative of the performance in realistic settings where many variables have mixed discretecontinuous distributions, sometimes with long tails and outliers, and correlation patterns are complex. For a recent discussion of these issues, see
Advani et al. (2019). In this paper we discuss how Generative Adversial Nets (GANs, Goodfellow et al. (2014)), and in particular GANs minimizing the Wasserstein distance (WGANs, Arjovsky et al. (2017)) can be used to systematically generate data that closely mimic actual data sets. Given an actual data set these methods allow researchers to systematically assess the performance of new methods in a realistic setting. Moreoever, it would preempt concerns that researchers chose particular simulation designs to favor their proposed methods.An important feature of many econometric methods is that they focus on causal effects, rather than predictions. This means that in general we need to have more than simply a single real data set to evaluate new methods: we need methods that allow us to generate large amounts of data beyond any given sample to establish the ground truth. This is where the WGANs can be very helpful.
We apply these ideas to a classic data set in the program evaluation literature, originally put together by LaLonde (1986). We use the specific sample subsequently recovered by Dehejia and Wahba (1999) that is available online on Dehejia’s website. We refer to this as the LalondeDehejiaWahba (LDW) data set. The LDW data set has some special features that make it a challenging setting for estimators for average treatment effects under unconfoundedness. That makes it an attractive starting point for comparing different estimators. First we demonstrate how WGANs can generate artificial data in this setting. Second, we discuss how similar the generated data are to the actual sample. Third, we assess the properties of a set of estimators. Finally, we evaluate the robustness of these results to sampling variation. Our approach to robustness relies on starting with a very large dataset, and repeating our entire analysis treating different samples from the large dataset as the “real” dataset. The very large dataset is generated using the WGAN that was based on our original real dataset.
We use three specific samples created from the LDW data to create a range of settings. First, in what we call the LDWE (experimental sample), we use the data from the original experiment. Second, LDWCPS (observational sample) contains the experimental treated units and data on individuals from the CPS comparison sample. Third, in the LDWPSID (observational sample) we use the experimental treated units and data from the PSID comparison sample. In our analysis we compare the performance of twelve estimators for the average effect for the treated proposed in the literature. As a baseline case we use the difference in means by treatment status. In addition we consider a number of more sophisticated estimators, all based on the assumption of unconfounded treatment assignment. Some of these use flexible estimators of the conditional outcome means, of the propensity score, or of both. These estimators are based on generalized linear models, random forests, or neural nets.
2 Wasserstein Generative Adversarial Networks
Suppose we have a sample of observations on
component vectors,
, drawn randomly from a distribution with cumulative distribution function
, density and domain . We are interested in drawing new samples that are similar to samples drawn from this distribution.2.1 Conventional Alternatives
A conventional approach in econometrics is to estimate the distribution
using kernel density estimation (
Silverman (2018); Härdle (1990)). Given a bandwidth and a kernel , the standard kernel density estimator isConventional kernels include the Gaussian kernel and the Epanechnikov kernel. Bandwidth choices have been proposed to minimize squarederror loss. These methods tend to perform poorly especially in highdimensional settings. The estimated densities tend to oversmooth and look substantially different from the actual ones.
An alternative approach to generate new samples that are similar to an existing sample is the bootstrap (Efron (1982); Efron and Tibshirani (1994)), which estimates the cumulative distribution function as
The major reason this does not work for our purposes is that it cannot generate observations that differ from those seen in the sample and that the sampled data contains an unrealistic amount of identical data points.
2.2 Generative Adversarial Networks
Generative Adversarial Networks (GANs) are a recently developed an alternative approach to generating data that are similar to a particular sample (Goodfellow et al. (2014); Arjovsky and Bottou (2017)). GANs can be thought of as implicitly estimating the distribution, although they do not directly produce estimates of the density or distribution function at a particular point. GANs have two components. The first is a generator
that is a flexible parametric model for the distribution of data. The second is a
discriminatorthat classifies observations as coming from the training sample rather than from the generator. If the generator performs well, the discriminator cannot distinguish between the training data and the artificial data. The training procedure for the generative model then tries to maximize the probability of the discriminator making a mistake. This can be viewed as a twoplayer minimax game. In typical applications, and in our application, both the generative model and the discriminator are multilayer neural nets, but in principle they could be simpler models. GANs have recently become popular in the machine learning literature, but have not received much attention in the econometrics literature. An exception is
Kaji et al. (2019) who use WGANs for estimation of structural models.The starting point is a random noise distribution, , with domain
, selected by the researcher. This can be a multivariate uniform or Gaussian distribution. Second, we have the generator, a mapping from
to the data space with parameter . Call this mapping . This may be a multilayer neural net but could be a much simpler model. For a given parameter the generator, in combination with the distribution of the inputs implies a distribution for . We denote this distribution by . Next, we have a second model, the discriminator with parameter . The output of the generator is the probability that came from the real data rather than from the generator. Again, typically the discriminator is a multilayer neural net, but it could be a very simple model. We train the discriminator, using samples from the training data and samples drawn from , to maximize the probability of assigning the correct label to the generated and the real data. At the same time we train the generator to minimize the the logarithm of the probability of assigning the generated data to the correct label, that is we minimize the logarithm of for observations from the generated data.2.3 A Simple Example
To fix ideas, let us consider a GAN in a very simple example where both models, the generator and the discriminator, are very parsimonious parametric models rather than multilayer neural nets. We have a random sample , the “real” data, with
observations. Suppose the distribution of the scalar input is a Normal distribution with mean zero and unit variance,
Suppose also that the generator is an additive shift:
This implies a model for the generated data that is :
This is of course a very simple model, and given this model we can estimate on the real data just as the sample average:
Note that if we generated our artificial observations from a Normal distribution , the artificial data would look very similar to the real data, and as a result any discriminator would find it impossible to systematically distinguish between the real data and the artificial data.
Let us see how the GAN analyses this problem. For the discriminator we use a logit model with two parameters,
, with outcome (a real observation) or (a fake observation):For the real data we create an outcome . We also generate a sample of observations from the input distribution, , and create an outcome for these fake observations, . Given the parameter for the generator, , the discriminator chooses to maximize the standard logit log likelihood function:
where
We then choose the parameter for the generator to minimize the concentrated log likelihood function for the discriminant:
Let us implement this by iterating between the following two steps. Given values after steps of the algorithm we update the two parameters:

Update the discriminator parameter as

Update the generator parameter by taking a small step (small learning rate ) along the derivative:
To see what this algorithm leads to, consider when the generator is successful in the sense that the discriminator is unable to tell apart the real and fake data. This happens when the solution for the parameter for the discriminant is . Then, inspecting the optimization problem for , it must be the case that is equal to , and thus . For this to be the solution to the optimization problem for , it must be the case that
If the number of observations from the generated model ( is large, then , and this will converge to .
Note that there are some subtleties in this algorithm. An important one is that at each iteration we can solve for the optimal value of , given the current value for . However, in the second step of each iteration we cannot optimize for given . In this simple example it is easy to see that as a function of the log likelihood function is monotone, and so would diverge. We avoid that by taking a small step in the direction of the derivative.
2.4 Earth Moving Distance and Wasserstein Generative Adversial Networks
The simple GAN as decribed in the previous section does not work well in more complex settings. An attractive alternative is based on a modification of the distance measure. First, for two distributions and
, let the KullbackLeibler divergence be
where both and are absolutely continuous with respect to the same measure. Also define the JensenShannon divergence as
Goodfellow et al. (2014) and Arjovsky et al. (2017)) show that the GAN as described in Section 2.2 chooses in large samples the generator to minimize the JensenShannon divergence between the generator and the distribution that generates the true data. This JensenShannon divergence metric turns out to have some disadvantages, especially when there is a difference in the support of the distributions. This can lead to substantial computational problems where the discriminator is too good early on so that it classifies the real and fake observations well, and as a result the derivatives for the generator vanish. See Gulrajani et al. (2017); Arjovsky and Bottou (2017) for details.
An attractive alternative to the JensenShannon divergence is the EarthMover or Wasserstein distance (Arjovsky et al. (2017)):
where
is the set of joint distributions that have marginals equal to
and . The term EarthMover distance comes from the interpretation that is the amount of probability mass that needs to be transported to move from the distribution to the distribution . It is well defined irrespective of the support of the distributions.Arjovsky et al. (2017) show that an alternative representation of the Wasserstein distance is
where we take the supremum of the functions over all Lipschitz functions with Lipschitz constant equal to 1. The function is known as the critic.
Suppose we parametrize the critic as . Ignoring the Lipschitz constraint the optimization problem would be
(2.1) 
Given the generator, we choose the parameters of the critic, , to maximize the difference between the average of over the real data and the average over the generated data. We then choose the parameter of the generator, , to minimize this maximum difference. Again we need to be careful in taking small steps for the generator, whereas we can take many steps for the critic. Ideally we would restrict the search to parameters that ensure that the critic is Lipschitz with constant 1. That is difficult computationally. In the literature two approaches have been followed (Gulrajani et al. (2017)). The first solution is to restrict the parameters to lie in some compact set, the Cartesian product of . That implies that the function is Lipschitz for some , so that we mininimize times the Wasserstein distance, for some unknown . This approach, known as clipping, still turns out to have computational problems. A second suggestion is to add a penalty term to the objective function for the critic. The penalty term depends on the average the square of the positive part of the difference between the norm of the derivative of with respect to the input and one. Specificially, it has the form
where the are a convex combination of the real and generated observations. Note that here we use batches of the real and generated data of the same size .
2.5 The Algorithm
Instead of using all the data in each step of the algorithm, we repeatedly use random batches of the real data with batch size , denoted by , and each iteration generate the same number of new fake observations from the input distribution, denoted by .
The general algorithm is described in Algorithm 2
. For the optimization we use a modification of the SGD (Stochastic Gradient Descent) algorithm (
e.g., Bottou (2010)), the Adam (Adaptive moment estimation,
Kingma and Ba (2014)) algorithm. The Adam algorithm combines the estimate of the (stochastic) gradient with previous estimates of the gradient, and scales this using an estimate of the second moment of the unitlevel gradients, the latter part being somewhat akin to the BerndtHallHallHausman algorithm proposed in Berndt et al. (1974) Details on the Adam algorithm are provided in the appendix. Our specific algorithm uses dropout (WardeFarley et al. (2013)) for regularization purposes as well as a modification of the Adam algorithm that adapts the learning rate (Baydin et al. (2017)), thus reducing the time for hyperparameter search.
2.6 Conditional WGANs
The algorithm discussed in Section 2.5 learns to generate draws from an unconditional distribution. In many cases we want to generate data from a conditional distribution. For example, for the causal settings that motivate this paper, we may wish to keep fixed the number of treated and control units. That would be simple to implement by just running two marginal WGANs. More importantly, we wish to generate potential treated and control outcomes given a common set of pretreatment variables. For that reason it is important to generate data from a conditional distribution (Mirza and Osindero (2014); Odena et al. (2017); Liu et al. (2018); Kocaoglu et al. (2017)).
Suppose we have a sample of real data , . We wish to train a generator to sample from the conditional distribution of . The conditioning variables are often referred to as labels in this literature. Let be the generator, and let
be the critic. As before, both will be neural networks, although this is not essential. The specific algorithm is described in Algorithm
2. ^{1}^{1}1 In a related paper, Kocaoglu et al. (2017) propose an architecture that preserves dependencies represented in a directed acyclic graph (DAG).3 Simulating the LalondeDehejiaWahba Data
In this section we discuss using WGANs to simulate data for the LalondeDehejiaWahba (LDW) data.
3.1 Simulation Studies for Average Treatment Effects
In the setting of interest we have data on an outcome , a set of pretreatment variables and a binary treatment . We postulate that there exists for each unit in the population two potential outcomes and , with the observed outcome equal to corresponding to the potential outcome for the treatment received, . We are interested in the average treatment effect for the treated,
assuming unconfoundedness (Rosenbaum and Rubin (1983); Imbens and Rubin (2015)):
and overlap
for all in the support of the pretreatment variables. Let (which by unconfoundedness is equal to ) be the conditional outcome mean, and let be the propensity score. There a large literature developing methods for estimating average and conditional average treatment effects in this literature (see Imbens (2004); Abadie and Cattaneo (2018) for surveys).
Given a sample we cannot directly establish the true value of the average treatment effect, so standard Machine Learning comparisons of estimators based on crossvalidation methods are not feasible. In this setting, researchers have often conducted simulation studies to assess the properties of proposed methods (Abadie and Imbens (2011); Belloni et al. (2014); Athey et al. (2018)). Using the LDW sample Abadie and Imbens (2011) estimate a model for the conditional means and the propensity score allowing for linear terms and second order terms. To account for the mass points at zero, they model separately the probability of the outcome being equal to zero and outcome conditional on being positive. Schuler et al. (2017) also start with a real data set. They postulate a value for the conditional average treatment effect . They then use the empirical distribution of as the true distribution. They estimate the conditional means using flexible models, imposing the constraint implied by the choice of conditional average treatment effect . Given these estimates they estimate the residual distribution as the empirical distribution of
. Then they impute outcomes for new samples using the estimated regression functions and random draws from the empirical residual distribution. Note that this procedure imposes homoskedasticity. Note also that the
Schuler et al. (2017) choice for the joint distribution of can create violations of the overlap requirement if the pretreatment variables are continuous. Because they specify the conditional average treatment effect that does not create problems for estimating the ground truth.3.2 The LDW Data
The data set we use in this paper was originally constructed by LaLonde (1986), and later recovered by Dehejia and Wahba (1999). The version we use is available on Dehejia’s website. This data set has been widely used in the program evaluation literature to compare different methods for estimating average treatment effects (e.g., Dehejia and Wahba (2002); Heckman and Hotz (1989); Abadie and Imbens (2011)). We use three versions of the data. The first, which we refer to as the experimental sample, LDWE, contains the observations from the actual experiment. This sample contains observations, with control observations and treated observations. For each individual in this sample we observe a set of eight pretreatment variables, denoted by . These include two earnings measures, two indicators for ethnicity, marital status, and two education measures, and age. denotes the matrix with each row corresponding to the pretreatment variables for one of these units, and denoting the for the treated units in this sample. Let denote the matrix with all the covariates. Simiarly, let denote the vector of outcomes for the control units in this sample, and denote the vector of outcomes for the treated units, and let denote the vector of treatment indicators for the control units in this sample (all zeros), and denote the vector of outcomes for the treated units (all ones). The outcome is a measure of earnings in 1978.
The second sample is the CPS sample, LDWCPS. It combines the treated observations from the experimental sample with control observations drawn from the Current Population Survey, for a total of observations. The third sample is the PSID sample, LDWPSID. It combines the treated observations from the experimental sample with control observations drawn from the Panel Survey of Income Dynamics, for a total of observations. Table 1 presents summary statistics for the eight pretreatment variables in these samples, the treatment indicator and the outcome.
Experimental  Experimental  CPS  PSID  
trainees (185)  controls (260)  controls (15,992)  controls (2,490)  
mean  s.d.  mean  s.d.  mean  s.d.  mean  s.d.  
black  0.84  (0.36)  0.83  (0.38)  0.07  (0.26)  0.25  (0.43) 
hispanic  0.06  (0.24)  0.11  (0.31)  0.07  (0.26)  0.03  (0.18) 
age  25.82  (7.16)  25.05  (7.06)  33.23  (11.05)  34.85  (10.44) 
married  0.19  (0.39)  0.15  (0.36)  0.71  (0.45)  0.87  (0.34) 
nodegree  0.71  (0.46)  0.83  (0.37)  0.30  (0.46)  0.31  (0.46) 
education  10.35  (2.01)  10.09  (1.61)  12.03  (2.87)  12.12  (3.08) 
earn ’74  2.10  (4.89)  2.11  (5.69)  14.02  (9.57)  19.43  (13.41) 
earn ’75  1.53  (3.22)  1.27  (3.10)  13.65  (9.27)  19.06  (13.60) 
treatment  1    0    0    0   
earn ’78  6.35  (7.87)  4.55  (5.48)  14.85  (9.65)  21.55  (15.56) 
3.3 A Conditional WGAN for the LDW Data
Consider the experimental data set LDWE. The goal is to create samples of observations, containing control units and treated units, where the samples are similar to the real sample. We proceed as follows. First, we run a conditional WGAN on the sample , conditional on . Let the parameters of the generator of the WGAN be . During training of the models, each batch of training data contains treated and control units with equal probability, to avoid estimation issues when the fraction of treated units is close to zero (for example, it is equal to 0.011 in the CPS dataset).
In each case, for the generator we use a neural net with the following architecture. There are three hidden layers in the neural net, with the number of inputs and outputs equal to and respectively. Here is the dimension of the vectors whose distribution we are modeling and is the dimension of the conditioning variables. For the unconditional WGAN this is , and , and for the conditional WGANs this is , and
. We use the rectified linear transformation,
in the hidden layers. For the final layer we have 128 inputs andoutputs. Here we use for binary variables a sigmoid transformation, for censored variables a rectified linear transformation, and for continuous variables the identify function.
For the critic we use the same architecture with three layers, with the number of inputs and outputs equal to and respectively. For the final layer we have 128 inputs, and 1 linear output.
We did not adapt the architectures to the individual settings, so these hyperparameters should not be thought of as optimal. In spite of this, they yield a wellperforming WGAN. This is to emphasize that the exact architectural choices do not matter in settings like ours, so long as the overall size of the network is large enough to capture the complexity of the data and the amount of regularization (i.e. dropout probability) is high enough to avoid overfitting.
Given the parameters for the generators, , , we first create a very large sample, with units. We use this sample as our population for the simulations. To create the large sample, first we draw separately the covariates for the treated and control units using the generator with parameter . In this step, we create the sample keeping the fraction of treated units equal to that in the sample. Next we draw independently and for each observation in this large sample, using the and as the conditioning variables, using the generators with parameters . Unlike in any real dataset, we observe both and for each unit, simplifying the task of estimating the ground truth in the simulated data. We use this large sample to calculate the approximate true average effect for the treated as the average difference between the two potential outcomes for the treated units:
For this fixed population we report in Table 2
the means and standard deviations for the same ten variables as in Table
1.experimental  cps  psid  

treated  control  treated  control  treated  control  
mean  sd  mean  sd  mean  sd  mean  sd  mean  sd  mean  sd  
black  0.85  (0.35)  0.83  (0.38)  0.83  (0.37)  0.09  (0.28)  0.84  (0.36)  0.29  (0.45) 
hispanic  0.07  (0.25)  0.10  (0.30)  0.05  (0.22)  0.08  (0.28)  0.07  (0.16)  0.03  (0.16) 
age  26.4  (7.0)  26.0  (7.0)  25.6  (7.0)  33.2  (11.3)  26.9  (7.3)  36.5  (11.2) 
married  0.19  (0.39)  0.18  (0.38)  0.19  (0.39)  0.72  (0.45)  0.19  (0.39)  0.85  (0.35) 
nodegree  0.72  (0.45)  0.83  (0.37)  0.71  (0.45)  0.29  (0.45)  0.72  (0.45)  0.33  (0.47) 
education  9.9  (2.0)  9.8  (1.6)  10.3  (2.0)  11.9  (2.9)  9.8  (2.0)  11.4  (3.0) 
earn ‘74  2.13  (4.89)  2.121  (5.35)  2.02  (4.15)  14.61  (9.95)  1.82  (4.16)  18.02  (12.40) 
earn ’75  1.50  (2.93)  1.42  (3.39)  1.43  (2.86)  14.32  (9.60)  0.93  (2.99)  16.72  (11.92) 
earn ’78  6.21  (6.35)  4.40  (4.70)  6.00  (5.34)  15.49  (9.46)  5.13  (5.44)  19.28  (13.20) 
In Figures 15 we present some graphical evidence on the comparison of the actual data and the generate data for the CPS control sample. In general the generated data and the actual data are quite similar. This is true for the first two moments, the marginal distributions, the correlations, as well as the conditional distributions. In particular it is impressive to see in Figure 5 that the conditional distribution of earnings for two groups for the actual data (those with 1974 earnings positive or zero), which has substantially different shapes, is still well matched by the artificial data. The fact that the first two moments of the generated data closely match those of the actual data is only limited comfort. There are simple ways in which to generate data for which the first two moments of each of the variables match exactly those of the actual data, such as the standard bootstrap. However, our generator allows us to generate new samples that contain observations not seen in the actual data, and with no duplicate observations. For some of the estimators, notably the nearest neighbor matching estimators, this can make a substantial difference.
Next we repeatedly construct samples of size , with control units and treated units. From our large population of size , we randomly draw observations with , and record their , their , and their outcome . Next we draw randomly observations with and record their , their , and their outcome . This gives us our sample with units, with control units and treated units. If we want another sample we use the next control units and the next treated observations.
4 Comparing Estimators for Average Treatment Effects
In this section we implement the WGANs to generate data sets to compare different estimators for the average treatment effects. We do this in three settings, first with the experimental LDWE data, second with the LDWCPS comparison group, and third with the LDWPSID comparison group.
4.1 Estimators
We compare eleven estimators for the average effect for the treated. Nine of them fit into a set where we compare three methods for estimating the two nuisance functions, the propensity score and the conditional outcome mean (linear and logit models, random forests, and neural nets), with three ways of combining these estimates of the nuisance functions (difference in estimated conditional means, propensity score weighting, and double robust methods), and two are standalone estimators. All estimators that involve estimating the propensity score use trimming on the estimated propensity score, dropping observations with an estimated propensity score larger than 0.95. See Crump et al. (2009) for discussions of the importance of trimming in general. In future versions we intend to include additional estimators including the Residual Balancing and dragonnet estimators (Athey et al. (2018); Shi et al. (2019)).
For estimating the two nuisance functions and we consider three methods:

A logit model for the propensity score and a linear model for the conditional outcome mean, given the set of eight pretreatment variables. Denote the estimator for the conditional mean by , and the estimator for the propensity score by .

Random Forests for the propensity score and the conditional outcome mean. Denote the estimator for the conditional mean by , and the estimator for the propensity score by . See Athey and Wager (forthcoming) for an analysis using generalized random forests Athey et al. (2019) in doubly robust estimation.

Neural Nets for the propensity score and the conditional outcome mean.Denote the estimator for the conditional mean by , and the estimator for the propensity score by . SeeFarrell et al. (2018) for theoretical analysis of neural nets.
We also consider three methods for incorporating the estimated nuisance functions into an estimator for the ATE:

Use the estimated conditional outcome mean by averaging the difference between the realized outcome and the esitmated control outcome, averaging this over the treated observations:

Use the estimated propensity score to weight the control observations:

Use both the estimated conditional mean and the estimated propensity score in a double robust approach:
Note that we do not use the sample splitting here.
4.2 Estimates for LDW Data
First we use all the estimators on the actual LDW data. The results are reported in Table 3.
Experimental  CPS  PSID  

est  s.e.  est  s.e.  est  s.e.  
Difference in Means  1.79  (0.67)  8.50  (0.58)  15.20  (0.66) 
Bias Corrected Matching  1.90  ()  2.35  ()  1.47  () 
Outcome Models  
Linear  1.00  (0.57)  0.69  (0.60)  0.79  (0.60) 
Random Forest  1.73  (0.58)  0.92  (0.6)  0.06  (0.63) 
Neural Nets  2.07  (0.59)  1.43  (0.59)  2.12  (0.59) 
Propensity Score Weighting  
Linear  1.81  (0.83)  1.18  (0.77)  1.26  (1.13) 
Random Forest  1.78  (0.94)  0.65  (0.77)  0.46  (1.00) 
Neural Nets  1.92  (0.87)  1.26  (0.93)  0.10  (1.28) 
Double Robust Methods  
Linear  1.80  (0.67)  1.27  (0.65)  1.50  (0.97) 
Random Forest  1.84  (0.8)  1.46  (0.63)  1.34  (0.85) 
Neural Nets  2.15  (0.74)  1.52  (0.75)  1.14  (1.08) 
4.3 Simulation Results for the Experimental Control Sample
First we report results for the comparison of all the estimators for the experimental sample in Table 4
. We draw 10,000 samples from the population distribution. We report the bias of the estimators, and standard deviation, the rootmeansquarederror, the coverage rates over the 10,000 replications, and the rmse ranking over the estimators. The rmse’s for the different estimators are fairly similar, ranging from 0.064 (for the residual balancing estimator) to 0.77 for the bias corrected matching estimator. Since there is balance between the treatment and control group due to random assignment of the treatment, it is not surprising that all methods perform fairly well. The ibiases are low relative to the standard deviations, and as a result the coverage rates for the nominal 95% confidence intervals are accurate.
rmse  

Estimator  rmse  rank  bias  sdev  coverage 
Difference in Means  0.62  9  0.29  0.55  0.90 
Bias Corrected Matching  0.64  10  0.08  0.64  
Outcome Models  
Linear  0.56  2  0.06  0.56  0.90 
Random Forest  0.58  4  0.15  0.56  0.89 
Neural Nets  0.65  11  0.17  0.63  0.85 
Propensity Score Weighting  
Linear  0.56  3  0.04  0.56  0.99 
Random Forest  0.60  7  0.17  0.58  0.99 
Neural Nets  0.59  5  0.11  0.58  0.99 
Double Robust Methods  
Linear  0.56  1  0.04  0.56  0.95 
Random Forest  0.60  8  0.08  0.60  0.95 
Neural Nets  0.59  6  0.09  0.59  0.95 
4.4 Simulation Results for the CPS Control Sample
Next, we report results for the comparison of the twelve estimators for the CPS comparison sample in Table 5. As expected, given the substantial differences in characteristics between the treatment group and the control group, in this exercise we see substantial differences in the performances of the different estimators. The methods relying on outcome modeling only, or propensity score modelling only, do poorly other than those that rely on neural nets for estimation of nuisance parameters. The flexible double robust methods do well here. The biases for some of the estimators are substantial, contributing to their confidence intervals having poor coverage rates.
Estimator  rmse  rank  bias  sdev  coverage 

Difference in Means  10.50  11  10.49  0.40  0.00 
Bias Corrected Matching  0.71  7  0.37  0.61  0.00 
Outcome Models  
Linear  0.77  8  0.62  0.45  0.67 
Random Forest  0.80  9  0.67  0.44  0.62 
Neural Nets  0.51  3  0.10  0.50  0.89 
Propensity Score Weighting  
Linear  0.66  6  0.47  0.46  0.95 
Random Forest  0.89  10  0.77  0.45  0.86 
Neural Nets  0.52  4  0.09  0.51  0.98 
Double Robust Methods  
Linear  0.64  5  0.45  0.45  0.84 
Random Forest  0.50  1  0.13  0.48  0.92 
Neural Nets  0.50  2  0.01  0.50  0.96 
4.5 Simulation Results for the PSID Control Sample
Third, we report results for the comparison of the twelve estimators for the psid comparison sample in Table 6. Here the linear methods do suprisingly well, outperforming all the methods using random forests and neural nets. The double robust methods do reasonably well overall. Note that the linear methods do particularly well in terms of bias.
185 treated, 2,490 Controls  945 Treated, 12,450 Controls  
Estimator  rmse  rank  bias  sdev  cov  rmse  rank  bias  sdev  cov 
DIM  15.18  12  15.17  0.48  0.00  15.17  11  15.17  0.21  0.00 
BCM  0.88  8  0.42  0.77  0.00  0.51  7  0.38  0.35  0.00 
Outcome Models  
L  0.57  1  0.09  0.56  0.88  0.27  1  0.09  0.25  0.86 
RF  0.97  10  0.79  0.57  0.52  0.63  10  0.57  0.27  0.22 
NN  1.20  11  0.85  0.85  0.48  0.62  9  0.50  0.36  0.36 
Propensity Score Weighting  
L  0.67  2  0.01  0.67  0.98  0.29  2  0.02  0.29  0.99 
RF  0.91  9  0.65  0.64  0.94  0.53  8  0.44  0.29  0.81 
NN  0.83  7  0.40  0.73  0.96  0.31  3  0.06  0.30  0.98 
Double Robust Methods  
L  0.72  4  0.27  0.67  0.93  0.38  6  0.23  0.30  0.87 
RF  0.70  3  0.11  0.69  0.91  0.33  4  0.07  0.32  0.89 
NN  0.76  6  0.35  0.67  0.90  0.34  5  0.17  0.30  0.91 
4.6 How Credible are the Rankings
The algorithm developed in this paper leads, for a given data set, to a rmse for each estimator, and, based on that, a unique ranking of a set of estimators. However, it does not come with a measure of robustness of that ranking. The question arises becaues in principle the ranking of the estimators could be quite different if we change the environment slightly. In particular we may be concerned with the robustness of the bias component of the rmse. In this section we discuss two approaches to assessing how robust the rankings are.
In the first approach we change the number of observations we draw from the population distribution generated by the WGAN. We focus here on the PSID estimates. Given the estimated WGAN, we originally drew many samples with treated units and control units. Now we change this by drawing repeatedly samples with treated units and control units, five times as many from each group as previously. In the second part of Table 6
we present the same statistics we reported previously for these samples. We find that for estimators that are increasingly flexible as the sample size increases, those involving random forests or neural nets, and the matching estimator, the bias goes down, sometimes substantially. The standard deviation goes down for all estimators, roughly to the extent expected. As a result the rankings do not change much. It remains the case that the linear model and the logit propensity score weighting do well, but the double robust methods based on random forests and neural nets remain competitive. It would be interesting to see if there are other robustness analyses that would change the relative performance of the linear regression estimator here.
In the second approach we do the following. Given our trained WGAN, we take random samples from its distribution. Now we treat in turn each of these artificial samples as the original sample. We then go through the same procedure as before, estimating a WGAN on the th artificial sample, and comparing the twelve estimators on the distribution implied by the WGAN estimated on the th sample. We then compare the rmse from the procedure applied to the th artificial sample to the rmse based on the WGAN trained on the actual sample. In Table 7 we report the results for the LDWCPS. We report the average, over the replications/samples the rmse, bias and standard deviation, and we report the standard deviation over these replications. The methods that did well on the original sample typically do well in these replications. For example the double robust random forest had an rmse of 0.50, whereas the linear double robust method has on the original sample an rmse of 0.64. Over the artificial samples the rmse for the double robust RF was 0.45, with a standard deviation of 0.05. For the double robust L method the average rmse is 0.66, with a standard deviation of 0.05. It typically ranks below the double robust RF method.
For this sample the tentative conclusion is that the flexible double robust methods (using neural nets or random forests do well), consistent with the theory. The neural net version of propensity score weighting also does well, but since this is less consistent with theory, we would not recommend this.
Both these exercises suggest that the WGANbased simulations do a good job in recovering the bias, and that ranking estimators based on such WGANs may be an effective way of choosing between competing methods.
Original Estimator  Ave Simulations  Standard Dev Sims  
RMSE  Bias  sdev  RMSE  Bias  sdev  RMSE  Bias  sdev  
DIM  10.50  10.49  0.40  10.46  10.45  0.38  0.52  0.52  0.03 
BCM  0.71  0.37  0.61  0.78  0.11  0.59  0.15  0.55  0.03 
Outcome Models  
L  0.77  0.62  0.45  0.89  0.66  0.42  0.40  0.61  0.02 
RF  0.80  0.67  0.44  0.66  0.50  0.40  0.19  0.24  0.03 
NN  0.51  0.10  0.50  0.50  0.10  0.46  0.05  0.17  0.03 
Propensity Score Weighting  
L  0.66  0.47  0.46  0.66  0.44  0.43  0.24  0.35  0.04 
RF  0.89  0.77  0.45  0.74  0.60  0.40  0.23  0.27  0.03 
NN  0.52  0.09  0.51  0.46  0.06  0.45  0.05  0.06  0.05 
Double Robust Methods  
L  0.64  0.45  0.45  0.66  0.44  0.43  0.23  0.34  0.04 
RF  0.50  0.13  0.48  0.45  0.09  0.43  0.05  0.13  0.04 
NN  0.50  0.01  0.50  0.45  0.02  0.44  0.05  0.05  0.05 
5 Conclusion
In this paper we show how WGANs can be used to tie Monte Carlo studies to real data. This has the benefit of ensuring that simulation studies are grounded in realistic settings, and not chosen to support the preferred methods. In this way the simulations studies will be more informative to the readers. We illustrate these methods comparing different estimators for average treatment effects using the LalondeDehejiaWahba. There are a number of findings. First, in the three different settings, the experimental data, the CPS control group and the PSID control group, different estimators emerge at the top. Within a particular sample the results appear to be relatively robust to changes in the analysis (e.g., changing the sample size, or doing the twostage WGAN robustness analysis). Second, the preference in the theoretical literature for double robust estimators is broadly mirrored in our results. Although the flexible double robust estimators (using random forests or neural nets) do not always outperform the other estimators, the loss in terms of rootmeansquarederror is always modest, where other estimators often perform particularly poorly in some settings. If one were to look for a single estimator in all settings, our recommendation would thereefore be the double robust estimator using random forests or neural nets. However, one may do better in a specific setting by using the WGANs to assess the relative performance of a wider range of estimators.
Appendix: The Estimators

Difference in Means

The BiasAdjusted Matching estimator
Bias Corrected Matching: 1. match all treated units with replacement to control units using diagonal version of Mahalanobis matching. 2. Regress difference between treated and control outcome for matched pairs on difference in covariates. See Abadie and Imbens (2011).

Conditional Outcome Model, Linear Model

Conditional Outcome Model, Random Forest

Conditional Outcome Model, Neural Nets

The HorowitzThompson Estimator, Logit Model
See Hirano et al. (2003).

The HorowitzThompson Estimator, Random Forest

The HorowitzThompson Estimator, Neural Net

The Double Robust Estimator, Linear and Logit Model

The Double Robust Estimator, Random Forest

The Double Robust Estimator, Neural Nets
Appendix: The Adam Algorithm
Comments
There are no comments yet.