Difference-in-differences is an an increasingly popular observational study design for estimating causal effects from repeated cross sections (e.g., Angrist and Pischke, 2008; Bertrand, Duflo, and Mullainathan, 2004; Card and Krueger, 1994; Lechner, 2011; Obenauer and von der Nienburg, 1915). In the simplest setting, difference-in-differences starts with data on some outcome (e.g., employment) from two comparable state (or cities, regions, etc.), one of which enacts some policy of interest (e.g., a minimum wage increase) and the other of which doesn’t; it then attributes any difference in trends between the two states to the effect of the policy change. The key assumption underlying any such argument is one of “parallel trends”: if neither state had enacted the policy, then their trends would have evolved in the same way.
In many applications, however, such global parallel trends assumptions are hard to justify, as states may have different subgroups of people that exhibit markedly different trends on their own. For example, when studying the effect of minimum wages on employment, we may find that there are subgroups within states (e.g., based on age, income or gender) that have different baseline trends; then, if our two states under comparison have different proportions of these subgroups, the global parallel trends assumption immediately becomes questionable. Instead, we may want to control for these covariates, and only assume parallel trends once we have conditioned on them (e.g., Abadie, 2005; Acemoglu and Angrist, 2001; Blundell, Dias, Meghir, and Van Reenen, 2004; Heckman, Ichimura, and Todd, 1998).
The goal of this paper is to develop flexible methods for nonparametric difference-in-differences estimation under a conditional parallel trends assumption. More formally, suppose we observe independent samples , where the state indicator denotes whether the -th individual is in the control or exposed state. denotes the time of the observation (pre vs post), is a set of potential confounders and is the outcome of interest, and suppose that only samples in the “exposed” state and “post” time period get treated, i.e., we can write the treatment indicator as . Following the potential outcomes framework (Imbens and Rubin, 2015), and denote the control and treated potential outcomes and suppose we observe . We are interested in estimating the average treatment effect , where is the expected treatment effect conditional on covariates and on being treated:
Because is not observed, we impose a parallel trends assumption conditional on covariates, so that
which then allows us to estimate the treatment effect as follows,
As discussed in Abadie (2005), (3) may be more credible than the standard parallel trends assumption that holds without conditioning on as it enables us to control for known sources of confounding. In this paper, we take the estimand with defined via (3) as given; Abadie (2005) provides further discussion of this identification strategy.
. All these papers, however, make an additional assumption that there is no covariate shift across cross-sections from the same state: they require that the joint distribution ofdoes not vary with , i.e., that that . And this assumption may be hard to justify with cross-sectional data where we are not able to survey exactly the same people in the pre and post periods. For example, in a ride-sharing application, Lu et al. (2018) estimates the effects of a dynamic pricing feature on drivers’ behaviors by leveraging a natural experiment where a software bug temporarily disables a dynamic pricing feature for certain drivers. In this case, we may expect the distribution of the covariates for active drivers varies both with exposure and time .
Another widely used approach for difference-in-differences estimation with covariates entails fitting a two-way fixed effect linear model for in terms of of the form
then interpreting the coefficient as the treatment effect (e.g., Anzia and Berry, 2011). This approach, however, makes several restrictive assumptions. It assumes that the effect of the state and that of are fixed, and that the treatment effect is homogeneous. It also assumes that any confounding effects of should affect the outcome linearly (Angrist and Pischke, 2008; Ding and Li, 2019; Keele and Minozzi, 2013; Lechner, 2011). These constraining assumptions can be difficult to satisfy in practice, and in particular the estimator (4) is not justified by a nonparametric version of the assumption (3).
In this paper, we propose flexible and robust nonparametric approaches for the difference-in-differences design that address both of these limitations, and allow us to estimate given only the assumption (3) along with a relevant form of overlap. Throughout this paper, we assume the data is generated from the following generic specification:
where the joint distribution of may be arbitrary and
are the conditional effect of alone and alone, and .
Our contributions are as follows. We start by developing an orthogonal transformation of (5) that generalizes the transformation of Robinson (1988) for the conditionally linear model. In the case where the underlying treatment effect is constant, this representation allows us to build a transformed regression estimator (TR) to estimate the treatment effect. Our TR estimator achieves the parametric rate of convergence while allowing for slower estimation rates on all nuisance components. We further discuss the properties of the transformed regression estimator when the underlying linearity assumption is misspecified.
In the case where we allow for treatment heterogeneity in the data generating process, we propose a heterogeneous treatment effect estimator for , and then use a balancing approach to estimate the average treatment effect. We name this approach the Difference-in-Differences Augmented Minimax Linear Estimator (DiD-AMLE). Building on ideas from Chernozhukov et al. (2018b) and Hirshberg and Wager (2018), we show that DiD-AMLE is semiparametrically efficient and and find it to have promising empirical performance in a variety of simulation setups. Our proposed heterogeneous treatment effect estimator may also be of independent interest.
1.1 Related Works
The difference-in-differences approach to treatment effect estimation was popularized by Card and Krueger (1994), and has since become ubiquitous in the social sciences. Angrist and Pischke (2008) and Lechner (2011) provide a textbook treatment and a broad literature review. Building on Abadie (2005), we are here most interested in extensions of classical difference-in-differences methods that leverage covariate information to make the parallel trends assumption more plausible. Several authors have also recently extended difference-in-differences analyses in other complementary directions. Arkhangelsky (2018) and Athey and Imbens (2006) consider difference-in-differences type designs where there may be non-additive treatment effects. Abadie et al. (2010), Arkhangelsky et al. (2018), Athey et al. (2018a), Ben-Michael et al. (2018) and Xu (2017) develop methods that can be applied in the panel setting in which the same individuals are observed in both the pre- and post-periods, whereas in our setting we observe separate cross-sectional data in each period.
Our goal is to estimate the average treatment effect in difference-in-differences settings, which involves two parts. The first part is to obtain good point estimates of the heterogeneous treatment effect. There have been many recent papers on how to perform this non-parametric estimation well. One approach is to reduce the “regularization bias” that might occur. Examples of this line of work include Athey and Imbens (2016), Hahn, Murray, and Carvalho (2017), Imai and Ratkovic (2013), and Shalit, Johansson, and Sontag (2016)
. Another approach, the one we choose to adopt, is to develop meta-learning procedures that do not depend on a specific machine learning method. Key examples of such works areKünzel et al. (2019) and Nie and Wager (2017). Our decomposition of
is conceptually similar to the orthogonal moments constructions fromRobinson (1988), and more broadly, from Belloni et al. (2011), Bickel et al. (1998), Newey (1994), Scharfstein et al. (1999), Van Der Laan and Rubin (2006) and others. The second part is to obtain a good estimate of the average treatment effect from the heterogeneous effect and other nuisance parameter estimates. Our approach here is closely related to Chernozhukov et al. (2018b), Hirshberg and Wager (2018) and Robins and Rotnitzky (1995). Other related works include Athey, Imbens, and Wager (2018b), Graham, Pinto, and Egel (2012), Hainmueller (2012), Imai and Ratkovic (2014), Kallus (2017), Zubizarreta (2015).
Fleixble modeling and estimation that goes beyond the standard two-way fixed effects and linearity assumptions has also drawn considerable interest. Abadie (2005) considers inverse propensity stratification based methods. Another approach considers more flexible outcome models, see (e.g., Heckman, Ichimura, and Todd, 1998; Meyer, 1995). Recently, Li and Li (2019), Sant’Anna and Zhao (2018) and Zimmert (2018) proposed doubly robust variants of the approach of Abadie (2005) that also allow for heterogeneity in . Our approach improve upon these in two main ways. Firstly, Li and Li (2019), Sant’Anna and Zhao (2018), and Zimmert (2018) make the same assumption as Abadie (2005), i.e., that the time when an individual is observed, , is independent of the joint distribution of , whereas our approach does not require this assumption (see Proposition 1 and the following comment). Secondly, the recent doubly robust proposals focus on methods based on augmented inverse propensity weighting, whereas we develop estimators based on either transformed regression (when the treatment effect is constant) or balancing estimation (when is heterogenous). Instead of using inverse propensity weight as in augmented inverse propensity weighting methods, we use convex optimization to learn the weights, which appear to be more stable in practice especially when the sample size is small.
2 An Orthogonal Transformation for DID
We start by constructing a new decomposition for the outcome model (5) that enables us to express in terms of various marginal effects. Generalizing the transformation of Robinson (1988), our decomposition underlies an orthogonal moments condition and can thus be used for -consistent estimation of with nuisance components estimated via flexible machine learning methods (Chernozhukov et al., 2018a).
The conditional probabilities of an observation being in stateor time period conditionally on play a central role in our analysis (Rosenbaum and Rubin, 1983). We write these quantities as
We write . We also write for the conditional response function marginalizing over and , and
for the conditional effect of marginalizing over and respectively. We write the conditional covariance of and as
Finally, for convenience, we let . Given this notation, we can verify the following (the derivation is given in Appendix A).
Suppose we have access to an independent and identically distributed sequence of tuples and . Under the model (3), our data-generating distribution admits a representation
where , and
Furthermore, all terms in the above decomposition are orthogonal in the following sense:
The key property of this representation is the orthogonality property (12), which will enable flexible estimation of treatment effects at parametric rates as discussed in the following section. In the setting of Abadie (2005), Sant’Anna and Zhao (2018) and Zimmert (2018), their assumption gives , which implies . As an immediate corollary to Proposition 1, this decomposition then simplifies to a functional form closely reminiscent of Robinson’s transformation (Robinson, 1988):
More generally, we see that when is close to 0, all expressions underlying (10) and (11) are well-conditioned, and we expect estimation using (10) to be stable. Conversely, if and are highly correlated conditionally on , then and (11) could become unstable; this is as expected, because if and are highly correlated, then we do not expect their interaction effect to be well identified.
Finally, we note that all nuisance components in the decomposition above are marginal quantities, and thus can be estimated using all of the data. This property is desirable for empirical performance as it is more data efficient when we need to estimate them in a small-sample regime.
3 The Transformed Regression Estimator
As a first application of the orthogonal decomposition given above, we consider estimation in a setting where the treatment effect itself is constant in the representation (10), but all other nuisance components defined above, i.e., , , , , and , may vary with . The standard approach to estimating in this setting is to write a linear regression model of the form in (4) and to interpret the coefficient on as an estimate of the treatment effect. However, as shown in our experiments, this simple linear regression-based approach to treatment effect estimation may be severely biased in the setting where the linear model (4) is misspecified.
Here, we propose the transformed regression (TR) estimator with cross-fitting for (shown in Algorithm 1). The method is based on the decomposition (10), which is motivated by a decomposition used by Robinson (1988) to estimate parametric components in partial linear models. Robinson’s decomposition has also been used in many other recent works, such as in Athey et al. (2019) for causal forests, Robins (2004) for G-estimation, as well as in Chernozhukov et al. (2018a) and Zhao et al. (2017). The transformed regression estimator, motivated by Robinson, also has good theoretical properties. In Theorem 2, we show that the transformed regression estimator is -consistent and asymptotically normal under considerably more generality than simply running an OLS regression with the model (4). The proof of the theorem is in Appendix B.
, using any supervised learning method for prediction accuracy (the superscript ofdenotes using data not in the -th fold). Estimate as a heterogeneous “treatment effect” of while ignoring ; and estimate as a “treatment effect” of ignoring . Both can leverage methods designed for heterogeneous treatment estimation in the single cross-section case. Construct , , and , where , using the estimated nuisance parameters following (11). Then, for , obtain point estimates and
Under the conditions of Proposition 1, suppose furthermore that is constant and that the following conditions hold:
Overlap: the conditional probabilities are bounded away from by some small for all values of , and .
Consistency: for any estimated nuisance parameter , such as , and , we have that:
Risk decay: for any estimated nuisance parameter , we have:
Boundedness: all the nuisance parameters are uniformly bounded:
for some constant .
Then, writing as the transformed regression estimater obtained using Algorithm 1, and as the transformed regression estimator with oracle nuisance parameters, we have
and . In the case when , and is constant, the expression for simplifies to .
In step 3 of Algorithm 1, and can be estimated with methods for heterogeneous treatment effect estimation, for which there exists a large body of work (e.g., Hill, 2011; Imai and Ratkovic, 2013; Künzel, Sekhon, Bickel, and Yu, 2019; Shalit, Johansson, and Sontag, 2016; Wager and Athey, 2018; Zhao, Small, and Ertefaie, 2017). In our simulations, we use the -learner as advocated in Nie and Wager (2017).
If we ever want to use the transformed regression estimator which assumes a constant treatment effect, it is important to understand how it behaves under misspecification. Interestingly, as shown in Proposition 3, even when is not constant, the transformed regression estimator converges to a weighted average of with positive weights, mirroring the findings in Crump, Hotz, Imbens, and Mitnik (2009) and Li, Morgan, and Zaslavsky (2018). The proof for Proposition 3 is found in Appendix C.
When is not constant, the transformed regression estimator can thus be thought of as a weighted mean of the treatment effect, where more weight is given to the data points that are likely to appear with all four pairs.
4 DiD-AMLE: A Balancing ATE Estimator
In this section, we relax the assumption from the previous section that the underlying treatment effect is constant, and aim to estimate the average treatment effect while allowing for heterogeneity. We first propose a flexible nonparametric heterogeneous treatment effect estimator in the difference-in-differences setup that draws inspiration from recent advances in heterogeneous treatment effect estimation in the single cross-section case. Then, we leverage this estimator to build a balancing average treatment effect estimator that allows for -consistent and semiparametrically efficient inference of the average treatment effect in a non-parametric specification of the outcomes model.
4.1 Estimating Treatment Heterogeneity in Difference-in-Differences
Our goal is to estimate the average treatment effect , assuming that is heterogeneous. The plan is first to obtain a good non-parametric estimate of the function , then leverage balancing techniques from Chernozhukov et al. (2018a) or Hirshberg and Wager (2018) to estimate the ATE. The task of estimating is related to that of estimating the conditional average treatment effect (CATE), which has been studied extensively in the literature. Recall our expression for in (3). From that expression, one might attempt to estimate
for the four pairs of and on the corresponding subsets of the data, and then estimate . However, this approach is often not robust. As an example where this method might fail, consider a high dimensional linear model, , with , and . We might consider fitting the Lasso (Tibshirani, 1996) for each separately, and estimate . However, the lasso regularizes each towards separately, which might result in being regularized away from , even when everywhere. See Künzel et al. (2019) and Nie and Wager (2017) for a similar discussion on the -Learner for the CATE.
We proceed by turning our estimator of a constant causal parameter into an estimator for a heterogeneous treatment function
, by turning the estimation equation underlying the former estimator into a loss function. Following this strategy and using the decomposition from (10), we can estimate treatment effect heterogeneity with the following algorithm:
Algorithm 2 is similar to the -learner from Nie and Wager (2017), which finds a similar empirical risk minimizer as the heterogeneous treatment effect estimator. They further showed that if the squared error from estimating the nuisance components (including the marginal effects and the propensity scores) decays at a rate faster than , then error bounds for can match the best available bounds for estimating using oracle nuisance parameters. We do not demonstrate such a result here, but the similarity of our problem to previous settings suggests that a similar result may hold. In the next section, we use Algorithm 2 to build an average treatment effect estimator, but we note that Algorithm 2 can also be of separate interest on its own for heterogeneous treatment effects estimation.
4.2 Estimating the Average Treatment Effect
Next, we consider estimating the average treatment effect in the case where we allow for potential treatment heterogeneity in . One way to proceed is by building on results for doubly robust estimation as developed in Chernozhukov et al. (2018b). In order to do so, we first note that the average treatment parameter can be written as a weighted average of outcomes using inverse-probability-style weights as follows: with
Recall . The result of Chernozhukov et al. (2018b) implies that if we obtain good estimates both of and the inverse-probability-style weights outlined above, then we can obtain semiparametrically efficient estimates of using the doubly robust form such as in Augmented Inverse Propensity Weighitng (AIPW) (Robins and Rotnitzky, 1995). We will refer to this algorithm as DiD-AIPW, which takes on the following steps:
Following the cross-fitting steps 2 to 4 of Algorithm 1, estimate the nuisance parameters , , , , , using data not in the , where , …, are the folds of the data. In the same way, fit nonparametric regressions for the propensities , , and .
Run step 3 of Algorithm 2 to obtain cross-fitted point estimates , for each .
For each , produce the cross-fitted point estimates:
Estimate the average treatment effect as:
Abadie (2005), Sant’Anna and Zhao (2018) and Zimmert (2018) explored a special case of this approach in cases where are independent from . As a result, only the estimate of the single propensity is needed to perform a similar estimation, as opposed to the quantity .
While plug-in estimation with the doubly robust score admits for algorithmically simple estimation of the average treatment effect, we find this approach to perform quite poorly in experiments. The main issue is that the probabilities etc. may get quite small, and so inverting even slightly inaccurate propensity estimates may result in instability. This mirrors observations made by Athey, Imbens, and Wager (2018b), Graham, Pinto, and Egel (2012), Hainmueller (2012), Imai and Ratkovic (2014), Kallus (2017), Zubizarreta (2015) and others in the case of treatment effect estimation under selection on observables, and these paper all consider different strategies to stabilizing inverse probability weights. The problem highlighted by these papers is exacerbated in the difference-in-differences setting because the probabilities we need to invert are comparably even smaller—and so the upside from using a stable weighting method is accentuated.
Here, we follow the Augmented Minimax Linear Estimation (AMLE) approach of Hirshberg and Wager (2018), and estimate a set of weights by directly solving a quadratic program that directly minimizes the worst-case bias and variance of an estimator of the type (24). Because this approach explicitly considers variance when estimating , the resulting weights are guaranteed not to blow up even when the underlying probabilities may be small. Assume that we have access to an estimator of that is consistent in the gauge-norm of some convex class , where . As discussed in Hirshberg and Wager (2018), this assumption is often justified if is derived via penalized regression; and we discuss practical choices of in our experiments section. When estimating , we use the building blocks discussed in Section 4.1. Given these preliminaries, we propose the estimator DiD-AMLE, given in Algorithm 3.
Most of the steps of Algorithm 3 involve nuisance parameter estimations that are similar to those in Algorithms 1 and 2. The most involved part is step 3, where we solve an optimization problem to obtain linear minimax weights. The challenge lies in the fact that we need to specify a convex function class , so that we can actually solve the optimization problem in (25)and obtain the weights . For example, in our simulations, we define as
where is an absolutely convex class, spanned by Hermite polynomials (details see Appendix E). As a result of this choice , the optimization for then becomes