## 1 Introduction

In this paper we study estimation of and inference for average treatment effects in a setting with panel data. We focus on the setting where units, e.g., individuals, firms, or states, adopt the policy or treatment of interest at a particular point in time, and then remain exposed to this treatment at all times afterwards. The adoption date at which units are first exposed to the policy may, but need not, vary by unit. We refer to this as a staggered adoption design (SAD), such designs are sometimes also referred to as event study designs. An early example is Athey and Stern (1998) where adoption of an enhanced 911 technology by counties occurs over time, with the adoption date varying by county. This setting is a special case of the general Difference-In-Differences (DID) set up (e.g., Card (1990); Meyer et al. (1995); Angrist and Pischke (2008); Angrist and Krueger (2000); Abadie et al. (2010); Borusyak and Jaravel (2016); Athey and Imbens (2006); Card and Krueger (1994); Freyaldenhoven et al. (2018); de Chaisemartin and D’Haultfœuille (2018); Abadie (2005)) where, at least in principle, units can switch back and forth between being exposed or not to the treatment. In this SAD setting we are concerned with identification issues as well as estimation and inference. In contrast to most of the DID literature, e.g., Bertrand et al. (2004); Shah et al. (1977); Conley and Taber (2011); Donald and Lang (2007); Stock and Watson (2008); Arellano (1987, 2003); Abraham and Sun (2018); Wooldridge (2010); de Chaisemartin and D’Haultfœuille (2017, 2018), we take a design-based perspective where the stochastic nature and properties of the estimators arises from the stochastic nature of the assignment of the treatments, rather than a sampling-based perspective where the uncertainty arises from the random sampling of units from a large population. Such a design perspective is common in the analysis of randomized experiments, e.g., Neyman (1923/1990); Rosenbaum (2002, 2017). See also Aronow and Samii (2016); Abadie et al. (2016, 2017) for this approach in cross-section regression settings. This perspective is particularly attractive in the current setting when the sample comprises the entire population, e.g., all states of the US, or all countries of the world. Our critical assumptions involve restrictions on the assignment process as well as exclusion restrictions, but in general do not involve functional form assumptions. Commonly made common trend assumptions (de Chaisemartin and D’Haultfœuille (2018); Abraham and Sun (2018)) follow from some of our assumptions, but are not the starting point.

As in Abraham and Sun (2018) we set up the problem with the adoption date, rather than the actual exposure to the intervention, as the basic treatment defining the potential outcomes. We consider assumptions under which this discrete multivalued treatment (the adoption date) can be reduced to a binary one, defined as the indicator whether or not the treatment has already been adopted. We then investigate the interpretation of the standard DID estimator under assumptions about the assignment of the adoption date and under various exclusion restrictions. We show that under a random adoption date assumption, the standard DID estimator can be interpreted as the weighted average of several types of causal effects; within our framework, these concern the impact of different types of changes in the adoption date of the units. We also consider design-based inference for this estimand. We derive the exact variance of the DID estimator in this setting. We show that under a random adoption date assumption the standard Liang-Zeger (LZ) variance estimator (Liang and Zeger (1986); Bertrand et al. (2004)), or the clustered bootstrap, are conservative. For this case we propose an improved (but still conservative) variance estimator.

Our paper is most closely relateds to a very interesting set of recent papers on DID methods that explicitly focus on issues with heterogenous treatment effects (Abraham and Sun (2018); de Chaisemartin and D’Haultfœuille (2018); Han (2018); Goodman-Bacon (2017); Callaway and Sant’Anna (2018); Hull (2018); Strezhnev (2018); Imai et al. (2018); Hazlett and Xu (2018), and Borusyak and Jaravel (2016)). Among other things these papers derive interpretations of the DID estimator as weighted averages of causal effects and bias terms under various assumptions. In many cases they find that these interpretations involve weighted averages of basic average causal effects with potentially negative weights and propose alternative estimators that do not involve negative weights.

## 2 Set Up

Using the potential outcome framework for causal inference, we consider a setting with a population of units. Each of these units are characterized by a set of potential outcomes in periods for treatment levels, . Here indexes the units, indexes the time periods, and the argument of the potential outcome function , indexes the discrete treatment, the date that the binary policy was first adopted by a unit. Units can adopt the policy at any of the time periods , or not adopt the policy at all during the period of observation, in which case we code the adoption date as . Once a unit adopts the treatment, it remains exposed to the treatment for all periods afterwards. This set up is like that in Abraham and Sun (2018); Hazlett and Xu (2018), and in contrast to most of the DID literature where the binary indicator whether a unit is exposed to the treatment in the current period indexes the potential outcomes. We observe for each unit in the population the adoption date and the sequence of realized outcomes, , for , where

is the realized outcome for unit at time . We may also observe pre-treatment characteristics, denoted by the

-component vector

, although for most of the discussion we abstract from their presence. Let , , and denote the , , and matrices with typical elements , , and respectively. Implicitly we have already made a sutva-type assumption (Rubin (1978); Imbens and Rubin (2015)) that units are not affected by the treatments (adoption dates) for other units. Our design-based analysis views the potential outcomes as deterministic, and only the adoption dates , as well as functions thereof such as the realized outcomes as stochastic. Distributions of estimators will be fully determined by the adoption date distribution, with the number of units and the number of time periods fixed, unless explicitly stated otherwise. Following the literature we refer to this as a randomization, or designed-based, distribution (Rosenbaum (2017); Imbens and Rubin (2015); Abadie et al. (2017)), as opposed to a sampling-based distribution.In many cases the units themselves are clusters of units of a lower level of aggregation. For example, the units may be states, and the outcomes could be averages of outcomes for individuals in that state, possibly of samples drawn from subpopulations from these states. In such cases and may be as small as 2, although in many of the cases we consider will be at least moderately large. This distinction between cases where is itself an average over basic units or not, affects some, but not all, of the formal statistical analyses. It may make some of the assumptions more plausible, and it may affect the inference, especially if individual level outcomes and covariates are available.

Define to be the binary indicator for the adoption date preceeding , and define to be the indicator for the the policy having been adopted by unit prior to, or at, time :

so that the matrix with typical element has the form:

Let be the number of units in the sample with adoption date , and define , for , as the fraction of units with adoption date equal to , and , for , as the fraction of units with an adoption date on or prior to .

Also define to be the population average of the potential outcome in period for adoption date :

Define the average causal effect of adoption date relative to , on the outcome in period , as

Abraham and Sun (2018) focus on slighlty different building blocks, what they call , which, for , are the super-population equivalent of . The average causal effects are the building blocks of many of the estimands we consider later. A particularly interesting average effect is

the average effect of switching the entire population from never adopting the policy (), to adopting the policy in the first period (). Formally there is nothing special about the particular average effect relative to any other , but will be useful as a benchmark. Part of the reason is that for all and the comparison is between potential outcomes for adoption prior to or at time (namely adoption date ) and potential outcomes for adoption later than (namely, never adopting, ). In contrast, any other average effect will for some involve comparing potential outcomes neither of which correspond to having adopted the treatment yet, or comparing potential outcomes both of which correspond to having adopted the treatment already. Therefore, reflects more on the effect of having adopted the policy than any other .

## 3 Assumptions

We consider three sets of assumptions. The first set, containing only a single assumption, is about the design, that is, the assignment of the treatment, here the adoption date, conditional on the potential outcomes and possibly pretreatment variables. We refer to this as a design assumption because it can be guaranteed by design. The second set of assumptions is about the potential outcomes, and rules out the presence of certain treatment effects. These exclusion restrictions are substantive assumptions, and they cannot be guaranteed by design. The third set of assumptions consists of four auxiliary assumptions, two about homogeneneity of certain causal effects, one about sampling from a large population, and one about an outcome model in a large population. The nature of these three sets of assumptions, and their plausibility, is very different, and it is in our view useful to carefully distinguish between them. The current literature often combines various parts of these assumptions implicitly in the notation used and in assumptions about the statistical models for the realized outcomes.

### 3.1 The Design Assumption

The first assumption is about the assignment process for the adoption date . Our starting point is to assume that the adoption date is completely random:

###### Assumption 1.

(Random Adoption Date) For some set of positive integers , for ,

for all -vectors such that for all , .

This assumption is obviously very strong. However, without additional assumptions that restrict either the potential outcomes, or expand what we observe, for example by including pre-treatment variables or covariates, this assumption has no testable implications in a setting with exchangeable units.

###### Lemma 1.

(No Testable Restrictions) Suppose all units are exchangeable. Then Assumption 1

has no testable implications for the joint distribution of

.All proofs are given in the Appendix.

Hence, if we wish to relax the assumptions, we need to bring in additional information. Such additional information can come in the form of pretreatment variables, that is, variables that are known not to be affected by the treatment. In that case we can relax the assumption by requiring only that the adoption date is completely random within subpopulations with the same values for the pre-treatment variables. Additional information can also come in the form of limits on the treatment effects. The implications of such restrictions on the ability to relax the random adoption assumption is more complex, as discussed in more detail in Section 3.2.

Under Assumption 1 the marginal distribution of the adoption dates is fixed, and so also the fraction is fixed in the repated sampling thought experiment. This part of the set up is similar in spirit to fixing the number of treated units in the sample in a completely randomized experiment. It is convenient for obtaining finite sample results. Note that it implies that the adoption dates for units and are not independent. Note also that in the standard framework where the uncertainty arises solely from random sampling, this fraction does not remain constant in the repeated sampling thought experiment.

An important role is played by what we label the adjusted treatment, adjusted for unit and time period averages:

where , , and are averages over units, time periods, and both, respectively:

and

where, with some minor abuse of notation, we adopt the convention that is zero if . Note that under Assumption 1, and are non-stochastic. Using these representations we can write the adjusted treatment indicator as

where

(3.1) |

Because the marginal distribution of is fixed under Assumption 1, the sum is non-stochastic under this assumption, even though and thus are stochastic. This fact enables us to derive exact finite sample results for the standard DID estimator as discussed in Section 4. This is similar in spirit to the derivation of the exact variance for the estimator for the average treatment effect in completely randomized experiments when we fix the number of treated and controls.

### 3.2 Exclusion Restrictions

The next two assumptions concern the potential outcomes. Their formulation does not involve the assignment mechanism, that is, the distribution of the adoption date. In essence these are exclusion restrictions, assuming that particular causal effects are absent. Collectively these two assumptions imply that we can think of the treatment as a binary one, the only relevant component of the adoption date being whether a unit is exposed to the treatment at the time we measure the outcome. Versions of such assumptions are also considered in Borusyak and Jaravel (2016); de Chaisemartin and D’Haultfœuille (2018); Abraham and Sun (2018); Hazlett and Xu (2018) and Imai and Kim (2016), where in the latter a graphical approach is taken in the spirit of the work by Pearl (2000).

The first of the two assumptions, and likely the more plausible of the two in practice, rules out effects of future adoption dates on current outcomes. More precisely, it assumes that if the policy has not been adopted yet, the exact future date of the adoption has no causal effect on potential outcomes for the current period.

###### Assumption 2.

(No Anticipation) For all units , all time periods , and for all adoption dates , such that ,

We can also write this assumption as requiring that for all ,

with the last representation showing most clearly how the assumption rules out certain causal effects. Note that this assumption does not involve the adoption date, and so does not restrict the distribution of the adoption dates. Violations of this assumption may arise if the policy is anticipated prior to its implementation.

The next assumption is arguably much stronger. It asserts that for potential outcomes in period it does not matter how long the unit has been exposed to the treatment, only whether the unit is exposed at time .

###### Assumption 3.

(Invariance to History) For all units , all time periods , and for all adoption dates , such that ,

This assumption can also be written as

with again the last version of the assumption illustrating the exclusion restriction in this assumption. Again, the assumption does not rule out any correlation between the potential outcomes and the adoption date, only that there is no causal effect of an early adoption versus a later adoption on the outcome in period , as long as adoption occurred before or on period .

In general, this assumption is very strong. However, there are important cases where it may be more plausible. Suppose the units are clusters of individuals, where in each period we observe different sets of individuals. To be specific, suppose the the units are states, the time periods are years, and outcome is the employment rate for twenty-five year olds, and the treatment is the presence or absence of some regulation, say a subsidy for college tuition. In that case it may well be reasonable to assume that the educational choices for students graduating high school in a particular state depends on what the prevailing subsidy is, but much less on the presence of subsidies in previous years.

If both the exclusion restrictions, that is, both Assumptions 2 and 3, hold, then the potential outcome can be indexed by the binary indicator :

###### Lemma 2.

If these two assumptions hold, we can therefore simplify the notation for the potential outcomes and focus on and .

Note that these two assumptions are substantive, and cannot be guaranteed by design. This in contrast to the Assumption 1, which can be guaranteed by randomization of the adoption date. It is also important to note that in many empirical studies Assumptions 2 and 3 are made, often implicitly by writing a model for realized outcome that depends solely on the contemporaneous treatment exposure , and not on the actual adoption date or treatment exposure in other periods . In the current discussion we want to be explicit about the fact that this restriction is an assumption, and that it does not automatically hold. Note that the assumption does not restrict the time series dependence between the potential outcomes.

It is trivial to see that without additional information, the exclusion restrictions in Assumptions 2 and 3 have no testable implications because they impose restrictions on pairs of potential outcomes that cannot be observed together. However, in combination with random assignment, 2 and 3, there are testable implications as long as and there is some variation in the adoption date.

### 3.3 Auxiliary Assumptions

In this section we consider four auxiliary assumptions that are convenient for some analyses, and in particular can have implications for the variance of specific estimators, but that are not essential in many cases. These assumptions are often made in empirical analyses without researchers explicitly discussing them.

The first of these assumptions assumes that the effect of adoption date , relative to adoption date , on the outcome in period , is the same for all units.

###### Assumption 4.

(Constant Treatment Effect Over Units) For all units and for all time periods and all adoption dates and

The second assumption restricts the heterogeneity of the treatment effects over time.

###### Assumption 5.

(Constant Treatment Effect over Time) For all units and all time periods and

We only restrict the time variation for comparisons of the adoption dates 1 and because we typically use this assumption in combination with Assumptions 2 and 3. In that case we obtain a constant binary treatment effect set up, as summarized in the following Lemma.

###### Lemma 4.

The final assumption allows us to view the potential outcomes as random by postulating a large population from which the sample is drawn.

###### Assumption 6.

(Random Sampling) The sample can be viewed as a random sampling from an infinitely large population, with joint distribution for denoted by .

Under this assumption we can put additional structure on average potential outcomes.

###### Assumption 7.

(Additivity)

## 4 Difference-In-Differences Estimators: Interpretation and Inference

In this section we consider the standard DID set up (e.g., Meyer et al. (1995); Bertrand et al. (2004); Angrist and Pischke (2008); Donald and Lang (2007); de Chaisemartin and D’Haultfœuille (2018)). In the simplest setting with units and time periods, without additional covariates, the realized outcome in period for unit is modeled as

(4.1) |

In this model there are unit effects and time effects , but both are additive with interactions between them ruled out. The effect of the treatment is implicitly assumed to be additive and constant across units and time periods.

We interpret the DID estimand under the randomized adoption date assumption, leading to a different setting from that considered in de Chaisemartin and D’Haultfœuille (2018); Abraham and Sun (2018); Goodman-Bacon (2017). We also derive its variance and show that in general it is lower than the standard random-sampling based variance. Finally we propose a variance estimator that is is smaller than the regular variance estimators such as the Liang-Zeger and clustered bootstrap variance estimators.

### 4.1 Difference-In-Differences Estimators

Consider the least squares estimator for based on the specification in (4.1):

It is convenient to write in terms of the adjusted treatment indicator as

The primary question of interest in this section concerns the properties of the estimator . This includes the interpretation of its expectation under various sets of assumptions, and its variance. Mostly we focus on exact properties in finite samples.

In order to interpret the expected value of we consider some intermediate objects. Define, for all adoption dates , and all time periods the average of the outcome in period for units with adoption date :

Under Assumption 1 the stochastic properties of these averages are well-defined because the are fixed over the randomization distribution. The averages are stochastic because the realized outcomes depend on the adoption date. Define also the following two difference between outcome averages:

In general these differences do not have a causal interpretation. Such an interpretation requires some assumptions, for example, on random assignment of the adoption date.

Example: To facilitate the interpretation of some of the results it is useful to consider a special case where the results from completely randomized experiments directly apply. Suppose , and , with a fraction adopting the policy in the second period. Suppose also that for all and . Then the DID estimator is

the simple difference in means for the second period outcomes for adopters and non-adopters. Under Assumption 1, the standard results for the variance of the difference in means for a randomized experiments apply (e.g., Neyman (1923/1990); Imbens and Rubin (2015)), and the exact variance of is,

The standard Neyman estimator for this variance ignores the third term, and uses unbiased estimators for the first two terms, leading to:

### 4.2 The Interpretation of Difference-In-Differences Estimators

The following weights play an important role in the interpretation of the DID estimand:

(4.2) |

with as defined in (3.1). Note that these weights are non-stochastic, that is, fixed over the randomization distribution.

Example (ctd): Continuing the example with two periods and adoption in the second period or never, we have in that case

The weights have some important properties,

Now we can state the first main result of the paper.

###### Lemma 5.

We can write as

(4.3) |

###### Comment 2.

The lemma implies that the DID estimator has an interpretation as a weighted average of simple estimators for the causal effect of changes in adoption dates, the . Moreover, the estimator can be written as the sum of three averages of these . The first is a weighted average of the , which are all averages of switching from never adopting to adopting in the first period, meaning that these are averages of changes in adoption dates that involve switching from not being treated at time to being treated at time . The sum of the weights for these averages is one, although not all the weights are necessarily non-negative. The second sum is a weighted sum of , for , so that the causal effect always involves changing the adoption date from never adopting to adopting some time after , meaning that the comparison is between potential outcomes neither of which involves being treated at the time. The sum of the weights for these averages is one again. The third sum is a weighted sum of , for , so that the causal effect always involves changing the adoption date from adopting prior to, or at time, relative to adopting at the initial time, meaning that the comparison is between potential outcomes both of which involves being treated at the time. These weights sum to minus one.

If we are willing to make the random adoption date assumption we can give this representation a causal interpretation:

###### Theorem 1.

Part of the theorem where we make the no-anticipation assumption is closely related to one of the results in Abraham and Sun (2018), who make a super-population common trend assumption that, in the super-population context, weakens our random adoption date assumption. Part of the theorem, where we assume both the exclusion restrictions so that the treatment is effectively a binary one, is related to the results in de Chaisemartin and D’Haultfœuille (2018), although unlike those authors we do not restrict the trends in the potential outcomes.

Without either Assumptions 2 or 3, the estimand has a causal interpretation, but it is not clear it is a very interesting one concerning the receipt of the treatment. With the no-anticipation assumption (Assumption 2), the interpretation, as given in part of the theorem, is substantially more interesting. Now the estimand is a weighted average of for , with weights summing to one. These are the average causal effect of changing the adoption date from never adopting to some adoption date prior to, or equal to, time , so that the average always involves switching from not being exposed to the treatment to being exposed to the treatment.

### 4.3 The Randomization Variance of the Difference-In-Differences Estimators

In this section we derive the randomization variance for under the randomized adoption date assumption. We do not rely on other assumptions here, although they may be required for making the estimand a substantively interesting one. The starting point is the representation . Because under Assumption 1 the weights are fixed, the variance is

Note that the are known. Working out the variance , and finding an unbiased estimator for it, is straightforward. It is more challenging to infer the covariance terms and even more difficult to estimate them. In general that is not possible. Note that for a sampling-based variance the are not fixed, because in different samples the fractions with a particular adoption date will be stochastic. This in general leads to a larger variance, as we verify in the simulations.

Define

Now we can write as

Define also

and

###### Theorem 2.

Comment (ctd): In our two period example with some units adopting in the second period and the others not at all, and , we have

so that in this special

which agrees with the Neyman variance for a completely randomized experiment.

### 4.4 Estimating the Randomization Variance of the Difference-In-Differences Estimators

In this section we discuss estimating the variance of the DID estimator. In general there is no unbiased estimator for . This is not surprising, because there is no such estimator for the simple difference in means estimator in a completely randomized experiment, and this corresponds to the special case with . However, it turns out that just like in the simpled randomized experiment case, there is a conservative variance estimator. In the current case it is based on using unbiased estimators for the terms involving , and ignoring the terms involving . Because the latter are non-negative, and enter with a minus sign, ignoring them leads to an upwardly biased variance estimator. One difference with the simple randomized experiment case is that there is no simple case with constant treatment effects such that the variance estimator is unbiased.

Next, define the estimated variance of this by adoption date:

Now we can characterize the proposed variance estimator as

###### Theorem 3.

There are two important issues regarding this variance estimator. The first is its relation to the standard variance estimator for DID estimators. The second is whether one can improve on this variance estimator given that in general it is conservative.

The relevant variance estimators are the Liang-Zeger clustered variance estimator and the clustered bootstrap (Bertrand et al. (2004); Liang and Zeger (1986)). Both have large sample justifications under random sampling from a large population, so they are in general not equal to the variance estimator here. In large samples both the Liang-Zeger and bootstrap variance will be more conservative than because they also take into account variation in the weights . These weights are kept fixed under the randomization scheme, because that keeps fixed the marginal distribution of the adoption dates. In contrast, under the Liang-Zeger calculations and the clustered bootstrap, the fraction of units with a particular adoption date varies, and that introduces additional uncertainty.

The second issue is whether we can improve on the conservative variance estimator . In general there is only a limited ability to do so. Note, for example, that in the two period example this variance reduces to the Neyman variance in randomized experiments. In that case we know we can improve on this variance a little bit exploiting heteroskedasticity, e.g., Aronow et al. (2014), but in general those gains are modest.

## 5 Some Simulations

The goal is to compare the exact variance, and the corresponding estimator in the paper to the two leading alternatives, the Liang-Zeger (stata) clustered standard errors and the clustered bootstrap. We want to confirm settings where the proposed variance estimator differs from the Liang-Zeger clustered variance, and settings where it is the same. We have

units, observed for time periods. We focus primarily on the case with . The adoption date is randomly assigned, with , and .We consider two designs for the potential outcome distributions in the population, the for . In design A the potential outcomes, are generated as

In this design the treatment effect is constant, and depends only on whether the adoption date preceeds the potential outcome date, or

where the are correlated over time.

In design B the potential outcomes are generated as

Here the treatment effects depend on the treatment having been adopted, but the effect differs by the adoption date.

In design C the potential outcomes are generated with positive correlations between the potential outcomes as

## Comment 1.

Alternative characterizations of the DID estimator or estimand as a weighted average of potentially causal comparisons are presented in Abraham and Sun (2018); de Chaisemartin and D’Haultfœuille (2018); Han (2018); Goodman-Bacon (2017), and Borusyak and Jaravel (2016)). The characterizations differ in terms of the building blocks that are used in the representation and the assumptions made. Like our representation, the representation in Abraham and Sun (2018) is in terms of average causal effects of different adoption dates, but it imposes no-anticipation. Goodman-Bacon (2017) presents the DID estimator in terms of basic two-group DID estimators. Like our representation, the Goodman-Bacon (2017) is mechanical and does not rely on any assumptions. To endow the building blocks and the representation itself with a causal interpretation requires some assumption on, for example, the assignment mechanism. □