Estimating heterogeneous causal effects in time series settings with staggered adoption: An application to neighborhood policing

06/13/2020 ∙ by Joseph Antonelli, et al. ∙ 0

Communities often self select into implementing a regulatory policy, and adopt the policy at different time points. Researchers are interested in (1) evaluating the impact of the policy, and (2) understanding what types of communities are most impacted by the policy, raising questions of heterogeneous treatment effects. We develop novel statistical approaches to study the causal effect of policies implemented at the community level. Using techniques from high-dimensional Bayesian time-series modeling, we estimate treatment effects by predicting counterfactual values of what would have happened in the absence of the policy. We couple the posterior predictive distribution of the treatment effect with flexible modeling to identify how the impact of the policy varies across time and community characteristics. This allows us to identify effect modifying variables and capture nonlinear heterogeneous treatment effects. Importantly, our approach is robust to unmeasured confounding bias. Our methodology is motivated by studying the effect of neighborhood policing on arrest rates in New York City. Using realistic simulations based on the policing data in New York City, we show our approach produces unbiased estimates of treatment effects with valid measures of uncertainty. Lastly, we find that neighborhood policing slightly decreases arrest rates immediately after treatment adoption, but has little to no effect in other time periods or on other outcomes of interest.



There are no comments yet.


page 12

page 13

page 14

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A common scientific goal is to understand the impact of a particular policy or treatment occurring over time. The gold standard for estimating causal effects are randomized controlled trials (RCTs), however, these are frequently unavailable due to ethical or cost considerations (Victora et al., 2004). Instead, observational data in which the subjects were not randomized to any particular policy or treatment is typically used to infer causality from data. While this greatly broadens the scope of questions that can be empirically evaluated, it comes with far more assumptions than a traditional RCT, and any violation of these assumptions can invalidate results. When data is observed repeatedly over time, both before and after a particular intervention is applied, the assumptions required to infer causality are different and potentially more believable than those from single time point studies. In this paper, we consider the problem of estimating the effect of neighborhood policing on arrest rates in New York City (NYC) over the years 2006-2018. Neighborhood policing is a policy implemented at the police precinct level in which the New York Police Department (NYPD) restructured its precincts and hired police officers whose priority is community engagement and problem solving rather than responding to 911 calls. Dedicated officers build community relationships and patrol a small sector with the twin goals of reducing crime and promoting trust between the police and constituents (Bratton, 2015). The City began implementation in May 2015 with two of the 77 police precincts, and all police precincts had adopted neighborhood policing by October 2018. In the statistics and economics literature these types of policy implementations are sometimes referred to as “staggered adoption” (Athey and Imbens, 2018; Shaikh and Toulis, 2019; Ben-Michael et al., 2019). Our overarching goal is two-fold: 1) to understand the causal effect of this policy on crime and arrest rates, and 2) to estimate how the effect varies over time and how communities with different characteristics might respond differently.

1.1 Review of neighborhood policing

Neighborhood policing is an updated version of an approach commonly used in the 1980s: community policing. Community policing de-emphasizes traditional police actions like arrests and prioritizes problem solving with community members. Community policing has been popular for several decades, but research on its efficacy and consequences is mixed. One reason for the unclear findings is that “many of the [past] studies were characterized by weak evaluation designs” (National Academies of Sciences, Engineering, and Medicine, 2018, p. 7). In Gill et al. (2014) a meta-analysis found that many studies compared only one treated and one control neighborhood. When looking at studies with rigorous pre- and post-intervention measures, they found that community policing did not have a significant effect on crime, and concluded there is “a need for further research around community policing” (Gill et al., 2014, p.399). Another review of studies found that high levels of police-community contact had the promise to reduce crime, but such findings did not hold up when examining only RCTs (Sherman and Eck, 2003).

Despite these null findings of community policing’s effect on crime, the NYPD made crime reduction the first goal of its new community-oriented policy (O’Neill, 2018). When announcing the policy, Mayor Bill de Blasio suggested several reasons as to why the policy might decrease crime, such as improved relationships between police and community members, the “constant vigilance” of neighborhood policing discouraging crime, and officers being more inclined to help would-be criminals desist from crime (de Blasio, 2015). One shortcoming of these suggested mechanisms to reducing crime that has not been evaluated in previous research is the potential for unintended consequences of neighborhood policing, such as changes in misdemeanor arrests and changes in the racial disparity of misdemeanor arrests.

Neighborhood policing is designed to increase contact between police and community members. The hope is that by developing relationships with constituents, police will garner the respect of the community, serve as positive role models, and develop informant networks. Neighborhood policing promotes problem-solving as the measure of an officers success and thereby eschews the traditional policing approach of treating arrests as the most important sign of productivity. So, the new practice could decrease arrests and racial disparities by reorienting the goals of officers and managers away from arrests. It is possible, however, that the increased police-community contact the policy generates will also increase the amount of a traditional police outcome: arrests. With more exposure to the community, especially in a city with a history of intense misdemeanor enforcement like New York, police might fall back on their more arrest-intensive practices. What’s more, if police are deployed in higher numbers to predominantly black or Latino neighborhoods and/or if individual bias makes officers more likely to arrest black or Latino residents, neighborhood policing might increase the racial disparity in low-level, discretionary arrests through this increased contact. In this paper, we analyze all aspects of neighborhood policing and aim to answer all of these important questions.

1.2 Statistical literature on causal inference and panel data

There is an extensive literature on the estimation of causal effects from observational, time-series data. The overarching goal across all methods is relatively similar, which is to leverage the repeated nature of the data in order to identify the counterfactual time series that is not observed. Interrupted time series (ITS) methods are one of the most common methods for estimating causal effects in such situations and have been used for decades in economics and epidemiology (Campbell and Cook, 1979; Gillings et al., 1981). See Bernal et al. (2017) for both a review of ITS and practical details on their implementation. Intuitively, ITS methods fit a time-series model that allows the outcome to depend on time and an indicator of whether the policy has been initiated. Under certain causal assumptions, the effect of the policy at any particular time can be obtained from the parameters of the model. Similar ideas have been extended to far more complex problems, such as estimating spillover effects in time series studies by using flexible outcome regression modeling (Bojinov et al., 2019).

Another commonly used method for identifying causal effects from observational time series data is the difference in differences (DiD) design (Ashenfelter, 1978; Angrist and Pischke, 2008). Traditionally these methods have been employed in settings with two groups and two time points. In the first time point, neither group has received treatment yet, while in the second time point only one group has received treatment treatment. These designs rely on the parallel trends assumption that states that the counterfactual outcome under no treatment has the same trend over time in both groups. This assumption can be weakened to only hold conditionally on observed covariates. However, most DiD methods rely on some form of this assumption. The parallel trends assumption allows for estimation of what would have happened to the treatment group had they not received treatment, the fundamental problem of causal inference in this setting. For a review of DiD methods and some of their many extensions to more complex settings, see Lechner et al. (2011) and references within. Of most relevance to the problem at hand are methods that account for the staggered adoption of treatment over time, in which units adopt treatment at different time points. This is a problem that has seen a spike in interest recently within DiD methodology (Athey and Imbens, 2018; Goodman-Bacon, 2018; Callaway and Sant’Anna, 2019). These methods focus on first defining relevant causal estimands in the presence of multiple time periods where treatment is initiated. Correspondences between these estimators and traditional DiD estimators are highlighted, and multiple different inferential strategies are utilized to provide uncertainty measures for treatment effects. Athey and Imbens (2018)

take a design-based approach to causal inference where uncertainty only stems from the treatment assignment, and derive conservative variance estimators in this setting.

Callaway and Sant’Anna (2019) highlight new causal estimands unique to the multiple time point setting, and derive a novel bootstrap approach to inference that is able to account for correlation and clustering within their data, which is common in time series analyses.

Another methodological area that can be used to tackle the problem of staggered adoption is the use of synthetic controls (Abadie et al., 2010, 2015). Synthetic controls were initially developed for the setting where only one unit receives treatment, and their potential outcome under control is estimated using a weighted average of control units with weights estimated using data prior to treatment initiation. This method has been extended to multiple treated units with staggered adoption by using the synthetic control method separately on each treated unit (Dube and Zipperer, 2015; Donohue et al., 2019). This approach has been shown to not be optimal if interest lies in average treatment effects, and was extended to the staggered adoption regime in (Ben-Michael et al., 2019).

More recently, time series methods have been used to estimate causal effects by forecasting what would happen in the absence of treatment (Brodersen et al., 2015; Papadogeorgou et al., 2018; Miratrix et al., 2019). The original approach in Brodersen et al. (2015), termed causal impact, applies when the observed data consists of one time series measured over many time points both in the pre- and post-treatment time periods. The pre-treatment data is used to estimate a Bayesian state-space model that is then used to predict the counterfactual outcome under control during the post-treatment period. Utilizing state-space models to estimate causal effects was extended to settings with many time series in Li and Bühlmann (2018). This approach extends the work of causal impact to heterogeneous treatment effects, and to settings with both treated and control units.

1.3 Review of our contribution

We develop a Bayesian framework for causal inference with time series in the presence of staggered adoption that allows for estimation of causal quantities and heterogeneous treatment effects that allow effects to vary over time or across precinct characteristics. Our paper extends the existing literature in a number of ways. Synthetic controls require there to be units without the treatment, while in our study every precinct adopts the policy by the end of the study. Many of the estimators proposed in the difference in differences literature with staggered treatment are targeting average treatment effects of a policy, while our goal is to estimate conditional treatment effects that are functions of observed precinct characteristics. Further, we have large amounts of pre-treatment data for each subject, and difference in differences approaches do not utilize all of this information to efficiently estimate the causal effects of interest. Li and Bühlmann (2018)

estimate heterogeneous treatment effects in time-series settings, but they also require both treated and control units, while we only observe treated units. Additionally they rely on an assumption of no unmeasured confounding, while we show our approach is not susceptible to confounding bias from unmeasured covariates. Another challenge we address is that our data at the precinct level are highly correlated across space, as neighboring precincts tend to function in similar manners with respect to crime and arrest numbers. We need an approach that can account for spatial correlation in the data when performing inference. We use multivariate Bayesian time series models that allow for a high-dimensional set of observed time series to produce posterior predictive distributions of subject-specific treatment effects. By using the Bayesian paradigm, we are able to produce flexible estimates of conditional average treatment effects that naturally account for all sources of uncertainty. We show that the posterior predictive distribution of counterfactual predictions can be coupled with flexible models, such as machine learning approaches, to find conditional average treatment effect functions in a straightforward manner. Finally, we apply our approach to address important questions in understanding the effects of policing policies in New York City.

2 Policing in New York City

We gathered data from three sources. Data on crimes reported to the police come from the NYPD’s Historic Complaint Database. Arrest data are from the NYPD’s Arrest Database. Data on precincts’ demographic, economic, and housing characteristics come from the Census Bureau’s American Community Survey (ACS) five-year estimates. The data are publicly available on New York City’s Open Data Portal or the Census Bureau’s API. The ACS data are estimates of census tract characteristics across five years. We use the final period before neighborhood policing initiation (2010-2014) as the observed characteristics for each precinct. The NYPD’s address-level data were placed into precincts using a precinct shapefile map provided by the City and Stata’s geoinpoly command (Picard, 2015). The ACS data are tract-level, a smaller geography than the precinct. Each tract was put into the precinct that hosted its centroid, also using Stata’s geoinpoly command. In total, the data contains 156 months of data for 76 precincts.

We are interested in studying how the adoption of neighborhood policing affected overall crime and arrest levels, and racial disparity in police enforcement. To study the effect on overall crime, we utilize the violent crime rate, defined as the number of murders, manslaughters, robberies, and felony assaults per 100,000 people made known to the police in each precinct. To protect victims’ privacy, the NYPD does not geocode reports of sexual assault, so rape is not included in this measure. We use the violent crime rate rather than the total crime rate to purge subjective classification and enforcement decisions by the police from the measure of crime. Violent crimes are more representative of the amount of crime in an area because they are reported regardless of police intervention, while many misdemeanor crimes are not reported unless observed by a police officer, thereby reflecting police enforcement more than actual crime levels. To analyze whether the increased police-community contact neighborhood policing generates increased arrest rates, we construct a measure of the misdemeanor arrest rate. This is the number of arrests a precinct made per 100,000 residents for 55 low-level offenses, the most common of which are trespassing, theft, misdemeanor assault, and drug possession. We measure misdemeanor arrests, rather than all arrests, because they reflect police enforcement priorities and police choices more than they reflect actually occurring crime. It is these discretionary arrests that are most likely to fluctuate as a result of policy changes. An illustration of misdemeanor arrest rates over time can be found in Figure 1. We see that misdemeanor arrest rates were generally increasing in the early years of the study, though this trend shifts to decreasing arrest rates in recent years. To understand the impact of neighborhood policing on racial disparities in arrest rates, we construct a ratio of racial disparity in misdemeanor arrests. This is a ratio of two rates: (1) the number of misdemeanor arrests a precinct makes of black people per the number of black residents of that precinct divided by (2) the number of misdemeanor arrests of white people per the number of white residents.

We obtained the implementation date of neighborhood policing for each precinct from the NYPD’s Commissioner Report in 2018 (O’Neill, 2018). Treatment adoption times can be found in Figure 1

, which highlights the fact that precincts steadily began to implement neighborhood policing in 2015, and that there was no period of time when the majority of precincts began neighborhood policing. We measure a vector of demographic, economic, and housing variables that have been shown to relate to crime and arrest rates. To capture socio-economic status, we utilize five precinct characteristics: the unemployment rate, the percent of people living below the poverty line, the percent of housing units that are owner-occupied, the percent of the housing units that are vacant, and the percent of residents with a B.A. degree. We also measure the percent of the precinct that is black or Latino, the percent of the precinct that are men aged 15 to 34, and the percent of the precinct that is foreign-born.

Figure 1: Preliminary look at the NYC policing data showing the time series for misdemeanor arrest rates for each precinct and the city average. The points marked by an x are the times at which each precinct adopted neighborhood policing.

3 Potential outcomes, estimands, and identification

Throughout we assume that interest lies in the effect of a particular treatment on an outcome , where we have measured both the treatment and outcome at repeated time intervals. Our observed data is therefore and for units, and time points. We will be working under the framework that once a unit receives treatment, it cannot revert back to the control condition, i.e , where

is the random variable denoting when subject

initiates treatment. Further, every unit will eventually commence with the treatment condition and therefore . We also assume that for each unit we have a vector of potentially time varying covariates .

3.1 Potential outcomes

Typically in causal inference with a binary treatment and a single time point, it is assumed that there are two potential outcomes of interest for subject : the potential outcome had they received treatment, and the potential outcome had they not received treatment. In our setting, however, every unit receives treatment, but there are many time points at which a unit could have begun receiving treatment. For this reason we denote potential outcomes by , where is the time at which treatment is initiated. Therefore represents the potential outcome we would observe for subject at time had they initiated treatment at time . Similar to Athey and Imbens (2018) we let denote the potential outcome for a subject if they never receive treatment. To link potential outcomes to the observed data we make a standard consistency assumption that . Denoting potential outcomes by the time of treatment initiation for each unit requires the assumption of no interference between units. This assumption states that the treatment status of one unit can not affect the outcomes of other units. This is a reasonable assumption in the motivating example as it is unlikely for neighborhood policing in one precinct to affect crime or arrest rates in neighboring precincts. Lastly, we assume that there are no anticipatory effects, i.e. that for all and . This states that units do not respond to treatment before it gets initiated. This could be violated if units are aware of the impending treatment change, and subsequently change their behavior due to the upcoming change in policy. In other words, this assumption states that the potential outcome if the subject never initiates treatment is the same as the potential outcome if the subject initiates treatment at a later time.

3.2 Individual treatment effects and policy-relevant estimands

We first focus our attention to defining individual treatment effects at specific time points for specific units in the sample, and then we extend them to sample level treatment effects, as well as estimands that target heterogeneity of the treatment effect. Let us first define the individual causal effect for unit at time if they initiated treatment at time as:

This contrast compares what would have happened if a unit initiated treatment at time versus what would have happened if they never received the treatment. While individual treatment effects are of interest themselves, in many settings average treatment effects are of more interest as they highlight the impact of a policy over an entire region, such as New York City. In our motivating example, average treatment effects correspond to questions such as the change in the number of misdemeanor arrests over the entire time period of the study, or the impact of a policy at a particular point in time across all of New York City. Here, we define relevant estimands that summarize the impact of a particular policy. We define the sample average treatment effect as

This is the average impact of the treatment time periods after treatment initiation over all units. For different values of , illuminates how the treatment effect varies over time after treatment initiation. It is possible that there is a lag before the treatment affects the outcome or that the effect dissipates over time. In our motivating example, this estimand targets the effectiveness of neighborhood policing time points after the observed initiation times, which measures the impact that the policy induced on crime and arrest rates. We also define the cumulative impact as:

This represents the total impact of the treatment after time periods post-intervention. In our motivating example, this estimand represents the total impact of neighborhood policing over the first months after its initiation, averaged over precincts.

3.3 Identification using time series forecasting

Our target parameter of interest is , which is a function of both and for all . Due to consistency, we know that , which is an observed quantity. However, we do not observe for any .

Briefly, our identification strategy will be to use the data in the pre-treatment period to predict what would have happened in the absence of treatment for the post-treatment period. Fortunately, we observe many time points before treatment initiation, and by the consistency and no anticipatory effects assumptions we have

We leverage that by using the observed data in the pre-treatment period to identify


where . It is important to note that while we have conditioned on for simplicity, our identification strategy does not rely on this choice, and we can choose to condition on more (or less) components of the data. We could allow the model to depend on more time points than simply the previous time point, or allow the model to include covariates, . Our observed data allow us to identify the distribution of the counterfactual value for , given the previous data. We can use this model to forecast what the counterfactual value is for time points when this quantity is unobserved. If our goal is to identify the treatment effect time points post-treatment, we rely on equation (1) holding, which is analogous to the time series being stationary, for time points after treatment initiation. Each unit initiates treatment at possibly different times and therefore we need this model to hold at all time points for which we need to predict the missing counterfactual. The estimands presented above are sample specific estimands and are therefore not strictly identifiable from the observed data in the sense that they can be written as a function of the observed data distribution in finite samples (Balzer et al., 2016). In Appendix A we show that the population version of our estimand is identified given stationarity of equation (1). In Section 5 we will evaluate how plausible the stationarity assumption is in the policing example, and find that it holds reasonably well for small values.

A crucial point regarding our strategy to identifying causal effects is that we do not rely on an assumption that there are no unmeasured confounders, a common assumption in causal inference with observational data. Each subject acts as their own control, and therefore our approach is not susceptible to bias from either time-varying or time-invariant covariates. This means that there can be unmeasured variables that affect both the time of treatment initiation and the outcome, yet we are still able to identify causal effects. The only assumption we rely on is the stationarity of equation (1). Importantly, this does not assume that we have the true data generating model for the time series in equation (1). We simply need whichever model we choose to use (whether the true data generating model or not) to be correctly specified and stationary. For instance, the inclusion of covariates does not immediately affect the validity of the procedure, but it could make the stationarity assumption more plausible by conditioning on covariates that are highly predictive of the outcome, and could also improve the efficiency of the model forecasts. This further illuminates how unmeasured covariates are not problematic as long as stationarity of equation (1) still holds.

3.4 Heterogeneous treatment effects

Many times interest also lies in understanding how treatment effects vary as a function of observed characteristics, frequently referred to as treatment effect heterogeneity. For the motivating example we are interested in understanding heterogeneity as a function of time since treatment initiation or precinct level characteristics such as the racial or socioeconomic distribution in a precinct. We study treatment effect heterogeneity by examining how varies with , , and using the following formulation:

The function is important in its own right, as it can convey how treatment effects are expected to change as a function of precinct level characteristics. We use to define an estimand that describes the impact of the policy on different types of precincts. Let to be the matrix of observed covariates in our data at time excluding covariate , and let and denote two distinct values for covariate . Our estimand that targets treatment effect heterogeneity with respect to covariate at time points after treatment initiation is defined as


This estimand highlights the difference in causal effects for units with the same but different values for covariate , when looking time points after treatment initiation. In our motivating example, we explore shifts where and are the , and quantile of . For instance, if we were looking at the socioeconomic status as an effect modifying variable, equation (2) would represent how the treatment effect changes when comparing a precinct with a lower socioeconomic status to a higher economic status, and fixing the remaining covariates. We could also define the heterogeneous treatment effect estimand to compare the treatment effects of two precincts with different covariate vectors, rather than simply modifying covariate . These estimands can be aggregated over time to provide a cumulative estimand similar to what we used for aggregated average treatment effects:

4 Estimation and inference

Here we detail our strategy for estimating the causal estimands described in the previous section, through estimation of model 1. As discussed in Section 3.3, we require a model that predicts future values of given previous time points. We develop a Bayesian multivariate time series model that accounts for spatial correlation across precincts. Our overarching goal from these models is to find the posterior predictive distribution of the unknown time series values. If we let represent all unknown time series values that we are interested in predicting, then interest lies in , the posterior predictive distribution of these predictions given the observed data. To obtain unbiased estimates of the estimands presented in Section 3 we simply need , to be an unbiased estimate of the true, unknown counterfactual time series. Of equal importance, however, is the uncertainty around the unknown counterfactuals, and therefore we need the posterior distribution to accurately quantify the uncertainty around the unknown values. Both temporal and spatial correlation in the data must be accounted for properly if we want our estimators to have good inferential properties. We now highlight our spatiotemporal model that can be used to target the unknown counterfactual time series, but it should be stressed that the ideas that follow will hold for any model for , as long as the posterior predictive distribution of future time points can be extracted from the model.

4.1 Bayesian vector autoregressive model

To account for both spatial and temporal dependencies, we specify a joint autoregressive process for the time series of all precincts. A local mean first-order vector autoregressive model takes the form


where . For a more general discussion of these models, see Banbura and van Vlodrop (2018). The vector of functions accounts for unit-specific mean values and trends over time. These functions will be estimated using basis functions by allowing , where are pre-specified basis functions that include an intercept. is an matrix of parameters that control the extent of temporal dependence over time. Diagonal elements of allow for correlation across time within each subject, while non-diagonal elements of dictate the amount of correlation across subjects over time. This allows for a wide range of temporal dependencies across units in our study. The error term allows for contemporaneous correlation across units in the study at a particular time point through the covariance matrix . We empirically evaluate the quality of model 3 in Section 5

and find that it performs quite well on the New York City policing data and leads to credible intervals with nominal coverage rates.

4.2 Reducing parameter space of

In high-dimensional time series settings, we do not have sufficient data to estimate all parameters in the matrix, and some form of dimension reduction or shrinkage is required. In low-dimensional settings, the so-called Minnesota prior (Litterman et al., 1979) assigns a Gaussian prior distribution on the coefficients in that favors larger values on the diagonal as it is expected that a subjects’ own time series will be more predictive for future time points than other subjects’ values. In high-dimensional settings, shrinkage priors (Bańbura et al., 2010; Kastner and Huber, 2017; Ghosh et al., 2019) or point mass priors (Korobilis, 2013) have been used to improve estimation of .

All of these approaches could be implemented to estimate . However, one complicating factor unique to our setting is that the observed time series lengths in the pre-treatment period may vary drastically across subjects. This complicates MCMC updates as nearly all algorithms have been developed for situations with equal numbers of time points. More importantly, this can cause a problem for forecasting. Let’s suppose for subjects and that the number of time points observed are and , respectively. If the time series for subject is highly predictive of the time series for subject , then would be nonzero. When predicting time point 141 for subject , data from subject would be used, but the most recent time point we observe subject

at is 100. This could possibly be fixed by imputing these missing values in a fully Bayesian framework, but this would complicate computation and will greatly increase the variance of our forecasts. For this reason, we force

for any subjects and that initiate treatment at substantially different times. Note that this is simply a matter of efficiency, and in no way restricts the identifiability results in Section 3.3. This greatly increases the sparsity in leading to more stable estimation, and also avoids any problems caused by two time series being observed at very different time points. Further dimension reduction can be achieved by setting any for subjects that are not geographic neighbors. This is based on the fact that geographic neighbors should be more correlated than geographically distant subjects, and is a strategy we will utilize for the New York City policing data. The nonzero entries of , along with the parameters for the time trends, are given independent normal prior distributions with diffuse variances, as the dimension of the parameter space has been reduced enough to alleviate the need for shrinkage priors. It is important to note that estimation of is more about improving efficiency and properly accounting for temporal uncertainty in the time series. Unbiased estimation of treatment effects can still be attained using an matrix with the incorrect elements set to zero, however, there could be a loss of efficiency and uncertainty quantification could be negatively impacted. We evaluate the impact of misspecification of in Section 5.3 and Appendix D.

4.3 Estimation of

The covariance matrix is also high-dimensional, and there are many ways to reduce its parameter space, such as imposing sparsity on the inverse of the covariance matrix, or assuming it has a low-rank structure (Fox and Dunson, 2015). In the Bayesian VAR literature, decomposing the covariance matrix into the product of upper triangular matrices has been utilized to simplify estimation with efficient MCMC sampling (George and Sun, 2008).

We take an alternative strategy by constructing an estimate , and proceeding with inference conditionally on this estimate. While being fully Bayesian is advantageous in that it accounts for uncertainty in the estimation of , it also greatly increases the computational complexity of any MCMC algorithm, particularly because we observe different numbers of data points for each unit. Our approach finds an initial estimator of model 3, then uses the model residuals and an optimization algorithm to acquire an estimate of . We use an alternating least squares algorithm described in Appendix B to estimate model 3. From the fitted model, we acquire predicted values for all , where is the earliest time point that treatment is initiated, and calculate

Note that

is the number of parameters in the mean estimate for each time series. For high-dimensional data sets, this estimator will be very unstable, and in our motivating example

is 76/113. We regularize the estimated sample covariance matrix by imposing sparsity on . It is known in Gaussian graphical models that if the element of is zero, then and are conditionally independent given the remaining observations. In our policing example, a reasonable assumption is that precincts that are more than one neighbor apart are conditionally independent. We enforce this in the estimation of by solving the following constrained optimization:

where is the space of all positive semi-definite matrices whose element is zero for any and that are not neighbors in the data. How neighbors are defined can be relaxed to allow for second-order neighbors to be conditionally correlated in the same manner. We proceed with this estimate for the rest of the paper, and see empirically that this strategy leads to inference with nominal coverage rates in Section 5, even though we condition on .

4.4 Combining time series predictions for estimating treatment effects

Sampling from the posterior distribution of all parameters in model 3 allows us to obtain the posterior predictive distribution of future time points. This characterizes our uncertainty around what would have happened in the absence of treatment for all units in the sample. The individual treatment effects of interest are for . The first of these two values is observed and known as it is simply the observed outcome after treatment is initiated, while the second of these two is the unknown quantity for which we have a posterior distribution. We automatically obtain the posterior distribution of the individual treatment effects, denoted by

where is the corresponding vector of observed outcomes after treatment initiation. Now that we have posterior distributions for for all , and all , we can proceed with obtaining estimates and credible intervals for the estimands of interest from Section 3. All marginal estimands, such as and are obtained directly by averaging the relevant individual treatment effects over the correct time periods. Inference is straightforward using the posterior distribution of these quantities.

For treatment effect heterogeneity, we focus on inference for . To estimate , we first draw a sample from the posterior distribution of individual treatment effects, denoted by . We then regress these values on , the observed characteristics for each subject, as well as and . This can be done using linear models, nonlinear models, or more complicated machine learning approaches. Importantly, this second stage model is solely a function of , , , and . We can repeat this process for posterior draws, each time keeping track of estimates of treatment effect heterogeneity that we are interested in, such as , and inference proceeds from the posterior distribution of these quantities.

5 Simulation study based on NYC precinct data

Here we present simulation studies using the observed data on arrest rates in New York City. These simulations allow us to test a number of features about our approach such as 1) how plausible our identification assumptions are in the New York City data, 2) how well our model performs on the observed data, and 3) whether conditioning on an estimate of negatively impacts inference. We follow a similar approach to Schell et al. (2018). We use the observed data before treatment initiation as the counterfactual outcome in the absence of treatment, and apply a known shift to the observed pre-treatment time series. The new outcome values after the shift are the observed outcomes after treatment initiation, while the shift represents the treatment effect. Specifically, for each precinct we randomly assign a treatment time between 10 and 40 months prior to their actual treatment time. We initiate the known shift to the data for each time point after the new treatment time, and this shift is the individual treatment effect for that particular precinct.

This simulation framework is extremely informative about the performance of our approach on the application of interest, because we only control the magnitude and form of the treatment effect, and we have no control on the data generating process for the outcome. This is a far more realistic simulation than simulations based completely on user-specified data generating models and will provide more insight into the performance of our approach for the data set at hand and the exact scientific questions we aim to answer. If our approach performs well in these simulations, that would indicate that our model assumptions (such as stationarity) hold reasonably well in this data set, and we would have much more reason to trust them in the actual data application. We use the 2010-2014 census data for covariates, and therefore for covariates in the simulation, where all covariates are zero centered.

We consider three distinct estimands for the simulation study: a time specific treatment effect ( for ), a cumulative treatment effect over time ( for ), and for each covariate in our study. We let and be the , and quantile of . To estimate the function, we specify the form and therefore the treatment effect depends on time solely through the time since treatment initiation. We focus on bias, interval coverage, and efficiency for estimating each estimand. Interval coverage is the proportion of simulations in which our 95% credible interval covers the true parameter. For brevity, we explore only two simulation designs here, while additional extensive simulations can be found in Appendix D, and a summary of which is in Section 5.3. The first simulation design assumes there is no treatment effect heterogeneity by observed variables , and a positive impact of the treatment. The treatment increases the outcome for each precinct in the first 5 months after treatment by and there is no effect of the treatment after that point. In the second simulation design we specify, there is no marginal effect of treatment, however, there is treatment effect heterogeneity. In particular the treatment effect function takes the form . Lastly, we take two strategies to estimating the matrix: 1) we specify to be diagonal, or 2) we let be nonzero if and we cap the number of nonzero elements in each row of to be three. We prioritize elements of row to be nonzero if they are neighbors or second-order neighbors with precinct . Results are similar for both approaches, but they are generally better for the non-diagonal matrix, so we present those results here and move the diagonal results to appendix D.

5.1 Results for homogeneous treatment effect

Here we focus on the setting where there is a positive effect of treatment, but that this effect is constant across units. Note that while there is no heterogeneity, we still estimate heterogeneous treatment effects of interest. Figure 2 shows boxplots of our estimates for both and for ten time points post-treatment, where the estimates are shifted by the true mean, so that unbiased estimates would be centered around zero. We see that estimates for lag are unbiased, and then a relatively small amount of bias begins at . Importantly the bias is small relative to the uncertainty in the estimates, which leads to interval coverages that are close to 95% for all time points considered. The bottom left panel of Figure 2 shows the coverage for individual and cumulative estimands at each of the ten time points considered and the coverage is around 95% for the first eight time points, and drops to around 93% for the final two time points. These plots suggest that we do a very good job estimating marginal treatment effects, and then a small amount of bias begins at later time points, however this doesn’t come at a large loss of interval coverage. In terms of heterogeneous treatment effects, the bottom-center panel of Figure 2 show the interval coverage for for

using both linear regression and random forests to estimate

. Contrary to the marginal effects, we see some coverages below the nominal rate, with some being far below the nominal rate, as the coverage drops to 40% for covariate 5 and 50% for covariate 8. We also consider estimation of where only the first three time periods post-treatment are used in the model for . In the bottom-right panel of Figure 2, when we restrict attention to only 3 time periods, these coverages are much closer to the desired rate. Most coverages are near 95% and the two problematic covariates, numbers 5 and 8, have coverages in the 80% and above range.

Figure 2: Results from the homogeneous treatment effect simulation study. The top-left panel shows the estimates of and the top-right panel shows estimates of for . Estimates are mean shifted so that an unbiased estimator will be centered at zero. The bottom-left panel shows coverage for all marginal estimands for . The bottom-center panel and bottom-right panel show coverage for heterogeneous treatment effects that use 10 and 3 time points, respectively.

5.2 Results for heterogeneous treatment effects

The results for the heterogeneous treatment effects setting are nearly identical to those from the homogeneous treatment effect setting. We see no bias through and then bias slowly increases with time since treatment initiation. However, coverages of marginal treatment effects are close to 95% in all time points considered. Heterogeneous treatment effects also share a similar story as in the previous simulation study. The 3rd, 5th, and 8th covariates have low interval coverage for when considering 10 time points into the future, but as soon as we restrict to 3 time points into the future, coverages increase to a reasonable level. Estimation of treatment effects becomes considerably more difficult the farther out in time we consider, and it appears that heterogeneous treatment effects are more sensitive to this degradation, possibly due to large impacts of biased individual effect estimates for a small number of precincts with extreme values for those covariates.

Figure 3: Results from the heterogeneous treatment effect simulation study. The top-left panel shows the estimates of and the top-right panel shows estimates of for . Estimates are mean shifted so that an unbiased estimator will be centered at zero. The bottom-left panel shows coverage for all marginal estimands for . The bottom-center panel and bottom-right panel show coverage for heterogeneous treatment effects that use 10 and 3 time points, respectively.

5.3 Summary of additional simulations

We have run a number of additional simulation studies that can be found in Appendix D that cover a wide range of situations that include misspecification of the matrix, the presence of an unmeasured variable affecting both treatment times and the outcome, smoothed estimates of , and data sets that were completely generated by us that do not use the New York City policing data in any way. One key takeaway from these simulations is that the presence of an unmeasured covariate affecting treatment times and the outcome does not bias our results or affect the validity of our inferential procedure, which confirms the theoretical results in Section 3 suggesting that our approach is robust to unmeasured confounders. Another key takeaway is that misspecifying the matrix does not bias our results, however it can slightly impact the construction of credible intervals. We see that while we still maintain unbiased estimates of treatment effects if nonzero elements of are forced to be zero, our uncertainty estimates are not accounting for all temporal correlation inherent in the data and therefore we obtain credible interval coverages that are slightly below the nominal level. We also explore scenarios where the true values are smooth in . We show that if smoothness of is incorporated into the estimation procedure, then more efficient estimates of marginal treatment effects can be obtained. Lastly, we show the results of the simulation study in Section 5 when we utilize the diagonal matrix. We find that the diagonal matrix has similar performance in the first few time periods, but performs worse than the non-diagonal model in terms of bias and interval coverage in later time periods. Further, the coverage of heterogeneous treatment effects is worsened by using the diagonal model as well.

6 Estimating the effects of neighborhood policing

We utilize our approach to estimate the impact of neighborhood policing on three outcomes: The misdemeanor arrest rate, the rate of violent crimes, and the ratio of black to white misdemeanor arrest rates. We choose these outcomes because they are plausibly affected by the initiation of neighborhood policing and are highly of interest for understanding the impact of neighborhood policing throughout New York City. The distribution of treatment times and a visualization of the observed data for misdemeanor arrest rates can be found in Figure 1. We include ten covariates as potential effect modifiers: precinct size, percentage of the population that is black, percentage of the population that is Latino, the percentage of vacant housing units, the percentage of unemployed individuals, the percentage living in poverty, the percentage of young men, the percentage of foreign born residents, the percentage of owner occupied housing units, and the percentage of people with a bachelor’s degree. We study both marginal effects that represent the overall impact of the policy in New York City, as well as heterogeneous treatment effects that describe how the treatment effects vary as a function of precinct-level characteristics. Due to the results of our simulation studies based on the New York City data, we specify to be a non-diagonal matrix with up to three nonzero elements in each row. These elements are chosen to represent precinct pairs () that are first or second-order neighbors of each other, and for which there is sufficient pre-treatment data of one precinct to predict future values of the other.

6.1 Marginal effects

First we focus on time-specific effects denoted by for and for each of the three outcomes. Throughout, we assume that is smooth in by letting and

be modeled with 3 degrees of freedom splines. For every posterior draw of the individual treatment effects, we fit this model, and use the fitted values to estimate

and . We show in Appendix D that this approach obtains nominal coverage rates while providing more efficient estimates when the true treatment effects are smooth in . Further, in Appendix E we present estimates of the effect of neighborhood policing on crime and arrest rates without the smoothness assumption. Those results are similar, though slightly more variable than the results here.

The estimates and pointwise 95% credible intervals are depicted in Figure 4. Estimates are generally negative for the first few time points for misdemeanor arrests indicating that neighborhood policing leads to an initial reduction in misdemeanor arrests. The only time point with a statistically significant effect is , which is the first month in which neighborhood policing was adopted. As increases, the estimates tend towards zero, but the widths of the credible intervals also increase due to the increasing uncertainty in our time series forecasts as we project further into the future. The estimates of the effects on violent crime show essentially no effect of the policy. While there is a slight indication that neighborhood policing increases violent crimes (middle panel of top row), this is far from significant as the credible interval covers zero at all time points. The policy has no effect on the ratio of black to white arrest rates as the effect estimates are extremely close to zero for each lag considered.

Now we look at cumulative impacts of the policy over time, which can be found in the second row of Figure 4. At each time point, the cumulative estimate is the sum of the time specific estimates up to that time point. We see that there is a negative impact on arrest rates when viewed cumulatively over time. The credible interval does not contain zero for either of the first two time points, indicating a significant effect of the policy on misdemeanor arrest rates. The cumulative estimates for both violent crimes and ratio of black and white arrest rates depict the same story as for the time-specific effects: there appears to be no effect of the policy on either violent crimes or the ratio of black to white misdemeanor arrest rates.

Figure 4: Estimates and 95% credible intervals for time specific effects (first row) and cumulative effects (second row) of neighborhood policing on misdemeanor arrest rates (first column), Violent crimes (second column), and the ratio of black and white misdemeanor arrest rates (third column).

6.2 Heterogeneous effects

Understanding the impact of the policy on different communities can potentially be even more useful than marginal effects themselves. It is possible that marginal effects show no effect of the policy, but that is only because certain communities have a positive treatment effect while others have a negative treatment effect. Ignoring such differences could lead to misleading conclusions about the success (or lack thereof) of the policy.

We estimate for each covariate using the and quantiles of the observed covariate distribution as and . We only utilize the first three time points as our simulation study on the New York City data in Section 5 indicated that heterogeneous treatment effect estimates are not as reliable when predicting 10 time points into the future. We present here only the results for misdemeanor arrests. Heterogeneous treatment effect estimates for the other outcomes were all close to zero, indicating no effect of the policy, even within subgroups of the city. Figure 5 shows the estimates for each of the 10 covariates considered in our study. We present results where the function is estimated using both linear regression and random forests, which allows for more complex relationships among the covariates. The random forest estimates seem to shrink estimates towards zero more so than the linear regression estimates. This leads to point estimates for random forests that are closer to zero and have smaller credible interval lengths. The qualitative results, however, are very similar across the two types of models for the heterogeneous treatment effect function. Two covariates have heterogeneous treatment effects that are negative and close to being separated from zero: the percent unemployment in a precinct, and the percentage of foreign born residents. Both of these covariates have negative effects, which indicates that the reduction in arrests caused by neighborhood policing that we saw in the marginal effects is even larger within communities with larger portions of foreign born or unemployed residents. We see negative estimates using both linear models and random forests, however the unemployment effect is only significant for the random forest model. We also see that percent young men has a somewhat positive effect, though the credible intervals do still cover zero.

Figure 5:

Estimate of changing one covariate from its lower quartile to upper quartile, while fixing the other effect modifiers at their observed values.

7 Discussion

In this paper we propose a novel approach to estimating the causal effect of a policy with staggered adoption. We utilize approaches from the high-dimensional Bayesian time series literature to forecast what the counterfactual time series would be for each area in the study in the absence of treatment. We showed how these forecasts can be combined with flexible models to study treatment effect heterogeneity as a function of observed characteristics. By placing our approach within the Bayesian paradigm, inference is relatively straightforward for all estimands once we have obtained the joint posterior distribution of the counterfactual time series in the absence of treatment. We showed through realistic simulations based on the observed NYC data that our approach is able to estimate the causal effects of a precinct level treatment with good finite sample properties. We applied our approach to study policing in New York City and found that neighborhood policing leads to slight reductions in the number of misdemeanor arrests. Further, we saw potential heterogeneity of the treatment effect by percent unemployment, percentage of foreign born residents, and percentage of residents who are young men. Our analyses also suggest that neighborhood policing does not reduce violent crime, and that it has no impact on racial disparities in misdemeanor arrest rates.

Our approach is robust in certain important ways, namely that unmeasured covariates do not bias our estimates, as our approach does not rely on an unconfoundedness assumption. This is critical because we can never know if we’ve measured all relevant covariates in observational studies, and unmeasured confounders are always a primary concern. We showed that our approach is based on a time-series stationarity assumption that is often more plausible than unconfoundedness. While this is an assumption in its own right, we were able to assess this assumption empirically in the simulation study based on the observed New York City policing data. We saw that our approach attained very good credible interval coverage that was at or near the nominal rate for all estimands. This shows that our proposed procedure is well-suited to the problem at hand, and therefore the results in the actual application on New York City data are much more believable.


The authors would like to thank Georgia Papadogeorgou, Aaron Molstad, and Rohit Patra for extremely insightful comments on the manuscript.


  • Abadie et al. (2010) Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of california’s tobacco control program. Journal of the American statistical Association, 105(490):493–505.
  • Abadie et al. (2015) Abadie, A., Diamond, A., and Hainmueller, J. (2015). Comparative politics and the synthetic control method. American Journal of Political Science, 59(2):495–510.
  • Angrist and Pischke (2008) Angrist, J. D. and Pischke, J.-S. (2008). Mostly harmless econometrics: An empiricist’s companion. Princeton university press.
  • Ashenfelter (1978) Ashenfelter, O. (1978). Estimating the effect of training programs on earnings. The Review of Economics and Statistics, pages 47–57.
  • Athey and Imbens (2018) Athey, S. and Imbens, G. W. (2018). Design-based analysis in difference-in-differences settings with staggered adoption. Technical report, National Bureau of Economic Research.
  • Balzer et al. (2016) Balzer, L. B., Petersen, M. L., van der Laan, M. J., and Collaboration, S. (2016). Targeted estimation and inference for the sample average treatment effect in trials with and without pair-matching. Statistics in medicine, 35(21):3717–3732.
  • Bańbura et al. (2010) Bańbura, M., Giannone, D., and Reichlin, L. (2010). Large bayesian vector auto regressions. Journal of applied Econometrics, 25(1):71–92.
  • Banbura and van Vlodrop (2018) Banbura, M. and van Vlodrop, A. (2018). Forecasting with bayesian vector autoregressions with time variation in the mean.
  • Ben-Michael et al. (2019) Ben-Michael, E., Feller, A., and Rothstein, J. (2019). Synthetic controls and weighted event studies with staggered adoption. arXiv preprint arXiv:1912.03290.
  • Bernal et al. (2017) Bernal, J. L., Cummins, S., and Gasparrini, A. (2017). Interrupted time series regression for the evaluation of public health interventions: a tutorial. International journal of epidemiology, 46(1):348–355.
  • Bojinov et al. (2019) Bojinov, I., Tu, Y., Liu, M., and Xu, Y. (2019). Causal inference from observational data: Estimating the effect of contributions on visitation frequency atlinkedin. arXiv preprint arXiv:1903.07755.
  • Bratton (2015) Bratton, W. J. (2015). The nypd plan of action and the neighborhood policing plan: A realistic framework for connecting police and communities. nd, http://home. nyc. gov.
  • Brodersen et al. (2015) Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., Scott, S. L., et al. (2015). Inferring causal impact using bayesian structural time-series models. The Annals of Applied Statistics, 9(1):247–274.
  • Callaway and Sant’Anna (2019) Callaway, B. and Sant’Anna, P. H. (2019). Difference-in-differences with multiple time periods. Available at SSRN 3148250.
  • Campbell and Cook (1979) Campbell, D. T. and Cook, T. D. (1979). Quasi-experimentation: Design & analysis issues for field settings. Rand McNally College Publishing Company Chicago.
  • de Blasio (2015) de Blasio, B. (2015). Transcript: Mayor de Blasio, Commissioner Bratton Unveil New, Groundbreaking Neighborhood Policing Vision. New York City Press Office.
  • Donohue et al. (2019) Donohue, J. J., Aneja, A., and Weber, K. D. (2019). Right-to-carry laws and violent crime: a comprehensive assessment using panel data and a state-level synthetic control analysis. Journal of Empirical Legal Studies, 16(2):198–247.
  • Dube and Zipperer (2015) Dube, A. and Zipperer, B. (2015). Pooling multiple case studies using synthetic controls: An application to minimum wage policies.
  • Fox and Dunson (2015) Fox, E. B. and Dunson, D. B. (2015). Bayesian nonparametric covariance regression. The Journal of Machine Learning Research, 16(1):2501–2542.
  • Friedman et al. (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441.
  • George and Sun (2008) George, E. and Sun, D. (2008). Stochastic search model selection for restricted var models.
  • Ghosh et al. (2019) Ghosh, S., Khare, K., and Michailidis, G. (2019). High-dimensional posterior consistency in bayesian vector autoregressive models. Journal of the American Statistical Association, 114(526):735–748.
  • Gill et al. (2014) Gill, C., Weisburd, D., Telep, C. W., Vitter, Z., and Bennett, T. (2014). Community-oriented policing to reduce crime, disorder and fear and increase satisfaction and legitimacy among citizens: A systematic review. Journal of experimental criminology, 10(4):399–428.
  • Gillings et al. (1981) Gillings, D., Makuc, D., and Siegel, E. (1981). Analysis of interrupted time series mortality trends: an example to evaluate regionalized perinatal care. American journal of public health, 71(1):38–46.
  • Goodman-Bacon (2018) Goodman-Bacon, A. (2018). Difference-in-differences with variation in treatment timing. Technical report, National Bureau of Economic Research.
  • Kastner and Huber (2017) Kastner, G. and Huber, F. (2017). Sparse bayesian vector autoregressions in huge dimensions. arXiv preprint arXiv:1704.03239.
  • Korobilis (2013) Korobilis, D. (2013). Var forecasting using bayesian variable selection. Journal of Applied Econometrics, 28(2):204–230.
  • Lechner et al. (2011) Lechner, M. et al. (2011). The estimation of causal effects by difference-in-difference methods. Foundations and Trends® in Econometrics, 4(3):165–224.
  • Li and Bühlmann (2018) Li, S. and Bühlmann, P. (2018). Estimating heterogeneous treatment effects in nonstationary time series with state-space models. arXiv preprint arXiv:1812.04063.
  • Litterman et al. (1979) Litterman, R. B. et al. (1979). Techniques of forecasting using vector autoregressions. Technical report.
  • Miratrix et al. (2019) Miratrix, L., Anderson, C., Henderson, B., Redcross, C., and Valentine, E. (2019). Simulating for uncertainty with interrupted time series designs.
  • National Academies of Sciences, Engineering, and Medicine (2018) National Academies of Sciences, Engineering, and Medicine (2018). Proactive policing: Effects on crime and communities. National Academies Press.
  • O’Neill (2018) O’Neill, J. P. (2018). The Police Commissioner’s Report 2018. New York City Police Department.
  • Papadogeorgou et al. (2018) Papadogeorgou, G., Mealli, F., Zigler, C. M., Dominici, F., Wasfy, J. H., and Choirat, C. (2018). Causal impact of the hospital readmissions reduction program on hospital readmissions and mortality. arXiv preprint arXiv:1809.09590.
  • Picard (2015) Picard, R. (2015). Geoinpoly: Stata module to match geographic locations to shapefile polygons.
  • Schell et al. (2018) Schell, T. L., Griffin, B. A., and Morral, A. R. (2018). Evaluating methods to estimate the effect of state laws on firearm deaths: A simulation study. RAND Corporation.
  • Shaikh and Toulis (2019) Shaikh, A. and Toulis, P. (2019). Randomization tests in observational studies with staggered adoption of treatment. University of Chicago, Becker Friedman Institute for Economics Working Paper, (2019-144).
  • Sherman and Eck (2003) Sherman, L. W. and Eck, J. E. (2003). Policing for crime prevention. In Evidence-based crime prevention, pages 309–343. Routledge.
  • Victora et al. (2004) Victora, C. G., Habicht, J.-P., and Bryce, J. (2004). Evidence-based public health: moving beyond randomized trials. American journal of public health, 94(3):400–405.

Appendix A Identification of population estimands

As discussed in the manuscript, sample treatment effects are generally not identified solely as a function of the observed data distribution, though we will show that the population counterpart of our estimands is identified under the stationarity, consistency, and no anticipatory effects assumptions. Our sample average treatment effect is the average difference between the potential outcome at the observed treatment time and the potential outcome assuming the treatment is never initiated. The population version of this estimand that we will target is

Note that we condition on in the population estimand, because the sample estimand looks at the observed only, and does not examine what would happen had everyone received treatment at a particular time point, which would be analogous to a marginal estimand that does not condition on . The first of these terms is immediately identifiable by the consistency assumption, which implies that , which is a function of the observed data distribution. Now, we can identify the second term:

The last equality held because by the consistency assumption. This final expression is a function solely of the observed data if is stationary up to time points after treatment initiation, . This is because , the density of given and is the same as the corresponding conditional distribution in the pre-treatment time period, which is fully observed. Note throughout we have not conditioned on covariates or more than one time point in the past, however, both of these can easily be conditioned on and the identification formula still holds in the same manner. While not necessary for identification, conditioning on covariates or additional time points can 1) improve efficiency of results, and 2) make it more likely that the aforementioned stationarity assumption holds.

Appendix B Alternating least squares estimate of

Here we detail how we find initial estimates of (), which can be used to construct an estimate of the residual covariance matrix. We will be constructing as , and therefore our model can be expressed as follows:

where is an diagonal matrix with each element of the diagonal equal to . is a vector of length representing coefficients for We will construct an algorithm to estimate all unknown parameters, which are given by . Note that we will also adopt the convention that . Our algorithm will iterate across all unknown parameters using least squares at each step while conditioning on the current estimates of the remaining parameters. We will do this until the estimates have converged, which we will assess using the l2 norm between iterative updates to see if the difference has dropped below a pre-chosen level. Below are the specific updates for each parameters involved in the iterative algorithm.

b.1 Update of

Let us first describe how to estimate given current estimates of the other parameters, for , and . Let us define

for , and the first time point is defined as

Lastly, we must define the following:

for , and . As we are estimating the parameters with least squares, our goal is to minimize the following quantity:

Taking the derivative of this expression with respect to , setting the derivative equal to zero, and solving for , we can see that our estimate is


b.2 Update of

To update we can separately estimate for . As described in the manuscript, many elements of will be zero by construction. To simplify notation, let be a vector containing only the nonzero elements of . For instance, if and , then . We will adopt the same convention for the vectors and , and we will let be the rows of corresponding to the indices in . We can now write our model as follows

Now we can define , where . Further, define to be a matrix where row is given by

Our least squares estimate of is then equal to

b.3 Update of

Once estimates of all parameters are obtained from the above algorithm, then estimates of can be obtained using the following

Once these have been obtained, then we can construct an estimate of the covariance matrix by first using the sample covariance matrix defined by

Lastly, this covariance matrix is unstable unless is large relative to , which is not the case in the policing example described in the manuscript. To improve estimation, we will enforce sparsity on the inverse of by solving the following constrained optimization problem:

where is the space of all positive semi-definite matrices whose element is zero for any and that are not neighbors in the data. can be inverted to provide our final estimate of .

Appendix C Computational details for MCMC sampling

All unknown parameters have full conditional distributions that are from known distributions and therefore the model can be implemented using a standard Gibbs sampling algorithm. First we will detail the Gibbs sampling update for for . Let us define

for , and the first time point is defined as

Further, we must define the following:

for , and . Next, define to be the set of subjects who have not yet been exposed to the treatment. We will use the superscript to denote versions of all relevant vectors and matrices that have elements corresponding to indices not in set to zero. Specifically, let be defined such that

identical notation will be used for . For matrices, we will let be equal to except all values in row and column will be set to zero if . Lastly, we will let be an matrix with any elements in row and column set to zero for all . The remaining elements of will be set to the inverse of the submatrix of defined by indices in .

The update for then proceeds as follows: for sample

from a multivariate normal distribution with mean

and variance defined by

Now we will detail the update for , which can be done in a very similar manner to if structured properly. To simplify the updates, assume that we are trying to update one element in each row of simultaneously, implying that we are updating values at a time, one for each subject in the data. This is nearly identical to the situation posed when updating the vector , with some slight modifications needed. We again need to define a residual value as

where is equal to with the elements we are updating all set to zero. If we let be the index of that we are updating, we can define to be a diagonal matrix with the element equal to the element of . We will again use the superscript to zero the relevant elements of all vectors and matrices as we did for the update of . We can update the vector of values from from a multivariate normal distribution with mean and variance defined by

This process is then iterated until all elements of have been updated. If each row in has exactly nonzero elements, then this process can simply be iterated times. If there is an unequal number of nonzero terms in the rows of then this process will be repeated times where is the maximum number of nonzero elements in a row of . In this setting, rows of that have less than nonzero elements can either recycle their nonzero elements and therefore they get updated more than once per MCMC iteration, or the relevant matrices and vectors in the construction of and can have the indices corresponding to these rows set to zero. In the latter setup we will update the parameters of from a multivariate normal with mean and variance given by only the elements of and that correspond to the indices being updated.

c.1 Posterior predictive distribution and causal effects

Once we have posterior samples of and , we can now produce posterior samples of outcome values in the absence of treatment, denoted by For all time periods before the treatment is initiated, this value is observed and known. We will use time series forecasting from our model to predict these values for values post-treatment initiation. Let be the first time point that treatment is initiated throughout the study. The following algorithm will generate the posterior draw from : for perform the following steps:

  1. Draw values for and from the posterior distribution of both parameters.

  2. Obtain residuals from the previous time point as

  3. For the current time point, calculate

  4. If then draw from a multivariate normal distribution with mean and variance . If there exists an such that , then split the mean and covariance matrices as

    where represents the components of corresponding to subjects with . Further corresponds to the covariance matrix of residual errors for the same set of subjects. We can also let and be defined similarly. We are able to observe whereas we will draw from a multivariate normal distribution with mean

    and variance

    Now that we have posterior draws of we automatically have posterior draws of the treatment effect at any time point as , the difference between the observed data (under treatment) and the prediction of what would have happened in the absence of treatment.

Appendix D Additional simulation studies

Here we present additional simulation results that are based on both the NYC policing data, as well as fully simulated data sets in which we have full control of the data generating process. We will present results from four simulation studies that show our approach in the following situations: 1) Misspecification of the matrix, 2) unmeasured confounding bias, 3) diagonal matrix for in the simulations of the main manuscript, and 4) smoothed estimates of .

d.1 Misspecification of A

In this simulation we will use the same dimensions as the New York City data, which is and . We will, however, fully control the data generating process by simulating outcomes according to the model

where . We will specify where . The focus of this simulation is on , and therefore we will simply set and we will correctly assume a linear model in for this mean component. We will specify the diagonal elements of to be 0, and we will let for . We will apply our proposed procedure under two scenarios: 1) assuming that is only nonzero on the diagonal, and 2) assuming that the true nonzero elements are nonzero and that the diagonal elements of are nonzero. The first scenario is incorrect as it forces elements of to be zero that are truly nonzero, while the second approach allows the true nonzero elements to be nonzero.

Figure A.1: Comparison between diagonal and non-diagonal matrices. The top-left and top-right panels show box plots for both the non-diagonal and diagonal matrices, respectively. The bottom-left panel shows the 95% credible interval coverage for both the diagonal and non-diagonal

matrices. The bottom-right panel shows the ratio of the empirical standard errors between the non-diagonal and diagonal

models for estimating .

In the top panels of Figure A.1, we can see that both approaches to estimating lead to unbiased estimates of treatment effects at all ten time points considered. This is expected as our theory suggests that the model need only be stationary, not the true data generating model, to identify the treatment effect. The difference between the two approaches can be seen in the bottom panels of Figure A.1. The bottom-left panel shows that the correctly specified matrix obtains the nominal interval coverage, while the diagonal matrix leads to slight under-coverage, particularly at later time points. The bottom-right panel shows that the correctly specified matrix leads to more efficient estimates than the misspecified one, however, the differences are not substantial.

d.2 Unmeasured confounding bias

In this simulation we will use the same dimensions as the New York City data, which is and . We will simulate outcomes according to the model