Gun violence and gun deaths are pressing public health and safety concerns in the United States, where tens of thousands of people die from gun-related injuries each year (gramlich2019pew). In the absence of federal legislation, states have implemented their own policies to reduce gun violence ranging from arming school teachers (green2018nyt) to restricting access to weapons with high capacity magazines (donohue2019assault).
In 2006 California introduced an innovative program designed to curb gun violence, known as the Armed and Prohibited Persons System (APPS). Both state and federal laws prohibit certain people from owning firearms, including those with felony convictions, history of mental illness, individuals subject to restraining orders and others (giffords2019). The APPS program maintains a list of all known firearm owners in the state, and checks this list against a record of new events that would prohibit an individual from owning a firearm. If this check finds an individual to be in possession of a firearm, the enforcement arm of the program attempts to recover the firearm. The APPS program has led to the removal of tens of thousands of firearms. See apps_paper for detailed discussion.
Our goal is to estimate the impact of APPS on the number of murders in California using annual, state-level data for the 50 states. Unfortunately, simple approaches are unlikely to perform well here; for example, the assumptions underlying the workhorse “difference-in-differences” regression approach do not hold in this application. In one recent analysis, apps_paper instead consider a variant of the synthetic control method (abadie2010synthetic). This approach, however, fails to address many important features of the problem, such as multiple outcome series, and does not lead to convincing uncertainty quantification in this case.
In this paper, we adapt a non-parametric Bayesian approach, multitask Gaussian Processes (MTGPs), to estimate the causal effect of APPS. Originally developed for the setting with multiple outcomes rather than multiple units (bonilla2008multi; alvarez2017gp), Multitask GPs separately parameterize dependence across time and dependence across units. This allows for a flexible and parsimonious panel data model that nests many existing approaches: Our primary model is a natural generalization of low-rank factor models commonly used in panel data settings (e.g., xu2017gsynth; athey2018matrix) with an additional prior that encourages smoothness in the underlying factors. We also take advantage of the large literature on the statistical and computational properties of Gaussian Processes (williams2006gaussian; gelman2013bayesian) to extend this model. Immediate extensions include: allowing for a count (Poisson) observational model; allowing for multiple outcome series; incorporating a mean model; and incorporating auxiliary covariates.
This flexibility, however, also comes with a cost. In settings such as ours, with relatively few units and (pre-treatment) time periods, estimates can be especially sensitive to the prior distributions. Thus, careful model checking is an important feature of the proposed workflow. In particular, we re-frame many standard panel data diagnostics in terms of standard Bayesian workflows, such as posterior predictive checks. This avoids many of the issues with, e.g., inflated Type I error due to pre-testing for parallel trends(roth2019pre).
In addition to diagnostics, the Bayesian approach also yields coherent uncertainty quantification, appropriately propagating uncertainty from the different sources of information in the model. In particular, the specific model we consider automatically produces uncertainty intervals that grow over time post-treatment. Surprisingly, this last point is a departure from many existing approaches, which sometimes have constant uncertainty intervals post-treatment. We also report many flexible summaries of the posterior distribution, allowing us to obtain posterior distributions for estimands like the benefit-cost ratio, a key quantity in policy analysis.
Finally, we adapt results from kanagawa2018gaussian showing that the Gaussian Process can be represented as a weighting estimator. We then use this representation to show that the overall weights decompose into separate unit and time weights. As a result, the proposed MTGP approach has the form of a doubly-weighted average over units and time periods, similar to recent proposals in the panel data literature (arkhangelsky2019synthetic; ben2018augmented).
Applying these ideas to the impact of APPS yields large, negative effect estimates on gun-related homicides in California. Our preferred Poisson model estimates an impact of several hundred prevented homicides per year, suggesting very large benefits from this program. In contrast, using the same methodology, we estimate that there is effectively no impact of APPS on non-gun-related homicides, as expected.
As we discuss in Section 2.3, our proposed approach combines the growing literature on Gaussian Processes for causal inference (see oganisian2020practical) with the robust literature on estimating causal effects for a single treated unit with panel data (see samartsidis2019assessing). Of particular relevance are modi2019generative and carlson2020gp, who also consider GPs in this context, albeit with different structures. More broadly, our proposed framework is a natural (Bayesian) generalization of recent proposals for low-rank factor models with panel data (e.g., xu2017gsynth; athey2018matrix); see also recent extensions from pang2020bayesian.
Our paper proceeds as follows. In Section 1.1 we give a brief overview of the APPS program and other institutional details. In Section 2 we describe the underlying causal problem and review related work. In Section 3 we review single- and multi-task Gaussian Processes and apply these ideas to our application setting. In Section 4 we represent the MTGP approach as a weighting estimator. In Section 5 we report model diagnostics and the estimated impact of APPS under different models. Finally, in Section 6 we discuss open questions and possible extensions. The Appendix includes additional computational details, diagnostics, and results. All code and replication files are available at github.com/darbour/mtgp_panel.
1.1 Armed and Prohibited Persons System
The core goal of the Armed and Prohibited Persons System program is to remove firearms from those who cannot own them under federal and state law. Two key parts of the program work in conjunction to achieve this: (1) a state-wide database of known gun owners; and (2) an enforcement operation that removes weapons from those who become ineligible to own firearms. This enforcement action, which began in December 2006, is the key innovation of the APPS program: the state has maintained a list of known firearm owners since the mid 1990s (CA_DOJ_2015). We provide a brief overview here; see apps_paper for further details.
The APPS program builds the list of all known gun owners from sales records and voluntary gun registrations, checking against other statewide databases that record criminal history, mental health reporting, restraining and protective orders, and more. If any of these daily checks show that a gun owner is now prohibited from owning a firearm, and a specialist confirms that they are indeed prohibited, the Bureau of Firearms (BOF) attempts to recover the firearm(s). This list has grown over time, with 900,000 known firearm owners in 2006 rising to 2.5 million in 2019. Of those 2.5 million, approximately 23,000 were armed and prohibited with cases at various stages of investigation (CA_DOJ_2019).
Oftentimes, the prohibited individual is no longer armed, having already relinquished their firearms to local law enforcement. If this is not the case, officers from the BOF attempt to contact the individual and search for firearms. Along with known firearms, on many occasions officers find additional, unregistered ones. In other cases, they also fail to recover the firearms, e.g. in instances where a gun has been stolen or given to a family member. Following the firearm seizures, the individuals are then removed from the APPS database. In decade after the start of the program, the BOF confiscated almost 32,000 firearms, with potentially many more retrieved by local law enforcement.
Figure 1 shows the annual homicide rate per 100,000 people in California versus the rest of the United States (with state-level estimates weighted by population); the dotted line indicates 2006, the year the APPS program was first launched (in December). The figure shows several striking patterns. First, the overall murder rate in the United States largely declined over this period, with a major increase in the final years of the panel. Second, the murder rate in California was substantially higher than the national average in the late 1990s and mid 2000s, but eventually converged to the rest of the country by the mid 2010s.
The methodological question is whether and how we can use this panel data structure to estimate the impact of APPS on the murder rate. Furthermore, we would like to separate out the effect on gun related and non-gun related homicides. The APPS program should primarily impact gun-related homicides by removing firearms from a high-risk population, though substitution effects may lead to changes in non-gun-related homicides. Therefore our methodological goal also includes estimating effects on multiple outcomes simultaneously.
Unfortunately, relatively simple approaches are unlikely to perform well here. For example, the workhorse “difference-in-differences” regression model assumes parallel trends: in the absence of the intervention, the differences between the murder rates in California and the rest of the country are constant over time. This is clearly violated in Figure 1, with much wider differences in the mid 2000s than in the late 1990s.
In a recent paper, apps_paper instead consider a variant of the synthetic control method (SCM) to estimate this effect, constructing a weighted average of other states (“synthetic control”), which closely matches California’s pre-treatment murder rate. apps_paper then compare the observed murder rate in California to that in “synthetic California,” finding a reduction of around 0.75 homicides per 100,000 people, roughly 10 percent below pre-intervention baseline.
While a promising initial analysis, several open questions remain. First, the estimated SCM weights on control units are sensitive to modeling choices (even if the overall SCM estimate is relatively stable), with little guidance on how to choose among them. Second, apps_paper propose a preliminary SCM analysis using multiple outcomes series: gun-related, non-gun-related, and overall homicide rates. However, the methodology for panel data with multiple outcomes is under-developed, and it is unclear how the approach used in apps_paper can be applied more generally. Relatedly, the number of homicides per state are inherently count outcomes, a fact ignored in existing SCM analyses. Finally, quantifying uncertainty with synthetic controls is challenging. apps_paper consider both placebo and conformal inference approaches (abadie2010synthetic; chernozhukov2017exact), which rest on versions of exchangeability either over units or over time. These assumptions are difficult to justify in our setting. Moreover, the uncertainty intervals from the conformal inference approach are constant over time even though, intuitively, we anticipate far greater uncertainty for the counterfactual murder rate in 2017 than immediately after the implementation of APPS in 2007. These concerns motivate our alternative approach, which we turn to next.
2 Setup and Background
2.1 Problem setup
We now describe the standard panel data setting with units observed for time periods. In our application, we observe states for years. Let be an indicator that unit is treated at time where units with never receive the treatment. While our setup is more general, we restrict our attention to the case where a single unit receives treatment, and follow the convention that this is the first one, . The remaining units are untreated. In our application, the treatment of interest is the APPS program, treated at time ; California is the treated unit and the remaining states are possible controls.
We describe our problem of interest in the potential outcomes framework. We assume no interference between units and stable treatment, which allows us to write the potential outcomes for unit in time under control and treatment as and respectively.111While this is not always the case, in our application, it is reasonable to assume that any state could launch an APPS-style program at any point. Thus, both potential outcomes could reasonably exist for all units and all time periods. Our primary outcome of interest is the overall murder rate; we consider multiple outcomes in Section 3.3.3. For ease of exposition, we initially ignore the presence of background covariates from our analyses; we discuss extensions to incorporate them in Section 3.3.1.
Since the first unit is treated at time , the observed outcomes are therefore:
The goal is to estimate the causal effect for the treated unit at each post-treatment time , , as well as the average treatment effect after time ,
Since we observe the treated outcome for the treated unit at
, the challenge is to impute the missing potential outcome for the treated unit at time. Note that we adopt a finite sample causal inference framework here: our goal is to estimate the impact for California, rather than for a hypothetical population (see imbens2015causal).
2.2 Bayesian causal inference
We adopt a Bayesian approach to estimating causal quantities. There are many subtle issues that arise in Bayesian causal inference that are not central to our discussion. For instance, identification in the Bayesian approach is conceptually distinct from identification in non-Bayesian causal inference: if the prior distribution is proper then the posterior distribution will also be proper, regardless of whether the parameters in the likelihood are fully or partially identified (see gustafson2010bayesian; imbens2015causal). See ding2018causal and oganisian2020practical for additional background. pang2020bayesian and menchetti2020estimating give overviews of Bayesian causal inference for panel or time series data.
At a high level, Bayesian causal inference views missing potential outcomes as unobserved random variables to be imputed(rubin1978bayesian; ding2018causal). We therefore break the problem into two parts: first, estimate a Bayesian model using the observed data; second, use this model to obtain the posterior (predictive) distribution for the missing potential outcomes. Specifically, let be the missing potential outcomes for the treated unit after , and let be the set of observed data, including all control outcomes, the treated unit’s pre-treatment outcomes, the treated unit’s observed outcomes post-treatment, and the set of treatment assignments. We introduce auxiliary covariates in Section 3.3.1.
The goal is to impute the missing potential outcomes, , by drawing from the posterior predictive distribution, , governed by a model parameter . We can then estimate at post-treatment time , where is a draw from the posterior predictive distribution for for unit 1 at time . To do so, we follow richardson2011transparent and use the fact that:
where is the posterior distribution of our model parameters and the missing potential outcomes given observed data. Now, to obtain the target posterior predictive distribution, we first specify a prior distribution and use Bayes rule to obtain the posterior distribution . With this posterior over the model parameters, we can repeatedly draw samples of , yielding the joint posterior on the left-hand side of the above expression. Obtaining the target posterior predictive distribution is then a matter of rearranging the above expression to isolate . Intuitively, this approach rests entirely on the form of the model and the estimated model parameters. See pang2020bayesian for an alternative framing that explicitly states sufficient conditions in the form of identifying assumptions.
Formally, we assume that units and time periods are exchangeable conditional on the model parameters , which have a prior distribution . Following richardson2011transparent, we partition the parameter space into , which governs the marginal potential outcome distribution, and , which governs the association between and . In principle, we could treat as a sensitivity parameter; for the purposes of this paper, we assume that the potential outcomes are conditionally independent given data and model parameters, allowing us to ignore . The posterior for is then:
where denotes the “control” unit and time periods: for and for other units. Let denote the posterior predictive distribution over the missing potential outcomes, . The estimated treatment effect at time is then the observed treated potential outcome minus the predicted missing potential outcome at post-treatment time , , where repeatedly drawing from the posterior predictive distribution propagates the uncertainty.
A key consequence of this setup is that we never use post-treatment data from the treated unit in our modeling (or, equivalently, we have an arbitrarily flexible model for the observed treated data). Rather, our focus is on modeling the observed control potential outcomes. And by imputing from the posterior predictive distribution, our inference is inherently finite sample, allowing us to directly reason about the effect of APPS on California’s murder rate (imbens2015causal).
Finally, while the setup here is quite general, in practice we restrict ourselves to a structure where the control potential outcomes are a model component plus additive noise.
Assumption 1 (Additive, separable error structure).
The control potential outcomes, , are generated as:
where is the outcome model for unit at time , governed by parameters ; and is mean-zero noise.
This assumption, sometimes referred to strict exogeneity in the panel data setting (see, e.g., imai2019use), places no restrictions on the model terms and . This allows for non-parametric modelling of the outcomes where is potentially infinitely dimensional — but does not allow treatment assignment to depend on the errors. In other words, given the model and the possibly infinitely many parameters , the error distribution for the treated unit’s (control) potential outcomes are exchangeable with other units’ errors. As we discuss in Section 6, we can relax this restriction within the fully Bayesian workflow, for instance, via sensitivity analysis (franks2019flexible).
2.3 Related work
Our paper bridges several robust literatures. First, we build on the many existing methods for estimating causal effects for a single treated unit with panel data; see samartsidis2019assessing for a recent review. Within this, there are several important threads. Most directly relevant are recent papers on Bayesian implementations of (or alternatives to) the synthetic control method, including (tuomaala2019bayesian; kim2020bayesian; pang2020bayesian; pinkney2021improved). There are also several recent papers that directly estimate factor models for causal effects (e.g., xu2017gsynth; athey2018matrix), as well as approaches that directly address multiple outcomes (samartsidis2020bayesian). Finally, in an important addition, causalImpact propose a Bayesian structural time series model for estimating causal effects, focused on a single treated series; see menchetti2020estimating for a recent extension. As we discuss below, we incorporate many of the novel ideas in these papers, including many of the prior choices. However, our proposed MTGP framework is typically more general, and, we believe, more conducive to estimating causal effects with panel data.
Next, there is a small but growing set of papers on the use of Gaussian Processes for estimating causal effects. To date, nearly all of these papers have focused on the cross-sectional setting (alaa2017bayesian; schulam2017reliable; ray2018semiparametric; huang2019gpmatch; branson2019nonparametric; witty2020gp; ren2021bayesian). See oganisian2020practical for a recent review.
We are aware of very few papers specifically using GPs to estimate causal effects in panel data settings, although, as we discuss in Section 4, many existing estimators can be written as special cases of this approach.222See also karch2018gaussian, who use GPs for latent growth curve modeling in psychometrics, and huang2015identification, who instead focus on causal discovery. Two working papers are especially relevant for our work. The first is recent work from modi2019generative
, who also propose a GP approach for estimating causal effects in this setting. Their main proposal, however, is very different than ours, with a focus on GP for the frequency domain. Thus, we view our model as a useful complement to theirs. A second recent proposal iscarlson2020gp, who also uses a GP approach but does not exploit the multitask structure. Finally, while not explicitly causal, of particular relevance to our approach is flaxman2015fast, who propose a hierarchical GP model that is similar to what we discuss here.
3 Multitask Gaussian Processes for Causal Inference
We now give a high-level description of Multitask Gaussian Processes for causal effects with panel data, beginning with the simpler setting of single-task GPs. We discuss specific model implementation choices in Appendix A.
3.1 Review: Single Task Gaussian Processes
We begin by reviewing the setup and properties of a Single-Task GP, which focuses solely on the treated unit; see williams2006gaussian for a textbook discussion. While the original GP setup is quite general, to fix ideas we initially focus on applying GPs to the interrupted time series or horizontal regression setting (athey2018matrix), in which we use the pre-treatment outcomes for California to forecast post-treatment outcomes in the absence of the intervention. Specifically, we initially model the (control) murder rate for California at time , , as the sum of a model component and a mean-zero, independent noise component (as in Assumption 1):
where is a Gaussian Process prior and
is the vector of model components for the treated unit. The key idea is that the GP prior overincorporates smoothness over time via a kernel function , where similar values of imply larger covariances. Here we use the squared exponential kernel:
where is the length scale.
The choice of kernel function plays an important role in defining the behavior of the model. For the family of kernels we consider, the relation between the time difference and the kernel value is central in determining how we expect the model to behave over time. If is large even for far-apart time periods, then the model components are assumed to be highly correlated, which corresponds to an assumption that the underlying model is very smooth over time. Conversely, if is low even for close time periods, then the model components are assumed to be close to independent, which corresponds to assuming that the model varies strongly over time. The length scale hyper-parameter explicitly controls the level of smoothness in the model components. Many other kernels are possible and appropriate for other settings, such as periodic kernels; see williams2006gaussian for a textbook discussion.
To write the MTGP model in matrix notation, let be the corresponding time kernel matrix, where . Then, we can write as:
Because the model is a multivariate Normal, the conditional distributions have a convenient form. Specifically, as in Section 2.2, we can partition the vector of control potential outcomes for the treated unit into the observed and missing components: , and . Suppressing the superscript time to reduce clutter, we can then partition this multivariate Normal model as:
where is the kernel matrix for observed points, is the kernel matrix for unobserved points, and is a rectangular matrix of kernel evaluations for pre- and post-treatment times, whose element is the kernel value . We can then write the the posterior predictive distribution for California’s outcomes at future time points, , as:
As we discuss in Section 4, the posterior mean is a linear combination of the pre-treatment observations, .
3.2 Multitask Gaussian Processes
We now turn to modeling the murder rate for all states jointly, rather than modeling California’s outcome series alone. The primary change is that each observation is now a unit-time pair , so the kernel measuring similarity between points is doubly-indexed: . The multitask GP framework, originally developed for the setting with a single unit but many outcomes (goovaerts1997geostatistics; bonilla2008multi), is a natural approach for capturing similarity across multiple dimensions. As in the single-task GP setting and Assumption 1, we assume that the control potential outcomes consist of a model component plus additive (Normal) noise:
where we want to flexibly model . While this framework is quite general, we focus on two common simplifications: allowing for separable unit and time covariances, and imposing a low-rank structure on the unit covariance.
Separable unit and time kernels.
The first key idea is to decompose the GP kernel into separate kernels that capture similarity across time (the “time covariance”) and similarity across units (the “unit covariance”):
where defines similarity between the units and defines similarity between the time periods. This is a strong restriction that implies that the covariance across units, , is constant in time. (Conversely, this implies that the time covariance is constant across units.) We discuss some approaches for relaxing this restriction in Section 6.
Low-rank unit covariance.
Even with the separable kernel restriction, the resulting model is overparameterized. In particular, the similarity between the time points, , has a natural parameterization in terms of the difference in time, , which is a scalar. By contrast, the similarity between the units, , involves comparisons. Following goovaerts1997geostatistics and bonilla2008multi, we therefore simplify the problem using the intrinsic coregionalization model (ICM), which imposes a low-rank assumption on the unit covariance (see also flaxman2015fast, Section 4.2). While initially proposed in terms of restrictions on the matrices, we can equivalently write the ICM model as an assumption that the outcome model for each of the units is a unit-specific linear combination of latent draws from a Gaussian process, i.e.,
where each latent is itself a GP with kernel ,333We can similarly write this restriction as , i.e., a linear kernel between weights for units and over the latent functions inferred from the data. and where each unit has its own set of unit-specific factor loadings with corresponding hyper-parameters, . Thus, the underlying model components are shared across units and induce a dependence across them. The number of shared latent functions corresponds to the rank of the unit kernel matrix . This is a key hyper-parameter governing the overall complexity of the model; a higher rank will lead to a more flexible model that potentially has the risk of overfitting. We discuss choosing the rank via Bayesian model criticism in Section 5.1.
The resulting model is then:
where . We estimate this model using the probabilistic programming language, Stan (stan); see Appendix A
for hyperprior distributions and additional implementation details.
As above, we can re-write this model in matrix notation. Define the unit covariance matrix as and the time covariance matrix as . Then we can represent the decomposition in Equation (4) as the Kronecker product between the two covariances:
where any one entry of can be written as the product of the unit and time covariance, as in Equation (4).
We can similarly divide this (admittedly unwieldy) kernel matrix into a matrix for observed unit-time pairs, , a matrix for the unobserved counterfactual post-treatment outcomes for the treated unit , and a rectangular matrix for the kernel between the outcomes for the observed unit-times and the post-treatment outcomes for the treated unit
. With this setup, the joint distribution of the missing post-treatment outcomes for the treated unit and the observed untreated outcomes are multivariate Normal:
The resulting conditional mean and variance have the same form as in Equations (2) and (3).
As we discuss in Section 4 below, there are two important special cases of this model, which correspond to setting either the time or unit covariances to be the identity. Setting the time covariance matrix to the identity, , corresponds to a model with no smoothness in the underlying latent factors, , allowing the function value to change arbitrarily between time points. The resulting approach is a (Bayesian) linear factor model (e.g., xu2017gsynth; athey2018matrix; samartsidis2020bayesian). In Section 4 below, we further connect this to “vertical regression,” which finds a set of weights over units that optimize some imbalance criterion (doudchenko2016balancing). On the other hand, setting the unit covariance matrix to the identity, , corresponds to a model in which there is no correlation in the underlying process for each unit, and each model component is an independent GP. The model therefore reduces back to the single-task GP discussed in Section 3.1 above. Finally, while we focus on the ICM model with separate time and unit covariances, there is a large literature on more elaborate generalizations, such as the semiparametric latent factor model and the linear model of coregionalization; see alvarez2017gp for a review.
Building on the broad GP literature, we can immediately extend the MTGP model above to better match our application. We focus on three main extensions here: incorporating a mean model; allowing for a non-Gaussian likelihood; and modeling multiple outcomes simultaneously.
3.3.1 Incorporating a mean model
Thus far, we have focused on mean-zero Gaussian Processes, . We can extend this to incorporate a mean model, :
A particularly natural model for the panel data setting is to set to have unit- and time-specific intercepts:
where is a global intercept, are the unit-specific intercepts, and are the time-specific intercepts. Here we place independent Normal priors on the unit-specific intercepts,444With independent priors, these are often referred to as fixed effects in the panel data literature. We could instead model random effects using a hierarchical prior. See hazlett2020understanding for discussion. and a global Gaussian Process prior on the time-specific intercepts.
The distinction between the mean function and the covariance matrix in the MTGP is somewhat artificial, and is typically chosen for clarity and to facilitate the choice of prior (kanagawa2018gaussian). To see this, consider a simplified MTGP in which the mean model only includes unit-specific intercepts, . This is equivalent to a mean-zero MTGP with kernel function:
where is the delta kernel:
Including the unit-specific intercepts in the mean function, however, allows us to directly specify reasonable priors and makes the role of these intercepts more transparent.
Finally, we can immediately extend the mean model to incorporate (time invariant) auxiliary covariates, :555With additional restrictions (e.g., “exogeneity”), we can extend this to include time-varying covariates as well. See imai2019use for discussion.
where is a vector of regression coefficients. As with the unit-specific intercepts, we could instead include this as part of the GP kernel, for example by incorporating it into the unit covariance.
3.3.2 Non-Gaussian Likelihoods and heteroskedasticity
Another natural GP extension is to allow for non-Gaussian observation models, which is essential for well calibrated uncertainty quantification. Analogous to using alternative link functions with generalized linear models (gelman2013bayesian, Ch. 21.3), we can model the control potential outcomes as:
where is an appropriate link function and is the latent GP factor. As above, we could also incorporate a mean model, . See naish2008generalized; hensman2015scalable for discussions of related computational issues.
We highlight two link functions here. First, the outcome in our application is the number of murders, which is more naturally modeled via a Poisson (or, perhaps, Negative Binomial) link. A Poisson model also captures the restriction that the variance in the murder rate increases with the mean murder rate. We use this as our primary model below. Alternatively, we could allow for a “robust” Gaussian Process by allowing for -distributed rather than Normal errors; see jylanki2011robust. In our application, the estimates using -distributed errors are nearly identical to those using Normal errors.
Finally, an important practical feature of our application is that populations vary considerably across states. Thus, we expect that the underlying variability in the murder rate will be much larger in, say Vermont or Wyoming than in California or Texas. With a Gaussian likelihood, we can parametrize this as:
where is the population for state at time . We can similarly allow for a population offset in a Poisson model.666More broadly, we could incorporate much more complex noise models, such as an additional GP on the noise term (e.g., naish2008generalized; hensman2015scalable). This is a challenging model to fit given our limited sample size. Allowing for heteroskedasticity by size is an important departure from many Frequentist panel data methods used in this setting; see samartsidis2019assessing.
3.3.3 Multiple outcomes
Finally, we can extend the MTGP approach to allow for multiple correlated outcomes, following the large literature on multi-output GPs (alvarez2017gp); see also samartsidis2020bayesian for a non-GP setting. This is particularly important for our application because we are interested in assessing the impact of APPS on both gun- and non-gun-related homicides.
To do so, we can write each data point as a unit-time-outcome triple, , with control potential outcome, . While many models are possible, we focus on a simple version with a shared factor structure across outcomes, which we can write as a multivariate Normal observation model with common covariance across units and time:
Equivalently, we can write this as an MTGP where the resulting kernel decomposes into separate unit, time, and outcome kernels:
Similarly, we can write this as a Kronecker structured overall kernel , as in flaxman2015fast. Unlike the unit kernel , which we restrict to be low-rank, the outcome kernel can be full rank because the number of outcomes (2) is less than the number of units (50).
4 MTGP as a weighting estimator
Our discussion thus far has been entirely Bayesian. Gaussian Processes, however, are fundamentally linked with (Frequentist) kernel ridge regression; a large literature exploits this connection to describe the Frequentist property of GPs; seekanagawa2018gaussian for a recent review. Here we adapt those results to our panel data setting. We first represent the MTGP estimate as a weighting estimator with separate unit and time weights, and give error bounds under fairly general conditions on the latent (noiseless) outcomes. We then describe several special cases to better illustrate this result.
4.1 Weighting representation
We now show that the MTGP estimate with separable unit and time kernels has a corresponding representation as a weighting estimator with separate unit and time weights. This setup differs from our fully Bayesian formulation above; to make this connection with weighting, we instead consider a Frequentist setup that assumes a known kernel. Specifically, conditional on hyperparameters, the MTGP estimate can be written as the solution to a constrained optimization problem that minimizes the worst-case mean-square error across the “noiseless” latent functions,. This representation exploits the connection between Gaussian Processes and kernel ridge regression (kanagawa2018gaussian); see hazlett2018trajectory for additional discussion of kernel weighting methods for panel data.
To set up the problem, we again assume that (control) potential outcomes are the sum of structural and error components, , where is mean-zero noise. We assume that is “well behaved” in the sense of belonging to a Reproducing Kernel Hilbert Space (RKHS); this setup is quite flexible and incorporates a wide range of function classes (see, e.g., wainwright2019high). We focus on the ICM kernel, , where, as in other Frequentist analyses, the kernel is assumed known.
We can then represent the MTGP as a weighting estimator. We consider two basic forms, corresponding to the weight-space and function-space interpretations of Gaussian processes (williams2006gaussian, Ch. 2). We first show that the MTGP minimizes the worst-case imbalance in the noiseless latent functions . We then show the equivalent formulation in terms of imbalance on the observed outcomes .
Let , where is a fixed function. Let be the RKHS for kernel with Hilbert-Schmidt norm, . Then let be contained in the unit ball of the reproducing Hilbert space, , where , and are iid mean-zero random variables with observed variance . Further, model the time covariance with a squared exponential kernel with length scale , and model the unit covariance . Finally, let and be the vectors of unit and time weights for target unit at time , and let and , respectively, be the set of weights across all targets. Then:
The unit and time weights, and , have the following closed form:
The posterior (predictive) mean estimate for target observation , for is:
The unit and time weights also solve the following, equivalent optimization problems, in terms of (1) the weight-space formulation:
and (2) the function-space formulation:
where and where and are the coefficient vectors for target unit at time , with and , and with corresponding sets and .
This estimate has the following out-of-training-sample estimation error for the structural component of the missing potential outcome ,
Proposition 4.1 begins with the algebraic result that the posterior mean estimate of the MTGP can be written as a weighting estimator, and that these overall weights separate into unit- and time-weights. This separation does not hold in general, but, in our setting with an ICM kernel, follows immediately from the restriction that the unit and time covariances are separable (bonilla2008multi). As a result, we can view our proposed MTGP approach as a doubly-weighted panel data estimator, similar to recent proposals from e.g. ben2018augmented and arkhangelsky2019synthetic that also impute the counterfactual as a double weighted average. Importantly, the MTGP implicitly estimates unit and time weights simultaneously, while previous proposals construct them separately.
Proposition 4.1 then leverages Frequentist results on GPs to provide additional intuition for these weights (see kanagawa2018gaussian). First, Equation (7) shows that the MTGP weights minimize the error on the noiseless outcome functions, subject to regularization on the weights. Importantly, this formulation does not rely on parametric assumptions, only that the noiseless outcomes, , are relatively “well behaved” in the sense of belonging to an RKHS. Equivalently, Equation (8) shows that the weights can be seen as the solution to an equivalent optimization problem in terms of minimizing imbalance in the observed outcomes directly, but subject to regularization on the “coefficients” and , rather than on the weights themselves. Finally, building off this, Equation (9) shows that the error bound depends directly on the distance between the control observations, , and the unobserved point, . Specifically, with a squared exponential kernel, , the bound is increasing for . See fiedler2021practical for recent generalizations of such bounds under model mis-specification.
We can incorporate a (prior) outcome model into the posterior mean estimate, echoing several recent proposals that combine outcome modeling and weighting in the panel data setting (see, e.g. ben2018augmented; ferman2019synthetic; arkhangelsky2021double):
where is the outcome (prior mean) model for unit at time . Equation (10) has a form similar to bias correction for matching (rubin1973matching), which corrects the doubly-weighted average by an estimate of the difference in the outcome model . Equation (11) instead has the form of augmented inverse propensity score weighting (robins1997causal), which re-weights the outcome model residuals.
4.2 Special cases
To help build intuition for the weighting representation, we consider two special cases: (1) the unit kernel is the identity; and (2) the time kernel
is the identity matrix.
Identity unit covariance.
First, let the unit covariance be the identity matrix. We can view this as the setting where there is no correlation between the underlying processes for each unit, so each model component is an independent GP. In this setting, there is nothing to be learned from the comparison units, and the MTGP estimate only compares to the pre-treatment outcomes of the treated unit. In particular, following the setup in Proposition 4.1, if we set the unit covariance matrix to be the identity, then the weights for post-treatment time period are
with posterior mean estimate:
This is a horizontal regression formulation (see athey2018matrix). As these weights ignore the outcomes for the comparison units, we can view this case as fitting separate single-task GP models, one for each unit. Thus, when the unit covariance is the identity matrix, this problem ignores the outcomes for comparison units and reduces to the single-task GP for the treated unit discussed in Section 3.1.
Identity time kernel.
Next, let the time kernel be the identity matrix. This corresponds to the setting where there is no smoothness at all in the underlying latent functions, so the function value can change wildly between time periods. In this case, it is impossible to use outcomes from different time periods to inform our estimates, and so only the unit weights remain. Again following the setup in Proposition 4.1, if the time covariance matrix is the identity, then the weights for the treated unit are:
with posterior mean estimate:
This is a vertical regression setup (see doudchenko2016balancing), and corresponds to the implicit weights of a linear factor model or interactive fixed effects model (gobillon2016regional; xu2017gsynth; athey2018matrix). The estimator only uses the learned unit covariance to implicitly construct weights over comparison units.
5 Model diagnostics and estimated impact for APPS
We now use the MTGP framework to estimate the impact of APPS on homicides. We begin by using posterior predictive checks to guide the choice of model. We then estimate the impacts both on overall homicides and separately on gun- and non-gun-related homicides.
5.1 Posterior Predictive Checks and other model diagnostics
Diagnostics and other forms of model criticism are important components of the Bayesian workflow, helping researchers diagnose and compare the fit of the proposed models (gelman2013bayesian). We begin with posterior predictive checks
(PPCs). First, we choose a test statistic,, which reflects a relevant aspect of model fit and is a function of both data and model parameters. We then compare the posterior predictive distribution of this statistic, , to the value of the test statistic for the observed data, , over draws of .
PPCs play an especially important role in panel data settings like ours, where it is often difficult to decide between modeling approaches given limited data. In particular, posterior predictive checks avoid many of the inferential issues associated with, for example, testing for parallel trends prior to estimating a difference-in-differences model (roth2019pre). See liu2020practical for a review of Frequentist approaches to model checks for panel data, including equivalence and placebo tests.
We focus on two common measures of model fit: the overall pre-treatment fit and the pre-treatment imbalance at each time point. First, we evaluate the pre-treatment fit for California by comparing the posterior predictive distribution of the root mean squared error (RMSE),
to the observed RMSE, , where is a posterior predictive draw of the overall murder rate for California at (pre-treatment) time . Figure 2 shows the distributions for both Gaussian and Poisson observation models with ranks . For each combination, we also compute the posterior predictive p-value, the fraction of posterior predictive test statistics that are larger than the observed test statistic. For both Gaussian and Poisson models, the observed RMSE is larger than we would expect for ranks 0 through 3, with minimal discrepancies by rank 5 for both models. The Poisson model has slightly lower RMSE, suggesting that Poisson with rank 5 is a reasonable initial choice here.
Figure 3 shows the imbalance at each pre-treatment time point for the Poisson MTGP model with rank 5. The upper pane shows the observed data (in red) as well as posterior predictive draws from the model on the scale of the original outcome (homicides per 100,000). The lower pane instead shows the “gap” between the outcome and the posterior mean, .777This is the Bayesian analog to estimates of placebo impacts that are standard in Frequentist panel data studies, although we compare the observed estimates to the posterior predictive distribution rather than to zero. Both plots show little difference between the observed data and the posterior predictive distribution, suggesting reasonable model fit along the dimensions we investigate here.
Finally, the Appendix includes several additional model diagnostics. Appendix Figure C.2 shows the coverage of the 50% and 95% posterior predictive intervals for pre-treatment outcomes across all 50 states, computed as where and
are the lower and upper endpoints of the credible region with probabilityfor unit at time . The Poisson intervals have close to nominal coverage for ranks 4 and 5, but under-cover for lower ranks and over-cover for higher ranks. By contrast, the corresponding Gaussian intervals over-cover across all ranks, with especially poor coverage for the 50% intervals for higher ranks, leading further credence to the Poisson model over the Gaussian model. This difference is likely due to the fact that the sampling variance increases with the mean in the Poisson model but is constant in the mean and across units for the Gaussian model.
In Appendix Figures C.3 and C.4 we show the inferred Gaussian Process for the time-specific intercept and the estimated factors from the rank 5 Poisson model next to the factor loadings for California. In Appendix Figures C.5 and C.6, we show the inferred time and unit weights for California in the first post-treatment period using a rank 5 Gaussian multitask Gaussian process without the global time series trend GP. Each of the time and unit weights are obtained by marginalizing over the non-relevant component (unit, or time, respectively) for each sample from the MCMC sampler.
5.2 Impact of APPS on homicides
We now turn to estimating the impact of APPS on homicides in California. We begin with the overall impact and then disaggregate by gun and non-gun-related homicides.
Overall impact of APPS.
Based on our model diagnostics above, we focus on the MTGP estimated with rank , Poisson observation model, and unit- and time-specific intercepts. Figure 4 shows the estimated impact of APPS on homicide rates in California over time.888Appendix Figure C.10 shows the corresponding estimates after adjusting for auxiliary covariates and several measures of crime, all for 2005. The auxiliary covariates are: the prison incarceration rate, the age distribution (percent 0-17, 18-24, 25-44, 45-64, 65 and older), the percent Black, the unemployment rate, the poverty rate, and log median income. The crime measures, normalized by population, are: overall violent crime, rape, robbery, assault, property crime, burglary, larceny, and motor/vehicular crime. As the results are largely unchanged, we focus on the unadjusted estimate here. The upper pane shows the posterior predictive distribution from the MTGP, with the posterior mean as well as 50% and 95% posterior predictive intervals. The red line shows the observed murder rate for California. The estimates pre-treatment repeat the pre-treatment imbalance checks in Figure 3; the counterfactual estimates post-treatment are consistently higher than the observed murder rate. The lower pane shows this difference explicitly, where zero is no impact. These estimates suggest that APPS reduced murders in the state, with both impacts and uncertainty growing over time.
shows the posterior estimates for the average annual impact. The posterior mean estimate is roughly 1.4 murders avoided per 100,000 people per year, with a 95% credible interval of
murders per 100,000 people; the posterior probability that the impact is negative is 99.6%. To contextualize these estimates, California’s annual murder rate in 2004 and 2005 was 6.8 per 100,000. Focusing on the smaller end of the 95% credible interval suggests a decline of at least 8 percent from baseline. We can also compare the annual number of murders prevented by this system to expenditures for APPS investigative teams in a single year, 2018, when California’s population was 39.6 million. A decrease in the murder rate of 0.5 corresponds to nearly 200 murders avoided per year, while the total funding for APPS investigation teams in fiscal year 2017-2018 was $11.3 million(petek2019budget). Dividing this cost by the number of avoided murders gives a (conservative) estimate of roughly $50,000 per murder prevented, dramatically lower than the estimated benefit from the avoided crime (heaton2010hidden; dominguez2015role). We include the posterior distribution of expidentures per murder avoided in Figure 5.
Separate impacts on gun- and non-gun-related homicides.
Substantively, an important check is to estimate separate impacts of APPS on gun- and non-gun-related. In particular, we expect the impact of APPS to be entirely on gun-related homicides, with negligible impacts on non-gun homicides except through possible substitution effects. Following apps_paper, we also apply our approach to a different data source that disaggregates overall homicides by whether a firearm was used (kaplan2019).
Figure 6 shows estimates for gun- and non-gun-related homicides, estimated via a joint multitask GP. Fitting both outcomes jointly is advantageous since these two outcomes are positively correlated due to shared latent factors. (See Appendix Figure C.11, for the posterior predictive correlation between gun and non-gun related homicides for California). While estimates are derived from a different data source than used above, the impacts for gun-related homicides are nearly identical to the estimated overall impacts in Figure 5. In contrast, the estimated impact of APPS on non-gun-related homicides is essentially zero.
This paper uses a multitask Gaussian Processes framework in a panel data setting to estimate the causal effect of an innovative gun control program in California. MTGPs find a natural middle ground between pure “horizontal” time series forecasting models, based on extrapolating the treated unit’s outcomes series alone, and pure “vertical” regression models, based on a weighted average of control units’ outcomes. Further, MTGPs automatically quantify the relevant sources of uncertainty, providing a unified Bayesian framework in a setting where uncertainty quantification is notoriously challenging. They also allow for extensive model diagnostics via posterior predictive checks. Building on a robust GP literature, there are also many immediate extensions to the standard MTGP setup, including allowing for multiple outcomes and incorporating a mean model. Using this flexible approach, we find large effects of APPS on homicides in California, with program benefits far exceeding costs under typical assumptions.
There are several promising directions for future research. First, there are many existing results for Gaussian Processes that would apply immediately in other settings but are not as relevant in our application. For instance, standard panel data models face a range of practical challenges when the data structure deviates from a so-called “balanced panel,” either due to irregularly sampled observations over time or due to missing data (see imai2019use). The MTGP framework naturally accommodates both irregularly sampled time points and missing values, and, unlike most Frequentist approaches, will automatically propagate the corresponding uncertainty.
We can also extend this approach to allow for multiple treated units that adopt treatment over time, also known as staggered adoption (e.g., ben2019synthetic). First, we can immediately extend to the case where multiple treated units adopt treatment at the same time (i.e., simultaneous adoption) by changing the target to the average of the treated units. However, the extension to the case when units adopt over time (i.e., staggered adoption) is slightly more complicated. For instance, we could adapt the model to have a separate kernel for each treatment time, possibly including a hierarchical structure on the kernels (flaxman2015fast).
Next, we can develop a framework for assessing sensitivity to departures from key modeling assumptions for the MTGP approach, especially the restriction that the time and unit kernels are separable. While there are many possible generalizations (see, for example alvarez2017gp), these are inherently under-specified in settings like ours with relatively few units and time periods. Thus, a sensitivity analysis approach in the spirit of franks2019flexible is a natural avenue for appropriately quantifying uncertainty in this case.
Finally, our proposed MTGP approach implicitly puts the same weight on the error for California as on other states, even though California is the target unit. We can explore alternative approaches for parameterizing similarity across units and time, such as by imposing an “arrowhead” structure on the overall covariance, or by learning a richer dependence structure across units; see, for example, li2020negative. Moreover, our approach focuses on modeling the control potential outcomes for California, in the spirit of finite sample causal inference — but allowing for potentially arbitrary treatment effects. We could also directly model the treatment effect, including by incorporating a prior, as part of the MTGP framework.
Appendix A Estimation and implementation choices
a.1 Single outcome
Both the Gaussian and Poisson model share the following three components to model the latent process
, is given an independent offset drawn from a standard normal distribution,, providing a unit-specific intercept.
The global trend is modeled as a single-task Gaussian process with lengthscale ,
where is a covariance matrix with each entry given by the squared exponential kernel with lengthscale, , i.e. .
Individual components are modeled with a multitask Gaussian process with a rank task-covariance matrix,
The variance for a Normal outcome is drawn from
With is a squared exponential kernel of the same form as
. For the Gaussian link, a shared standard deviation is drawn,, and the observed control outcomes for each unit, , are modeled as
where is a matrix containing the population for each state for all time periods. Similarly, for the Poisson link function the observed untreated outcomes for each unit is modeled as
where is an offset.
a.2 Multiple outcomes
The multi-outcome models analogously use the following components,
The hyperparameters of the global and individual kernels are drawn as
The global time trend for each outcome is drawn as
Individual components are modeled as
Each unit is given an independent intercept for each outcome drawn from a standard normal distribution .
The covariance between outcomes is modeled as
and the latent function is drawn as
The variance for each Normal outcome is drawn from
Finally the probability of each outcome is modeled as
for the normal outcome for unit and outcome , and
for the Poisson.
a.3 Additional computational details
Naive estimation of the latent variable MTGP formulation used in our model has complexity due of the cost of taking the Cholesky decomposition of , with an additional complexity from multiplying the Cholesky decomposition with an length draw from a standard normal. This can be reduced by utilizing properties of the Kronecker product, namely that
Which avoids performing Cholesky decomposition on the combined matrix , and instead entails taking separate decompositions of the matrices individually which results in a complexity of . We reduce complexity further by taking advantage of the form of , where which, when is sufficiently small, is substantially smaller than . The sampling procedure then consists of drawing , and taking . The final complexity with this low rank parameterization is then which provides substantial advantage when is sufficiently large. Finally, we note that in cases where is large, a number of methods can be used to improve computational efficiency (see solin2020hilbert; wilson2015thoughts; flaxman2015fast).
Appendix B Proofs
The proof largely follows from propositions in kanagawa2018gaussian, presented here for the sake of completeness.
The form of the weights follows directly from the conditional expectation of the multivariate normal in Equation (2). Specifically, the posterior (predictive) mean of the MTGP for unit at time is:
for general weights for target observation .
The separation of the overall weights into the product of unit and time weights and follows from fundamental properties of kernels. Specifically, the form of the ICM kernel is given by the product of its constituent components, i.e.