1 Introduction
In a wide range of scientific analyses, the quantity of interest is the causal effect of an exposure on an outcome , which may vary across space. Consider, for instance, the causal effect of fertilizer usage on crop yield . The effect, denoted , quantifies how the average crop yield changes per unit of change of fertilizer quantity. Due to variations in the soil, will vary as a function of spatial location .
The goal in this paper is to infer the effect from a finite dataset . Unmeasured confounding variables, however, render this problem nontrivial. Consider a case illustrated in Figure 0(a), where there is a positive association between exposure and outcome yet the causal effect is zero, i.e., . The causal structure underlying the datagenerating process can be expressed using structural equations and summarized by a directed acyclic graph (Dag) (Peters et al., 2017; Pearl, 2009; Hernán & Robins, 2019).
The structure is illustrated in Figure 0(b), where denotes an unmeasured confounding variable which gives rise to a strong association between and despite there being no causal effect. This situation is referred to as spatial confounding.
The bulk of the existing literature considers a fixed effect across space and, moreover, views as a residual spatial variation using a spatial random effect (Sre) model (Cressie, 1992). For example, Reich et al. (2006) estimate the effect of socioeconomic factors on stomach cancer incidence ratio using a Gaussian prior model for . Such approaches end up biasing the effect estimates and various restricted spatial regression techniques attempt to mitigate these problems, cf. (Hodges & Reich, 2010; Hughes & Haran, 2013; Hanks et al., 2015; Paciorek, 2010; Page et al., 2017). The geoadditive structural equation model (gSem) in Thaden & Kneib (2018) eliminates the bias problem by exploiting the causal structure underlying the data generating process. Specifically, they employ a linear Gaussian structural equation model and implicitly remove the spatial variation in and due to . While it is capable of removing systematic errors, the gSem method is only applicable to discrete space and homogeneous or fixed causal effects. The effect may, however, be heterogeneous with respect to spatial location. For example, when applying the method to study the effect of work experience on monthly income (Thaden & Kneib, 2018), it is quite possible that experience will have greater effect in regions with dense industries and highskilled jobs, as compared to areas dominated by lowskilled jobs.
Existing methods for heterogeneous causal effect build upon statistical machine learning methods but do not address spatially varying effects specifically and are, moreover, mainly restricted to binary or categorical exposures (Imai et al., 2013; Wager & Athey, 2017; Künzel et al., 2017; Nie & Wager, 2017). The standard formulation of the inference problem requires estimating a nuisance function, which leads to significant biases when inferring even in the largesample regime. The orthogonalized formulation introduced by Chernozhukov et al. (2016) circumvents this problem but results in multiplicative errors that are not addressed in the finite sample regime.
In this paper, we develop an inference method based on the orthogonalized formulation that

can infer heterogeneous causal effects with respect to spatial location,

is operational in discretized as well as continuous space,

provides robust and conservative estimates in the finite sample regime.
The outline of the paper is as follows: In Section 2, we present the problem setup and define the heterogeneous effect. In Section 3, we introduce a model of the effect and develop a robust orthogonalized formulation. The resulting method is demonstrated on both simulated as well as real data in Section 4.
Notation:
stacks both elements into a single column vector.
denotes the Kronecker product. denotes the sample mean of .2 Problem Formulation
We consider continuous exposures and outcomes . Spatial location or region is indexed by , where , if space is continuous, or , when space is partitioned into discrete regions.
Let be the counterfactual outcome had the exposure been assigned to , contrary to the observed value . The causal effect of the exposure at location is then defined as
(1) 
We consider scenarios in which the conditional independence
(2) 
holds. Figure 2 shows examples of causal structures for which (2) is valid. To illustrate the effect of an unmeasured confounding variable, we consider a simple example based on the structure in Fig. 1(a).
Let space be continuous one dimensional. We generate data shown in Fig. 0(a) based on the structure
which corresponds to Fig. 1(a) and where is an unmeasured confounding variable described as a spatial Gaussian Process
with a Matérn plus white noise covariance function
(Williams & Rasmussen, 2006) and is white Gaussian noise. We consider two cases of effects:Case 1: Fixed effect . A standard approach is to parameterize the unknown effect by a constant and adjust or control for the unmeasured by considering it as spatially structured noise (Cressie, 1992). The effect is then readily estimated using a generalized least squares method that weights the residuals based on the covariance structure of . Fig. 2(a) shows the resulting estimate and its confidence interval (CI) along with . As can be seen, the estimate is biased and provides misleading inferences. By contrast, the method we propose in this work produces an estimate with CI that clearly contains the true effect, see Fig. 2(b).
Case 2: Heterogeneous effect . In some applications, the effect may vary significantly across space and even change sign. Unlike the standard approaches, the proposed method can also estimate spatially varying , as shown in Fig. 2(c). The estimate closely follows the true effect, whose sign changes across space.
The inability to control for unmeasured by exploiting its spatial correlation structure arises because the variable is taken to be independent of the exposure (Paciorek, 2010). As we see in the example, however, is highly correlated with .
We now formulate the problem in the general case, building upon (2) and modularity (Pearl, 2009; Richardson & Robins, 2013; Hernán & Robins, 2019) we obtain the identity
(3) 
Locally at , we assume that the expected outcome (3) is well approximated by an affine function with respect to the exposure . Then it follows from (1) that the effect is a function of space alone, that is, . By combining (1) and (3), we get
After integrating both sides with respect to , this yields
(4) 
which implies the following form of the outcome
(5) 
where is a nuisance function and is a zeromean error term that is uncorrelated with any function of and . We see that in the general case, will be correlated with a spatially varying exposure . Therefore omitting it, as in the example above, gives rise to bias due to spatial confounding. On the other hand, taking this term directly into account requires the estimation of a highdimensional nuisance function jointly with the effect . This leads to nonnegligible biases in the finite sample case as pointed out by Chernozhukov et al. (2016) (Illustrated in Fig 4 below). These limitations motivate an alternative approach to infer from .
3 Method
In this section, we begin by deriving an orthogonalized formulation of the problem that circumvents the need to directly estimate the nuisance function in (5). Then we develop a robust errorsinvariables method that yields heterogeneous effect estimates as a function of the spatial location . Finally, we specify a spatial predictive model that we employ in the subsequent experimental section.
3.1 Orthogonalized formulation
Consider residualized versions of the outcome and exposure, and , respectively. These residuals are orthogonal to and in combination with (4) and (5), we obtain an orthogonalized formulation
(6) 
which eliminates the nuisance function .
To estimate the effect, which is a general function of
, we consider a flexible parametric model class
(7) 
where is a spatial basis vector described below and is a parameter vector. Thus the effect is directly identifiable from the residuals given in (6).
3.2 Robust errorsinvariables approach
In practice, however, the conditional means and are unknown and they are replaced by empirical predictors. Thus the residuals can be written as
(8a)  
(8b) 
where and denote errors. Let
(9) 
denote an unobservable random vector. Plugging (8) and the parametric model from (7) into (6), we obtain the following relation
(10) 
where is an unknown error. Hence using (10), it is therefore possible to estimate the effect as
(11) 
where the parameter vector is learned from This can be done using the leastsquares (LS) method, which ignores the unknown deviation in (10). It was shown by Chernozhukov et al. (2016) that LS nevertheless has good asymptotic properties in the case of fixed effects, that is when . The deviation , however, yields errors that are correlated with so that the LSbased estimate exhibits notable finitesample biases.
To tackle this problem, we propose using a robust learning method that takes into account the errorsinvariables arising from the unknown deviations. Specifically, we formulate a convex minimax problem
(12) 
where denotes an uncertainty set for a deviation . In this way the method mitigates the worst case deviations in the set, which we now determine on conservative statistical grounds.
When using an empirical predictor of the exposure, it is reasonable to assume that the error term in (8) tends to be smaller than
, in terms of their secondorder moments. Considering the unknown deviation (
9), this assumption on motivates a conservative upper bound on as specified by the uncertainty setUsing (12), we ensure that the estimated effect is robust against the worst deviations in the set.
To provide further insight to (12), we may reformulate the minimax problem as
using Theorem 1 in Xu et al. (2009). Note that the first term is the square root of the LS criterion and the second term results in a dataadaptive regularization that mitigates overfitting and biases the estimator against spurious effects. Moreover, the problem can be solved in runtime using the algorithm by Zachariah & Stoica (2015). This property enables the efficient computation of bootstrap confidence intervals for . We use the pivotal bootstrap method (Wasserman, 2013) to obtain CIs below.
3.3 Estimating residuals
The conditional means in (8) are MSEoptimal predictors of the exposure and the outcome. In lieu of these, we use predictors and as approximations.
Any reasonable prediction method is applicable since (12) mitigates both of the errors and . To match the form of the spatial model in (7), we consider here predictors that are linear in the parameters so that
(13a)  
(13b) 
where denote the parameters and for simplicity. To learn the parameters, we use here the method in Zachariah et al. (2017) for which learned parameters are given as
where weighted penalty term yields tuningfree regularization that mitigates overfitting in a dataadaptive manner. In addition, and can be obtained in a computationally efficient manner. Plugging in and in (13) yields and , respectively.
Combining the predictors with (12) leads to a robust orthogonalized method for spatially varying causal effect estimation (Rosce).
3.4 Spatial basis function
The form of the spatial basis vector in (7) depends on whether the space is continuous or discrete.
Continuous space
In the continuous case, we consider . For clarity of explanation we restrict our description here to the case when , but modifications to one or three dimensional space are straightforward. In two dimensions, let and
(14) 
where
Each is composed of components. Based on their attractive approximation and computational properties, we use the cubic bspline functions in (15), where determines the location of each component and determines the function support (Wasserman, 2013).
Thus the effect is approximated as a linear combination of bsplines placed at different points across space via the model class (7). The maximum value of is practically limited by the number of data points and the available computational resources. For dimensional space with components the dimension of parameter in (7) is .
The approximation accuracy of the model class introduced above is readily refined by incorporating bsplines with multiple supports. For instance,
(16) 
accommodates two different support sizes and . This enables inferring the effects at different spatial resolutions.
Discrete space
In the discrete case, the space consists of disjoint regions and the spatial basis vector is simply
(17) 
where denotes the indicator function. The above choice of yields a model class for with an average effect for every discrete region .
4 Experimental Results
In this section, we demonstrate the proposed Rosce method using simulated data for both continuous and discrete space. We subsequently apply the method to real data.
4.1 Continuous twodimensional space
First, we illustrate Rosce in continuous twodimensional space . The outcome is generated according to
where the effect of the exposure varies according to
as shown in Figure 3(a). Note that does not belong to the model class (7), but can nevertheless be well approximated using the bsplines basis. The nuisance function is , where is a bspline basis with and support ( being the range of in each dimension), cf. Sec. 3. The unknown coefficients were drawn as . The exposure and error are distributed as and , respectively. We generate a dataset , where . The samples of were drawn uniformly across .
For estimation, we use a basis vector with . To capture multiple resolutions, we use three levels of supports and , cf. (16). For the sake of comparison, we consider a direct approach to estimate using (5), which requires the joint estimation of the nuisance function . For the latter, we use the wellspecified model , which contains since . The effect and nuisance functions are jointly estimated using the LS method.
Figure 3(b) and 3(c) show the estimate obtained using Rosce and the direct approach, respectively. It can be clearly seen that the latter results in very poor effect estimates, due to the highdimensionality of . In contrast to this, the Rosce method approximates the true effect well. Moreover, we also evaluate confidence interval (CI) using the pivotal bootstrap method (Wasserman, 2013) and bootstrap iterations for both methods. Figure 3(d) shows the effect along the left diagonal with the CIs of the two methods. The direct approach infers insignificant effects, both where the true effect is actually positive and as well where the true effect is negative. The inference is also inconsistent, reporting a significant effect with an opposite sign to the true effect at around . The CI of Rosce, on the other hand, is consistent with the true effect.
4.2 Discrete spatial regions
Next, we illustrate Rosce in the simple case of discrete space and compare it with a naive LS approach. The outcome is generated according to
where the effect of the exposure varies across regions as
and the nuisance function is . The exposure and error are distributed as and , respectively. We generate a dataset , where and the samples are drawn uniformly across space.
The vector is given by (17). For the sake of comparison, we consider a standard LS regression approach, in which the predictive effect of the exposure in each region is estimated separately. Figures 4(a) and 4(b) illustrate the difference between Rosce, which takes the causal structure into account, and a standard regression approach. The latter leads to nonnegligible biases and misleading inferences, see for instance the inferred effect in region 5.
4.3 Robustness to errorsinvariables
Finally, we illustrate the robustness of the method against the unobserved deviations . We consider onedimensional continuous space with an effect parameterized as , where has dimension . We generate samples and, for clarity, let in (10) be zero, so that the only estimation error arises via . The residual in (8) is generated with independent variables
, which yields large errors.
We compare Rosce with an LS estimator applied to (10), which assumes . An example realization of the estimates is shown in Figures 5(a) and 5(b). We see that Rosce biases the estimates towards zero and reduces the risk of reporting spurious effects. Correspondingly, this method reports larger regions in which effects are significant (via 95%CIs) as compared to the LS method. We repeat the experiment using Monte Carlo simulations and confirm in Figure 5(c) that the proposed method is less prone to report spurious effects under errorsinvariables than an LSbased approach.
4.4 German income data
We proceed to illustrate the method with real datasets. In these examples, the space is partitioned into discrete regions. First, we consider the average household income in Euros at county level in Germany for the year (INKAR, 2012).
Suppose we are interested in studying the effect of age on income across Germany’s federal states. The observed data is which spans
states. The data is standardized to have zero mean and unit standard deviation. For the sake of illustration, suppose that different regions have different age demographics but also different economic structures with varying levels of unemployment. Then the unemployment level, denoted
, acts as a spatial confounder on the relation between age and income (Thaden & Kneib, 2018), see Figure 7.Assuming that is unknown, we use Rosce to infer the average change in income per year of age, i.e., . The result is shown in Figure 7(a), where we do not infer any significant causal effects across states. By contrast, when using a standard regression approach to infer the predictive effect in each state, we find significant negative effects of age in four states, see Figure 7(b). For comparison with Rosce, we control for countylevel unemployment rates in each state by partialing out from both and , and then estimating the effect from the residuals. The results in Figure 7(c) corroborate those reported in Figure 7(a) in that the effects remain insignificant when the unobserved variables are taken into account.
4.5 US Crime data
Next, we are interested in studying the effect of poverty on crime, which has been a longstanding research topic in criminology. Based on a review of different studies, Ellis & McDonald (2001) concluded that there is an inverse association between an individual’s socioeconomic status and criminal behavior. We use number of crimes and number of poor families data on county level from US census of year (Census Bureau, 2000) to test if our method is able infer such a causal effect. The outcome considered is the total number of crimes due to theft, larcency and burglary. The exposure
considered is the number of families that are classified as poor. The observed data is
which spans over states.Figure 8(a) shows the estimated effect of number of poor families on crimes due to theft, larcency and burglary . It is estimated to be positive for all states, where the estimates are significant in all but three states as shown in Figure 8(b). The results are thus consistent with previous findings. We find that across most states, eliminating poverty among ten families, reduces on average three crimes (of either theft, larceny and burglary).
We also consider the effect of education on income using percentage of people getting high school education and further and median annual household income data at county level from US census of year (Census Bureau, 2000). Figure 10 shows the estimated effect of education attainment percentage on annual median household income. Our method gives a positive effect which one would intuitively expect between these two quantities.
5 Conclusion
We have proposed a method for estimating heterogeneous effects of an exposure on an outcome variable in presence of possible unmeasured spatially varying confounding variables. The method is applicable to both continuous and discrete space . The orthogonalized approach circumvents joint estimation of a possibly highdimensional nuisance function while the robust estimation approach mitigates the resulting errorsinvariables. We demonstrated the properties of the method on synthetic data and illustrated its potential use in realworld applications.
Acknowledgements
The authors would like to thank Fredrik Johansson (MIT) for feedback on an early version of this work. This research was financially supported by the projects Counterfactual Prediction Methods for Heterogeneous Populations (contract number: 201805040) and NewLEADS  New Directions in Learning Dynamical Systems (contract number: 621201606079), funded by the Swedish Research Council and by the Swedish Foundation for Strategic Research (SSF) via the project ASSEMBLE (contract number: RIT150012)
References
 Census Bureau (2000) Census Bureau. USA data, 2000. URL https://www.census.gov/prod/www/decennial.html.
 Chernozhukov et al. (2016) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. K. Double machine learning for treatment and causal parameters. Technical report, cemmap working paper, Centre for Microdata Methods and Practice, 2016.
 Cressie (1992) Cressie, N. Statistics for spatial data. Terra Nova, 4(5):613–617, 1992.
 Ellis & McDonald (2001) Ellis, L. and McDonald, J. N. Crime, delinquency, and social status: A reconsideration. Journal of Offender Rehabilitation, 32(3):23–52, 2001.
 Hanks et al. (2015) Hanks, E. M., Schliep, E. M., Hooten, M. B., and Hoeting, J. A. Restricted spatial regression in practice: geostatistical models, confounding, and robustness under model misspecification. Environmetrics, 26(4):243–254, 2015.
 Hernán & Robins (2019) Hernán, M. A. and Robins, J. M. Causal inference. Boca Raton: Chapman Hall/CRC, forthcoming, 2019.
 Hodges & Reich (2010) Hodges, J. S. and Reich, B. J. Adding spatiallycorrelated errors can mess up the fixed effect you love. The American Statistician, 64(4):325–334, 2010.

Hughes & Haran (2013)
Hughes, J. and Haran, M.
Dimension reduction and alleviation of confounding for spatial generalized linear mixed models.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1):139–159, 2013.  Imai et al. (2013) Imai, K., Ratkovic, M., et al. Estimating treatment effect heterogeneity in randomized program evaluation. The Annals of Applied Statistics, 7(1):443–470, 2013.
 INKAR (2012) INKAR. German income, age and unemployment data, 2012. URL http://http://www.inkar.de/.
 Künzel et al. (2017) Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. Metalearners for estimating heterogeneous treatment effects using machine learning. arXiv preprint arXiv:1706.03461, 2017.
 Nie & Wager (2017) Nie, X. and Wager, S. Quasioracle estimation of heterogeneous treatment effects, 2017.
 Paciorek (2010) Paciorek, C. J. The importance of scale for spatialconfounding bias and precision of spatial regression estimators. Statistical science: a review journal of the Institute of Mathematical Statistics, 25(1):107, 2010.
 Page et al. (2017) Page, G. L., Liu, Y., He, Z., and Sun, D. Estimation and prediction in the presence of spatial confounding for spatial linear models. Scandinavian Journal of Statistics, 44(3):780–797, 2017.
 Pearl (2009) Pearl, J. Causality. Cambridge university press, 2009.
 Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. Elements of causal inference: foundations and learning algorithms. MIT press, 2017.
 Reich et al. (2006) Reich, B. J., Hodges, J. S., and Zadnik, V. Effects of residual smoothing on the posterior of the fixed effects in diseasemapping models. Biometrics, 62(4):1197–1206, 2006.
 Richardson & Robins (2013) Richardson, T. S. and Robins, J. M. Single world intervention graphs (swigs): A unification of the counterfactual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington Series. Working Paper, 128(30):2013, 2013.
 Thaden & Kneib (2018) Thaden, H. and Kneib, T. Structural equation models for dealing with spatial confounding. The American Statistician, 72(3):239–252, 2018.

Wager & Athey (2017)
Wager, S. and Athey, S.
Estimation and inference of heterogeneous treatment effects using random forests.
Journal of the American Statistical Association, (justaccepted), 2017.  Wasserman (2013) Wasserman, L. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.
 Williams & Rasmussen (2006) Williams, C. K. and Rasmussen, C. E. Gaussian processes for machine learning. the MIT Press, 2(3):4, 2006.
 Xu et al. (2009) Xu, H., Caramanis, C., and Mannor, S. Robust regression and lasso. In Advances in Neural Information Processing Systems, pp. 1801–1808, 2009.

Zachariah & Stoica (2015)
Zachariah, D. and Stoica, P.
Online hyperparameterfree sparse estimation method.
IEEE Trans. Signal Processing, 63(13):3348–3359, 2015.  Zachariah et al. (2017) Zachariah, D., Stoica, P., and Schön, T. B. Online learning for distributionfree prediction. arXiv preprint arXiv:1703.05060, 2017.
Comments
There are no comments yet.