Combining Heterogeneous Spatial Datasets with Process-based Spatial Fusion Models: A Unifying Framework

06/02/2019 ∙ by Craig Wang, et al. ∙ 0

In modern spatial statistics, the structure of data that is collected has become more heterogeneous. Depending on the type of spatial data, different modeling strategies for spatial data are used. For example, a kriging approach for geostatistical data; a Gaussian Markov random field model for lattice data; or a log Gaussian Cox process for point-pattern data. Despite these different modeling choices, the nature of underlying scientific data-generating (latent) processes is often the same, which can be represented by some continuous spatial surfaces. In this paper, we introduce a unifying framework for process-based multivariate spatial fusion models. The framework can jointly analyze all three aforementioned types of spatial data (or any combinations thereof). Moreover, the framework accommodates different conditional distributions for geostatistical and lattice data. We show that some established approaches, such as linear models of coregionalization, can be viewed as special cases of our proposed framework. We offer flexible and scalable implementations in R using Stan and INLA. Simulation studies confirm that the predictive performance of latent processes improves as we move from univariate spatial models to multivariate spatial fusion models. The introduced framework is illustrated using a cross-sectional study linked with a national cohort dataset in Switzerland, we examine differences in underlying spatial risk patterns between respiratory disease and lung cancer.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 23

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In statistics, spatial models are useful when residuals exhibit correlation in space after accounting for known covariates in a regression-type setting. Spatial data has long been classified into three categories, namely geostatistical (point-level) data, lattice (area-level) data, and point-pattern data

(Cressie, 1991). Due to reasons such as measurement method constraints and privacy considerations, different types of spatial data may be collected in different settings. Depending on the data type, different statistical models that capture the residual spatial correlation are used. In a nutshell, (1) geostatistical data are commonly collected in environmental science. For example, rainfall at weather stations (Kyriakidis et al., 2001), where the exact geo-coordinates for each observation are known. The strength of dependency is a function of the distance separation between two locations. Kriging can be used to model a smooth surface. (2) Lattice data can be either gridded or irregularly aligned, and occur in the form of aggregated observation over areas. They are often collected in epidemiology, such as disease prevalence of each district (Chammartin et al., 2016)

. Another source of lattice data is from measuring instruments, such as satellites where the spatial resolution is intrinsically limited, resulting in gridded observations. Gaussian Markov random field (GMRF) models such as conditionally autoregressive models are typically used to capture the spatial dependency between neighboring areas. Finally, (3) there is point-pattern data, where the locations themselves are stochastic. They are used to model epidemiological data of case locations

(Gatrell et al., 1996) or other event locations such as epicenter of earthquakes (Ogata, 1988)

. One approach to model such data is using a Poisson process, where the intensity function may depend on observed covariates. Despite having different modeling strategies for different types of data, a common purpose of all the aforementioned statistical models is to capture the residual spatial dependency between different observations. A natural way to do this is using Gaussian processes to model continuous spatial surfaces, which represent the underlying scientific process that drives the response variables together with observed covariates.

There are already attempts to analyze lattice data and point-pattern data using Gaussian process-based models. For example, Kelsall and Wakefield (2002)

modeled aggregated disease counts using a Gaussian process approach. The authors derived analytical approximations to area-level Poisson mean and produced continuous underlying relative risk functions using Markov chain Monte Carlo (MCMC). Point-pattern data can be linked to Gaussian process with a log-Gaussian Cox process (LGCP)

(Møller et al., 1998). Instead of having a fixed intensity at each location, the intensity is modeled as a log-Gaussian process, yielding a doubly stochastic Poisson process. In modern spatial statistics, researchers are dealing with increased heterogeneity in the structure of spatial data that is collected. Different data sources may contain overlapping information concerning the same research questions. Recently, there are several approaches in the literature that proposed to use Gaussian process as a basis to fuse spatial data of different types. Moraga et al. (2017) proposed a model to analyze spatial data available in both geostatistical and lattice type, with the same set of covariates and a response variable observed at two different spatial resolutions. Their computation is made efficient using integrated nested Laplace approximations (INLA) (Rue et al., 2009). Wilson and Wakefield (2018) extended the work by allowing non-Gaussian response variables. This opens up possibilities of modeling count data which commonly occurs in epidemiological settings. Shi and Kang (2017) proposed a fixed rank kriging-based fusion model to combine multiple lattice-type remote sensing datasets. Other works on spatial fusion models (Berrocal et al., 2010; McMillan et al., 2010; Sahu et al., 2010) also implemented efficient algorithms for specific model structure introduced in their applications. In general, spatial fusion models are challenged by their flexibility and computational efficiency. There exists a trade-off between them, i.e. more flexible modeling structure comes with a higher computational cost. For example, as shown in Wilson and Wakefield (2018)

, their fusion model with normally-distributed response variables took an order of minutes using INLA to compute. A more flexible modeling structure for Poisson-distributed response took several weeks using Hamiltonian Monte Carlo-based inference method.

In this paper, we extend previous works in those two aspects. In terms of flexibility, our framework incorporates an additional data type, namely point-pattern data. To the best of our knowledge, this is the first spatial fusion framework that incorporates all three types of spatial data. We additionally allow arbitrary combinations of those three data types in multivariate settings. We propose a unifying framework that includes the features of several well-established models, such as linear models of co-regionalization (LMC) (Wackernagel, 2003; MacNab, 2016), spatial factor model (Wang and Wall, 2003), and shared component model (Knorr-Held and Best, 2001). In terms of computational cost, we implement a fully Bayesian-based approach using Stan modeling language (Carpenter et al., 2017)

. As a more efficient alternative, we offer INLA-based implementation which significantly reduces computation time from hours to minutes for thousands of observations. Last but not least, we benchmark the performance of these two implementations in terms of prediction and parameter estimation in simulation studies.

The rest of this paper is structured as follows: Section 2 introduces the unifying framework and explicitly show its link to existing spatial models. Section 3 discusses implementation strategies in Stan and INLA. Section 4 illustrates the framework using two simulated scenarios and an analysis on epidemiological datasets, as well as comparing the performance of our two implementations. Finally, we end with a summary followed by some discussion on identifiability problems and research outlook in Section 5.

2 Process-based Spatial Fusion Model

2.1 The Unifying Framework

For , we let denote the th response variable with observations, with a conditional distribution that belongs to the exponential family. Each of the response can take any of the following data types: i) geostatistical data, observed at locations ; ii) lattice data observed at areas ; or iii) point-pattern data that has been discretized to regular fine grid containing mostly zeros or ones, observed at gridded locations , where denotes the number of events in the grid cell containing . Further, we let denote a full (column) rank matrix of spatially-referenced covariates that are observed at the same spatial units as the corresponding response variables,

denote a vector

of fixed effect coefficients. We assume there is a

vector of zero-mean, unit variance, independent latent Gaussian processes

having a design matrix , i.e. is the th linear combination of Gaussian processes. Each Gaussian process is parameterized by its own covariance function. Finally, non-linear operator subsets and aggregates some components of such that it matches the spatial resolution of the corresponding response variable. Overall, the framework can be formulated as

(1)

where is a link function that corresponds to the conditional distribution of . Fig. 1 outlines a graphical formulation of the framework.

Figure 1: A graphical formulation of the spatial fusion model framework, consisting of multiple response variables of different type and multiple latent processes.

Change of support problems (Gotway and Young, 2002) arise when lattice data needs to be modeled. We only observe aggregated information while the underlying process is continuous. We employ a sampling-points approximation approach to stochastic integrals (Gelfand et al., 2001; Fuentes and Raftery, 2005) for aggregating latent processes. Let denote the set of all sampling points and each area contains sampling points. For the th area in th response, under a linear link function we obtain

(2)

When a nonlinear link function is used for a response variable, the aggregation in (2) will result ecological bias (Greenland, 1992). For a general link function , we have the following approximation,

(3)

Typically, a small is chosen to balance the trade-off between computational efficiency and model accuracy (Fuentes and Raftery, 2005; Liu et al., 2011).

Albeit the latent processes have a continuous index, we work with a finite set of locations in practice. The set of locations to be modeled in the latent processes comprise the locations where geostatistical data are observed, the locations of sampling points for lattice data and gridded locations for point-pattern data. The non-linear operator takes a different form for different data types. For geostatistical and point-pattern data, the non-linear operator subsets to the corresponding locations of the th response variable. For lattice data and a linear link function, subsets to the sampling point locations and aggregates them to the corresponding areas by taking averages. With non-linear link functions, first applies an inverse link function and then aggregates.

2.2 Link to Existing Models

Our proposed unifying framework utilizes elements from existing literature and combines them to create a flexible yet efficient spatial fusion model framework. As a result, there are some strong links in terms of model structure between this framework and some established methods in spatial statistics. At the same time, they share the same potential identifiability issues.

In univariate settings, the unifying framework allows us to model each type of spatial data individually with a latent Gaussian process. When we have geostatistical data, it results in a geostatistical regression (Cressie, 1991). With Poisson-distributed lattice data, we obtain a sampling-points approximation to the model used in Kelsall and Wakefield (2002), which is an alternative modeling strategy to Besag-York-Mollé model (Besag et al., 1991). With point-pattern data, we obtain a discretized LGCP (Møller et al., 1998).

In multivariate geostatistical data settings, the design matrix plays a pivotal role in the identifiability of model parameters. When the number of independent Gaussian processes is less than the number of responses , we obtain a spatial factor model (Wang and Wall, 2003). The latent spatial factors are assumed to have zero-mean unit-variance Gaussian processes, such that controls the variance (partial sill) of latent processes. When , we obtain a general LMC framework (Wackernagel, 2003; Schmidt and Gelfand, 2003). A similar LMC framework also exists for lattice data (MacNab, 2016)

. Identifiability issues occur in the LMC since the number of latent values to be estimated in the latent processes is equal to the total number of observations in response variables. Additional spatial hyper-parameters, and fixed-effect coefficients also need to be estimated. For this reason, regularization is done via one of the following: 1) empirical Bayes method by fixing some of the hyper-parameters; 2) choosing informative prior distributions in Bayesian models; or 3) using a lower triangular matrix for

(Schmidt and Gelfand, 2003). In cases of , we acquire a similar model structure as shared component models (Knorr-Held and Best, 2001) for Gaussian processes, where multiple outcomes have their own latent spatial components plus some shared spatial components. In this setting, the values in need to be even further constrained to avoid identifiability issues (Knorr-Held and Best, 2001).

Our framework is also linked to other process-based spatial data fusion models, which combine geostatistical and lattice data types. When we let the response variables represent the same information but have different data types, we obtain the model presented in Wilson and Wakefield (2018), where an explicit relationship is used to link multiple response variables. If we further allow different information to be represented in the response variables, we reach the generalized spatial fusion model framework proposed in Wang et al. (2018).

To the best of our knowledge, there is no existing approach or implementation that jointly models all three types of spatial data in a multivariate framework. With those links to the existing approaches, our framework extends upon them by combining different features and enhance the overall flexibility of spatial fusion models.

3 Model Implementations

It is well known that fitting full Gaussian processes in Bayesian models is computationally expensive in both univariate and multivariate settings. Marginalized and conjugate Gaussian process models dramatically save computation time but they are only available when fitting geostatistical data with normally-distributed outcomes (Banerjee et al., 2014; Zhang et al., 2019). There exist several approaches to reduce the computational burden, such as low rank (Cressie and Johannesson, 2007; Banerjee et al., 2008; Stein, 2008) and sparse (Furrer et al., 2006; Rue et al., 2009; Datta et al., 2016) methods. Some of those approaches are utilized in existing spatial fusion models. Shi and Kang (2017) adapted the spatial basis function approach from fixed rank kriging (Cressie and Johannesson, 2007). Moraga et al. (2017) used integrated nested Laplace approximations (Rue et al., 2009). Wang et al. (2018) exploited the nearest neighbor Gaussian process (NNGP) (Datta et al., 2016). In this paper, we offer two efficient implementation strategies for the unifying spatial fusion model framework. The first strategy follows an adaptation of NNGP implementation in Wang and Furrer (2019). The second strategy follows Wilson and Wakefield (2018) to use INLA, with additional approximations for non-linear link functions.

3.1 Implementation using NNGP

Fitting full Gaussian processes in a Bayesian hierarchical model is costly, therefore we seek for efficient methods to speed up the inference. An NNGP implementation approximates a full Gaussian process by assuming that latent variables in the Gaussian process are conditionally independent given their neighborhood sets, hence introducing sparsity in the precision matrix. Datta et al. (2016) showed that it significantly reduces computation time in geostatistical models while yielding results close to a full Gaussian process-based inference. In our implementation, we let each latent spatial process following an independent NNGP. Let denote a latent process on the set of locations , then the NNGP likelihood according to Datta et al. (2016) can be written as

(4)

where is the set of nearest neighbors from for location with a fixed constant , is the cross-covariance matrix between the latent process and its neighbors , is the covariance matrix of , and is the variance of

. The variance and covariance matrices are parameterized by spatial hyperparameters.

The full Bayesian hierarchical model is then implemented using Stan modeling language via the rstan package (Stan Development Team, 2018) in R (R Core Team, 2018), consisting of likelihoods for each outcome variable and NNGP, as well as priors for fixed effect coefficients and spatial hyperparameters. Stan implements the No-U-Turn sampler (Homan and Gelman, 2014)

based on Hamiltonian Monte Carlo, which provides efficient means of conducting full Bayesian inference for complex hierarchical structures.

3.2 Implementation using INLA

Although the computational efficiency can be improved by using NNGPs instead of full Gaussian processes, it is still not feasible to fit multiple latent processes with more than thousands of locations in . Therefore, we implement an alternative strategy using INLA.

Over a fixed set of locations, a Gaussian process is equivalent to a Gaussian random field (GRF). Lindgren et al. (2011)

established a connection between GRFs and GMRFs through a stochastic partial differential equation approach, where a GRF can be approximated by triangulating the spatial domain and using a weighted sum of basis functions as

(5)

where is the number of points in the triangulation,

are Gaussian distributed weights and

are basis functions. The weights forms a GMRF with sparse precision matrix which makes computation efficient. In this approach, the covariance function must be a member of the Matérn family defined as

(6)

where is the Euclidean distance between and , is the modified Bessel function of second kind with order , is the partial sill and relates to the spatial range, with being the smoothness parameter. The approximation in Eq. (5) can be written as , where is a projection matrix that maps a GMRF defined on the triangulation mesh nodes to the observations’ locations.

The key to implementing the spatial fusion models in INLA lies within the projection matrix, with a different structure required for each data type (Krainski et al., 2018). For geostatistical data, the th row of the projection matrix corresponds to the th location, it is filled with zeros except where 1) the location is on the th vertex, then the th column is 1 or 2) the location is within a triangulation area, then three cells get values based on a mixture of barycentric based weights from three neighboring vertices of the triangulation. For lattice data, we construct a projection matrix that links the th area with the mean value of the GRF at mesh nodes which falls into the th area. If the link function is linear, increasing the mesh density will increase the number of mesh nodes that falls into each area hence better approximate the average. However, for non-linear link functions, it is preferable to have less-dense mesh due to Jensen’s inequality (Jensen, 1906), which states

(7)

for the th area in the th response. The approximation is better when there is only a smaller number of mesh nodes within each area (Follestad and Rue, 2003). Finally, for the point-pattern data, we use an augmentation approach by Simpson et al. (2016)

, which avoids discretizing the spatial domain into grid cells. The projection matrix is built as an identity matrix with dimension equal to the total number of mesh nodes, row-binded with a projection matrix that is constructed on observed locations in the same way as the geostatistical case.

The final model fitting is done by stacking the projection matrices together and assigning appropriate priors using the INLA (Lindgren and Rue, 2015) package in R. Advances in INLA (Martins et al., 2013) such as allowing multiple likelihoods and ‘copy’ feature made this implementation possible.

4 Illustrations

In this section, we conduct two simulation studies and an analysis of epidemiological datasets to illustrate our proposed framework. All results are obtained in R version 3.5.0 (R Core Team, 2018), on a Linux server with 256GB of RAM and two Intel Xeon 6-core 2.5GHz processors. All R codes used in the simulation studies are provided in the supplementary material.

4.1 Simulation Study One

We are interested in modeling a single latent spatial process within a

square, using three spatial response variables with one being from each type. First, we simulate a zero-mean GRF on densely uniformly distributed locations with a covariance matrix

. We then sub-sample 200 locations to obtain the latent process at observed locations. For lattice observations, we divide the square into 100 Voronoi cells and compute aggregated GRF from all locations while accounting for ecological bias using Eq. (3). In addition, we generate a covariate for geostatistical and lattice response by sampling from a standard normal distribution. Afterwards, we generate a normally-distributed geostatistical response at the same sampled locations and a Poisson-distributed lattice response for each area. For point-pattern observations, we simulate from the same GRF on a coarse 2020 grid, then exponentiate the values to obtain intensity at the grid cells, afterwards we generate Poisson point process using each intensity value multiplied by cell area and an offset term as the final intensity. In summary, the response variables are generated according to

(8)

In the simulation, we use an exponential covariance function (), i.e. . The influence of varying sample sizes and spatial hyperparameters on predictive performance was investigated by Wang et al. (2018), therefore we only consider a single setup by setting and . In addition, we set and . is the product of grid cell area and an offset term which takes value 0.25.

We consider seven different model specifications within our proposed framework. (i - iii) three univariate models using a single data type each, namely one of geostatistical, lattice and point-pattern data, (iv - vi) three fusion models using different combinations of two data types, and (vii) a multivariate fusion model combining all three response variables. In Stan implementation, the intercepts and coefficients are assigned with independent priors. The variance parameters and are assigned with inverse Gamma prior , which has a mean of one and undefined variance. For the spatial decay , a zero-truncated normal prior is assigned. We use nearest neighbors and sampling points randomly selected within each area. We run 4 chains of 2,000 iterations with 1,000 warm-up samples, without thinning for each model. Multiple chain convergence is checked with potential scale reduction factors (Brooks and Gelman, 1998). For INLA implementation, we use penalized complexity (PC) prior for Matérn GRF (Fuglstad et al., 2018) with fixed at 1.5, corresponding to the exponential covariance function. In addition, we choose the median practical spatial range to be 2 (corresponds to the median of

being 1) and the probability of

greater than 1.7 is 5%, such that the allocated probability mass is closely matched with the priors in Stan. The rest of the priors in INLA are default options. The same data was modeled using both Stan and INLA implementations for comparison. Additionally, the simulation is repeated 100 times for INLA. We leave out the Stan implementation in the repetition part due to its long computation time.

We chose an additional 1600 sites to evaluate the predictive performance of models in terms of root mean squared prediction errors (RMSPE) under each scenario. The prediction sites are located at the centers of a grid that uniformly covers the sampling domain. Their predictive performance is shown in Figure 2. The first two venn diagrams show the RMSPEs for the simulated scenario under different models with Stan and INLA implementation. The last Venn diagram shows the average RMSPEs over 100 simulations for INLA. For both Stan and INLA implementations, the RMSPEs are smaller in multivariate fusion models compared to univariate process-based models. The joint modeling of all three types of spatial data has the lowest RMSPE on the prediction of the latent process at unobserved locations. Stan and INLA implementations produced comparable results, with the RMSPEs of Stan fall inside the ranges of repeated INLA simulations.

Figure 2: Venn diagram of root mean squared prediction error for the unifying fusion framework fitted to each data type (and combinations thereof), using models implemented in Stan and INLA. Values in overlapping areas indicate results from models with multiple data types.

4.2 Simulation Study Two

In the second simulation study, we focus on comparing the parameter estimates of a multivariate fusion model with three response variables of different types and two latent processes. We firstly simulate two independent zero-mean unit-variance GRFs uniformly distributed on the spatial domain of square, then compute the sub-sampled and aggregated GRF for each response variable in the same way as in simulation one. Each response depends on the latent processes via the design matrix

(9)

The first geostatistical response variable only depends on the first latent process, the second lattice response variable depends on both latent processes, while the third point pattern response variable depends only on the second latent process. The response variables are generated as follows,

(10)

where consists of 500 geostatistical observations, has 100 lattice observations and represents the number of events observed at each of 400 cells on a grid. In addition, we set , and . Since we have two latent processes in the simulation, using any of the univariate model or fusion model with two response variables can lead to identifiability problem. Hence, we estimate the parameters only using the unifying spatial fusion model with three responses only. The model and their prior specifications for both Stan and INLA are the same as in simulation one, except for the spatial range parameter. The prior for both and is zero-truncated in Stan and PC prior with median practical spatial range in INLA (corresponds to the median of being 10).

The parameter estimates based on posterior medians and their 95% posterior credible intervals for both implementations are summarized in Table 

1. We obtained similar parameter estimates in both models. The PC prior in INLA penalizes complex structure in GRF hence tends to have a slightly over-estimated range. The posterior median of fitted latent processes at locations with geostatistical observations are shown in Fig 3. The root mean squared errors using Stan are 0.50 and 0.44 for the first and second latent process, compared to 0.54 and 0.48 using INLA. The computation time for the Stan implementation of the fusion model is 1.9 hours while it takes 11 minutes for INLA.

Stan INLA
True Median 95% CI Median 95% CI
3.00 2.78 (2.43, 3.11) 2.78 (2.41, 3.13)
5.00 5.03 (4.92, 5.13) 5.01 (4.91, 5.12)
0.50 0.56 (0.24, 0.84) 0.61 (0.3, 0.89)
2.00 1.94 (1.74, 2.16) 1.92 (1.77, 2.09)
1.20 1.33 (1.16, 1.53) 1.23 (1.08, 1.41)
0.50 0.51 (0.28, 0.75) 0.65 (0.46, 0.87)
1.20 1.03 (0.8, 1.37) 0.96 (0.69, 1.36)
1.00 0.86 (0.7, 1.11) 0.82 (0.61, 1.15)
5.00 5.44 (3.85, 8.67) 5.93 (4.12, 8.48)
25.00 17.36 (10.7, 28.88) 27.60 (17.86, 45.99)
0.50 0.47 (0.27, 0.73) 0.55 (0.37, 0.8)
Table 1: Parameter estimates and their 95% posterior credible intervals (95% CI) from the unifying spatial fusion model in simulation two with both Stan and INLA implementation.
Figure 3: True versus fitted latent process at locations with geostatistical observation. Pearson’s correlation coefficients are displayed.

4.3 Application to LuftiBus-SNC Dataset

In spatial epidemiology, the joint analysis of multiple diseases with similar etiology allows us to separate underlying risk factors into shared and disease-specific components. In this analysis, we examine the disease-specific spatial risk surface of lung cancer and shared spatial risk components between lung cancer and respiratory disease.

Chronic lung disease contributes substantially to morbidity and mortality worldwide, with chronic obstructive pulmonary disease (COPD) being the third leading cause of death (Lozano and et al, 2012). Forced expiratory volume in one second (FEV1) is a measure of the amount of air a person can exhale during a pulmonary test, it can be used to diagnose disease and predict respiratory-related mortality (Menezes et al., 2014). While respiratory disease and lung cancer share many common risk factors such as smoking and exposure to air pollution, it is of interest to examine the lung cancer specific spatial risk component. It may provide insights into identifying risk factors that are solely associated with lung cancer.

Initiated as a health promotion campaign by Lunge Zurich (Lunge Zürich, 2017) in Switzerland, the ‘LuftiBus’ project collected lung function measurements including FEV1 and demographic information from local residents. The data from LuftiBus observed between 2003 and 2012 were deterministically linked with a census-based Swiss National Cohort (SNC) study, to obtain 44,071 people with demographic, health and environmental variables in Switzerland. More importantly, the linkage provides us with the residential location of individual participants.

For lattice data, we compute the expected cause-specific (respiratory and lung cancer, respectively) mortalities in each municipality, adjusted by 5-year age-group and gender using the SNC data. We assume there are two latent spatial risk surfaces which are associated respiratory disease and lung cancer. The first risk surface is shared between FEV1, respiratory mortality and lung cancer mortality, while the second is lung cancer-specific risk surface. Typically with lattice data, multivariate conditional autoregressive models allow us to jointly analyze multiple responses and identify different latent components (Jin et al., 2005). However, municipal boundaries are artificial, we argue that a continuous spatial surface is a more natural modeling assumption. Therefore, we use our process-based unifying framework to conduct the analysis. Another advantage is that it allow us to incorporate the rich FEV1 data from Luftibus. The fusion model is structured as

where and are the expected cause-specific mortalities.

More than 60% of the FEV1 measurements in the linked dataset are located in Canton of Zurich, therefore we restrict our analysis to Canton of Zurich. In addition, we focus the analysis on people who are 40 years or older, which results in 16,160 geostatistical observations. Since we have a large number of observations in the FEV1 outcome, the number of locations required to model in the latent processes is large. Therefore it is not feasible to use the Stan implementation. We conduct the analysis using the INLA implementation only. We use PC prior for the latent components with corresponding to exponential covariance function, median practical range of 1km and median of 1. Figure 4 shows the locations of geostatistical observation and the standardized mortality ratio for respiratory disease and lung cancer. Figure 5 shows the transformed posterior estimates of the latent processes representing relative risk surfaces in Canton of Zurich. The shared risk components between FEV1, respiratory mortality, and lung cancer mortality is highest in urban areas, with an effective range of 3.1 (95% CI: 1.9, 5.2) km based on the exponential covariance function. The estimated relative risk is computed by exponentiating the latent process, which varies between 0.72 and 1.59. Meanwhile, the high-risk areas of lung cancer-specific components are scattered around Canton of Zurich, mainly in the north and west regions with an effective range of 1.2 (95% CI: 0.3, 4.0) km. The variability is smaller than the shared component with values between 0.87 and 1.32, indicating a smoother risk surface compared to the shared component. The lung cancer-specific risk component is modeled via lattice data only, hence appearing to have some block-wise structures compared to the shared component.

Parameter
Median 4.74 0.907 -0.0375 -0.0501
95% CI (4.79, 4.7) (0.923, 0.891) (-0.0368, -0.0382) (-0.101, -0.00127)
Parameter
Median -0.112 0.268 1020 389
95% CI (-0.171, -0.056) (0.274, 0.262) (628, 1720) (87.3, 1320)
Parameter
Median 0.0887 0.148 0.177 0.527
95% CI (0.0223, 0.233) (0.09, 0.257) (0.0597, 0.449) (0.242, 1.37)
Table 2: Parameter estimates and their 95% posterior credible intervals (95% CI) for the LuftiBus-SNC dataset. and are in meters.
Figure 4: Data used in the fusion model. Left: locations of geostatistical observation. Middle: respiratory standardized mortality ratio. Right: lung cancer standardized mortality ratio.
Figure 5: Estimated spatial relative risk surfaces. Left: shared component between FEV1, respiratory mortality and lung cancer mortality. Right: lung cancer mortality-specific component.

5 Summary and Discussions

We have proposed a unifying process-based statistical framework to handle spatial data fusion. The framework allows all three types of spatial data, namely geostatistical, lattice and point-pattern data, to be easily incorporated into a single multivariate spatial model. This framework contains theoretical and computational elements from several existing literatures: the basis for modeling latent processes in Stan is NNGP (Datta et al., 2016), the first Bayesian implementation is based on Stan (Stan Development Team, 2018), the alternative implementation uses INLA (Rue et al., 2009), the sampling point approximation approach for modeling lattice data is adopted from Fuentes and Raftery (2005), discretization (Møller et al., 1998) is used in modeling point-pattern data in Stan while data augmentation Simpson et al. (2016) is used in INLA. We have combined all of the individual elements and constructed this unifying framework. The framework extends upon existing flexible spatial fusion models (Wang et al., 2018; Wilson and Wakefield, 2018)

by making point-pattern data also compatible, hence completes all three spatial data types. We have benchmarked our INLA implementation against full Bayesian inference in Stan, and observed comparable performance with significantly decreased computation time. In addition, we have shown in the first simulation study that it is advantageous to conduct multivariate analysis using multiple spatial datasets if they are available.

Identifiability issues arise when there is more than one latent spatial process in the fusion model. Similar concern has been brought up in other multivariate spatial models (Ren and Banerjee, 2013; Knorr-Held and Best, 2001). Since the model becomes invariant under certain orthogonal transformations, the design matrix is not identifiable. Knorr-Held and Best (2001) proposed a specific constraint on the relationship among the individual elements of the design matrix. Ren and Banerjee (2013) proposed to constrain one element of each row in the design matrix to be strictly positive and having an ordered spatial range parameter. The same constraints allow identifiable parameters in the INLA implementation but not in Stan. A distinction between our proposed framework and existing multivariate models is, that we could potentially have at most one observation at any of the spatial locations even when we have three response variables. This makes identifying more than one latent process at each location problematic. Our INLA implementation does not directly model the latent variable parameters at the set of locations , but on the mesh vertices. One solution in the Stan implementation is to re-use the observed locations as sampling points and locations representing grid cells of LGCP whenever possible, such that the number of latent process parameters is reduced. Alternatively, certain elements of the design matrix can be constrained to zero based on expert knowledge, as done in our application to the LuftiBus-SNC dataset. When a model involves Matérn covariance function with smoothness parameter greater than 1, our implementation in Stan can be easily adapted by modifying the likelihood expressions, while INLA models can still be used as an approximation.

The usage of our proposed framework is multifaceted. The interest sometimes lies within latent spatial processes when analyzing spatial data, which represent residual spatial correlation in the response variables after taking existing covariates into consideration. The result can be used for detecting spatial clusters of unexplained risk or shared scientific drivers for response variables, which warrant further investigation in identifying those unknown drivers. When the interest is in predicting response variable for a newly observed spatial unit, fusion model improves the prediction of latent processes which in turn can improve the response variable prediction. Furthermore, the framework can be modified to use a one-dimensional Gaussian process in the latent components such that it applies beyond spatial data. For example, it can be used in time series modeling where all the observations are in

and some machine learning applications

(Rasmussen and Williams, 2005).

Further research could be done on checking the compatibility of different data sources for spatial fusion modeling, i.e. if overlapping information exists between different spatial datasets. Such information can help to inform the model structure, especially the design matrix . Needless to say, the framework can be extended to include temporal components.


SUPPLEMENTARY MATERIAL

All the R code used for the simulation studies is available in a separate file on the authors website.

References

  • Banerjee et al. (2014) Banerjee, S., B. P. Carlin, and A. E. Gelfand (2014). Hierarchical Modeling and Analysis for Spatial Data. CRC Press. p. 136-139.
  • Banerjee et al. (2008) Banerjee, S., A. E. Gelfand, A. O. Finley, and H. Sang (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society. Series B 70(4), 825–848.
  • Berrocal et al. (2010) Berrocal, V. J., A. E. Gelfand, and D. M. Holland (2010). A spatio-temporal downscaler for output from numerical models. Journal of Agricultural, Biological, and Environmental Statistics 15(2), 176–197.
  • Besag et al. (1991) Besag, J., J. York, and A. Mollié (1991). Bayesian Image Restoration, with Two Applications in Spatial Statistics. The Annals of the Institute of Statistical Mathematics 43(1), 1–20.
  • Brooks and Gelman (1998) Brooks, S. P. and A. Gelman (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics 7(4), 434–455.
  • Carpenter et al. (2017) Carpenter, B., A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell (2017). Stan: A probabilistic programming language. Journal of Statistical Software 76(1).
  • Chammartin et al. (2016) Chammartin, F., N. Probst-Hensch, J. Utzinger, and P. Vounatsou (2016). Mortality atlas of the main causes of death in Switzerland, 2008-2012. Swiss Medical Weekly 146.
  • Cressie (1991) Cressie, N. (1991). Statistics for Spatial Data. John Wiley & Sons.
  • Cressie and Johannesson (2007) Cressie, N. and G. Johannesson (2007). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B 70(1), 209–226.
  • Datta et al. (2016) Datta, A., S. Banerjee, A. O. Finley, and A. E. Gelfand (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association 111(514), 800–812.
  • Follestad and Rue (2003) Follestad, T. and H. Rue (2003). Modelling spatial variation in disease risk using Gaussian Markov random field proxies for Gaussian random fields. Technical report, Norwegian University of Science and Technology.
  • Fuentes and Raftery (2005) Fuentes, M. and A. E. Raftery (2005).

    Model evaluation and spatial interpolation by Bayesian combination of observations with outputs from numerical models.

    Biometrics 61(1), 36–45.
  • Fuglstad et al. (2018) Fuglstad, G.-A., D. Simpson, F. Lindgren, and H. Rue (2018). Constructing Priors that Penalize the Complexity of Gaussian Random Fields. Journal of the American Statistical Association 0(0), 1–8.
  • Furrer et al. (2006) Furrer, R., M. G. Genton, and D. Nychka (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics 15(3), 502–523.
  • Gatrell et al. (1996) Gatrell, A. C., T. C. Bailey, P. J. Diggle, and B. S. Rowlingson (1996). Spatial point pattern analysis and its application in geographical epidemiology. Transactions of the Institute of British Geographers 21(1), 256–274.
  • Gelfand et al. (2001) Gelfand, A. E., L. Zhu, and B. P. Carlin (2001). On the change of support problem for spatio-temporal data. Biostatistics 2(1), 31–45.
  • Gotway and Young (2002) Gotway, C. A. and L. J. Young (2002). Combining incompatible spatial data. Journal of the American Statistical Association 97(458), 632–648.
  • Greenland (1992) Greenland, S. (1992). Divergent biases in ecologic and individual-level studies. Statistics in Medicine 11(9), 1209–1223.
  • Homan and Gelman (2014) Homan, M. D. and A. Gelman (2014). The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research 15(1), 1593–1623.
  • Jensen (1906) Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Mathematica 30, 175–193.
  • Jin et al. (2005) Jin, X., B. P. Carlin, and S. Banerjee (2005). Generalized hierarchical multivariate CAR models for areal data. Biometrics 61(4), 950–961.
  • Kelsall and Wakefield (2002) Kelsall, J. and J. Wakefield (2002). Modeling spatial variation in disease risk: A geostatistical approach. Journal of the American Statistical Association 97(459), 692–701.
  • Knorr-Held and Best (2001) Knorr-Held, L. and N. G. Best (2001). A shared component model for detecting joint and selective clustering of two diseases. Journal of the Royal Statistical Society. Series A 164(1), 73–85.
  • Krainski et al. (2018) Krainski, E., V. Gómez Rubio, H. Bakka, A. Lenzi, D. Castro-Camilo, D. Simpson, F. Lindgren, and H. Rue (2018). Advanced Spatial Modeling with Stochastic Partial Differential Equations Using R and INLA. Chapman and Hall/CRC.
  • Kyriakidis et al. (2001) Kyriakidis, P. C., J. Kim, and N. L. Miller (2001). Geostatistical mapping of precipitation from rain gauge data using atmospheric and terrain characteristics. Journal of Applied Meteorology 40(11), 1855–1877.
  • Lindgren and Rue (2015) Lindgren, F. and H. Rue (2015). Bayesian spatial modelling with R-INLA. Journal of Statistical Software, Articles 63(19), 1–25.
  • Lindgren et al. (2011) Lindgren, F., H. Rue, and J. Lindström (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B 73(4), 423–498.
  • Liu et al. (2011) Liu, Z., N. D. Le, and J. V. Zidek (2011). An empirical assessment of Bayesian melding for mapping ozone pollution. Environmetrics 22(3), 340–353.
  • Lozano and et al (2012) Lozano, R. and et al (2012). Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: a systematic analysis for the Global Burden of Disease Study 2010. The Lancet 380(9859), 2095–2128.
  • Lunge Zürich (2017) Lunge Zürich (2017). The LuftiBus Project. Accessed: 2017–02–28.
  • MacNab (2016) MacNab, Y. C. (2016). Linear models of coregionalization for multivariate lattice data: a general framework for coregionalized multivariate CAR models. Statistics in Medicine 35(21), 3827–3850.
  • Martins et al. (2013) Martins, T. G., D. Simpson, F. Lindgren, and H. Rue (2013). Bayesian computing with INLA: New features. Computational Statistics & Data Analysis 67, 68–83.
  • McMillan et al. (2010) McMillan, N. J., D. M. Holland, M. Morara, and J. Feng (2010). Combining numerical model output and particulate data using Bayesian space–time modeling. Environmetrics 21(1), 48–65.
  • Menezes et al. (2014) Menezes, A. M. B., R. Pérez-Padilla, F. C. Wehrmeister, M. V. Lopez-Varela, A. Muiño, G. Valdivia, C. Lisboa, J. R. B. Jardim, M. de Oca Maria, C. Talamo, R. Bielemann, M. Gazzotti, R. Laurenti, B. Celli, C. G. Victora, and for the PLATINO team (2014). Lozano2012. PLOS ONE 9(10), 1–10.
  • Møller et al. (1998) Møller, J., A. R. Syversveen, and R. P. Waagepetersen (1998). Log Gaussian Cox processes. Scandinavian Journal of Statistics 25(3), 451–482.
  • Moraga et al. (2017) Moraga, P., S. M. Cramb, K. L. Mengersen, and M. Pagano (2017). A geostatistical model for combined analysis of point-level and area-level data using INLA and SPDE. Spatial Statistics 21, 27–41.
  • Ogata (1988) Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical Association 83(401), 9–27.
  • R Core Team (2018) R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
  • Rasmussen and Williams (2005) Rasmussen, C. E. and C. K. I. Williams (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
  • Ren and Banerjee (2013) Ren, Q. and S. Banerjee (2013). Hierarchical factor models for large spatially misaligned data: a low-rank predictive process approach. Biometrics 69(1), 19–30.
  • Rue et al. (2009) Rue, H., S. Martino, and N. Chopin (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B 71(2), 319–392.
  • Sahu et al. (2010) Sahu, S. K., A. E. Gelfand, and D. M. Holland (2010). Fusing point and areal level space-time data with application to wet deposition. Journal of the Royal Statistical Society Series C 59(1), 77–103.
  • Schmidt and Gelfand (2003) Schmidt, A. M. and A. E. Gelfand (2003). A Bayesian coregionalization approach for multivariate pollutant data. Journal of Geophysical Research: Atmospheres 108(D24).
  • Shi and Kang (2017) Shi, H. and E. L. Kang (2017). Spatial data fusion for large non-Gaussian remote sensing datasets. Stat 6(1), 390–404.
  • Simpson et al. (2016) Simpson, D., J. B. Illian, F. Lindgren, S. H. Sørbye, and H. Rue (2016). Going off grid: computationally efficient inference for log-Gaussian Cox processes. Biometrika 103(1), 49–70.
  • Stan Development Team (2018) Stan Development Team (2018). RStan: the R interface to Stan. R package version 2.18.0.
  • Stein (2008) Stein, M. L. (2008). A modeling approach for large spatial datasets. Journal of the Korean Statistical Society 37(1), 3–10.
  • Wackernagel (2003) Wackernagel, H. (2003). Multivariate Geostatistics: An Introduction with Applications. Berlin, Heidelberg: Springer Berlin Heidelberg.
  • Wang and Furrer (2019) Wang, C. and R. Furrer (2019). Efficient inference of generalized spatial fusion models with flexible specification. Stat 8(1), e216.
  • Wang et al. (2018) Wang, C., M. A. Puhan, and R. Furrer (2018). Generalized spatial fusion model framework for joint analysis of point and areal data. Spatial Statistics 23, 72–90.
  • Wang and Wall (2003) Wang, F. and M. M. Wall (2003). Generalized common spatial factor model. Biostatistics 4(4), 569–582.
  • Wilson and Wakefield (2018) Wilson, K. and J. Wakefield (2018). Pointless spatial modeling. Biostatistics, 1–16.
  • Zhang et al. (2019) Zhang, L., A. Datta, and S. Banerjee (2019). Practical bayesian modeling and inference for massive spatial data sets on modest computing environments†.

    Statistical Analysis and Data Mining: The ASA Data Science Journal

     12(3), 197–209.