Modeling Massive Spatial Datasets Using a Conjugate Bayesian Linear Regression Framework

09/09/2021
by   Sudipto Banerjee, et al.
16

Geographic Information Systems (GIS) and related technologies have generated substantial interest among statisticians with regard to scalable methodologies for analyzing large spatial datasets. A variety of scalable spatial process models have been proposed that can be easily embedded within a hierarchical modeling framework to carry out Bayesian inference. While the focus of statistical research has mostly been directed toward innovative and more complex model development, relatively limited attention has been accorded to approaches for easily implementable scalable hierarchical models for the practicing scientist or spatial analyst. This article discusses how point-referenced spatial process models can be cast as a conjugate Bayesian linear regression that can rapidly deliver inference on spatial processes. The approach allows exact sampling directly (avoids iterative algorithms such as Markov chain Monte Carlo) from the joint posterior distribution of regression parameters, the latent process and the predictive random variables, and can be easily implemented on statistical programming environments such as R.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

02/01/2018

Practical Bayesian Modeling and Inference for Massive Spatial Datasets On Modest Computing Environments

With continued advances in Geographic Information Systems and related co...
05/31/2020

Spatial Factor Modeling: A Bayesian Matrix-Normal Approach for Misaligned Data

Multivariate spatially-oriented data sets are prevalent in the environme...
07/03/2018

Bayesian Spatial Analysis of Hardwood Tree Counts in Forests via MCMC

In this paper, we perform Bayesian Inference to analyze spatial tree cou...
12/06/2019

Manifold Markov chain Monte Carlo methods for Bayesian inference in a wide class of diffusion models

Bayesian inference for partially observed, nonlinear diffusion models is...
03/17/2022

An alternative Interpretation of residual feed intake by phenotypic recursive relationships in dairy cattle

There has been an increasing interest in residual feed intake (RFI) as a...
11/12/2017

Bayesian linear regression models with flexible error distributions

This work introduces a novel methodology based on finite mixtures of Stu...
01/10/2021

Grid-Parametrize-Split (GriPS) for Improved Scalable Inference in Spatial Big Data Analysis

Rapid advancements in spatial technologies including Geographic Informat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Statistical modeling and analysis for spatial and spatial-temporal data continue to receive much attention due to enhancements in computerized Geographic Information Systems (GIS) and accompanying technologies. Bayesian hierarchical spatiotemporal process models have become widely deployed statistical tools for researchers to better understand the complex nature of spatial and temporal variability See, for example, the books Cressie (1993), Stein (1999), Moller and Waagepetersen (2003), Schabenberger and Gotway (2004), Gelfand et al. (2010), Cressie and Wikle (2011) and Banerjee et al. (2014) for a variety of statistical methods in diverse applications domains.

Spatial data analysis is conveniently carried out by embedding a spatial process within the familiar hierarchical modeling paradigm,

(1)

Modeling for point-referenced data, which refers to data referenced by points with coordinates (latitude-longitude, Easting-Northing etc.), proceeds from a random field that introduces dependence among any finite collection of random variables. Formally, the random field is a stochastic process defined as an uncountable set of random variables, say , over a domain of interest

. This uncountable set is endowed with a probability law specifying the joint distribution for any finite subset of random variables. Spatial processes are usually constructed assuming

(usually or ) or, perhaps, as a subset of points on a sphere or ellipsoid. In spatiotemporal settings , where and are the space and time domains, respectively, and is a space-time coordinate with spatial location and time point (see, e.g., Gneiting and Guttorp, 2010, for details).

Gaussian random fields are specified with a covariance function for any two points and in . If and are finite sets comprising and points in , respectively, then denotes the matrix whose -th element is evaluated using the covariance function between the -th point and the -th point in . If or comprises a single point,

is a row or column vector, respectively. A valid spatiotemporal covariance function ensures that

is positive definite for any finite set , which we will denote simply as if the context is clear. A customary specification models as a zero-centered Gaussian process, denoted as . For any finite collection in , the random vector is distributed as , where . Further details on valid spatial (and spatiotemporal) covariance functions can be found in Gneiting and Guttorp (2010), Cressie (1993), Stein (1999), Gelfand et al. (2010), Cressie and Wikle (2011) and Banerjee et al. (2014) and numerous references therein.

If represents a variable of interest at point , then a customary spatial regression model at is

(2)

where is a () vector of spatially referenced predictors, is the vector of slopes, and is the spatial or spatiotemporal process and

is a white noise process modeling measurement error or fine scale variation attributed to disturbances at distances smaller than the minimum observed separations in space and/or time. We now embed (

2) and the spatial process within the Bayesian hierarchical model

(3)

where is the vector of observed outcomes, is the matrix of regressors with -th row and is the covariance matrix for over . A common specification is , where is called the “nugget.” The hierarchy is completed by assigning prior distributions to , and .

For fitting (3) to large spatial datasets, a substantial computational expense is incurred from the size of . Since is unknown, each iteration of the model fitting algorithm will involve decomposing or factorizing , which typically requires floating point operations (flops) and order of for memory requirements. In geostatistical settings, data are almost never observed on regular grids and the configuration of points are typically highly irregular. The covariance models that have been demonstrated to be most effective for inference do not, in general, result in any computationally exploitable structure for , which makes the matrix computations prohibitive for large values of . For Gaussian likelihoods, one can integrate out the random effects from (3) and work with the posterior

(4)

This reduces the parameter space to by excluding the high-dimensional vector , but one still needs to work with , which is . These are referred to as “big-n” or “high-dimensional” problems in geostatistics.

There is already a substantial literature on high-dimensional spatial and spatiotemporal modeling and we do not attempt to undertake a comprehensive review here; see, e.g., Banerjee (2017) for a focused review on some popular Bayesian approaches and Heaton et al. (2019) for a comparative evaluation for a variety of contemporary statistical methods. These papers, and the references therein, offer a variety of algorithmic and model-based approaches for large data. Some published methods have scalable implementations into the millions (see, e.g., Katzfuss, 2017; Abdulah et al., 2018; Huang and Sun, 2018; Finley et al., 2019; Zhang et al., 2019) but often require specialized high-performance computer architectures and libraries harnessing parallel processing or graphical processing units. Also, uncertainty quantification on the spatial process while maintaining fidelity to the underlying probability model may also be challenging. With the advent of a new generation of data products, there is a need for some simpler implementations that can be run on modest computing architectures by practicing spatial analysts. This requires new directions in thinking about high-dimensional spatial problems. Here, we will show how some elementary conjugate Bayesian linear regression models can be exploited to conduct Bayesian analysis for massive spatial datasets. While a common underlying idea is to approximate the underlying spatial process with a scalable alternative, we will ensure that such approximations will result in well-defined probability models. In this sense, these approaches can be described as model-based solutions for very large spatial datasets that can be executed on modest computing environments. One exception to the fully model-based approach will be a divide and conquer approach that we briefly review, where an approximation to the full posterior distribution for the entire data is constructed from several posterior distributions of smaller subsets of the data.

The balance of the paper proceeds as follows. The next section briefly reviews dimension reduction and sparsity inducing spatial models. Section 3 presents some standard distribution theory for Bayesian linear regression and outlines how scalable spatial process models can be cast into such frameworks. A synposis of some simulation experiments and data analysis examples are provided in Section 4. Section 5 presents an alternative approach based upon dividing and conquering the data, known as meta-kriging. The paper concludes with some further discussion in Section 6.

2 Dimension reduction and sparsity

Dimension reduction (Wikle and Cressie, 1999) is among the most conspicuous of approaches for handling large spatial datasets. This customarily proceeds by representing or approximating the spatial process in terms of the realizations of a latent process over a smaller set of points, often referred to as knots. Thus,

(5)

where is a well-defined (usually unobserved) process and is a family of basis functions or kernels, possibly depending upon some parameters . The collection of locations are the knots, and are vectors with components and , respectively. Therefore, , where and is with -th element . We work with (instead of ) ’s and the matrix . Choosing effectuates dimension reduction because , as defined in (5), spans only an -dimensional space. When , the joint distribution of is singular. Nevertheless, we construct a valid stochastic process with covariance function

(6)

where

is the variance-covariance matrix (also depends upon parameter

) for . From (6), we see that, even if is stationary, the induced covariance function is not. If the ’s are Gaussian, then is a Gaussian process. Every choice of basis functions yields a process and there are too many choices to enumerate here. Wikle Wikle (2010) offers an excellent overview of low rank models.

Some choices of basis functions can be more computationally efficient than others depending upon the specific application. For example, Cressie and Johannesson (2008) (also see Shi and Cressie (2007)) discuss “Fixed Rank Kriging” (FRK) by constructing using very flexible families of non-stationary covariance functions to carry out high-dimensional kriging, Cressie et al. (2010) extend FRK to spatiotemporal settings calling the procedure “Fixed Rank Filtering” (FRF), Katzfuss and Cressie (2012) provide efficient constructions for for massive spatiotemporal datasets, and Katzfuss (2013) uses spatial basis functions to capture medium to long range dependence and tapers the residual to capture fine scale dependence. Multiresolution basis functions (Nychka et al., 2002, 2015) have been shown to be effective in building computationally efficient nonstationary models. These papers amply demonstrate the versatility of low-rank approaches using different basis functions. An alternative approach specifies itself as a spatial process. This process is called the “parent process” and one can derive a low-rank process from the parent. One such derivation emerges from truncating the Karhunen-Loève (infinite) basis expansion for a Gaussian process to a finite number of terms to obtain a low-rank process (see, e.g., Rasmussen and Williams, 2005; Banerjee et al., 2014). This is equivalent to projecting the parent process on a lower-dimensional subspace determined by a partial realization of the parent over knots of the process. This yields the predictive process and several variants aimed at improving the approximation (Banerjee et al., 2008, 2010; Sang et al., 2011; Sang and Huang, 2012; Katzfuss, 2017); also see (Finley et al., 2015) and Banerjee (2017) for computational details on efficiently implementing Gaussian predictive processes.

While dimension reduction methods have been applied extensively and effectively to analyze spatial data sets in the order of , their computational efficiency and inferential performance tend to struggle at even larger scales (Banerjee, 2017). More recently, there has been substantial developments in full rank models that exploit sparsity. We introduce sparsity either in the covariance matrix or its inverse (the precision matrix). Covariance tapering (Furrer et al., 2006; Kaufman et al., 2008; Du et al., 2009) is in the spirit of the former by modeling , where is a sparse covariance matrix formed from a compactly supported, or tapered, covariance function with tapering parameter and denotes the element wise (or Hadamard) product of two matrices. The Hadamard product retains positive definiteness, so is positive definite. Furthermore, is sparse because a tapered covariance function is equal to for all pairs of locations separated by a distance beyond a threshold . Covariance tapering is undoubtedly an attractive approach for constructing sparse covariance matrices, but its practical implementation for full Bayesian inference will generally require efficient sparse Cholesky decompositions, numerically stable determinant computations and, perhaps most importantly, effective memory management. These issues are yet to be tested for truly massive spatiotemporal datasets with or more.

One could also devise models with sparse precision matrices. For finite-dimensional distributions conditional and simultaneous autoregressive (CAR and SAR) models (see, e.g., Cressie, 1993; Banerjee et al., 2014, and references therein) adopt this approach for areally referenced datasets. The CAR models are special instances of Gaussian Markov random fields or GMRFs (Rua and Held, 2005) that have led to the popular quadrature based Integrated Nested Laplace Approximation (INLA) algorithms Rue et al. (2009) for Bayesian inference and to the approximation of Gaussian processesLindgren et al. (2011)

. These approaches can be computationally efficient for certain classes of covariance functions with stochastic partial differential equations (SPDE) representations (including the versatile Matérn class), but their inferential performance on spatiotemporal or multivariate Gaussian processes (perhaps specified through more general covariance or cross-covariance functions) embedded within Bayesian hierarchical models is yet to be fully developed or assessed for massive datasets.

One could also construct massively scalable sparsity-inducing Gaussian processes using essentially the techniques used in graphical Gaussian models by exploiting the relationship between the Cholesky decomposition of a positive definite matrix and conditional independence. For Gaussian processes in particular, recent developments on the Nearest Neighbor Gaussian Processes (NNGP) (Datta et al., 2016a, b; Banerjee, 2017; Finley et al., 2019) have proceeded from GP likelihoods using directed acyclic graphs (or DAGs) as used by Vecchia Vecchia (1988) and Stein et al.Stein et al. (2004). The NNGP is a Gaussian process whose finite-dimensional realizations will have sparse precision matrices. Other related papers using the approximation in Vecchia (1988) include Stroud et al. (2017), Guinness (2018), Katzfuss and Guinness (2017), and Katzfuss et al. (2018). Shi et al. Shi et al. (2017) recently used the NNGP for uncertainty quantification and Ma et al. Ma and Kang (2017) used it as a part of a rich class of fused Gaussian process models.

Full Bayesian inference for low-rank and sparse Gaussian process models require iterative algorithms such as Markov chain Monte Carlo (MCMC) or INLA. Details of these implementations can be found in the aforementioned references. In the following section, we will discuss how these spatial models can be embedded within a Bayesian linear regression framework and provide some practical strategies for inference based upon direct (exact) sampling from the posterior distribution.

3 Conjugate Bayesian models for massive datasets

3.1 Conjugate Bayesian linear geostatistical models

A conjugate Bayesian linear regression model is written as

(7)

where is an vector of observations of the dependent variable, is an matrix (assumed to be of rank ) of independent variables (covariates or predictors) and its first column is usually taken to be the intercept, is a fixed (i.e., known) positive definite matrix, , , and are assumed to be fixed hyper-parameters specifying the prior distributions on the regression slopes and the scale . This model is easily tractable and the posterior distribution is

(8)

where , , and . Sampling from the joint posterior distribution of is achieved by first sampling and then sampling for each sampled . This yields marginal posterior samples from , which is a non-central multivariate distribution but we do not need to work with its complicated density function. See Gelman et al. (2013) for further details on the conjugate Bayesian linear regression model and sampling from its posterior.

We will adapt (7) to accommodate (3) or (4). Let us first consider (4) with the customary specification and let , where is a correlation matrix whose entries are given by a correlation function . Thus, , where is the spatial variance component and is a spatial decay parameter controlling the rate at which the spatial correlation decays with separation between points. A simple example is , although much richer choices are available (see, e.g., Ch 3 in Banerjee et al., 2014). Therefore, we can write , where and is the ratio between the “noise” variance and “spatial” variance. If we assume that and are fixed and that the prior on are as in (7), then we have reduced (4) to (7) and direct sampling from its posterior is easily achieved as described below (8). We will return to the issue of fixing shortly.

Let us turn to accommodating (3) within (7), which would include directly sampling the spatial random effects from their marginal posterior . Here, it is instructive to write the joint distribution of and in (3) as a linear model,

(9)

where and . If we assume that and are fixed at known values, then is fixed. We have a conjugate Bayesian linear regression model , where has a flat prior and . Thus,

(10)

where , , and . The posterior mean of is

, which is the generalized least squares estimate obtained from the augmented linear system in (

9). Sampling from the posterior proceeds analogous to that described below (8).

From the preceding account we see that fixing the spatial range decay parameter and the noise-to-spatial variance ratio casts the Bayesian geostatistical model into a conjugate framework that will allow inference on . Note that multiplying the posterior samples of by the fixed quantity fetches us the posterior samples of . Therefore, we neglect uncertainty in and, partially, for one of the variance components due to fixing their ratio. This, however, provides the computational advantage that inference can be carried out without resorting to expensive iterative algorithms such as MCMC that require several iterations before sampling from the posterior distribution. This computational benefit becomes especially relevant when handling massive spatial data. Furthermore, fixing the values of and is not entirely unreasonable given that these parameters are weakly identified by the data (Zhang, 2004) and difficult to learn from the posterior. Nevertheless, the inference will depend upon these fixed parameters so we discuss a practical approach to fix and at reasonable values.

3.2 Choosing and

We can set values for and

by conducting some simple spatial exploratory data analysis using the “variogram”. Several practical algorithms exist for empirically calculating the variogram (or semivariogram) from observations using finite sample moments. Many of these methods for variograms are now offered in user-friendly

R packages hosted by the Comprehensive R Archive Network (CRAN) (https://cran.r-project.org). As one example, Finley et al. Finley et al. (2019) investigate the impact of tree cover and occurrence of forest fires on forest height. They first fit an ordinary linear regression of the form and then compute a variogram for the residuals from the ordinary linear regression.

Figure 1: Variogram of the residuals from non-spatial regression indicates strong spatial pattern

Figure 1 depicts the variogram, which informs about the process parameters. The lower horizontal line represents the “nugget” or the micro-scale variation captured by the measurement error variance component . The top horizontal line represents the “sill” (or ceiling) which is the total variation captured by . Therefore, the difference between the two horizontal lines is called the “partial sill” and is captured by . Finally, the vertical line represents the distance beyond which the variogram flattens or the covariance tends to zero. One can provide “eye-ball” estimates for these quantities and, in particular, fix the values of and . Fixing these values from the variogram yields the desired highly accessible conjugate framework and the models can be estimated without resorting to Markov chain Monte Carlo (MCMC) as described earlier. Note that instead of , we could also have fixed and any one of the variance components, or , which would also yield a conjugate model with exact distribution theory. The one slight advantage of fixing is that we will get the posterior samples of both and , the latter obtained simply as .

The above crude estimates can be improved using a -fold cross-validation. We split the data randomly into different folds. Let be the -th folder of observed points and let denote the observed points outside of . For each , we compute the predictive mean . We then compute the “Root Mean Square Predictive Error” (RMSPE) (Yeniay and Goktas, 2002) and choose the value of corresponding to the smallest RMSPE from a grid of candidate values. The range of the grid is based on interpretation of the hyper-parameters. We suggest a reasonably wide range for (e.g., ), which accommodates one variance component substantially dominating the other in either direction. For the spatial decay we suggest a lower bound of , which, based upon the exponential covariance function, indicates that the spatial correlation drops below 0.05 at the maximum inter-site distance, and an upper bound that can be initially set as 100 times of the lower bound. Functions like variofit in the R package geoR (Ribeiro Jr and Diggle, 2012) can provide empirical estimates for from an empirical variogram. After initial fitting, we can shrink the range and refine the grid of the candidate values for more precise estimators.

3.3 Conjugate Bayesian geostatistical models for massive spatial data

Conjugate models can be estimated by sampling directly from their joint posterior density and, therefore, completely obviate problems associated with MCMC convergence. This is a major computational benefit. However, the challenges in analyzing massive spatial data do not quite end here. When the number of spatial locations providing measurements are in the order of millions as in Finley et al. (2019), then the matrices , or that we encountered earlier in different model parametrizations will be too massive to be efficiently loaded on to the machine’s memory, let alone be computed with. This precludes efficient likelihood computations and has led several researchers to propose models specifically adapted for spatial analysis. We briefly present adaptations of (9) using two different classes of models for massive spatial data: (i) low-rank process models and (ii) NNGP models.

As discussed in Section 2, in low rank models the spatial effect in (3) is replaced by , where is the matrix whose -th row is . Dimension reduction is achieved by fixing to be much smaller than so that we only deal with random effects instead of . The framework in (9) can be easily adapted to this situation as below:

(11)

where and is and fixed, and is now instead of the matrix in (9). Computations for (11) proceed analogous to those for (8), but benefits accrue in terms of storage and the number of floating point operations (flops) when conducting the exact conjugate Bayesian analysis for this model.

We now outline the construction of sparse NNGP models. These can be regarded as a special case of Gaussian Markov Random Fields (GMRFs) with a neighborhood structure specified using a directed acyclic graph (DAG). The computational benefits for NNGP models accrue from the ease of inverting sparse matrices. This is immediate from noting that the expense to obtain in (10) is dominated by . Therefore, if is easily available then the inference for will be inexpensive. Modeling sparse can be easily achieved as follows. Writing as is equivalent to the following set of linear models,

or, more compactly, simply , where is strictly lower-triangular with elements whenever and and is diagonal with diagonal entries and for . From the structure of it is evident that is unit lower-triangular, hence nonsingular, and .

We now introduce sparsity in by letting whenever (since is strictly lower-triangular) and also whenever is not among the nearest neighbors of , where is fixed by the user to be a small number. It turns out that a very effective approximation emerges by recognizing that the lower-triangular elements of are precisely the coefficients of a linear combination of ’s equating to the conditional expectation . Thus, the vector of non-zero entries in the -th row of are obtained by solving the linear system , where is the principal submatrix extracted from corresponding to the neighbors of (indexed by elements of a neighbor set ) and is the vector extracted by choosing the indices in from the -th column of . Once is obtained, the -th diagonal entry of is obtained as . These computations need to be carried out for each (note that for , and ), but can be kept very small (say or even if ) so that the expense is and still feasible. The details can be found in Banerjee (2017). This notion is familiar in Gaussian Graphical models and have been used in Vecchia (1988) and, more recently, in Datta et al. (2016a) and Finley et al. (2019) to tackle massive amounts of spatial locations.

The framework in (9) now assumes the form

(12)

where and is and fixed with much greater sparsity. While this approach can also be subsumed into the framework of (9), its efficient implementation on standard computing architectures needs careful consideration and involves solving a large linear system with coefficient matrix . This matrix is large, but is sparse because of . Since has at most nonzero entries in each row, an upper bound of nonzero entries in is and, therefore, the upper bound in is . This sparsity can be exploited by sparse linear solvers such as conjugate gradient methods that can be implemented on modest computing environments.

Sampling from the joint posterior distribution is achieved in the following manner. First, the least-squares estimate is obtained using a sparse least-square solver using a preconditioned conjugate gradient algorithm. Subsequently, is sampled from its marginal posterior density , where and , and we sample one value of from using each sampled value of . In general, solving requires flops, but when , the structure of and ensures memory requirements in the order of and the computational complexity in the order of flops. Details on such implementations on modest computing platforms can be found in (Zhang et al., 2019).

3.4 Spatial prediction

Let be a set of locations where we wish to predict the outcome . Let be an vector with -th element and let be the vector with elements . The predictive model augments the joint distribution to

(13)

The factorization in (13) also implies that and are conditionally independent of each other given and

. Predictive inference for spatial data evaluates the posterior predictive distribution

. This is the joint posterior distribution for the outcomes and the spatial effects at locations in . This distribution is easily derived from (13) as

(14)

Sampling from (14) is achieved by first sampling from . For each drawn sample, we make one draw of the vector from and then, using this sampled , we make one draw of from . The resulting samples of and will be draws from the desired posterior predictive distribution . This delivers inference on both the latent spatial random effect and the outcome at arbitrary locations since

can be any finite collection of samples. Summarizing these distributions by computing their sample means, standard errors, and the

-th and

-th quantiles (to produce a

credible interval) yields point estimates with associated uncertainty quantification.

It is instructive to see how the entire inference for Gaussian outcomes can be cast into an augmented linear regression model. The predictive model for can be written as a spatial regression

(15)

where is the matrix of predictors observed at locations in and , where is the vector with elements . The second equation in (15) expresses the relationship between the spatial effects across the unobserved locations in and the spatial effects across the observed locations in . Since there is one underlying random field over the entire domain, the covariance function for the random field specifies the coefficient matrix . In particular, if , then and , where . The model for the data and the predictions is combined into

(16)

where . If locations where predictions are sought are fixed by study design, then fitting (16) using the Bayesian conjugate framework can be beneficial. On the other hand, one can first estimate and store samples from their posterior distribution. Then, for any arbitrary set of points in , for each stored sample of the parameters we draw one sample of followed by one draw of . The resulting will be the desired posterior predictive samples for the latent spatial process and the unobserved outcomes. Again, the advantage of this formulation is that an efficient least squares algorithm to solve (16) that can exploit the sparsity of the design matrix will immediately deliver inference on the regression slopes (), the spatial process (

) at observed points, the interpolated process (

) at unobserved points, and the predicted response () all at once.

4 Illustrative examples

We present a part of some simulation experiments conducted in (Zhang et al., 2019), where we generated data using the spatial regression model in (2) over a set of spatial locations within a unit square and using an exponential covariance function to specify the spatial process. While

spatial locations may seem too modest, we use this to draw comparisons with a full GP model that will be too expensive for large datasets. The model included an intercept and a single predictor generated from a standard normal distribution.

We fit a full Gaussian process based model (labeled as full GP in Table 1) using the spBayes package in R, a latent NNGP model with neighbors using the sequential MCMC algorithm described in Datta et al. (2016a) (using the spNNGP package), and the conjugate latent NNGP model described in the preceding section with neighbors. We will refer to the latent NNGP model fitted using MCMC (with all process parameters unknown) as simply the NNGP or latent NNGP model, while we will explicitly use “conjugate” to describe the conjugate latent NNGP model.

These models were trained using observations, while the remaining observations were withheld to assess predictive performance. The fixed parameters for the conjugate latent NNGP model were picked through the -fold cross-validation algorithm described in Section 3.2. The intercept and slope parameters in were assigned improper flat priors and an (mean ) prior was used for . For the latent NNGP and full GP models, the spatial decay was modeled using a fairly wide uniform prior prior and Inverse-Gamma priors (mean ) were used for the nugget () and the partial sill () in order to compare the conjugate Bayesian models with other models. The shape parameter was fixed at and the scale parameter was set from the empirical estimate provided by the variogram using the geoR package (Ribeiro Jr and Diggle, 2012). The parameter estimates and performance metrics are provided in Table 1.

True Full GP NNGP Conj NNGP
1 1.07(0.72, 1.42) 1.10 (0.74, 1.43) 1.06 (0.76, 1.46)
-5 -4.97 (-5.02, -4.91) -4.97 (-5.02, -4.91) -4.97 (-5.02, -4.91)
2 1.94 (1.63, 2.42) 1.95 (1.63, 2.41) 1.94 (1.77, 2.12)
0.2 0.14 (0.07, 0.23) 0.15 (0.06, 0.24) 0.17 (0.16, 0.19)
16 19.00 (13.92, 23.66) 18.53 (14.12, 24.17) 17.65
KL-D 4.45(1.16, 9.95) 5.13(1.66, 11.39) 3.58(1.27, 8.56)
MSE(w) 297.45(231.62, 444.79 ) 303.38(228.18, 429.54) 313.28 (258.96, 483.75)
RMSPE 0.94 0.94 0.94
time(s) 2499 + 23147 109.5 12 + 0.6
Table 1: Simulation study summary table: posterior mean (2.5%, 97.5%) percentiles

Table 1 presents parameter estimates and performance metrics for the candidate models. The inference for is almost indistinguishable across the three models. The full GP and the NNGP fully estimate using MCMC and yield very similar results. The conjugate NNGP does not estimate and estimates subject to the constraint that their ratio is fixed. This results, expectedly, in slightly narrower credible intervals for and . Overall, the parameter estimates are very comparable across the models.

Turning to model comparisons, Zhang et al.Zhang et al. (2019)

computed the posterior distribution of the Kullback-Leibler divergence (KL-D) by computing it between each candidate model and the full GP for each posterior sample. The KL-D values presented in Table 

1 show no significant differences between the three models in their separation from the true full GP model. The root mean-squared prediction error (RMSPE) values (computed from the hold-out set of 200 locations) across all three models are also similar, further corroborating the comparable predictive performance of the conjugate model with the full Gaussian process.

In terms of timing (presented in seconds in Table 1), the recorded time of the conjugate models includes the time for choosing hyper-parameters through cross-validation and (“+”) the time for sampling from the posterior distribution. The recorded time of the full GP model consists of the time for MCMC sampling and (“+”) the time for recovering the regression coefficients and predictions. The full latent NNGP model is 200 times faster than the full Gaussian process based model, while the conjugate latent NNGP model uses one tenth of the time required by the latent NNGP model to obtain similar inference on the regression coefficients and latent process. Further simulation experiments conducted by Zhang et al. (Zhang et al., 2019) also show that interpolation of the latent process is almost indistinguishable between the conjugate and full models.

Next, we present a second simulation example using exactly the same setup as in the preceding example, but with spatial locations. Here, we fit a latent NNGP model using the MCMC algorithm in Datta et al. (2016a) and the conjugate latent NNGP model. We used locations for training the models while the remaining locations were used for predictive assessment. We summarized the results from the latent NNGP model using a post burn-in posterior sample for iterations. This was deemed adequate based upon the customary convergence diagnostics available in the coda and mcse packages within the R computing environment (Plummer et al., 2006; Flegal and Jones, 2011). The inference from the conjugate latent NNGP model were based on 300 samples. This is sufficient for the conjugate latent NNGP model since the conjugate model provides independent samples from the exact posterior distribution. The full MCMC-based NNGP model took about seconds to deliver full Bayesian inference, while the conjugate model took only seconds (99 seconds for the cross-validation to fix and 13 seconds for sampling from the posterior distribution). We found that the RMSPE values for the full latent NNGP and the conjugate model computed using the hold-out locations were almost identical (0.67 up to 2 decimal places).

The parameter estimates from the full NNGP and conjugate NNGP models in this larger simulation experiment reveal essentially the same story as in Table 1 so we do not repeat them here. Instead, we focus on the estimation of the latent process and the predictive performance for the two models. Figure 2 shows interpolated surfaces from the simulation example: 2(a) shows an interpolated map of the “true” spatial latent process , while 2(b) and 2(c) present the posterior means of over the entire domain obtained from the full latent NNGP model and the conjugate latent NNGP model, respectively.

(a) True
(b) NNGP
(c) Conjugate NNGP
(d) CIs of from NNGP
(e) CIs of from conjugate NNGP
Figure 2: Interpolated maps of (a) the true generated surface, (b) the posterior means of the spatial latent process for the NNGP and (c) posterior means of

for the conjugate latent NNGP. The 95% confidence intervals for the spatial effects

from (d) the NNGP and (e) the conjugate NNGP. The NNGP models were all fit using nearest neighbors.

The recovered spatial residual surfaces are almost indistinguishable, and are comparable to the true interpolated surface of . Figure 2(d)–(e) present the 95% credible intervals for the spatial effects from the latent NNGP model and the conjugate latent NNGP model. These intervals are plotted against the true values of from the generated model. We found that 9567 out of 10000 credible intervals successfully included the true value for the conjugate model, while the corresponding number was a very comparable 9584 for the full NNGP model.

Turning to a real example, we present a synopsis of the analysis in Zhang et al. (2019) of a spatial dataset from NASA comprising sea surface temperature (in degrees Centigrade) observations over 2,827,252 spatial locations of which approximately 90% (2,544,527) were used for model fitting and the rest were withheld for cross-validatory predictive assessment. Details of the dataset can be found in http://modis-atmos.gsfc.nasa.gov/index.html and details on the analysis can be found in Zhang et al. (2019). The salient feature of the analysis is that a conjugate Bayesian framework for the NNGP model as in (12) was able to deliver full inference including the estimation of the spatial latent effects in about 2387 seconds. Sampling from the posterior distribution was achieved using direct sampling as described below (12). Since this algorithm is fast and directly samples from the posterior, hence there is no burn-in period for convergence, it was run over a grid of values of . For each such value, a posterior predictive assessment over the cross-validatory hold-out set was carried out and the value of producing the least RMSPE was selected as optimal inputs for which the estimates of were presented.

(a) Posterior mean of sea-surface temperature
(b) Posterior predictive mean of latent spatial effects
Figure 3: Posterior predictive maps of sea-surface temperature (in degree centigrade) and latent spatial effects. The land is colored in gray, locations in the ocean without observations are indicated in yellow.

5 Spatial Meta-Kriging

A different approach toward BIG DATA problems relies upon divide and conquer methods. The idea here is divide and conquer (or map and reduce) by pooling posterior inference across a partition of data subsets. Once again consider the Bayesian linear regression model

(17)

where is , is , is , is a fixed covariance matrix, is a fixed vector and is a fixed matrix. The joint posterior density is available in closed form as where the marginal posterior density and the conditional posterior density with , , , and . Therefore, exact posterior inference can be carried out by first sampling from and then sampling from for each sampled value of . This results in samples from

. Besides the fixed hyperparameters in the prior distributions, this exercise requires computing

, and .

Now consider a situation where is large enough so that memory requirements for computing (17) is unfeasible. One possible resolution is to replace the likelihood in (17) with a composite likelihood that assumes independence across blocks formed by partitioning the data into subsets. We partition the vector into subvectors with as the subvector forming the -th subvector, where . The size of the -th subset is . These sizes need not be the same across , but will be chosen in a manner so that each of the subsets can be fitted easily with the computational resources available. Also, let be the matrix of predictors corresponding to and let be the marginal correlation matrix for . The conjugate Bayesian model with a block-independent composite likelihood assumes that , where . The Bayesian specification is completed by assigning priors to and as in (17). If we distribute the analysis to different computing cores, where the -th core fits the above model but only with the likelihood , then the quantities needed for sampling from the full can be computed entirely using quantities obtained from the individual subsets of the data. For each we independently compute and based upon the -th subset of the data. We then combine them to obtain