Practical Bayesian Modeling and Inference for Massive Spatial Datasets On Modest Computing Environments

by   Lu Zhang, et al.
Johns Hopkins University

With continued advances in Geographic Information Systems and related computational technologies, statisticians are often required to analyze very large spatial datasets. This has generated substantial interest over the last decade, already too vast to be summarized here, in scalable methodologies for analyzing large spatial datasets. Scalable spatial process models have been found especially attractive due to their richness and flexibility and, particularly so in the Bayesian paradigm, due to their presence in hierarchical model settings. However, the vast majority of research articles present in this domain have been geared toward innovative theory or more complex model development. Very limited attention has been accorded to approaches for easily implementable scalable hierarchical models for the practicing scientist or spatial analyst. This article is submitted to the Practice section of the journal with the aim of developing massively scalable Bayesian approaches that can rapidly deliver Bayesian inference on spatial process that are practically indistinguishable from inference obtained using more expensive alternatives. A key emphasis is on implementation within very standard (modest) computing environments (e.g., a standard desktop or laptop) using easily available statistical software packages without requiring message-parsing interfaces or parallel programming paradigms. Key insights are offered regarding assumptions and approximations concerning practical efficiency.


Modeling Massive Spatial Datasets Using a Conjugate Bayesian Linear Regression Framework

Geographic Information Systems (GIS) and related technologies have gener...

Spatial Factor Modeling: A Bayesian Matrix-Normal Approach for Misaligned Data

Multivariate spatially-oriented data sets are prevalent in the environme...

A Scalable Partitioned Approach to Model Massive Nonstationary Non-Gaussian Spatial Datasets

Nonstationary non-Gaussian spatial data are common in many disciplines, ...

Scalable Bayesian Functional Connectivity Inference for Multi-Electrode Array Recordings

Multi-electrode arrays (MEAs) can record extracellular action potentials...

Fixed-Domain Asymptotics Under Vecchia's Approximation of Spatial Process Likelihoods

Statistical modeling for massive spatial data sets has generated a subst...

Scalable Bayesian modeling for smoothing disease risks in large spatial data sets

Several methods have been proposed in the spatial statistics literature ...

Log-Gaussian Cox Process Modeling of Large Spatial Lightning Data using Spectral and Laplace Approximations

Lightning is a destructive and highly visible product of severe storms, ...

I Introduction

Rapidly increasing usage and growing capabilities of Geographic Information Systems (GIS) have spawned considerable research in modeling and analyzing spatial datasets in diverse disciplines including, but not limited to, environmental sciences, economics, biometry and so on (see, e.g., Gelfand et al., 2010; Cressie and Wikle, 2015; Banerjee et al., 2014). Much of spatial modeling is carried out within the familiar hierarchical modeling paradigm,


For point-referenced data sets, where spatial locations are indexed by coordinates on a map, the “process” is modeled as a spatial random field over the domain of interest and the observations are treated as a finite realization of this random field. The Gaussian process (GP) is, perhaps, the most conspicuous of process specifications and offers flexibility and richness in modeling. The GP’s popularity as a modeling tool is enhanced due to their extensibility to multivariate and spatial-temporal geostatistical settings, although we do not pursue such generalizations in this article. They also provide comparatively greater theoretical tractability among spatial processes (Stein, 1999).

Fitting GPs incur onerous computational costs that severely hinders their implementation for large datasets. The key bottleneck stems from the massive spatial covariance matrix present in the multivariate normal density for the finite realizations of the GP. For irregularly situated spatial locations, as is common in geostatistics, these matrices are typically dense and carry no exploitable structure to facilitate computations. Even for a modestly large number of points ( or greater), the computational demands become prohibitive for a modern computer and preclude inference from GP models.

A substantial literature exists on methodologies for massive spatial datasets and it is already too vast to be summarized here (see, e.g., Banerjee, 2017; Heaton et al., 2017, and references therein). Some are more amenable than others to the hierarchical setup in (1). Even within the hierarchical paradigm, there is already a burgeoning literature on massively scalable spatial process models. There are two pressing issues facing the practicing spatial analyst. The first is to analyze massive amounts of spatial data on “modest” computing environments such as standard desktop or laptop architectures. By “modest computing environments”, we mean environments such as R ( or STAN ( that do not require knowledge of lower-level languages or distributed programming paradigms. The second pressing issue is that of full inference

that subsumes parameter estimation, spatial prediction of the outcome, and estimation of the underlying latent process. Yet the size of the datasets easily exceed the CPU memory available for computing, which means that we need to rely upon statistical models that will enable analysis with the available memory.

Some scalable processes such as the multi-resolution predictive process models proposed by Katzfuss (2017) or the nearest-neighbor Gaussian process (NNGP) models by Datta et al. (2016a) can be programmed in modest computing environments to estimate parameters and predict outcomes, but not necessarily infer on the latent process efficiently. Katzfuss (2017) does not address this, while Datta et al. (2016a) and Datta et al. (2016b) implement high-dimensional Gibbs sampling algorithms that had to be run for several iterations on a high-performance computing environment to yield adequate convergence due to high autocorrelations. Other approaches such as Gaussian Markov random field (GMRF) approximations to spatial processes Rue et al. (2009); Lindgren et al. (2011) use Integrated Nested Laplace Approximations (INLA) for computing the marginal distribution of the process at given locations. These approximations can be implemented on standard environments for a variety of spatial models using the R-INLA software (, but may be difficult to program for new unconventional process models subsumed in (1).

This article outlines strategies for achieving fully model-based Bayesian inference subsuming parameter estimation, response surface predictions and interpolation of the latent spatial process for massive spatial datasets on modest computing environments. This manuscript is designed for the

Practice section of this journal. The focus is not on novel statistical methodologies, nor is it on proposing new highly complex models. Instead, we seek to offer some effective strategies for the practicing spatial analyst to run robust and quick analysis of massive spatial datasets on very standard desktop or laptop architectures.

To achieve this goal, we need a massively scalable spatial process that will be able to estimate (1) by obviating the memory obstacles. Here, there are a few choices that are well-suited for (1) all of whom seem to be competitive based upon the recent “contest” paper by Heaton et al. (2017), but we opt for the sparsity-inducing Nearest-neighbor Gaussian process (NNGP) primarily because of its ease of use and also because of its easier accessibility through the spnngp package available from (see Section II). In fact, Finley et al. (2017b) outline several strategies for estimating NNGP models, including a conjugate Bayesian NNGP response model for exact inference without requiring MCMC that was fitted to a dataset with approximately 5 million locations in a matter of seconds. However, the approach avoided inferring on the latent process. This article will extend such models to subsume inference on the latent process and that too on standard computing environments.

Ii The nearest-neighbor Gaussian process

The computational burden in GP models arises from the covariance matrix , where is the set of observed locations and is large. One effective approach to achieve efficient computations is to replace with an approximate such that the inverse of is sparse. There are multiple options, but notable among them are approximations based upon Gaussian Markov random fields or GMRFs (see, e.g., Rue and Held, 2005; Rue et al., 2009)

that yield computationally efficient sparse representations. An alternative approach exploits an idea familiar in graphical models or Bayesian networks

(see, e.g., Lauritzen, 1996; Bishop, 2006; Murphy, 2012) that has also been exploited by Vecchia (1988), Stein et al. (2004) and Stroud et al. (2017) to construct composite likelihoods for inference. Datta et al. (2016a, b) extended this idea to construct a Nearest Neighbor Gaussian Process (NNGP) for modeling large spatial data. NNGP is a well defined Gaussian Process that yields finite dimensional Gaussian densities with sparse precision matrices. It delivers massive scalability both in terms of parameter estimation and spatial prediction or “kriging”.

I Response NNGP model

Consider modeling a point-referenced outcome as a partial realization of a Gaussian process, on a spatial domain . The mean and covariance functions are assumed to be determined by one or more parameters in a set . The finite-dimensional distribution for the vector with elements is multivariate normal with mean and covariance matrix . As a directed acyclic graph (DAG) (Bishop, 2006), the joint density is , where is the empty set and for is the set of parent nodes with directed edges to . Vecchia (1988) suggested approximating the multivariate normal likelihood by shrinking from the set of all nodes preceding to a much smaller subset of locations preceding that are among the (a fixed small number) nearest neighbors of based upon their Euclidean distance. Datta et al. (2016a) extended that notion to arbitrary points in the domain by defining

for any arbitrary point in the domain, where is the fixed number of nearest neighbors. This results in another multivariate Gaussian density


where is sparse, is sparse and strictly lower triangular with for and at most non-zero entries in each row, and

is diagonal whose elements are the conditional variances

based upon the full GP model, i.e., and

Turning to the structure of , all its elements are completely determined from . Its first row, i.e., has all zeroes. For the -th row, the nonzero entries appear in the positions indexed by and are obtained as row vectors,

The nonzero entries in each row of are precisely the “kriging” weights of based upon the values of at neighboring locations, i.e., (Chilés and Delfiner, 1999). The , constructed as above, is called an NNGP approximation to .

With the above definition of , we can express the partial realizations of an NNGP as a linear model. Let be the set of the observed locations as defined earlier (and is assumed to be large) and let be a set of arbitrary locations where we wish to predict . Then,


where , is diagonal and is sparse formed by extending the definitions of and as

Each row of has exactly nonzero entries corresponding to the column indices in . The above structure implies that and are conditionally independent for any two points and outside of , given . The parameters will be estimated from the data and predictions will be carried out using the conditional distribution of given . In a Bayesian setting, will be sampled from its posterior distribution ,


where and is the prior distribution for .

Consider a specific example with the covariance function , where is equal to one if and otherwise, and

is a linear regression with spatial predictors

and corresponding slope vector . Then and one choice of priors could be

where we are using standard notations for the above distributions as, e.g., in Gelman et al. (2013). The parameter space for this model is not high-dimensional and MCMC algorithms such as Gibbs sampling in conjunction with random-walk Metropolis (RWM) or HMC can be easily implemented. Other approximate algorithms such as Variational Bayes or INLA can also be used, although subsequent results in this article use HMC.

Once the parameter estimates (i.e., posterior samples) are obtained from (4) we can carry out predictive inference for

from the posterior predictive distribution


where is an

-dimensional multivariate normal distribution with mean

and conditional covariance matrix . Since is diagonal, it is easy to sample from . For each sampled from (4), we sample an -dimensional vector from . The resulting ’s are samples from (5). The NNGP exploits the conditional independence between the elements of , given and , to achieve efficient posterior predictive sampling for . This assumption of conditional independence is not restrictive as the samples from (5) are not independent. In fact, the marginal covariance matrix of , given only, is , which is clearly not diagonal.

Ii Latent NNGP model

Rather than model the outcome as an NNGP, as was done in the preceding subsection, one could use the NNGP as a prior for the latent process (Datta et al., 2016a). We envision a spatial regression model at any location


where, usually, and is a latent spatial process capturing spatial dependence. Using definitions analogous to Section I, we assume , which means that for any and , as constructed in (3), will have a zero-centered multivariate normal law with covariance matrix . The posterior distribution to be sampled from is now given by


It is easier to sample from (4) than from (7) since the parameter space in the latter includes the high-dimensional random vector in addition to . One option is to integrate out from (7) which yields the posterior


where and are as defined for (4). The parameter space has collapsed from to , so (8) is called the collapsed version of (7). However, the matrix is still high-dimensional, not sparse or computationally tractable as . Efficient computations will require sophisticated sparse-Cholesky algorithms in conjunction with parallel programming paradigms such as message-parsing-interfaces (MPI) or some variant thereof. Since our definition of “modest” does not extend to such environments, we will turn to conjugate models in the next section.

Iii Conjugate Bayesian model

The response NNGP and latent NNGP models outlined in Sections I and II

, respectively, will still require iterative simulation methods such as MCMC for full Bayesian inference. Conjugate models, i.e., using conjugate priors, can provide exact Bayesian inference by exploiting analytic forms for the posterior distributions. While some specific assumptions are needed, these models are much faster to implement even for massive datasets. Here we develop conjugate NNGP models using the tractable Normal Inverse-Gamma (NIG) family of conjugate priors. We formulate a

conjugate response model (also formulated in Finley et al. (2017b) and available in the spNNGP package from and a new conjugate latent NNGP model. These are conjugate versions of the models described in Sections I and II. We especially focus on the conjugate latent NNGP model and show how it can exploit sparsity by sampling from latent spatial processes over massive numbers of locations efficiently using a conjugate gradient algorithm for solving large sparse systems.

I The NIG conjugate prior family

Let the spatial linear regression model be specified as


where , and are the realization of the corresponding processes defined in (6) over the observed locations , is the matrix of regressors with -th row being a vector of regressors, at location . Henceforth, we suppress the dependence of , , and their covariance matrix on when this will not lead to confusion. Assume that , , where and are known. Let , and . The Normal-Inverse-Gamma (NIG) density yields a convenient conjugate prior,


The posterior distribution of the parameters, up to proportionality, is


The joint posterior distribution is of the form , where


The marginal posterior distribution of follows an and the marginal posterior distribution of can be identified as a multivariate t with mean , variance

and degree of freedom

. Exact Bayesian inference is carried out by sampling directly from the joint posterior density: we sample from and then, for each sampled , we draw from its conditional posterior density . This yields posterior samples from (11). Furthermore, note that once the posterior samples of are obtained, we can obtain samples from by simply multiplying the sampled s with . Thus, posterior samples are obtained without recourse to MCMC or other iterative algorithms.

Ii Conjugate response NNGP model

Finley et al. (2017b) formulated a conjugate NNGP model for the response model described in Section I. This is formed by integrating out from (9) and applying an NNGP approximation to the marginal covariance matrix of . The model can be cast as a conjugate Bayesian linear regression model


where is the NNGP approximation of , and are as described in Section I. Also, with and as described in Section I. We will refer to (13) as the conjugate response NNGP model. Note that this model can estimate

and also impute the outcome at unknown locations, but does not permit inference on the latent process

(also see Section 4 of Datta et al., 2016a). We address this shortcoming with a new conjugate latent NNGP model in the next section.

Iii Conjugate latent NNGP model

The conjugate models in Section I works for any covariance matrix . Here, we derive a conjugate latent NNGP model that will subsume inference on . We rewrite the covariance matrix in section II for as with fixed parameter . Note that is the NNGP approximation of the dense matrix , where . Specifically, , where and depend only on . We recast the model as


where is the Cholesky decomposition of the matrix , and . The joint posterior distribution of and is now easily seen as


where , and . Evaluating the posterior mean of involves solving , which requires flops. However, when , the structure of ensures low storage complexity. Also,


Since has less than nonzero elements and each of its row has at most nonzero elements, the storage of the matrix is less than , and the computational complexity is less than .

This sparsity in can be exploited by a conjugate gradient solver (see, e.g., Golub and Van Loan, 2012) which transforms the problem of solving a linear system into minimizing a quadratic function. Instead of direct methods such as Cholesky decompositions, conjugate gradient methods produce successive iterations that converge to the exact solution in at most steps. The computational cost per iteration here is due to the sparsity of . Since the algorithm converges in super-linear time, a sufficiently good approximation is often obtained in a few iterations (Banerjee and Roy, 2014), hence the performance of the conjugate gradient algorithm will be competitive when is large. This enables posterior sampling of the latent process in high-dimensional settings. The algorithm for sampling from (15) using the conjugate gradient method is given below.

  Algorithm 1: Sample from conjugate latent NNGP model

  1. Fixing and , obtain and :

    • Compute a Cholesky decomposition of to get

    • Compute and

  2. Obtain the posterior mean for :

    • Construct and as described, for example, in Finley et al. (2017b)

    • Construct and from (14)

    • Calculate and

    • Use conjugate gradient to solve

  3. Obtain posterior samples of

    • Calculate and as given below (15)

    • Sample from

  4. Obtain posterior samples of

    • Generate

    • Calculate by solving using conjugate gradient

    • Obtain

  It is readily seen that the

in step 4 follows a Gaussian distribution with variance

. Note that Algorithm 1 implements the conjugate gradient method for an -dimensional linear system in steps 2 and 4. Since and depend only on , the linear equation in step 2 only need to be solved once for each choice of .

Iv Choice of and

The implementation of Algorithm 1 requires specifying and . Here we propose two practical methods for choosing and . The first is to use the posterior samples for from the response or imputation model described in Section I

. This model is much faster to run for high-dimensional data and can be regarded as a fast approximation of the latent process model in (

6). Approximate posterior samples can then be obtained by sampling from the posterior predictive density

where represents the density in (4) and represents the density in (7). We will present detailed empirical investigations on this approximation later.

Our second approach is to obtain point estimates of using cross-validation. The practical advantage here is that the function spConjNNGP within the spNNGP package (Finley et al., 2017a) in R can be used to carry out the cross-validation with the conjugate response NNGP model in (13). The algorithm behind spConjNNGP is exactly linear in and highly efficient for implementation. Fixing and

typically shrinks the 95% posterior credible interval of the parameters, which may adversely affect predictive coverage but will allow massive scalability and recovery of latent process.

Both methods will enable learning of process parameters, including full inference for the latent process, on modest computing environments. The first method often results in a better approximation of the posterior distribution but is somewhat slower than the second approach as it requires full inference from the response model. However, since the recovery of the latent spatial process through MCMC samplers for the latent process model may be prohibitively slow due to the presence of high autocorrelations in the samples, the combination of the response NNGP model and conjugate latent NNGP model provides an efficient practical alternative for inferring on the latent spatial process.

Iv Simulation Study

We discuss results from the aforementioned models in Sections II and III. MCMC samplers were programmed in the Rstan environment (Stan Development Team, 2016). For the approach in Section III, the conjugate gradient solver for sparse linear systems was implemented through RcppEigen (Bates and Eddelbuettel, 2013).The nearest-neighbor sets were built using the spConjNNGP function in the spNNGP package. All simulations were conducted on a LINUX system with 16GB RAM and four 4.2 GHz Intel-Core i7 processors.

I Univariate simulation study

We generated data using the spatial regression model in (6) over a set of spatial locations within a unit square. The true values of the parameters generating the data are supplied in Table 1. The size of the data set was kept moderate to permit comparisons with the expensive full GP models. The model had an intercept and a single predictor generated from a standard normal distribution. An exponential covariance function was used to generate the data.

Candidate models for fitting the data for the spatial process included a response NNGP model, a latent NNGP model with neighbors and a full Gaussian process (labeled as full GP in Table 1). These models were trained using of the observed locations using the HMC-NUTS algorithm (Hoffman and Gelman, 2014) implemented in Stan; the remaining observations were withheld to assess predictive performance. Section IV suggests two methods for sampling the latent spatial process using a conjugate latent NNGP model. Both approaches will employ the conjugate-gradient algorithm in Section III. The model for the first approach is referred to as Model1; this uses the posterior samples of from the response NNGP model. For each , we sample the remaining model parameters from (15). The model where and are fixed is referred to as Model2. The values of are fixed by finding the minimum RMSPE using the conjugate response NNGP model in (13) based upon the training data.

The intercept and slope parameters were assigned improper flat priors. The spatial decay was modeled using a fairly wide uniform prior . We use Inverse-Gamma priors (mean ) for the nugget () and the partial sill () in order to compare the conjugate Bayesian models with other models. The shape parameter was fixed at and the scale parameter was set from the empirical estimate provided by the variogram using the geoR package (Ribeiro Jr and Diggle, 2012). The parameter estimates and performance metrics are provided in Table 1.

Response Latent
True NNGP Full GP NNGP Full GP
1 1.06 (0.84, 1.26) 1.06 ( 0.84, 1.27) 1.05(0.84, 1.27) 1.06 (0.86, 1.27)
-5 -4.96 (-5.02, -4.89) -4.96 (-5.02, -4.90) -4.96(-5.03, -4.90) -4.96 (-5.03, -4.90)
2 1.98(1.71, 2.30) 1.99(1.72, 2.29 ) 1.97(1.69, 2.29) 1.96(1.68, 2.31)
0.2 0.16(0.06, 0.27) 0.15(0.07, 0.26) 0.17(0.08, 0.30) 0.17( 0.09, 0.29)
30 33.41(26.59, 40.36) 33.61 (27.11, 40.08) 33.18(26.46, 40.14) 32.93(26.40, 40.01)
KL-D 7.74(1.58, 18.47) 7.43(1.28, 17.35) 7.31(1.62, 18.5) 7.10(1.52, 16.4)
RMSPE 1.135 1.14 1.146 1.144
MSE(w) 316.11 320.93
time(s) 630 9199 2522 27467
Conjugate latent NNGP
Model1 Model2
1 1.06(0.84, 1.27 ) 1.06(0.81, 1.29)
-5 -4.96(-5.02, -4.89) -4.96 (-5.02, -4.89)
2 1.98(1.71, 2.29) 2.15(1.97, 2.34)
0.2 0.16(0.06, 0.28) 0.13(0.12, 0.14)
30 33.18(26.46, 40.14) 35
KL-D 7.78(1.8, 17.85) 9.32(3.17, 20.27)
RMSPE 1.144 1.13
MSE(w) 307.72 291.69
time(s) 32 10
Table 1: Simulation study summary table: posterior mean (2.5%, 97.5%) percentiles

The summaries for all the candidate models were based on 3 MCMC chains with iterations each. We took the first half of the simulations as burn-in (default in Stan). The statistics (Team, 2017) for all parameters were close to

. All models were assessed by the Kullback-Leibler divergence (labeled KL-D;

Gneiting and Raftery (2007)) and the out-of-sample root mean squared prediction error (RMSPE) (Yeniay and Goktas (2002)). The KL-D between true distribution and fitted distribution is measured by:


where and define Gaussian distributions on with mean vectors and , respectively, and covariance matrices and , respectively. The parameter set includes all process parameter except the latent process across all models. We estimated the KL-D by the empirical estimator:


where are samples from the posterior distribution of . We also present the 95% credible intervals for in Table 1. The predicted outcome at any withheld location was estimated as


where and is the likelihood for the respective model. These were used to calculate the RMSPE using the hold-out values. Both KL-D and RMSPE metrics reveal that the NNGP provides a highly competitive alternative to the full GP. The posterior means and 95% credible intervals are very similar among all models.

Table 1 also lists the parameter estimates and performance metrics for the candidate models. The posterior estimates of the process parameters from Model2 are, somewhat expectedly, sensitive to the choice of fixed parameters, while Model1 yields results closer to the latent process models. For the purpose of assessing the performance of recovering spatial latent process, we also report the Mean Squared Error (MSE) with respect to the true values of the spatial latent process (MSE) over the observed locations in the simulation. Based on MSE and RMSPE, the conjugate latent NNGP models emerge as highly competitive alternatives to latent NNGP models for prediction and inference on the latent process.

Ii Comparison of response and latent process models

We conducted simulations on 5 datasets for response and latent NNGP models with two choices of number of nearest neighbors to compare their posterior distributions. We generate datasets with varying numbers of locations, viz., , , , and , with the same set of models and parameter specifications described in Section I. We also experimented with and nearest-neighbors. The posterior distributions from models with the same number of neighbors on the same dataset tended to produce very similar inference. Here we take the posterior distribution for for illustration. The top row of Figure 1 shows the q-q plots for NNGP models using while the bottom row shows the plots for .

Figure 1: Quantile-quantile plots of posterior samples from response and latent NNGP models with a number of nearest neighbors (top row) and (bottom row).

As soon as the sample sizes become moderately large, we see that the quantile-quantile plots closely approximate the degree line indicating excellent agreement between the response and latent NNGP model for posterior inference on and . For smaller sample sizes, we see some discrepancies in the extremes, which is not unusual in spatial process models because of weak identifiability of the process parameters from the data. Also, marginal improvements are seen with over neighbors.

Iii Recovery of latent spatial process

Figure 2 shows interpolated surfaces from the simulation example: 2(a) shows an interpolated map of the “true” spatial latent process , 2(b)–(e) are maps of the posterior medians of the latent process using a latent NNGP model, a full GP model, Model1 and Model2, respectively. Figure 2(f) presents an interpolated map of RMSPEs over the grid of .

(a) True
(b) NNGP
(c) fullGP
(d) Model1
(e) Model2
(f) RMSPE map over
Figure 2: Interpolated maps of (a) the true generated surface, the posterior medians of the spatial latent process for (b) the full Gaussian Process (Full GP), (c) the latent NNGP, (d) Model1, (e) Model2, and (f) the RMSPE of over a grid of values. The models in (c), (d) and (e) were all fit using nearest neighbors.

We generate 3000 samples of (15) by repeatedly using steps 3 and 4 in the algorithm in Section II. With a single processor, the computation time for generating these samples from the two models were 32 seconds and 10 seconds, respectively. The recovered spatial residual surfaces are almost indistinguishable, and are comparable to the true interpolated surface of . Figure 3 provides quantile-quantile plots for the posterior distributions of ’s from Model1 and Model2 against the latent NNGP model for three randomly selected locations.

Figure 3: Quantile-quantile plots of the posterior samples of for 3 locations from Model1 and Model2 against those from the latent NNGP model

The quantile-quantile plots indicate that Model1 provides a slightly better approximation of the true posterior distribution than Model2, although Model2 performs exact inference without resorting to MCMC and can rapidly deliver inference even for very large datasets.

V Data Analysis

Global warming continues to be an ongoing concern among scientists. In order to develop conceptual and predictive global models, NASA monitors temperature and other atmospheric properties of the Earth regularly by two Moderate Resolution Imaging Spectroradiometer (MODIS) instruments in Aqua and Terra platforms. There is an extensive global satellite-based database processed and maintained by NASA. Details of the data can be found in In particular, inferring on processes generating sea surface temperatures (SST) are of interest to atmospheric scientists studying exchange of heat, momentum, and water vapor between the atmosphere and ocean. Our aforementioned development will enable scientists to analyze large spatially-indexed datasets using a Bayesian geostatistical model easily implementable on modest computing platforms.

Model-based inference is obtained rapidly using the conjugate latent NNGP model and, based on simulation studies, will be practically indistinguishable from MCMC-based output from more general NNGP specification. The dataset we analyze here consists of 219,496 spatially indexed observations of sea surface temperature (SST) collected between June 18-26, 2017, on the eastern Pacific Ocean along the west coast of North America. Among the 219,496 observations, (90%) were used for model fitting and the rest were withheld to assess predictive performance of the candidate models. To understand trends across the coordinates, we used sinusoidally projected coordinates (scaled to 1000km units) as explanatory variables. An exponential spatial covariance function was used for all models. Further model specifications included non-informative flat priors for the intercept and regression coefficients, inverse-gamma priors for and with shape parameter and scale parameter equaling the respective estimates from an empirical variogram. The spatial decay was assigned a uniform prior suggesting a minimum spatial range of about of the maximum distance.

Since the memory of our workstation was insufficient to fit a Full GP model, we fit the (non-conjugate) response NNGP model in (4) using Stan and the two conjugate Bayesian models, Model1 and Model2 described earlier, using the algorithm in Section III with nearest neighbors. For the response NNGP model, we generated 3 MCMC chains from Stan with 500 iterations each for the response NNGP model and took the first half as burn-in (default in Stan). The statistics provided by Stan for all parameters were approximately equal to and subsequent inference was carried out using the posterior samples across the three chains.

Figure (a)a depicts an interpolated map of the observed SST records over training locations. The temperatures are color-coded from shades of blue indicating lower temperatures, primarily seen in the higher latitudes, to shades of red indicating high temperatures.

(a) Observed SST
(b) RMSPE map over
(c) Recovered by Model1
(d) Recovered by Model2
(e) Interpolate map of SST over test locations
(f) Fitted SST from Model2
Figure 4: Notes: (a) Observed SST (b) Interpolated map of RMSPE for cross-validation and picked value (c) median of recovered latent spatial process over observed location by Model1 (d) median of recovered latent spatial process over observed location by Model2 (e) Map of Interpolation of SST over withheld locations (f) Map of Interpolation of the median of SST posterior predictive samples from Model1 on withheld locations

Recall that Model2 uses cross-validatory RMSPE to fix values of . The interpolation of the RMSPE is seen in Fig (b)b using the spConjNNGP function in spNNGP package in R indicated and for Model2. Figures (c)c and (d)d show the posterior means for the latent process from Model1 and Model2, respectively. Very similar spatial patterns are evinced from both these models; both suggest high SST along the coast and far from the coast.

Parameter estimates along with their estimated 95% credible intervals and performance metrics for candidate models are shown in Table 2.

Non-spatial Response NNGP 111m = 10
31.08(31.04, 31.12) 29.92 (29.08, 30.76)
-1.20(-1.21, -1.18) -0.95 (-1.27, -0.62)
-3.45(-3.46, -3.44) -3.17 (-3.35, -2.98)
1.74 (1.59, 1.89)
13.25 (12.16, 14.41)
2.42(2.41, 2.44) 1.97 (1.33, 2.80)
RMSPE 0.55 0.26
time(h) 55.67
Model1 Model2
29.98(29.11, 30.86) 30.05(29.24, 30.86)
-0.95(-1.27, -0.64) -0.96(-1.26, -0.69)
-3.18(-3.38, -2.99) -3.20(-3.38, -3.03)
1.74(1.60, 1.89) 1.61 (1.60, 1.62)
13.25 (12.16, 14.41) 14
1.98 (1.31, 2.88) 1.61(1.60, 1.62)
RMSPE 0.26 0.26
time(h) 0.44 0.12
Table 2: Real data analysis summary table. Parameter Posterior summary mean (2.5, 97.5) percentiles

The RMSPE for a non-spatial linear regression model, the response NNGP model in (4) and the two conjugate Bayesian NNGP models (Model1 and Model2) were 0.55, 0.26, 0.26 and 0.26, respectively. Compared to the spatial models, the non-spatial models have substantially higher values of RMSPE, which suggest that coordinates alone dose not adequately capture the spatial structure of SST. The fitted SST map (Fig (f)f) using Model2 is almost indistinguishable from the real SST map (Fig (e)e). The fitted SST map using Model1 is also indistinguishable from Fig (f)f and is not shown. These results indicate strong consistency in practical inference between Model1 and Model2. Yet the latter takes around 7 minutes on a single processor, while the former takes 26 minutes ( savings in CPU time).

Vi Conclusions and Future Work

This article has attempted to address some practical issues encountered by scientists and statisticians in the hierarchical modeling and analysis for very large geospatial datasets. Building upon some recent work on nearest-neighbor Gaussian processes for massive spatial data, we build conjugate Bayesian spatial regression models and propose strategies for rapidly deliverable inference on modest computing environments equipped with user-friendly and readily available software packages. In particular, we have demonstrated how judicious use of a conjugate latent NNGP model can be effective for estimation and uncertainty quantification of latent (underlying) spatial processes. This provides an easily implementable practical alternative to computationally onerous Bayesian computing approaches. All the computations done in the paper were implemented on a standard desktop using R and Stan. The article intends to contribute toward innovations in statistical practice rather than novel methodologies.

It is important to recognize that the conjugate Bayesian models outlined here are not restricted to the NNGP. Any spatial covariance structure that leads to efficient computations can, in principle, be used. There are a number of recently proposed approaches that can be adopted here. These include, but are not limited to, multi-resolution approaches (e.g., Nychka et al., 2002, 2015; Katzfuss, 2017), covariance tapering and its use in full-scale approximations (e.g., Furrer et al., 2006; Sang and Huang, 2012; Katzfuss, 2013)

, and stochastic partial differential equation approximations

(Lindgren et al., 2011), among several others (see, e.g., Banerjee, 2017, and references therein).

With regard to the NNGP specifically, our choice was partially dictated by its easy implementation in R using the spNNGP package and in Stan as described in The NNGP is built upon a very effective likelihood approximation (Vecchia, 1988; Stein et al., 2004), which has also been explored recently by several authors in a variety of contexts (Stroud et al., 2017; Guinness, 2016). Guinness (2016) provides empirical evidence about Vecchia’s approximation outperforming other alternate methods, but also points out some optimal methods for permuting the order of the spatial locations before constructing the model. His methods for choosing the order of the locations can certainly be executed prior to implementing the models proposed in this article. Finally, an even more recent article by Katzfuss and Guinness (2017) proposes further extensions of the Vecchia approximation, but its practicability for massive datasets on modest computing environments with easily available software packages is yet to be ascertained.


The authors wish to thank Dr. Michael Betancourt, Dr. Bob Carpenter and Dr. Aki Vehtari of the STAN Development Team for useful guidance regarding the implementation of non-conjugate NNGP models in Stan for full Bayesian inference. The work of the first and third authors was supported, in part, by federal grants NSF/DMS 1513654, NSF/IIS 1562303 and NIH/NIEHS 1R01ES027027.

Supplementary Material

All computer programs implementing the examples in this article can be found in the public domain and downloaded from


  • Banerjee (2017) Banerjee, S. (2017). “High-Dimensional Bayesian Geostatistics.” Bayesian Analysis, 12: 583–614.
  • Banerjee et al. (2014) Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014). Hierarchical modeling and analysis for spatial data. Crc Press.
  • Banerjee and Roy (2014) Banerjee, S. and Roy, A. (2014). Linear algebra and matrix analysis for statistics. CRC Press.
  • Bates and Eddelbuettel (2013) Bates, D. and Eddelbuettel, D. (2013). “Fast and Elegant Numerical Linear Algebra Using the RcppEigen Package.” Journal of Statistical Software, 52(5): 1–24.
  • Bishop (2006) Bishop, C. (2006). Pattern Recognition and Machine Learning. New York, NY: Springer-Verlag.
  • Chilés and Delfiner (1999) Chilés, J. and Delfiner, P. (1999). Geostatistics: Modeling Spatial Uncertainty. John Wiley: New York.
  • Cressie and Wikle (2015) Cressie, N. and Wikle, C. K. (2015). Statistics for spatio-temporal data. John Wiley & Sons.
  • Datta et al. (2016a) Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016a). “Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets.” Journal of the American Statistical Association, 111: 800–812.
  • Datta et al. (2016b) Datta, A., Banerjee, S., Finley, A. O., Hamm, N. A. S., and Schaap, M. (2016b). “Non-separable Dynamic Nearest-Neighbor Gaussian Process Models for Large spatio-temporal Data With an Application to Particulate Matter Analysis.” Annals of Applied Statistics, 10: 1286–1316.
  • Finley et al. (2017a) Finley, A., Datta, A., and Banerjee, S. (2017a). spNNGP: Spatial Regression Models for Large Datasets using Nearest Neighbor Gaussian Processes. R package version 0.1.1.
  • Finley et al. (2017b) Finley, A. O., Datta, A., Cook, B. C., Morton, D. C., Andersen, H. E., and Banerjee, S. (2017b). “Applying Nearest Neighbor Gaussian Processes to Massive Spatial Data Sets: Forest Canopy Height Prediction Across Tanana Valley Alaska.” arXiv preprint arXiv:1702.00434.
  • Furrer et al. (2006) Furrer, R., Genton, M. G., and Nychka, D. (2006). “Covariance Tapering for Interpolation of Large Spatial Datasets.” Journal of Computational and Graphical Statistics, 15: 503–523.
  • Gelfand et al. (2010) Gelfand, A. E., Diggle, P., Guttorp, P., and Fuentes, M. (2010). Handbook of spatial statistics. CRC press.
  • Gelman et al. (2013) Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis, 3rd Edition. Chapman & Hall/CRC Texts in Statistical Science. Chapman & Hall/CRC.
  • Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007). “Strictly proper scoring rules, prediction, and estimation.” Journal of the American Statistical Association, 102(477): 359–378.
  • Golub and Van Loan (2012) Golub, G. H. and Van Loan, C. F. (2012). Matrix Computations, 4th Edition. Johns Hopkins University Press.
  • Guinness (2016) Guinness, J. (2016). “Permutation Methods for Sharpening Gaussian Process Approximations.” arXiv preprint arXiv:1609.05372.
  • Heaton et al. (2017) Heaton, M., Datta, A., Finley, A., Furrer, R., Guhaniyogi, R., Gerber, F., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D., and Zammit-Mangion, A. (2017). “Methods for Analyzing Large Spatial Data: A Review and Comparison.” arXiv:1710.05013.
  • Hoffman and Gelman (2014) Hoffman, M. D. and Gelman, A. (2014). “The No U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research, 15: 1593–1623.
  • Katzfuss (2013) Katzfuss, M. (2013). “Bayesian nonstationary modeling for very large spatial datasets.” Environmetrics, 24: 189—200.
  • Katzfuss (2017) — (2017). “A multi-resolution approximation for massive spatial datasets.” Journal of the American Statistical Association, 112: 201–214.
  • Katzfuss and Guinness (2017) Katzfuss, M. and Guinness, J. (2017). “A General Framework for Vecchia Approximations of Gaussian Processes.” arXiv preprint arXiv:1708.06302.
  • Lauritzen (1996) Lauritzen, S. L. (1996). Graphical Models. Oxford, United Kingdom: Clarendon Press.
  • Lindgren et al. (2011) Lindgren, F., Rue, H., and Lindstrom, J. (2011). “An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4): 423–498.
  • Murphy (2012) Murphy, K. (2012). Machine Learning: A probabilistic perspective. Cambridge, MA: The MIT Press.
  • Nychka et al. (2015) Nychka, D., Bandyopadhyay, S., Hammerling, D., Lindgren, F., and Sain, S. (2015). “A Multiresolution Gaussian Process Model for the Analysis of Large Spatial Datasets.” Journal of Computational and Graphical Statistics, 24(2): 579–599.
  • Nychka et al. (2002) Nychka, D., Wikle, C., and Royle, J. A. (2002). “Multiresolution models for nonstationary spatial covariance functions.” Statistical Modelling, 2(4): 315–331.
  • Ribeiro Jr and Diggle (2012) Ribeiro Jr, P. J. and Diggle, P. J. (2012). “geoR: a package for geostatistical analysis.” R package version 1.7-4.
  • Rue and Held (2005) Rue, H. and Held, L. (2005). Gaussian Markov Random Fields : Theory and Applications.

    Monographs on statistics and applied probability. Boca Raton, FL: Chapman & Hall/CRC.

  • Rue et al. (2009) Rue, H., Martino, S., and Chopin, N. (2009). “Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2): 319–392.
  • Sang and Huang (2012) Sang, H. and Huang, J. Z. (2012). “A Full Scale Approximation of Covariance Functions for Large Spatial Data Sets.” Journal of the Royal Statistical society, Series B, 74: 111–132.
  • Stan Development Team (2016) Stan Development Team (2016). “RStan: the R interface to Stan.” R package version 2.14.1.
  • Stein (1999) Stein, M. L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer, first edition.
  • Stein et al. (2004) Stein, M. L., Chi, Z., and Welty, L. J. (2004). “Approximating Likelihoods for Large Spatial Data Sets.” Journal of the Royal Statistical society, Series B, 66: 275–296.
  • Stroud et al. (2017) Stroud, J. R., Stein, M. L., and Lysen, S. (2017). “Bayesian and Maximum Likelihood Estimation for Gaussian Processes on an Incomplete Lattice.” Journal of Computational and Graphical Statistics, 26: 108–120.
  • Team (2017) Team, S. D. (2017). “Stan Modeling Language: User’s Guide and Reference Manual.”
  • Vecchia (1988) Vecchia, A. V. (1988). “Estimation and Model Identification for Continuous Spatial Processes.” Journal of the Royal Statistical society, Series B, 50: 297–312.
  • Yeniay and Goktas (2002) Yeniay, O. and Goktas, A. (2002). “A comparison of partial least squares regression with other prediction methods.” Hacettepe Journal of Mathematics and Statistics, 31(99): 99–101.