I Introduction
Rapidly increasing usage and growing capabilities of Geographic Information Systems (GIS) have spawned considerable research in modeling and analyzing spatial datasets in diverse disciplines including, but not limited to, environmental sciences, economics, biometry and so on (see, e.g., Gelfand et al., 2010; Cressie and Wikle, 2015; Banerjee et al., 2014). Much of spatial modeling is carried out within the familiar hierarchical modeling paradigm,
(1) 
For pointreferenced data sets, where spatial locations are indexed by coordinates on a map, the “process” is modeled as a spatial random field over the domain of interest and the observations are treated as a finite realization of this random field. The Gaussian process (GP) is, perhaps, the most conspicuous of process specifications and offers flexibility and richness in modeling. The GP’s popularity as a modeling tool is enhanced due to their extensibility to multivariate and spatialtemporal geostatistical settings, although we do not pursue such generalizations in this article. They also provide comparatively greater theoretical tractability among spatial processes (Stein, 1999).
Fitting GPs incur onerous computational costs that severely hinders their implementation for large datasets. The key bottleneck stems from the massive spatial covariance matrix present in the multivariate normal density for the finite realizations of the GP. For irregularly situated spatial locations, as is common in geostatistics, these matrices are typically dense and carry no exploitable structure to facilitate computations. Even for a modestly large number of points ( or greater), the computational demands become prohibitive for a modern computer and preclude inference from GP models.
A substantial literature exists on methodologies for massive spatial datasets and it is already too vast to be summarized here (see, e.g., Banerjee, 2017; Heaton et al., 2017, and references therein). Some are more amenable than others to the hierarchical setup in (1). Even within the hierarchical paradigm, there is already a burgeoning literature on massively scalable spatial process models. There are two pressing issues facing the practicing spatial analyst. The first is to analyze massive amounts of spatial data on “modest” computing environments such as standard desktop or laptop architectures. By “modest computing environments”, we mean environments such as R (http://cran.rproject.org) or STAN (http://mcstan.org) that do not require knowledge of lowerlevel languages or distributed programming paradigms. The second pressing issue is that of full inference
that subsumes parameter estimation, spatial prediction of the outcome, and estimation of the underlying latent process. Yet the size of the datasets easily exceed the CPU memory available for computing, which means that we need to rely upon statistical models that will enable analysis with the available memory.
Some scalable processes such as the multiresolution predictive process models proposed by Katzfuss (2017) or the nearestneighbor Gaussian process (NNGP) models by Datta et al. (2016a) can be programmed in modest computing environments to estimate parameters and predict outcomes, but not necessarily infer on the latent process efficiently. Katzfuss (2017) does not address this, while Datta et al. (2016a) and Datta et al. (2016b) implement highdimensional Gibbs sampling algorithms that had to be run for several iterations on a highperformance computing environment to yield adequate convergence due to high autocorrelations. Other approaches such as Gaussian Markov random field (GMRF) approximations to spatial processes Rue et al. (2009); Lindgren et al. (2011) use Integrated Nested Laplace Approximations (INLA) for computing the marginal distribution of the process at given locations. These approximations can be implemented on standard environments for a variety of spatial models using the RINLA software (www.rinla.org), but may be difficult to program for new unconventional process models subsumed in (1).
This article outlines strategies for achieving fully modelbased Bayesian inference subsuming parameter estimation, response surface predictions and interpolation of the latent spatial process for massive spatial datasets on modest computing environments. This manuscript is designed for the
Practice section of this journal. The focus is not on novel statistical methodologies, nor is it on proposing new highly complex models. Instead, we seek to offer some effective strategies for the practicing spatial analyst to run robust and quick analysis of massive spatial datasets on very standard desktop or laptop architectures.To achieve this goal, we need a massively scalable spatial process that will be able to estimate (1) by obviating the memory obstacles. Here, there are a few choices that are wellsuited for (1) all of whom seem to be competitive based upon the recent “contest” paper by Heaton et al. (2017), but we opt for the sparsityinducing Nearestneighbor Gaussian process (NNGP) primarily because of its ease of use and also because of its easier accessibility through the spnngp package available from cran.rproject.org/web/packages/spNNGP (see Section II). In fact, Finley et al. (2017b) outline several strategies for estimating NNGP models, including a conjugate Bayesian NNGP response model for exact inference without requiring MCMC that was fitted to a dataset with approximately 5 million locations in a matter of seconds. However, the approach avoided inferring on the latent process. This article will extend such models to subsume inference on the latent process and that too on standard computing environments.
Ii The nearestneighbor Gaussian process
The computational burden in GP models arises from the covariance matrix , where is the set of observed locations and is large. One effective approach to achieve efficient computations is to replace with an approximate such that the inverse of is sparse. There are multiple options, but notable among them are approximations based upon Gaussian Markov random fields or GMRFs (see, e.g., Rue and Held, 2005; Rue et al., 2009)
that yield computationally efficient sparse representations. An alternative approach exploits an idea familiar in graphical models or Bayesian networks
(see, e.g., Lauritzen, 1996; Bishop, 2006; Murphy, 2012) that has also been exploited by Vecchia (1988), Stein et al. (2004) and Stroud et al. (2017) to construct composite likelihoods for inference. Datta et al. (2016a, b) extended this idea to construct a Nearest Neighbor Gaussian Process (NNGP) for modeling large spatial data. NNGP is a well defined Gaussian Process that yields finite dimensional Gaussian densities with sparse precision matrices. It delivers massive scalability both in terms of parameter estimation and spatial prediction or “kriging”.I Response NNGP model
Consider modeling a pointreferenced outcome as a partial realization of a Gaussian process, on a spatial domain . The mean and covariance functions are assumed to be determined by one or more parameters in a set . The finitedimensional distribution for the vector with elements is multivariate normal with mean and covariance matrix . As a directed acyclic graph (DAG) (Bishop, 2006), the joint density is , where is the empty set and for is the set of parent nodes with directed edges to . Vecchia (1988) suggested approximating the multivariate normal likelihood by shrinking from the set of all nodes preceding to a much smaller subset of locations preceding that are among the (a fixed small number) nearest neighbors of based upon their Euclidean distance. Datta et al. (2016a) extended that notion to arbitrary points in the domain by defining
for any arbitrary point in the domain, where is the fixed number of nearest neighbors. This results in another multivariate Gaussian density
(2) 
where is sparse, is sparse and strictly lower triangular with for and at most nonzero entries in each row, and
is diagonal whose elements are the conditional variances
based upon the full GP model, i.e., andTurning to the structure of , all its elements are completely determined from . Its first row, i.e., has all zeroes. For the th row, the nonzero entries appear in the positions indexed by and are obtained as row vectors,
The nonzero entries in each row of are precisely the “kriging” weights of based upon the values of at neighboring locations, i.e., (Chilés and Delfiner, 1999). The , constructed as above, is called an NNGP approximation to .
With the above definition of , we can express the partial realizations of an NNGP as a linear model. Let be the set of the observed locations as defined earlier (and is assumed to be large) and let be a set of arbitrary locations where we wish to predict . Then,
(3) 
where , is diagonal and is sparse formed by extending the definitions of and as
Each row of has exactly nonzero entries corresponding to the column indices in . The above structure implies that and are conditionally independent for any two points and outside of , given . The parameters will be estimated from the data and predictions will be carried out using the conditional distribution of given . In a Bayesian setting, will be sampled from its posterior distribution ,
(4) 
where and is the prior distribution for .
Consider a specific example with the covariance function , where is equal to one if and otherwise, and
is a linear regression with spatial predictors
and corresponding slope vector . Then and one choice of priors could bewhere we are using standard notations for the above distributions as, e.g., in Gelman et al. (2013). The parameter space for this model is not highdimensional and MCMC algorithms such as Gibbs sampling in conjunction with randomwalk Metropolis (RWM) or HMC can be easily implemented. Other approximate algorithms such as Variational Bayes or INLA can also be used, although subsequent results in this article use HMC.
Once the parameter estimates (i.e., posterior samples) are obtained from (4) we can carry out predictive inference for
from the posterior predictive distribution
(5) 
where is an
dimensional multivariate normal distribution with mean
and conditional covariance matrix . Since is diagonal, it is easy to sample from . For each sampled from (4), we sample an dimensional vector from . The resulting ’s are samples from (5). The NNGP exploits the conditional independence between the elements of , given and , to achieve efficient posterior predictive sampling for . This assumption of conditional independence is not restrictive as the samples from (5) are not independent. In fact, the marginal covariance matrix of , given only, is , which is clearly not diagonal.Ii Latent NNGP model
Rather than model the outcome as an NNGP, as was done in the preceding subsection, one could use the NNGP as a prior for the latent process (Datta et al., 2016a). We envision a spatial regression model at any location
(6) 
where, usually, and is a latent spatial process capturing spatial dependence. Using definitions analogous to Section I, we assume , which means that for any and , as constructed in (3), will have a zerocentered multivariate normal law with covariance matrix . The posterior distribution to be sampled from is now given by
(7) 
It is easier to sample from (4) than from (7) since the parameter space in the latter includes the highdimensional random vector in addition to . One option is to integrate out from (7) which yields the posterior
(8) 
where and are as defined for (4). The parameter space has collapsed from to , so (8) is called the collapsed version of (7). However, the matrix is still highdimensional, not sparse or computationally tractable as . Efficient computations will require sophisticated sparseCholesky algorithms in conjunction with parallel programming paradigms such as messageparsinginterfaces (MPI) or some variant thereof. Since our definition of “modest” does not extend to such environments, we will turn to conjugate models in the next section.
Iii Conjugate Bayesian model
The response NNGP and latent NNGP models outlined in Sections I and II
, respectively, will still require iterative simulation methods such as MCMC for full Bayesian inference. Conjugate models, i.e., using conjugate priors, can provide exact Bayesian inference by exploiting analytic forms for the posterior distributions. While some specific assumptions are needed, these models are much faster to implement even for massive datasets. Here we develop conjugate NNGP models using the tractable Normal InverseGamma (NIG) family of conjugate priors. We formulate a
conjugate response model (also formulated in Finley et al. (2017b) and available in the spNNGP package from cran.rproject.org/web/packages/spNNGP) and a new conjugate latent NNGP model. These are conjugate versions of the models described in Sections I and II. We especially focus on the conjugate latent NNGP model and show how it can exploit sparsity by sampling from latent spatial processes over massive numbers of locations efficiently using a conjugate gradient algorithm for solving large sparse systems.I The NIG conjugate prior family
Let the spatial linear regression model be specified as
(9) 
where , and are the realization of the corresponding processes defined in (6) over the observed locations , is the matrix of regressors with th row being a vector of regressors, at location . Henceforth, we suppress the dependence of , , and their covariance matrix on when this will not lead to confusion. Assume that , , where and are known. Let , and . The NormalInverseGamma (NIG) density yields a convenient conjugate prior,
(10) 
The posterior distribution of the parameters, up to proportionality, is
(11) 
The joint posterior distribution is of the form , where
(12) 
The marginal posterior distribution of follows an and the marginal posterior distribution of can be identified as a multivariate t with mean , variance
. Exact Bayesian inference is carried out by sampling directly from the joint posterior density: we sample from and then, for each sampled , we draw from its conditional posterior density . This yields posterior samples from (11). Furthermore, note that once the posterior samples of are obtained, we can obtain samples from by simply multiplying the sampled s with . Thus, posterior samples are obtained without recourse to MCMC or other iterative algorithms.Ii Conjugate response NNGP model
Finley et al. (2017b) formulated a conjugate NNGP model for the response model described in Section I. This is formed by integrating out from (9) and applying an NNGP approximation to the marginal covariance matrix of . The model can be cast as a conjugate Bayesian linear regression model
(13) 
where is the NNGP approximation of , and are as described in Section I. Also, with and as described in Section I. We will refer to (13) as the conjugate response NNGP model. Note that this model can estimate
and also impute the outcome at unknown locations, but does not permit inference on the latent process
(also see Section 4 of Datta et al., 2016a). We address this shortcoming with a new conjugate latent NNGP model in the next section.Iii Conjugate latent NNGP model
The conjugate models in Section I works for any covariance matrix . Here, we derive a conjugate latent NNGP model that will subsume inference on . We rewrite the covariance matrix in section II for as with fixed parameter . Note that is the NNGP approximation of the dense matrix , where . Specifically, , where and depend only on . We recast the model as
(14) 
where is the Cholesky decomposition of the matrix , and . The joint posterior distribution of and is now easily seen as
(15) 
where , and . Evaluating the posterior mean of involves solving , which requires flops. However, when , the structure of ensures low storage complexity. Also,
(16) 
Since has less than nonzero elements and each of its row has at most nonzero elements, the storage of the matrix is less than , and the computational complexity is less than .
This sparsity in can be exploited by a conjugate gradient solver (see, e.g., Golub and Van Loan, 2012) which transforms the problem of solving a linear system into minimizing a quadratic function. Instead of direct methods such as Cholesky decompositions, conjugate gradient methods produce successive iterations that converge to the exact solution in at most steps. The computational cost per iteration here is due to the sparsity of . Since the algorithm converges in superlinear time, a sufficiently good approximation is often obtained in a few iterations (Banerjee and Roy, 2014), hence the performance of the conjugate gradient algorithm will be competitive when is large. This enables posterior sampling of the latent process in highdimensional settings. The algorithm for sampling from (15) using the conjugate gradient method is given below.
Algorithm 1: Sample from conjugate latent NNGP model

Fixing and , obtain and :

Compute a Cholesky decomposition of to get

Compute and


Obtain posterior samples of

Calculate and as given below (15)

Sample from


Obtain posterior samples of

Generate

Calculate by solving using conjugate gradient

Obtain

It is readily seen that the
in step 4 follows a Gaussian distribution with variance
. Note that Algorithm 1 implements the conjugate gradient method for an dimensional linear system in steps 2 and 4. Since and depend only on , the linear equation in step 2 only need to be solved once for each choice of .Iv Choice of and
The implementation of Algorithm 1 requires specifying and . Here we propose two practical methods for choosing and . The first is to use the posterior samples for from the response or imputation model described in Section I
. This model is much faster to run for highdimensional data and can be regarded as a fast approximation of the latent process model in (
6). Approximate posterior samples can then be obtained by sampling from the posterior predictive densitywhere represents the density in (4) and represents the density in (7). We will present detailed empirical investigations on this approximation later.
Our second approach is to obtain point estimates of using crossvalidation. The practical advantage here is that the function spConjNNGP within the spNNGP package (Finley et al., 2017a) in R can be used to carry out the crossvalidation with the conjugate response NNGP model in (13). The algorithm behind spConjNNGP is exactly linear in and highly efficient for implementation. Fixing and
typically shrinks the 95% posterior credible interval of the parameters, which may adversely affect predictive coverage but will allow massive scalability and recovery of latent process.
Both methods will enable learning of process parameters, including full inference for the latent process, on modest computing environments. The first method often results in a better approximation of the posterior distribution but is somewhat slower than the second approach as it requires full inference from the response model. However, since the recovery of the latent spatial process through MCMC samplers for the latent process model may be prohibitively slow due to the presence of high autocorrelations in the samples, the combination of the response NNGP model and conjugate latent NNGP model provides an efficient practical alternative for inferring on the latent spatial process.
Iv Simulation Study
We discuss results from the aforementioned models in Sections II and III. MCMC samplers were programmed in the Rstan environment (Stan Development Team, 2016). For the approach in Section III, the conjugate gradient solver for sparse linear systems was implemented through RcppEigen (Bates and Eddelbuettel, 2013).The nearestneighbor sets were built using the spConjNNGP function in the spNNGP package. All simulations were conducted on a LINUX system with 16GB RAM and four 4.2 GHz IntelCore i7 processors.
I Univariate simulation study
We generated data using the spatial regression model in (6) over a set of spatial locations within a unit square. The true values of the parameters generating the data are supplied in Table 1. The size of the data set was kept moderate to permit comparisons with the expensive full GP models. The model had an intercept and a single predictor generated from a standard normal distribution. An exponential covariance function was used to generate the data.
Candidate models for fitting the data for the spatial process included a response NNGP model, a latent NNGP model with neighbors and a full Gaussian process (labeled as full GP in Table 1). These models were trained using of the observed locations using the HMCNUTS algorithm (Hoffman and Gelman, 2014) implemented in Stan; the remaining observations were withheld to assess predictive performance. Section IV suggests two methods for sampling the latent spatial process using a conjugate latent NNGP model. Both approaches will employ the conjugategradient algorithm in Section III. The model for the first approach is referred to as Model1; this uses the posterior samples of from the response NNGP model. For each , we sample the remaining model parameters from (15). The model where and are fixed is referred to as Model2. The values of are fixed by finding the minimum RMSPE using the conjugate response NNGP model in (13) based upon the training data.
The intercept and slope parameters were assigned improper flat priors. The spatial decay was modeled using a fairly wide uniform prior . We use InverseGamma priors (mean ) for the nugget () and the partial sill () in order to compare the conjugate Bayesian models with other models. The shape parameter was fixed at and the scale parameter was set from the empirical estimate provided by the variogram using the geoR package (Ribeiro Jr and Diggle, 2012). The parameter estimates and performance metrics are provided in Table 1.
Response  Latent  
True  NNGP  Full GP  NNGP  Full GP  
1  1.06 (0.84, 1.26)  1.06 ( 0.84, 1.27)  1.05(0.84, 1.27)  1.06 (0.86, 1.27)  
5  4.96 (5.02, 4.89)  4.96 (5.02, 4.90)  4.96(5.03, 4.90)  4.96 (5.03, 4.90)  
2  1.98(1.71, 2.30)  1.99(1.72, 2.29 )  1.97(1.69, 2.29)  1.96(1.68, 2.31)  
0.2  0.16(0.06, 0.27)  0.15(0.07, 0.26)  0.17(0.08, 0.30)  0.17( 0.09, 0.29)  
30  33.41(26.59, 40.36)  33.61 (27.11, 40.08)  33.18(26.46, 40.14)  32.93(26.40, 40.01)  
KLD  –  7.74(1.58, 18.47)  7.43(1.28, 17.35)  7.31(1.62, 18.5)  7.10(1.52, 16.4) 
RMSPE  –  1.135  1.14  1.146  1.144 
MSE(w)  –  –  –  316.11  320.93 
time(s)  –  630  9199  2522  27467 
Conjugate latent NNGP  
Model1  Model2  
1  1.06(0.84, 1.27 )  1.06(0.81, 1.29)  
5  4.96(5.02, 4.89)  4.96 (5.02, 4.89)  
2  1.98(1.71, 2.29)  2.15(1.97, 2.34)  
0.2  0.16(0.06, 0.28)  0.13(0.12, 0.14)  
30  33.18(26.46, 40.14)  35  
KLD  –  7.78(1.8, 17.85)  9.32(3.17, 20.27)  
RMSPE  –  1.144  1.13  
MSE(w)  –  307.72  291.69  
time(s)  –  32  10 
The summaries for all the candidate models were based on 3 MCMC chains with iterations each. We took the first half of the simulations as burnin (default in Stan). The statistics (Team, 2017) for all parameters were close to
. All models were assessed by the KullbackLeibler divergence (labeled KLD;
Gneiting and Raftery (2007)) and the outofsample root mean squared prediction error (RMSPE) (Yeniay and Goktas (2002)). The KLD between true distribution and fitted distribution is measured by:(17) 
where and define Gaussian distributions on with mean vectors and , respectively, and covariance matrices and , respectively. The parameter set includes all process parameter except the latent process across all models. We estimated the KLD by the empirical estimator:
(18) 
where are samples from the posterior distribution of . We also present the 95% credible intervals for in Table 1. The predicted outcome at any withheld location was estimated as
(19) 
where and is the likelihood for the respective model. These were used to calculate the RMSPE using the holdout values. Both KLD and RMSPE metrics reveal that the NNGP provides a highly competitive alternative to the full GP. The posterior means and 95% credible intervals are very similar among all models.
Table 1 also lists the parameter estimates and performance metrics for the candidate models. The posterior estimates of the process parameters from Model2 are, somewhat expectedly, sensitive to the choice of fixed parameters, while Model1 yields results closer to the latent process models. For the purpose of assessing the performance of recovering spatial latent process, we also report the Mean Squared Error (MSE) with respect to the true values of the spatial latent process (MSE) over the observed locations in the simulation. Based on MSE and RMSPE, the conjugate latent NNGP models emerge as highly competitive alternatives to latent NNGP models for prediction and inference on the latent process.
Ii Comparison of response and latent process models
We conducted simulations on 5 datasets for response and latent NNGP models with two choices of number of nearest neighbors to compare their posterior distributions. We generate datasets with varying numbers of locations, viz., , , , and , with the same set of models and parameter specifications described in Section I. We also experimented with and nearestneighbors. The posterior distributions from models with the same number of neighbors on the same dataset tended to produce very similar inference. Here we take the posterior distribution for for illustration. The top row of Figure 1 shows the qq plots for NNGP models using while the bottom row shows the plots for .
As soon as the sample sizes become moderately large, we see that the quantilequantile plots closely approximate the degree line indicating excellent agreement between the response and latent NNGP model for posterior inference on and . For smaller sample sizes, we see some discrepancies in the extremes, which is not unusual in spatial process models because of weak identifiability of the process parameters from the data. Also, marginal improvements are seen with over neighbors.
Iii Recovery of latent spatial process
Figure 2 shows interpolated surfaces from the simulation example: 2(a) shows an interpolated map of the “true” spatial latent process , 2(b)–(e) are maps of the posterior medians of the latent process using a latent NNGP model, a full GP model, Model1 and Model2, respectively. Figure 2(f) presents an interpolated map of RMSPEs over the grid of .
We generate 3000 samples of (15) by repeatedly using steps 3 and 4 in the algorithm in Section II. With a single processor, the computation time for generating these samples from the two models were 32 seconds and 10 seconds, respectively. The recovered spatial residual surfaces are almost indistinguishable, and are comparable to the true interpolated surface of . Figure 3 provides quantilequantile plots for the posterior distributions of ’s from Model1 and Model2 against the latent NNGP model for three randomly selected locations.
The quantilequantile plots indicate that Model1 provides a slightly better approximation of the true posterior distribution than Model2, although Model2 performs exact inference without resorting to MCMC and can rapidly deliver inference even for very large datasets.
V Data Analysis
Global warming continues to be an ongoing concern among scientists. In order to develop conceptual and predictive global models, NASA monitors temperature and other atmospheric properties of the Earth regularly by two Moderate Resolution Imaging Spectroradiometer (MODIS) instruments in Aqua and Terra platforms. There is an extensive global satellitebased database processed and maintained by NASA. Details of the data can be found in http://modisatmos.gsfc.nasa.gov/index.html. In particular, inferring on processes generating sea surface temperatures (SST) are of interest to atmospheric scientists studying exchange of heat, momentum, and water vapor between the atmosphere and ocean. Our aforementioned development will enable scientists to analyze large spatiallyindexed datasets using a Bayesian geostatistical model easily implementable on modest computing platforms.
Modelbased inference is obtained rapidly using the conjugate latent NNGP model and, based on simulation studies, will be practically indistinguishable from MCMCbased output from more general NNGP specification. The dataset we analyze here consists of 219,496 spatially indexed observations of sea surface temperature (SST) collected between June 1826, 2017, on the eastern Pacific Ocean along the west coast of North America. Among the 219,496 observations, (90%) were used for model fitting and the rest were withheld to assess predictive performance of the candidate models. To understand trends across the coordinates, we used sinusoidally projected coordinates (scaled to 1000km units) as explanatory variables. An exponential spatial covariance function was used for all models. Further model specifications included noninformative flat priors for the intercept and regression coefficients, inversegamma priors for and with shape parameter and scale parameter equaling the respective estimates from an empirical variogram. The spatial decay was assigned a uniform prior suggesting a minimum spatial range of about of the maximum distance.
Since the memory of our workstation was insufficient to fit a Full GP model, we fit the (nonconjugate) response NNGP model in (4) using Stan and the two conjugate Bayesian models, Model1 and Model2 described earlier, using the algorithm in Section III with nearest neighbors. For the response NNGP model, we generated 3 MCMC chains from Stan with 500 iterations each for the response NNGP model and took the first half as burnin (default in Stan). The statistics provided by Stan for all parameters were approximately equal to and subsequent inference was carried out using the posterior samples across the three chains.
Figure (a)a depicts an interpolated map of the observed SST records over training locations. The temperatures are colorcoded from shades of blue indicating lower temperatures, primarily seen in the higher latitudes, to shades of red indicating high temperatures.
Recall that Model2 uses crossvalidatory RMSPE to fix values of . The interpolation of the RMSPE is seen in Fig (b)b using the spConjNNGP function in spNNGP package in R indicated and for Model2. Figures (c)c and (d)d show the posterior means for the latent process from Model1 and Model2, respectively. Very similar spatial patterns are evinced from both these models; both suggest high SST along the coast and far from the coast.
Parameter estimates along with their estimated 95% credible intervals and performance metrics for candidate models are shown in Table 2.
Nonspatial  Response NNGP ^{1}^{1}1m = 10  
31.08(31.04, 31.12)  29.92 (29.08, 30.76)  
1.20(1.21, 1.18)  0.95 (1.27, 0.62)  
3.45(3.46, 3.44)  3.17 (3.35, 2.98)  
–  1.74 (1.59, 1.89)  
–  13.25 (12.16, 14.41)  
2.42(2.41, 2.44)  1.97 (1.33, 2.80)  
RMSPE  0.55  0.26 
time(h)  –  55.67 
Model1  Model2  
29.98(29.11, 30.86)  30.05(29.24, 30.86)  
0.95(1.27, 0.64)  0.96(1.26, 0.69)  
3.18(3.38, 2.99)  3.20(3.38, 3.03)  
1.74(1.60, 1.89)  1.61 (1.60, 1.62)  
13.25 (12.16, 14.41)  14  
1.98 (1.31, 2.88)  1.61(1.60, 1.62)  
RMSPE  0.26  0.26 
time(h)  0.44  0.12 
The RMSPE for a nonspatial linear regression model, the response NNGP model in (4) and the two conjugate Bayesian NNGP models (Model1 and Model2) were 0.55, 0.26, 0.26 and 0.26, respectively. Compared to the spatial models, the nonspatial models have substantially higher values of RMSPE, which suggest that coordinates alone dose not adequately capture the spatial structure of SST. The fitted SST map (Fig (f)f) using Model2 is almost indistinguishable from the real SST map (Fig (e)e). The fitted SST map using Model1 is also indistinguishable from Fig (f)f and is not shown. These results indicate strong consistency in practical inference between Model1 and Model2. Yet the latter takes around 7 minutes on a single processor, while the former takes 26 minutes ( savings in CPU time).
Vi Conclusions and Future Work
This article has attempted to address some practical issues encountered by scientists and statisticians in the hierarchical modeling and analysis for very large geospatial datasets. Building upon some recent work on nearestneighbor Gaussian processes for massive spatial data, we build conjugate Bayesian spatial regression models and propose strategies for rapidly deliverable inference on modest computing environments equipped with userfriendly and readily available software packages. In particular, we have demonstrated how judicious use of a conjugate latent NNGP model can be effective for estimation and uncertainty quantification of latent (underlying) spatial processes. This provides an easily implementable practical alternative to computationally onerous Bayesian computing approaches. All the computations done in the paper were implemented on a standard desktop using R and Stan. The article intends to contribute toward innovations in statistical practice rather than novel methodologies.
It is important to recognize that the conjugate Bayesian models outlined here are not restricted to the NNGP. Any spatial covariance structure that leads to efficient computations can, in principle, be used. There are a number of recently proposed approaches that can be adopted here. These include, but are not limited to, multiresolution approaches (e.g., Nychka et al., 2002, 2015; Katzfuss, 2017), covariance tapering and its use in fullscale approximations (e.g., Furrer et al., 2006; Sang and Huang, 2012; Katzfuss, 2013)
, and stochastic partial differential equation approximations
(Lindgren et al., 2011), among several others (see, e.g., Banerjee, 2017, and references therein).With regard to the NNGP specifically, our choice was partially dictated by its easy implementation in R using the spNNGP package and in Stan as described in http://mcstan.org/users/documentation/casestudies/nngp.html. The NNGP is built upon a very effective likelihood approximation (Vecchia, 1988; Stein et al., 2004), which has also been explored recently by several authors in a variety of contexts (Stroud et al., 2017; Guinness, 2016). Guinness (2016) provides empirical evidence about Vecchia’s approximation outperforming other alternate methods, but also points out some optimal methods for permuting the order of the spatial locations before constructing the model. His methods for choosing the order of the locations can certainly be executed prior to implementing the models proposed in this article. Finally, an even more recent article by Katzfuss and Guinness (2017) proposes further extensions of the Vecchia approximation, but its practicability for massive datasets on modest computing environments with easily available software packages is yet to be ascertained.
Acknowledgements
The authors wish to thank Dr. Michael Betancourt, Dr. Bob Carpenter and Dr. Aki Vehtari of the STAN Development Team for useful guidance regarding the implementation of nonconjugate NNGP models in Stan for full Bayesian inference. The work of the first and third authors was supported, in part, by federal grants NSF/DMS 1513654, NSF/IIS 1562303 and NIH/NIEHS 1R01ES027027.
Supplementary Material
All computer programs implementing the examples in this article can be found in the public domain and downloaded from https://github.com/LuZhangstat/ConjugateNNGP.
References
 Banerjee (2017) Banerjee, S. (2017). “HighDimensional Bayesian Geostatistics.” Bayesian Analysis, 12: 583–614.
 Banerjee et al. (2014) Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014). Hierarchical modeling and analysis for spatial data. Crc Press.
 Banerjee and Roy (2014) Banerjee, S. and Roy, A. (2014). Linear algebra and matrix analysis for statistics. CRC Press.

Bates and Eddelbuettel (2013)
Bates, D. and Eddelbuettel, D. (2013).
“Fast and Elegant Numerical Linear Algebra Using the
RcppEigen Package.”
Journal of Statistical Software, 52(5): 1–24.
URL http://www.jstatsoft.org/v52/i05/  Bishop (2006) Bishop, C. (2006). Pattern Recognition and Machine Learning. New York, NY: SpringerVerlag.
 Chilés and Delfiner (1999) Chilés, J. and Delfiner, P. (1999). Geostatistics: Modeling Spatial Uncertainty. John Wiley: New York.
 Cressie and Wikle (2015) Cressie, N. and Wikle, C. K. (2015). Statistics for spatiotemporal data. John Wiley & Sons.

Datta et al. (2016a)
Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E.
(2016a).
“Hierarchical NearestNeighbor Gaussian Process Models for
Large Geostatistical Datasets.”
Journal of the American Statistical Association, 111:
800–812.
URL http://dx.doi.org/10.1080/01621459.2015.1044091 
Datta et al. (2016b)
Datta, A., Banerjee, S., Finley, A. O., Hamm, N. A. S., and Schaap, M.
(2016b).
“Nonseparable Dynamic NearestNeighbor Gaussian Process
Models for Large spatiotemporal Data With an Application to Particulate
Matter Analysis.”
Annals of Applied Statistics, 10: 1286–1316.
URL http://dx.doi.org/10.1214/16AOAS931 
Finley et al. (2017a)
Finley, A., Datta, A., and Banerjee, S. (2017a).
spNNGP: Spatial Regression Models for Large Datasets using
Nearest Neighbor Gaussian Processes.
R package version 0.1.1.
URL https://CRAN.Rproject.org/package=spNNGP  Finley et al. (2017b) Finley, A. O., Datta, A., Cook, B. C., Morton, D. C., Andersen, H. E., and Banerjee, S. (2017b). “Applying Nearest Neighbor Gaussian Processes to Massive Spatial Data Sets: Forest Canopy Height Prediction Across Tanana Valley Alaska.” arXiv preprint arXiv:1702.00434.
 Furrer et al. (2006) Furrer, R., Genton, M. G., and Nychka, D. (2006). “Covariance Tapering for Interpolation of Large Spatial Datasets.” Journal of Computational and Graphical Statistics, 15: 503–523.
 Gelfand et al. (2010) Gelfand, A. E., Diggle, P., Guttorp, P., and Fuentes, M. (2010). Handbook of spatial statistics. CRC press.
 Gelman et al. (2013) Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis, 3rd Edition. Chapman & Hall/CRC Texts in Statistical Science. Chapman & Hall/CRC.
 Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007). “Strictly proper scoring rules, prediction, and estimation.” Journal of the American Statistical Association, 102(477): 359–378.
 Golub and Van Loan (2012) Golub, G. H. and Van Loan, C. F. (2012). Matrix Computations, 4th Edition. Johns Hopkins University Press.
 Guinness (2016) Guinness, J. (2016). “Permutation Methods for Sharpening Gaussian Process Approximations.” arXiv preprint arXiv:1609.05372.

Heaton et al. (2017)
Heaton, M., Datta, A., Finley, A., Furrer, R., Guhaniyogi, R., Gerber, F.,
Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D., and ZammitMangion,
A. (2017).
“Methods for Analyzing Large Spatial Data: A Review and
Comparison.”
arXiv:1710.05013.
URL https://arxiv.org/abs/1710.05013  Hoffman and Gelman (2014) Hoffman, M. D. and Gelman, A. (2014). “The No UTurn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research, 15: 1593–1623.
 Katzfuss (2013) Katzfuss, M. (2013). “Bayesian nonstationary modeling for very large spatial datasets.” Environmetrics, 24: 189—200.

Katzfuss (2017)
— (2017).
“A multiresolution approximation for massive spatial
datasets.”
Journal of the American Statistical Association, 112:
201–214.
URL http://dx.doi.org/10.1080/01621459.2015.1123632  Katzfuss and Guinness (2017) Katzfuss, M. and Guinness, J. (2017). “A General Framework for Vecchia Approximations of Gaussian Processes.” arXiv preprint arXiv:1708.06302.
 Lauritzen (1996) Lauritzen, S. L. (1996). Graphical Models. Oxford, United Kingdom: Clarendon Press.

Lindgren et al. (2011)
Lindgren, F., Rue, H., and Lindstrom, J. (2011).
“An explicit link between Gaussian fields and Gaussian Markov
random fields: the stochastic partial differential equation approach.”
Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 73(4): 423–498.
URL http://dx.doi.org/10.1111/j.14679868.2011.00777.x  Murphy (2012) Murphy, K. (2012). Machine Learning: A probabilistic perspective. Cambridge, MA: The MIT Press.

Nychka et al. (2015)
Nychka, D., Bandyopadhyay, S., Hammerling, D., Lindgren, F., and Sain, S.
(2015).
“A Multiresolution Gaussian Process Model for the Analysis of
Large Spatial Datasets.”
Journal of Computational and Graphical Statistics, 24(2):
579–599.
URL http://dx.doi.org/10.1080/10618600.2014.914946  Nychka et al. (2002) Nychka, D., Wikle, C., and Royle, J. A. (2002). “Multiresolution models for nonstationary spatial covariance functions.” Statistical Modelling, 2(4): 315–331.

Ribeiro Jr and Diggle (2012)
Ribeiro Jr, P. J. and Diggle, P. J. (2012).
“geoR: a package for geostatistical analysis.”
R package version 1.74.
URL https://cran.rproject.org/web/packages/geoR 
Rue and Held (2005)
Rue, H. and Held, L. (2005).
Gaussian Markov Random Fields : Theory and Applications.
Monographs on statistics and applied probability. Boca Raton, FL: Chapman & Hall/CRC.
URL http://opac.inria.fr/record=b1119989 
Rue et al. (2009)
Rue, H., Martino, S., and Chopin, N. (2009).
“Approximate Bayesian inference for latent Gaussian models by
using integrated nested Laplace approximations.”
Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 71(2): 319–392.
URL http://dx.doi.org/10.1111/j.14679868.2008.00700.x  Sang and Huang (2012) Sang, H. and Huang, J. Z. (2012). “A Full Scale Approximation of Covariance Functions for Large Spatial Data Sets.” Journal of the Royal Statistical society, Series B, 74: 111–132.

Stan Development Team (2016)
Stan Development Team (2016).
“RStan: the R interface to Stan.”
R package version 2.14.1.
URL http://mcstan.org/  Stein (1999) Stein, M. L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer, first edition.
 Stein et al. (2004) Stein, M. L., Chi, Z., and Welty, L. J. (2004). “Approximating Likelihoods for Large Spatial Data Sets.” Journal of the Royal Statistical society, Series B, 66: 275–296.

Stroud et al. (2017)
Stroud, J. R., Stein, M. L., and Lysen, S. (2017).
“Bayesian and Maximum Likelihood Estimation for Gaussian
Processes on an Incomplete Lattice.”
Journal of Computational and Graphical Statistics, 26:
108–120.
URL http://dx.doi.org/10.1080/10618600.2016.1152970  Team (2017) Team, S. D. (2017). “Stan Modeling Language: User’s Guide and Reference Manual.”
 Vecchia (1988) Vecchia, A. V. (1988). “Estimation and Model Identification for Continuous Spatial Processes.” Journal of the Royal Statistical society, Series B, 50: 297–312.
 Yeniay and Goktas (2002) Yeniay, O. and Goktas, A. (2002). “A comparison of partial least squares regression with other prediction methods.” Hacettepe Journal of Mathematics and Statistics, 31(99): 99–101.