I Introduction
Analysis for environmental data sets often require joint modeling of multiple spatially dependent variables accounting for dependence among the variables and the spatial association for each variable. For pointreferenced variables, multivariate Gaussian processes (GPs) serve as versatile tools for joint modeling of spatial variables (see, e.g., Schabenberger and Gotway, 2004; Cressie and Wikle, 2015; Banerjee et al., 2014, and references therein). However, for a dataset with observed locations, fitting a GP based spatial model typically requires floating point operations (flops) and memory requirements of the order and , respectively. This is challenging when is large. This “Big Data” problem has received much attention in the literature and a comprehensive review is beyond the scope of this article; see, e.g., Banerjee (2017); Heaton et al. (2019); Sun et al. (2011) for a review and comparison of scalable modeling methods. Much of the aforementioned literature for scalable models focused on univariate spatial processes, i.e., assuming only one response for each location.
Multivariate processes (see, e.g., Genton and Kleiber, 2015; Salvaña and Genton, 2020; Le and Zidek, 2006, and references therein)
, has received relatively limited developments in the context of massive data. Bayesian models are attractive for inference on multivariate spatial processes because they can accommodate uncertainties in the process parameters more flexibly through their hierarchical structure. Multivariate spatial interpolation using conjugate Bayesian modeling can be found in
Brown et al. (1994); Le et al. (1997); Sun et al. (1998); Le et al. (2001); Gamerman and Moreira (2004), but these methods do not address the challenges encountered in massive data sets. More flexible methods for joint modeling, including spatial factor models, have been investigated in Bayesian contexts (see, e.g. Schmidt and Gelfand, 2003; Ren and Banerjee, 2013; TaylorRodriguez et al., 2019), but these methods have focused upon delivering full Bayesian inference through iterative algorithms such as Markov chain Monte Carlo (MCMC).
Our current contribution is an augmented Bayesian multivariate linear model framework that accommodates conjugate distribution theory, similar to Gamerman and Moreira (2004), but that can scale up to massive data sets with locations numbering in the millions. More specifically, we embed the NearestNeighbor Gaussian process (NNGP) (Datta et al., 2016a) within our conjugate Bayesian framework. We will consider two classes of models. The first is obtained by modeling the spatially dependent variables jointly as a multivariate spatial process, while the second models a latent multivariate spatial process in a hierarchical setup. We refer to the former as the “response” model and the latter as the “latent” model and we explore some properties of these models.
The remainder of our paper is arranged as follows. Section II develops a conjugate Bayesian multivariate spatial regression framework using MatrixNormal and InverseWishart prior distributions. We first develop two classes of models, response models and latent models using Gaussian spatial processes, in Section I. Subsequently, in Section II
we develop scalable versions of these models using the Nearest Neighbor Gaussian process (NNGP). We develop NNGP response models and NNGP latent models in this conjugate Bayesian framework. A crossvalidation algorithm to fix certain hyperparameters in these models is presented in Section
III and some theoretical attributes of these models are presented in Section IV. Section III present some simulation experiments, while Section IV analyzes a massive Normalized Difference Vegetation Index data with a few million locations. Finally, Section V concludes the manuscript with some discussion.Ii Bayesian Multivariate Geostatistical Modeling
I Conjugate Multivariate Spatial Models
Conjugate Multivariate Response Model
Let be a vector of outcomes at location and be a corresponding vector of explanatory variables. Conditional on these explanatory variables, the response is assumed to follow a multivariate Gaussian process,
(1) 
where the mean of is , is a matrix of regression coefficients and is a crosscovariance matrix (Genton and Kleiber, 2015) whose th element is the covariance between and . The crosscovariance matrix is defined for each pair of locations and is further specified as a multiple of a nonspatial positive definite matrix . The multiplication factor is a function of the two locations and is composed of two components: a spatial correlation function, , which introduces spatial dependence between the outcomes through hyperparameters , and a microscale adjustment , where if and is if , and is a scalar parameter representing the overall strength of the spatial variability as a proportion of the total variation.
The covariance among the elements of within a location is given by the elements of suggesting the interpretation of as the withinlocation (nonspatial) dependence among the outcomes adjusted by a scale of to accommodate additional variation at local scales. The interpretation of is equivalent to the ratio of the “partial sill” to the “sill” in classical geostatistics. For example, in the special case that , , which shows that
is the variance of microscale processes (or the “nugget”), so that
is the ratio of the spatial variance (partial sill) to the total variance (sill). A similar interpretation for results in the univariate setting with .Let be a set of locations yielding observations on . Then is and is the corresponding matrix of explanatory variables observed over . We will assume that has full column rank (). The likelihood emerging from (1) is , where MN
denotes the MatrixNormal distribution defined in
Ding and Cook (2014), i.e.,(2) 
where tr denotes trace, and is the spatial correlation matrix. A conjugate Bayesian model is obtained by a MatrixNormalInverseWishart (MNIW) prior on , which we denote as
(3) 
where
is the invertedWishart distribution. The MNIW family is a conjugate prior with respect to the likelihood (
2) and, for any fixed values of , and the hyperparameters in the prior density, we obtain the posterior density(4) 
where
(5)  
Direct sampling from the MNIW posterior distribution in (4) is achieved by first sampling and then sampling one draw of for each draw of . The resulting pairs will be samples from (4). Since this scheme draws directly from the posterior distribution, the sample is exact and does not require burnin or convergence.
Turning to predictions, let
be a finite set of locations where we intend to predict or impute the value of
based upon an observed design matrix for . If is thematrix of predictive random variables, then the conditional predictive distribution is
(6)  
where is and
. Predictions can also be directly carried out in posterior predictive fashion, where we sample from
(7) 
Sampling from (7) is achieved by drawing one from (6) for each posterior draw of .
Conjugate Multivariate Latent Model
We now discuss a conjugate Bayesian model for a latent process. Consider the spatial regression model
(8) 
where is a multivariate latent process with crosscovariance matrix and captures microscale variation. The “proportionality” assumption for the variance of will allow us to derive analytic posterior distributions using conjugate priors.
The latent process captures the underlying spatial pattern and holds specific interest in many applications. Let be . The parameter space with the latent process is . Letting be , we assume that , where and . The posterior density is
(9)  
where
(10)  
For prediction on a set of location , we can estimate the unobserved latent process and the response through
(11) 
where and . Posterior predictive inference proceeds by sampling one draw of for each posterior draw of and then one draw of for each drawn .
Ii Scalable Conjugate Bayesian Multivariate Models
Conjugate multivariate response NNGP model
A conjugate Bayesian modeling framework is appealing for massive spatial data sets because the posterior distribution of the parameters are available in closed form circumventing the need for MCMC algorithms. The key computational bottleneck for Bayesian estimation of spatial process models concerns the computation and storage involving in (5). The required matrix computations require flops and storage when is and dense. While conjugate models reduce computational expenses by enabling direct sampling from closedform posterior and posterior predictive distributions, the computation and storage of is still substantial for massive datasets.
One approach to obviate the overwhelming computations is to develop a sparse alternative for in (5). One such approximation that has generated substantial recent attention in the spatial literature is an approximation due to Vecchia (Vecchia, 1988). Consider the spatial covariance matrix in (2). This is a dense matrix with no apparent exploitable structure. Instead, we specify a sparse Cholesky representation
(12) 
where is a diagonal matrix and is a sparse lowertriangular matrix with along the diagonal and with no more than a fixed small number of nonzero entries in each row of . The diagonal entries of and the nonzero entries of are obtained from the conditional variance and conditional expectations for a Gaussian process with covariance function . To be precise, we consider a fixed order of locations in and define to be the set comprising at most neighbors of among locations such that . The th entry of is whenever . If are the column indices indicating the nonzero entries in the th row of , then the th element of is set equal to the th element of the vector . The th diagonal element of is given by . Repeating these calculations for each row completes the construction of and and yields a sparse in (12). This construction can be performed in parallel and requires storage or computation of at most matrices, where , costing flops and storage. Further algorithmic details about this construction can be found in Finley et al. (2019).
Based on Section I, the posterior distribution follows where are given in (5). With the sparse representation of in (12), the process of obtaining posterior inference for only involves steps with storage and computational requirement in .
The predictions on the unobserved locations is also simplified as follows. We extend the definition of ’s to arbitrary locations by defining to be the set of nearest neighbors of from . Furthermore, we assume that and are conditionally independent of each other given and the other model parameters. Thus, for any , we have
(13) 
where is an vector with nonzero elements. If , then
(14) 
If and , then the conditional predictive density for is
(15) 
Since the posterior distribution of and the conditional predictive distribution of are both available in closed form, direct sampling from the posterior predictive distribution is straightforward. A detailed algorithm for obtaining the posterior inference on parameter set and the posterior prediction over a new set of location is given as below.
Algorithm 1:
Obtaining posterior inference of and predictions on for conjugate multivariate response NNGP

Construct , , and :

Compute the Cholesky decomposition of

Compute and

Construct and as described, for example, in Finley et al. (2019)

Compute and


Obtain , and

Compute and its Cholesky decomposition

Compute

Compute

Compute



Generate posterior samples on a new set given

Construct and as described in (14)

For in

Sample

Sample

Calculate Cholesky decomposition of ,

Sample (i.e. )

Generate


Sample

Sample .

Generate



Conjugate multivariate latent NNGP model
Bayesian estimation for the conjugate multivariate latent model is more challenging because inference is usually sought on the (highdimensional) latent process itself. In particular, the calculations involved in in (9) are often too expensive for large data sets even when the precision matrix is sparse. Here, the latent process in (8) follows a multivariate Gaussian process so that its realizations over follows , where is the Vecchia approximation of . Hence, , where and are constructed analogous to and in (12) with replaced by . This corresponds to modeling with a NearestNeighbor Gaussian Process (NNGP) (see, e.g., Datta et al., 2016a, b; Banerjee, 2017, for details).
The posterior distribution of follows a MatrixNormal distribution similar to (9), but with in (10) replaced by its Vecchia approximation . We will solve the linear system for , compute and generate posterior samples of from . Posterior samples of are obtained by generating , solving for and then obtaining posterior samples of from .
However, sampling is still challenging for massive data sets, where we seek to minimize storage and operations with large matrices. Here we introduce a useful representation. Let be a nonsingular square matrix such that where we write . We treat the prior of as additional “observations” and recast into an augmented linear model
(16) 
where is the Cholesky decomposition of , and . With a flat prior for , degenerates to and does not contribute to the linear system. The expression in (10) can now be simplified as follows
(17)  
Following developments in Zhang et al. (2019) for the univariate case, one can efficiently generate posterior samples through a conjugate gradient algorithm exploiting the sparsity of . The sampling process for will be scalable when there is a sparse precision matrix . It is also possible to construct and in (17) using instead of . We refer to Zhang et al. (2019) for further details of this construction. We provide a detailed algorithm of the conjugate multivariate latent NNGP model below, where we implement a “Sparse Equations and Least Squares” (LSMR) algorithm (Fong and Saunders, 2011) to solve the linear system and needed to generate . LSMR is a conjugategradient type algorithm for solving sparse linear equations where the matrix may be square or rectangular. The matrix is a sparse tall matrix. LSMR only requires storing , and and, unlike the conjugate gradient algorithm, avoids , and . LSMR also tends to produce more stable estimates than conjugate gradient. We have also tested a variety of conjugate gradient methods and preconditioning methods, where we have observed that their performances varied across different data sets. The LSMR without conditioning showed a relatively good performance for the latent models. Therefore, we choose LSMR without preconditioning for our current illustrations.
Posterior predictive inference will adapt from (11) for scalable models. After sampling , we sample one draw of for each sampled , where , with
(18)  
Finally, for each sampled we make one draw of .
The following provides details of the algorithm for predictive inference.
Algorithm 2:
Obtaining posterior inference of and predictions on set for conjugate multivariate latent NNGP

Obtain , and .

Obtain

Solve from by LSMR for .


Obtain and

Generate

Compute

Compute



Generate posterior samples of . For in

Sample

Sample

Sample

Calculate Cholesky decomposition of ,

Generate

Solve from by LSMR for .

Generate with



Generate posterior samples of on a new set given .

Construct and using (18)

For in

Sample

Sample

Generate


Sample

Sample

Generate



Iii Crossvalidation for Conjugate Multivariate NNGP Models
Conjugate Bayesian multivariate regression models will depend upon fixing hyperparameters in the model. Here, we apply a fold crossvalidation algorithm for choosing . This algorithm is a straightforward generalization of the univariate algorithm in (Finley et al., 2019). We run the conjugate models for each point on a grid and choose the value that produces the least magnitude of root mean square prediction error. The inference on that point is then presented. This is appealing for scalable Gaussian process models that, for any fixed , can deliver posterior inference at new locations requiring storage and flops in .
Algorithm 3: Crossvalidation of tuning , for conjugate multivariate response or latent NNGP model

Split into folds, and build neighbor index.

Split into folder . We use to denote the location of without .

Build nearest neighbors for

Find the collection of nearest neighbor set for among for .


(For response NNGP) Fix and , obtain posterior mean of after removing the fold of the data:

Use step 1 in Algorithm 1 to obtain by taking to be and to be .


(For latent NNGP) Fix and , obtain posterior mean of after removing the fold of the data:

Use step 12 in Algorithm 3 to obtain by taking to be and to be .


(For response NNG