# High-dimensional multivariate Geostatistics: A Bayesian Matrix-Normal Approach

Joint modeling of spatially-oriented dependent variables are commonplace in the environmental sciences, where scientists seek to estimate the relationships among a set of environmental outcomes accounting for dependence among these outcomes and the spatial dependence for each outcome. Such modeling is now sought for very large data sets where the variables have been measured at a very large number of locations. Bayesian inference, while attractive for accommodating uncertainties through their hierarchical structures, can become computationally onerous for modeling massive spatial data sets because of their reliance on iterative estimation algorithms. This manuscript develops a conjugate Bayesian framework for analyzing multivariate spatial data using analytically tractable posterior distributions that do not require iterative algorithms. We discuss differences between modeling the multivariate response itself as a spatial process and that of modeling a latent process. We illustrate the computational and inferential benefits of these models using simulation studies and real data analyses for a Vege Indices dataset with observed locations numbering in the millions.

• 101 publications
• 30 publications
• 19 publications
05/31/2020

### Spatial Factor Modeling: A Bayesian Matrix-Normal Approach for Misaligned Data

Multivariate spatially-oriented data sets are prevalent in the environme...
07/23/2019

### Conjugate Nearest Neighbor Gaussian Process Models for Efficient Statistical Interpolation of Large Spatial Data

A key challenge in spatial statistics is the analysis for massive spatia...
07/16/2021

### Nearest-Neighbor Geostatistical Models for Non-Gaussian Data

We develop a class of nearest neighbor mixture transition distribution p...
04/30/2022

### Bayesian Models for Multivariate Difference Boundary Detection in Areal Data

Regional aggregates of health outcomes over delineated administrative un...
12/06/2017

### A Multi-Resolution Spatial Model for Large Datasets Based on the Skew-t Distribution

Large, non-Gaussian spatial datasets pose a considerable modeling challe...
02/02/2020

### Modeling Multivariate Spatial-Temporal Data with Latent Low-Dimensional Dynamics

High-dimensional multivariate spatial-temporal data arise frequently in ...
03/04/2021

### Approximate Bayesian Conditional Copulas

Copula models are flexible tools to represent complex structures of depe...

## I Introduction

Analysis for environmental data sets often require joint modeling of multiple spatially dependent variables accounting for dependence among the variables and the spatial association for each variable. For point-referenced variables, multivariate Gaussian processes (GPs) serve as versatile tools for joint modeling of spatial variables (see, e.g., Schabenberger and Gotway, 2004; Cressie and Wikle, 2015; Banerjee et al., 2014, and references therein). However, for a dataset with observed locations, fitting a GP based spatial model typically requires floating point operations (flops) and memory requirements of the order and , respectively. This is challenging when is large. This “Big Data” problem has received much attention in the literature and a comprehensive review is beyond the scope of this article; see, e.g., Banerjee (2017); Heaton et al. (2019); Sun et al. (2011) for a review and comparison of scalable modeling methods. Much of the aforementioned literature for scalable models focused on univariate spatial processes, i.e., assuming only one response for each location.

Multivariate processes (see, e.g., Genton and Kleiber, 2015; Salvaña and Genton, 2020; Le and Zidek, 2006, and references therein)

, has received relatively limited developments in the context of massive data. Bayesian models are attractive for inference on multivariate spatial processes because they can accommodate uncertainties in the process parameters more flexibly through their hierarchical structure. Multivariate spatial interpolation using conjugate Bayesian modeling can be found in

Brown et al. (1994); Le et al. (1997); Sun et al. (1998); Le et al. (2001); Gamerman and Moreira (2004), but these methods do not address the challenges encountered in massive data sets. More flexible methods for joint modeling, including spatial factor models, have been investigated in Bayesian contexts (see, e.g. Schmidt and Gelfand, 2003; Ren and Banerjee, 2013; Taylor-Rodriguez et al., 2019)

, but these methods have focused upon delivering full Bayesian inference through iterative algorithms such as Markov chain Monte Carlo (MCMC).

Our current contribution is an augmented Bayesian multivariate linear model framework that accommodates conjugate distribution theory, similar to Gamerman and Moreira (2004), but that can scale up to massive data sets with locations numbering in the millions. More specifically, we embed the Nearest-Neighbor Gaussian process (NNGP) (Datta et al., 2016a) within our conjugate Bayesian framework. We will consider two classes of models. The first is obtained by modeling the spatially dependent variables jointly as a multivariate spatial process, while the second models a latent multivariate spatial process in a hierarchical setup. We refer to the former as the “response” model and the latter as the “latent” model and we explore some properties of these models.

The remainder of our paper is arranged as follows. Section II develops a conjugate Bayesian multivariate spatial regression framework using Matrix-Normal and Inverse-Wishart prior distributions. We first develop two classes of models, response models and latent models using Gaussian spatial processes, in Section I. Subsequently, in Section II

we develop scalable versions of these models using the Nearest Neighbor Gaussian process (NNGP). We develop NNGP response models and NNGP latent models in this conjugate Bayesian framework. A cross-validation algorithm to fix certain hyperparameters in these models is presented in Section

III and some theoretical attributes of these models are presented in Section IV. Section III present some simulation experiments, while Section IV analyzes a massive Normalized Difference Vegetation Index data with a few million locations. Finally, Section V concludes the manuscript with some discussion.

## Ii Bayesian Multivariate Geostatistical Modeling

### I Conjugate Multivariate Spatial Models

#### Conjugate Multivariate Response Model

Let be a vector of outcomes at location and be a corresponding vector of explanatory variables. Conditional on these explanatory variables, the response is assumed to follow a multivariate Gaussian process,

 y(s)∼GP(β⊤x(s),C(⋅,⋅));C(s,s′)=[ρψ(s,s′)+(α−1−1)δs=s′]Σ, (1)

where the mean of is , is a matrix of regression coefficients and is a cross-covariance matrix (Genton and Kleiber, 2015) whose -th element is the covariance between and . The cross-covariance matrix is defined for each pair of locations and is further specified as a multiple of a nonspatial positive definite matrix . The multiplication factor is a function of the two locations and is composed of two components: a spatial correlation function, , which introduces spatial dependence between the outcomes through hyperparameters , and a micro-scale adjustment , where if and is if , and is a scalar parameter representing the overall strength of the spatial variability as a proportion of the total variation.

The covariance among the elements of within a location is given by the elements of suggesting the interpretation of as the within-location (nonspatial) dependence among the outcomes adjusted by a scale of to accommodate additional variation at local scales. The interpretation of is equivalent to the ratio of the “partial sill” to the “sill” in classical geostatistics. For example, in the special case that , , which shows that

is the variance of micro-scale processes (or the “nugget”), so that

is the ratio of the spatial variance (partial sill) to the total variance (sill). A similar interpretation for results in the univariate setting with .

Let be a set of locations yielding observations on . Then is and is the corresponding matrix of explanatory variables observed over . We will assume that has full column rank (). The likelihood emerging from (1) is , where MN

denotes the Matrix-Normal distribution defined in

Ding and Cook (2014), i.e.,

 (2)

where tr denotes trace, and is the spatial correlation matrix. A conjugate Bayesian model is obtained by a Matrix-Normal-Inverse-Wishart (MNIW) prior on , which we denote as

 MNIW(β,Σ|μβ,Vr,Ψ,ν)=IW(Σ|Ψ,ν)×MNp,q(β|μβ,Vr,Σ), (3)

where

is the inverted-Wishart distribution. The MNIW family is a conjugate prior with respect to the likelihood (

2) and, for any fixed values of , and the hyperparameters in the prior density, we obtain the posterior density

 p(β,Σ|Y)∝MNIW(β,Σ|μβ,Vr,Ψ,ν)×MNn,q(Y|Xβ,K,Σ)∝MNIW(μ∗,V∗,Ψ∗,ν∗), (4)

where

 V∗ =(X⊤K−1X+V−1r)−1,μ∗=V∗(X⊤K−1Y+V−1rμβ), (5) Ψ∗ =Ψ+Y⊤K−1Y+μ⊤βV−1rμβ−μ∗⊤V∗−1μ∗, and ν∗=ν+n.

Direct sampling from the MNIW posterior distribution in (4) is achieved by first sampling and then sampling one draw of for each draw of . The resulting pairs will be samples from (4). Since this scheme draws directly from the posterior distribution, the sample is exact and does not require burn-in or convergence.

Turning to predictions, let

be a finite set of locations where we intend to predict or impute the value of

based upon an observed design matrix for . If is the

matrix of predictive random variables, then the conditional predictive distribution is

 p(YU|Y,β,Σ)=MNn′,q(YU|XUβ +ρψ(U,S)K−1[Y−Xβ], (6) ρψ(U,U)+(α−1−1)In′−ρψ(U,S)K−1ρψ(S,U),Σ),

where is and

. Predictions can also be directly carried out in posterior predictive fashion, where we sample from

 p(YU|Y)=∫MNn′,q(XUβ+ρψ(U,S)K−1[Y−Xβ],ρψ(U,U)+(α−1−1)In′−ρψ(U,S)K−1ρψ(S,U),Σ)×MNIW(μ∗,V∗,Ψ∗,ν∗)dβdΣ. (7)

Sampling from (7) is achieved by drawing one from (6) for each posterior draw of .

#### Conjugate Multivariate Latent Model

We now discuss a conjugate Bayesian model for a latent process. Consider the spatial regression model

 y(s)=β⊤x(s)+ω(s)+ϵ(s),s∈D, (8)

where is a multivariate latent process with cross-covariance matrix and captures micro-scale variation. The “proportionality” assumption for the variance of will allow us to derive analytic posterior distributions using conjugate priors.

The latent process captures the underlying spatial pattern and holds specific interest in many applications. Let be . The parameter space with the latent process is . Letting be , we assume that , where and . The posterior density is

 p(γ,Σ|Y) ∝MNIW(γ,Σ|μγ,Vγ,Ψ,ν)×MNn,q(Yn×q|[X:In]γ,(α−1−1)In,Σ) (9) ∝MNIW(γ,Σ|μ∗γ,V∗,Ψ∗,ν∗),

where

 V∗ =[α1−αX⊤X+V−1rα1−αX⊤α1−αXρ−1ψ(S,S)+α1−αIn]−1,μ∗γ=V∗[α1−αX⊤Y+V−1rμβα1−αY], (10) Ψ∗ =Ψ+α1−αY⊤Y+μ⊤βV−1rμβ−μ∗⊤γV∗−1μ∗γ and ν∗=ν+n.

For prediction on a set of location , we can estimate the unobserved latent process and the response through

 p(YU,ωU|Y)=∫MNn′,q(YU|XUβ+ωU,(α−1−1)In′,Σ)×MNn′,q(ωU|MUω,VωU,Σ)×MNIW(γ,Σ|μ∗γ,V∗,Ψ∗,ν∗)dγdΣ, (11)

where and . Posterior predictive inference proceeds by sampling one draw of for each posterior draw of and then one draw of for each drawn .

### Ii Scalable Conjugate Bayesian Multivariate Models

#### Conjugate multivariate response NNGP model

A conjugate Bayesian modeling framework is appealing for massive spatial data sets because the posterior distribution of the parameters are available in closed form circumventing the need for MCMC algorithms. The key computational bottleneck for Bayesian estimation of spatial process models concerns the computation and storage involving in (5). The required matrix computations require flops and storage when is and dense. While conjugate models reduce computational expenses by enabling direct sampling from closed-form posterior and posterior predictive distributions, the computation and storage of is still substantial for massive datasets.

One approach to obviate the overwhelming computations is to develop a sparse alternative for in (5). One such approximation that has generated substantial recent attention in the spatial literature is an approximation due to Vecchia (Vecchia, 1988). Consider the spatial covariance matrix in (2). This is a dense matrix with no apparent exploitable structure. Instead, we specify a sparse Cholesky representation

 K−1=(I−AK)⊤D−1K(I−AK), (12)

where is a diagonal matrix and is a sparse lower-triangular matrix with along the diagonal and with no more than a fixed small number of nonzero entries in each row of . The diagonal entries of and the nonzero entries of are obtained from the conditional variance and conditional expectations for a Gaussian process with covariance function . To be precise, we consider a fixed order of locations in and define to be the set comprising at most neighbors of among locations such that . The -th entry of is whenever . If are the column indices indicating the nonzero entries in the -th row of , then the -th element of is set equal to the -th element of the vector . The -th diagonal element of is given by . Repeating these calculations for each row completes the construction of and and yields a sparse in (12). This construction can be performed in parallel and requires storage or computation of at most matrices, where , costing flops and storage. Further algorithmic details about this construction can be found in Finley et al. (2019).

Based on Section I, the posterior distribution follows where are given in (5). With the sparse representation of in (12), the process of obtaining posterior inference for only involves steps with storage and computational requirement in .

The predictions on the unobserved locations is also simplified as follows. We extend the definition of ’s to arbitrary locations by defining to be the set of nearest neighbors of from . Furthermore, we assume that and are conditionally independent of each other given and the other model parameters. Thus, for any , we have

 y(ui)|Y,β,Σ∼N(β⊤x(ui)+[Y−Xβ]⊤~ai,~diΣ),i=1,…,n′, (13)

where is an vector with non-zero elements. If , then

 ({~ai}j1,…,{~ai}jm)=ρψ(ui,Nm(ui)){ρψ(Nm(ui),Nm(ui))+(α−1−1)Im}−1,~di=α−1−ρψ(ui,Nm(ui))[ρψ(Nm(ui),Nm(ui))+(α−1−1)Im]−1ρψ(Nm(ui),ui). (14)

If and , then the conditional predictive density for is

 YU|Y,β,Σ∼MN(XUβ+~A[Y−Xβ],~D,Σ). (15)

Since the posterior distribution of and the conditional predictive distribution of are both available in closed form, direct sampling from the posterior predictive distribution is straightforward. A detailed algorithm for obtaining the posterior inference on parameter set and the posterior prediction over a new set of location is given as below.

Algorithm 1: Obtaining posterior inference of and predictions on for conjugate multivariate response NNGP

1. Construct , , and :

1. Compute the Cholesky decomposition of

2. Compute and

• Construct and as described, for example, in Finley et al. (2019)

• Compute and

3. Obtain , and

• Compute and its Cholesky decomposition

• Compute

• Compute

• Compute

2. Generate posterior samples on a new set given

1. Construct and as described in (14)

2. For in

1. Sample

2. Sample

• Calculate Cholesky decomposition of ,

• Sample (i.e. )

• Generate

3. Sample

• Sample .

• Generate

#### Conjugate multivariate latent NNGP model

Bayesian estimation for the conjugate multivariate latent model is more challenging because inference is usually sought on the (high-dimensional) latent process itself. In particular, the calculations involved in in (9) are often too expensive for large data sets even when the precision matrix is sparse. Here, the latent process in (8) follows a multivariate Gaussian process so that its realizations over follows , where is the Vecchia approximation of . Hence, , where and are constructed analogous to and in (12) with replaced by . This corresponds to modeling with a Nearest-Neighbor Gaussian Process (NNGP) (see, e.g., Datta et al., 2016a, b; Banerjee, 2017, for details).

The posterior distribution of follows a Matrix-Normal distribution similar to (9), but with in (10) replaced by its Vecchia approximation . We will solve the linear system for , compute and generate posterior samples of from . Posterior samples of are obtained by generating , solving for and then obtaining posterior samples of from .

However, sampling is still challenging for massive data sets, where we seek to minimize storage and operations with large matrices. Here we introduce a useful representation. Let be a non-singular square matrix such that where we write . We treat the prior of as additional “observations” and recast into an augmented linear model

 (16)

where is the Cholesky decomposition of , and . With a flat prior for , degenerates to and does not contribute to the linear system. The expression in (10) can now be simplified as follows

 V∗ =(X∗⊤X∗)−1,μ∗=(X∗⊤X∗)−1X∗⊤Y∗, (17) Ψ∗ =Ψ+(Y∗−X∗μ∗)⊤(Y∗−X∗μ∗),ν∗=ν+n.

Following developments in Zhang et al. (2019) for the univariate case, one can efficiently generate posterior samples through a conjugate gradient algorithm exploiting the sparsity of . The sampling process for will be scalable when there is a sparse precision matrix . It is also possible to construct and in (17) using instead of . We refer to Zhang et al. (2019) for further details of this construction. We provide a detailed algorithm of the conjugate multivariate latent NNGP model below, where we implement a “Sparse Equations and Least Squares” (LSMR) algorithm (Fong and Saunders, 2011) to solve the linear system and needed to generate . LSMR is a conjugate-gradient type algorithm for solving sparse linear equations where the matrix may be square or rectangular. The matrix is a sparse tall matrix. LSMR only requires storing , and and, unlike the conjugate gradient algorithm, avoids , and . LSMR also tends to produce more stable estimates than conjugate gradient. We have also tested a variety of conjugate gradient methods and preconditioning methods, where we have observed that their performances varied across different data sets. The LSMR without conditioning showed a relatively good performance for the latent models. Therefore, we choose LSMR without preconditioning for our current illustrations.

Posterior predictive inference will adapt from (11) for scalable models. After sampling , we sample one draw of for each sampled , where , with

 ({~ai}j1, …,{~ai}jm)=ρψ(ui,Nm(ui))ρ−1ψ(Nm(ui),Nm(ui)), (18) ~di =1−ρψ(ui,Nm(ui))ρ−1ψ(Nm(ui),Nm(ui))ρψ(Nm(ui),ui).

Finally, for each sampled we make one draw of . The following provides details of the algorithm for predictive inference.
Algorithm 2: Obtaining posterior inference of and predictions on set for conjugate multivariate latent NNGP

1. Construct and in (16)

1. and

• Compute the Cholesky decomposition of ,

• Compute and

• Construct and as described, for example, in Finley et al. (2019)

• Compute

2. Construct and

2. Obtain , and .

1. Obtain

• Solve from by LSMR for .

2. Obtain and

• Generate

• Compute

• Compute

3. Generate posterior samples of . For in

1. Sample

2. Sample

• Sample

• Calculate Cholesky decomposition of ,

• Generate

• Solve from by LSMR for .

• Generate with

4. Generate posterior samples of on a new set given .

1. Construct and using (18)

2. For in

1. Sample

• Sample

• Generate

2. Sample

• Sample

• Generate

### Iii Cross-validation for Conjugate Multivariate NNGP Models

Conjugate Bayesian multivariate regression models will depend upon fixing hyperparameters in the model. Here, we apply a -fold cross-validation algorithm for choosing . This algorithm is a straightforward generalization of the univariate algorithm in (Finley et al., 2019). We run the conjugate models for each point on a grid and choose the value that produces the least magnitude of root mean square prediction error. The inference on that point is then presented. This is appealing for scalable Gaussian process models that, for any fixed , can deliver posterior inference at new locations requiring storage and flops in .   Algorithm 3: Cross-validation of tuning , for conjugate multivariate response or latent NNGP model

1. Split into folds, and build neighbor index.

• Split into folder . We use to denote the location of without .

• Build nearest neighbors for

• Find the collection of nearest neighbor set for among for .

2. (For response NNGP) Fix and , obtain posterior mean of after removing the fold of the data:

• Use step 1 in Algorithm 1 to obtain by taking to be and to be .

3. (For latent NNGP) Fix and , obtain posterior mean of after removing the fold of the data:

• Use step 1-2 in Algorithm 3 to obtain by taking to be and to be .

4. (For response NNG