A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging

12/28/2017 ∙ by Rajarshi Guhaniyogi, et al. ∙ Bureau of Labor Statistics University of California Santa Cruz The University of Iowa 0

Flexible hierarchical Bayesian modeling of massive data is challenging due to poorly scaling computations in large sample size settings. This article is motivated by spatial process models for analyzing geostatistical data, which typically entail computations that become prohibitive as the number of spatial locations becomes large. We propose a three-step divide-and-conquer strategy within the Bayesian paradigm to achieve massive scalability for any spatial process model. We partition the data into a large number of subsets, apply a readily available Bayesian spatial process model on every subset in parallel, and optimally combine the posterior distributions estimated across all the subsets into a pseudo-posterior distribution that conditions on the entire data. The combined pseudo posterior distribution is used for predicting the responses at arbitrary locations and for performing posterior inference on the model parameters and the residual spatial surface. We call this approach "Distributed Kriging" (DISK). It offers significant advantages in applications where the entire data are or can be stored on multiple machines. Under the standard theoretical setup, we show that if the number of subsets is not too large, then the Bayes risk of estimating the true residual spatial surface using the DISK posterior distribution decays to zero at a nearly optimal rate. While DISK is a general approach to distributed nonparametric regression, we focus on its applications in spatial statistics and demonstrate its empirical performance using a stationary full-rank and a nonstationary low-rank model based on Gaussian process (GP) prior. A variety of simulations and a geostatistical analysis of the Pacific Ocean sea surface temperature data validate our theoretical results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 26

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A fundamental challenge in geostatistics is the analysis of massive spatially-referenced data. Due to the recent influx of data with complex spatial associations, sophisticated spatial modeling has become an enormously active area of research; see, for example, [28, 13, 4]. Massive spatial data provide scientists with an unprecedented opportunity to hypothesize and test complex theories. This leads to the implementation of rather complex hierarchical GP-based models that are computationally intractable for large , where is the number of spatial locations, due to the computational cost and the storage cost. This article develops a general distributed Bayesian approach, called Distributed Kriging (DISK), for boosting the scalability of any state-of-the-art spatial process model based on GP prior to multiple folds using the divide-and-conquer technique.

The literature on process-based modeling of massive spatial data is large, so we only provide a selective review. Briefly, these methods seek “dimension-reduction” by endowing the spatial covariance matrix either with a low-rank or a sparse structure. Low-rank structures represent spatial surface using apriori chosen basis functions. They include fixed-rank kriging [12], or predictive process and its variants [6, 24, 34, 5, 60]; see [78] and [4] for comprehensive reviews. The time complexity for fitting spatial models with a low-rank structure decreases from to floating point operations (flops); however, practical considerations show that when is large, must grow roughly as for accurate estimation, implying that flops are also expensive in low-rank structures. On the other hand, sparse structures intuit that spatial correlation between two distantly located observations is nearly zero, so little information is lost by assuming conditional independence given the intermediate locations. For example, covariance tapering [41, 19, 25, 63] uses compactly supported covariance functions to create sparse spatial covariance matrices that approximate the full covariance matrix. Alternately, one could introduce sparsity in the inverse covariance (precision) matrix using conditional independence assumptions or composite likelihoods [75, 59, 70, 20, 15, 71, 36]. In related literature on computer experiments, localized approximations of GP models are proposed, see, for example, [30, 82, 54]

. GP-based modeling using low-rank or sparse structures has also received significant attention in machine learning; see

[55, 10] for recent reviews.

Some variants of dimension-reduction methods partition the spatial domain into sub-regions containing fewer spatial locations. Each of these sub-regions is modeled using a GP which are then hierarchically combined by borrowing information from across the sub-regions. Examples include non-stationary models [4], multi-level and multi-resolution models [27, 52, 40, 35], and the Bayesian Treed GP models [31]. These models usually achieve scalability by assuming block-independence at some level of the hierarchy, usually across sub-regions, but may lose scalability when they borrow across sub-regions. In an unrelated thread, [45] propose parameter estimation in the GP-based geostatistical model using resampling based on stochastic approximation. Besides being fully frequentist in nature, it is less clear as to how such an idea would be extended to enable analysis of more general nonstationary models with massive data.

The proposed DISK framework is a three-step approach for distributed Bayesian inference in any model based on spatial process. First, we divide the spatial locations into subsets such that each subset has representative samples from all regions of the spatial domain. Second, we choose any spatial model and estimate the posterior distributions for inference and prediction in parallel across subsets after raising the likelihood to a power of in each subset. The pseudo posterior distribution obtained after modifying the likelihood for each subset of data is referred to as the “subset posterior.” Since each subset posterior distribution conditions on -fraction of the full data, the modification by raising the likelihood to the power

ensures that variance of each subset posterior distribution is of the same order (as a function of

) as that of the full data posterior distribution. Third, the

subset posterior distributions are combined into a single pseudo probability distribution, called the DISK pseudo posterior (henceforth, DISK posterior), that conditions on the full data and replaces the computationally expensive full data posterior distribution for the purpose of prediction and inference. Computationally, the main innovations are in the first and third steps, where general partitioning and combining schemes are unavailable in process-based modeling of spatial data. Theoretically, we provide guarantees on the rate of decay of the Bayes

-risk in estimating the true residual spatial surface using the DISK posterior as a function of , , and analytic properties of the true spatial surface. This involves two new upper bounds. First, an upper bound for the Bayes risk of the DISK posterior is developed assuming each subset size approaches to infinity. Second, we provide an in depth analysis of the bias-variance tradeoff in estimating the true spatial surface using the DISK posterior and develop upper bounds on (as a function on ) that lead to near optimal performance as tends to infinity.

Motivated by large and complex data, there has been significant interest in adopting the divide-and-conquer technique for distributed Bayesian inference [50, 76, 77, 32, 62, 33, 48, 80]. DISK is significantly different from these as it is based on combining the collection of subset posterior distributions through their barycenter, a notion of geometric center that generalizes the Euclidean mean to a space of probability measures. There are recent approaches based on combining subset inferences, either through the posterior means and covariance matrices of parameters across the subsets [62] or by employing a “product of experts” [51, 17]. These approaches are intuitively appealing for GP regression, but lead to theoretically sub-optimal uncertainty quantification [72] and are less suited when a GP or its derivative is embedded in a more general hierarchical model; for example, GP-based classification [37]. Recent developments on distributed variational GP [26] are impractical for

and no theoretical results are available on the quality of approximation of the full posterior distribution. There are recent works on combining subset posterior distributions through their geometric centers, such as the mean or the median, but they are restricted to parametric models

[46, 67, 44, 61, 47, 68]. Extensions to general nonparametric models, including those based on stochastic processes, are missing, except for some empirical results on nonparametric regression using GP [46, 67]

and theoretical guarantees for both these methods in the Gaussian white noise model

[72]. Combining subset posterior distributions of a stochastic process is challenging for two major reasons. First, it requires estimation of a function, an infinite dimensional parameter. Second, and most importantly, stochastic process models induce complex dependencies among observations that is challenging to capture with a combination of subset posterior distributions that are estimated independently without accounting for the inter-subset dependence.

Divide-and-conquer nonparametric regression, which includes kriging as a special case, has received significant attention lately in the optimization literature, though the Bayesian literature is relatively lightly populated. The bias-variance decomposition of the

-risk in the estimation of true regression function in divide-and-conquer kernel ridge regression is known

[84, 81]. Bayesian divide-and-conquer nonparametric regression has been mostly studied from the theoretical perspective [11, 64, 65, 72]. Filling the methodological gap, the DISK framework provides a general approach to enhance the scalability of any process-based model for Bayesian nonparametric regression, including models based on spatial processes. For example, if the application of spatial process models is feasible for a subset of size , then one can run them on subsets in parallel and DISK allows prediction and inference using spatial locations. The values of and depend mainly upon the available computational resources and the model, but our theoretical results provide guidance on choosing depending on the analytic properties of the true spatial process. For clarity of exposition, we illustrate the empirical performance of the DISK framework with the stationary GP and the modified predictive process (MPP) [24] priors. The MPP is a low-rank nonstationary GP prior that allows accurate modeling of spatial surfaces whose variability or smoothness changes with the location. MPP, like any low-rank model, faces computational bottlenecks when is large, and either computational efficiency or accuracy worsens when MPP is applied to even observations. Our numerical results establish that DISK with MPP scales to observations without compromising on either computational efficiency or accuracy in inference and prediction. We expect this conclusion to hold for other popular structured GP priors.

The remainder of the manuscript evolves as follows. In Section 2 we outline a Bayesian hierarchical mixed model framework that incorporates models based on both the full-rank and the low-rank GP priors. Our DISK approach will work with posterior samples from such models. Section 3 develops the framework for DISK, discusses how to compute the DISK posterior distribution, and offers theoretical insights into the DISK for general GPs and their approximations. A detailed simulation study followed by an analysis of the Pacific ocean sea surface temperature data are illustrated in Section 4 to justify the use of DISK for real data. Finally, Section 5 discusses what DISK achieves, and proposes a number of future directions to explore.

2 Hierarchical Bayesian inference for GP-based spatial models

Consider the standard univariate spatial regression model for the data observed at location in a compact spatial domain ,

(1)

where is a univariate response at , is a

predictor vector at

, is a predictor coefficient, is the realization of an unknown spatial function at , and is the realization of white-noise process at and is independent of . The Bayesian implementation of the model in (1) customarily assumes (a) that apriori

follows a Gaussian distribution with mean

and covariance matrix and (b) that and apriori follow mean 0 GPs with covariance functions and that model and , respectively, where are the process parameters indexing the two families of covariance functions and ; therefore, the model parameters are . If in (1), then we obtain the setup for Bayesian nonparametric regression using GP prior, with as covariates and as the response. The training data consists of predictors and responses, denoted as , observed at spatial locations, denoted as .

Standard Markov chain Monte Carlo (MCMC) algorithms exist for performing posterior inference on

and the values of at a given set of locations , where , and for predicting for any [4]. Given , the prior assumptions on and imply that and are independent and follow and , respectively, where denotes the density of a multivariate Gaussian distribution of appropriate dimension with mean and covariance matrix and the th entries of and are and , respectively. The hierarchy in (1) is completed by assuming that apriori follows a distribution with density . If is the response vector and is the matrix of predictors, where , then the MCMC algorithm for sampling , , and cycle through the following three steps until sufficient samples are drawn post convergence:

  1. Integrate over in (1) and

    1. sample given , , and from , where

      (2)

      and ; and

    2. sample given , , and using the Metropolis-Hastings algorithm with a normal random walk proposal.

  1. Sample given , , , and from , where

    (3)

    and are and matrices, respectively, and the th entries of and are and , respectively.

  2. Sample given , , and from , where .

Many Bayesian spatial models can be formulated in terms of (1) by assuming different forms of and ; see [4] and supplementary material for details on the MCMC algorithm.

MCMC computations face a major computational bottleneck due to matrix inversions involving . The steps 1(a), 1(b), and 2 for sampling and involve inversion of . Irrespective of the form of , if no additional assumptions are made on the structure of , then the three steps require flops in computation and memory units in storage in every MCMC iteration. Spatial models with this form of posterior computations are based on a full-rank GP prior. In practice, if , then posterior computations in a model based on a full-rank GP prior are infeasible due to numerical issues in matrix inversions involving an unstructured .

This problem is solved by imposing additional structure on . Our focus is on those methods that impose a sparse or low-rank structure on the covariance function of a GP prior [55, 4]. Every method in this class expresses the covariance function in terms of basis functions, in turn inducing a low-rank GP prior. We use the MPP as a representative example of this class of methods. Let be a set of locations, known as the “knots,” which may or may not intersect with . Let be an vector and be an matrix whose th entry is . Using and , define the diagonal matrix with

(4)

Let if and 0 otherwise. Then, the MPP is a GP with the covariance function

(5)

where depends on the covariance function of the parent GP and the selected knots, which define , , and . We have used a in (5) to distinguish the covariance function of a low-rank GP prior from that of its parent full-rank GP. If is a matrix with th entry , then the posterior computations using MPP, a low-rank GP prior, replace by in the steps 1(a), 1(b), and 2. The (low) rank structure imposed by implies that computation requires flops using the Woodbury formula [38].

Spatial models based on a low-rank GP prior, including MPP, suffer from computational bottlenecks in massive data settings. The computational complexity of posterior computations for a rank- GP prior is , which is linear in ; however, practical considerations often necessitate that for accurate inference and prediction. This severely limits the computational advantages of low-rank GP priors, including MPP, especially in applications with large [15]. The next few sections develop our DISK framework, which is key in extending any GP-based spatial model to massive data using the divide-and-conquer technique without compromising either on the quality of inference and prediction or on the computational cost.

3 Distributed Kriging

3.1 First step: partitioning of spatial locations

We partition the spatial locations into subsets. The value of depends on the chosen spatial model, and it is large enough to ensure efficient posterior computations on any subset. The default partitioning scheme is to randomly allocate the locations into subsets, but we specify a technical condition later which ensures that every subset has locations from all regions of the spatial domain. For theoretical and notational simplicity, we also assume that the subsets are non-overlapping and that every subset has spatial locations so that . Let be the set of spatial locations in subset (). Our simplifying assumptions imply that , for , and , where for some and for every and . Denote the data in the th partition as (), where is a vector and is a matrix of predictors corresponding to the spatial locations in with .

The univariate spatial regression models using either a full-rank or a low-rank GP prior for the data observed at any location is given by

(6)

Let and be the realizations of GP and white-noise process , respectively, in the th subset. After marginalizing over in the GP-based model for the th subset, the likelihood of is given by

(7)

where represents the multivariate normal density of with mean and covariance matrix , and for full-rank and low-rank GP priors, respectively, and are obtained by extending the definitions of to the th subset. In a model based on full-rank or low-rank GP prior, the likelihood of given , , and is

(8)

The likelihoods in (7) and (8) are used for defining the posterior distributions for , , called th subset posterior distributions, using a full-rank or a low-rank GP prior in subset .

3.2 Second step: sampling from subset posterior distributions

We define subset posterior distributions by modifying the likelihoods in (7) and (8). More precisely, the density of the th subset posterior distribution of is given by

(9)

where we assume that , and the subscript ‘’ denotes that the density conditions on samples in the th subset. The modification of likelihood to yield the subset posterior density in (9) is called stochastic approximation [46]. Raising the likelihood to the power of is equivalent to replicating every times (). Thus, stochastic approximation accounts for the fact that the th subset posterior distribution conditions on a -fraction () of the full data and ensures that its variance is of the same order (as a function of ) as that of the full data posterior distribution. This is a common strategy adopted in divide-and-conquer based Bayesian inference in parametric models; see, for example, [44, 80] for recent applications.

With the proposed stochastic approximation in (9), the full conditional densities of th subset posterior distributions for prediction and inference follow from their full data counterparts. The th full conditional densities of and in the GP-based models are given by

(10)

where , is the prior density of , and we assume that and respectively are finite. The th full conditional densities of and are calculated after modifying the likelihood of in (7) using stochastic approximation. Given , , and , straightforward calculation yields that the

th subset posterior predictive density of

is , where

(11)

where and for full-rank and low-rank GP priors, respectively, and are , matrices obtained by extending the definition in (3) to subset for full-rank and low-rank GP priors with covariance functions and , respectively. We note that the stochastic approximation exponent, , scales in so that the uncertainty in subset posterior distributions are scaled to that of the full data posterior. The th subset posterior predictive density of given the samples of and in the th subset is . We employ the same three-step sampling algorithm, as earlier introduced, specialized to subset , sampling in each subset across multiple MCMC iterations; see supplementary material for detailed derivations of subset posterior sampling algorithms in the full-rank and low-rank GP priors.

The computational complexity of th subset posterior computations follows from their full data counterparts if we replace by . Specifically, the computational complexities for sampling a subset posterior distribution are and flops if the model in (6) uses a full-rank or a low-rank GP prior, respectively. Since the subset posterior computations are performed in parallel across subsets, the computational complexities for obtaining post burn-in subset posterior samples from subsets are and flops in models based on full-rank and low-rank GP priors, respectively.

The combination step of subset posteriors using the DISK framework outlined below is more widely applicable compared to other divide-and-conquer type approaches because it does not rely on any model- or data-specific assumptions, such as independence, except that every subset posterior distribution has a density with respect to the Lebesgue measure and has finite second moments.

3.3 Third step: combination of subset posterior distributions

3.3.1 Wasserstein distance and Wasserstein barycenter

The combination step relies on the Wasserstein barycenter, so we provide some background on this topic. Let be a complete separable metric space and be the space of all probability measures on . The Wasserstein space of order is a set of probability distributions defined as

(12)

where is arbitrary and does not depend on the choice of . The Wasserstein distance of order 2, denoted as , metrizes . Let be two probability measures in and be the set of all probability measures on with marginals and , then distance between and is defined as

(13)

Let , then the Wasserstein barycenter of is defined as

(14)

It is known that

is the unique solution of a linear program

[1].

If in (14) represent the subset posterior distributions, then Wasserstein barycenter provides a general notion of obtaining the mean of subset posterior distributions. If the subset posteriors are combined using , then has finite second moments, conditions on the full data, and does not rely on model-specific or data-specific assumptions. If

are analytically intractable but MCMC samples are available from them, then an empirical approximation of the Wasserstein barycenter can be estimated by solving a sparse linear program or by averaging empirical subset posterior quantiles

[14, 9, 67, 44, 69].

3.3.2 Combination scheme

In the DISK framework, we combine the collection of posterior samples from the subset posterior distributions for , , , and through their respective Wasserstein barycenters. Our combination scheme relies on the following key result. If is a one-dimensional functional of the full data posterior distribution and the th subset posterior distribution for is denoted by , then the th quantile of the Wasserstein barycenter for , denoted as , is estimated from the collection of subset posterior samples as

(15)

where is the grid-size of the quantiles, is the estimate of quantile of obtained using MCMC samples from , and is the estimate of the th quantile of [44].

The post burn-in samples from the subset posterior distributions are combined using (15). Let , , , () be the collection of post burn-in MCMC samples from the subsets; and be the th post burn-in MCMC samples for the th and th marginals of and from their th subset posteriors, where , , is the dimension of , and is the dimension of . If , , , and

represent the random variables that follow the DISK posterior distributions for

, , , and , then MCMC-based estimates of the DISK posterior are obtained through their quantile estimates as follows:

  1. Use (15) to estimate the th quantiles of the th and th marginals of the DISK posteriors for and , respectively, as

    (16)

    where and are the estimates of th quantiles of th and th marginals of th subset posterior distributions for and obtained from and .

  2. Use (15) to estimate the pointwise th quantiles of and as

    (17)

    where and are the estimates of th quantiles of and of the th subset posterior disributions for and obtained from and .

A key feature of the DISK combination scheme is that given the subset posterior samples, the combination step is agnostic to the choice of a model. Specifically, (15) remains the same for models based on a full-rank prior or a low-rank GP prior, such as MPP, given MCMC samples from the subset posterior distributions. Since the averaging over subsets takes flops and , the total time for computing the empirical quantile estimates of the DISK posterior in inference or prediction requires and flops in models based on full-rank and low-rank GP priors. Assuming that we have abundant computational resources, is chosen large enough so that computations are feasible. This would enable applications of the DISK framework in models based on both full-rank and low-rank GP priors in massive settings.

3.4 Illustrative example: linear regression

We develop intuitions behind the DISK posterior in this section by studying its theoretical properties in multivariate linear regression. The spatial linear model in (

1) reduces to a linear regression model with error variance 1 if follows and there is no spatial effect; that is, for every . A flat prior on implies that the “full data” posterior distribution of given , denoted as , has density , where . Using notations similar to the earlier sections, the th subset posterior distribution has density , where () and the factor of in the posterior covariance matrix comes from stochastic approximation. The Wasserstein barycenter of the subset posterior distributions, denoted as , is Gaussian with density , where

(18)

see Section 6.3 in [1]. The DISK framework replaces with for inference and predictions in massive data settings.

The covariance matrix in (18) is estimated via fixed point iterations; however, is analytically tractable when have identical left and right singular vectors [2]. If and are and orthogonal matrices and is a

diagonal matrix containing singular values of

, then , , and . The mean vector and covariance matrix of reduce to and . Let be the true value of and , be the random variables with distributions , . Assume that for some universal positive constant and for , and , then the Bayes -risk of the DISK posterior in estimating is

where is fixed, is the Euclidean norm, and represents expectation with respect to the density of obtained by fixing in (1). This shows that the Bayes -risk of the DISK posterior in estimating is upper bounded by the Bayes -risk of full data posterior distribution.

A result like the one above is crucial in developing intuition on the DISK posterior as an alternative to the full data posterior and are generally difficult to come by in the literature for models presented in Section 2. In the next two sections, we develop theoretical guarantees for the DISK posterior as an alternative to the full posterior in estimating the residual spatial surface.

3.5 Bayes -risk of DISK: General convergence rates

Consider a special case when in (1). The regression model reduces to a finite dimensional model with parameters and the DISK posterior reduces to the recently developed WASP method [68]. If is the true value of , then it is known that when the data are independent, WASP converges in probability to a Dirac measure centered at or to the full data posterior distribution of at a near optimal rate, under certain assumptions as [68, 44]; however, in models based on spatial process, inference on the infinite dimensional true residual spatial surface, denoted as , is of primary importance and no formal results are available in the regard. A notable exception is [72], which shows that combination using Wasserstein barycenter has optimal Bayes risk and adapts to the smoothness of in the Gaussian white noise model. This model is a special case of (1) with additional smoothness assumptions on .

We focus on providing theoretical guarantees for the DISK posterior distribution for estimating using the spatial regression model in (1) with a GP prior on , including the low-rank GP prior such as MPP, as . Recall that is the set of reference locations in and . Let be the true residual spatial surface generating the data at the locations in and be the realization of GP at the locations in . Following the standard theoretical setup in [73], we assume that , is known, and , where is the known non-spatial error variance, so that (1) reduces to

(19)

This setup subsumes the low-rank GP priors, hence MPP, as MPP is a GP with covariance function in (5). We also assume that the data are generated using , .

Adapting our discussing in Section 3.2 for the models in (6) to the one in (19), we have that given follows the Gaussian distribution with density after stochastic approximation as in (11) and the GP prior on implies that after integrating over

(20)

where , , and are defined in (11). Let and represent the full data versions of and in (20). For any