Multi-resolution approximations of Gaussian processes for large spatial datasets

10/24/2017
by   Matthias Katzfuss, et al.
0

Gaussian processes are popular and flexible models for spatial, temporal, and functional data, but they are computationally infeasible for large datasets. We discuss Gaussian-process approximations that use basis functions at multiple resolutions to achieve fast inference and that can (approximately) represent any covariance structure. We consider two special cases of this multi-resolution-approximation framework, a taper version and a domain-partitioning (block) version. We describe theoretical properties and inference procedures, and study the computational complexity of the methods. Numerical comparisons and an application to satellite data are also provided.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

08/21/2017

A general framework for Vecchia approximations of Gaussian processes

Gaussian processes (GPs) are commonly used as models for functions, time...
06/15/2016

Understanding Probabilistic Sparse Gaussian Process Approximations

Good sparse approximations are essential for practical inference in Gaus...
01/29/2020

Convergence Guarantees for Gaussian Process Approximations Under Several Observation Models

Gaussian processes are ubiquitous in statistical analysis, machine learn...
05/20/2019

Gaussian Process Learning via Fisher Scoring of Vecchia's Approximation

We derive a single pass algorithm for computing the gradient and Fisher ...
07/03/2020

Karhunen-Loève Expansions for Axially Symmetric Gaussian Processes: Modeling Strategies and L^2 Approximations

Axially symmetric processes on spheres, for which the second-order depen...
06/30/2020

Spatiotemporal Multi-Resolution Approximations for Analyzing Global Environmental Data

Technological developments and open data policies have made large, globa...
04/03/2020

Faster Gaussian Processes via Deep Embeddings

Gaussian processes provide a probabilistic framework for quantifying unc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian processes (GPs) are highly popular models for spatial data, time series, and functions. They are flexible and allow natural uncertainty quantification, but their computational complexity is cubic in the data size. This prohibits GPs from being used directly for the analysis of many modern datasets consisting of a large number of observations, such as satellite remote-sensing data.

Consequently, many approximations or assumptions have been proposed that allow the application of GPs to large spatial datasets. Some of these approaches are most appropriate for capturing fine-scale structure (e.g., Furrer et al., 2006; Kaufman et al., 2008), while others are more capable at capturing large-scale structure (e.g., Higdon, 1998; Mardia et al., 1998; Wikle and Cressie, 1999; Cressie and Johannesson, 2008; Katzfuss and Cressie, 2009, 2011, 2012). Lindgren et al. (2011)

proposed an approximation based on viewing a GP with Matérn covariance as the solution to the corresponding stochastic partial differential equation. Vecchia’s method and its extensions

(e.g., Vecchia, 1988; Stein et al., 2004; Datta et al., 2016; Katzfuss and Guinness, 2017) are discontinuous and assume the so-called screening effect to hold, meaning that any given observation is conditionally independent from other observations given a small subset of (typically, nearby) observations.

We propose here a class of multi-resolution-approximation (-RA) of GPs, which allows capturing spatial structure at all scales. The -RA framework is based on an orthogonal decomposition of the GP of interest into processes at multiple resolutions by iteratively applying the predictive process (Quiñonero-Candela and Rasmussen, 2005; Banerjee et al., 2008). The process at each resolution has an equivalent representation as a weighted sum of spatial basis functions. For increasing resolution, the number of functions increases while their scale decreases. Unlike other multi-resolution models or wavelets (e.g. Chui, 1992; Johannesson et al., 2007; Cressie and Johannesson, 2008; Nychka et al., 2015), our -RA automatically specifies the basis functions and the prior distributions of their weights to adapt to the given covariance function of interest, without requiring any restrictions on this covariance function. Thus, in contrast to other approaches, it is clear which covariance structure is approximated by the sum of basis functions in the -RA.

To achieve computational feasibility within the -RA framework, an approximation of the “remainder process” at each resolution using so-called modulating functions is necessary. We consider two special cases: For the -RA taper, the modulating functions are taken to be tapering functions (i.e., compactly supported correlation functions). For increasing resolution, the remainder process is approximated with increasingly restrictive tapering functions, leading to increasingly sparse matrices. In contrast, the -RA-block iteratively splits each region at each resolution into a set of subregions, with the remainder process assumed to be independent between these subregions. This can lead to discontinuities at the region boundaries. A special case of the -RA-block (Katzfuss, 2017) performed very well in a recent comparison of different methods for large spatial data (Heaton et al., 2017). A further special case of the -RA with only one resolution is given by the full-scale approximation (Snelson and Ghahramani, 2007; Sang et al., 2011; Sang and Huang, 2012).

The -RA is suitable for inference based on large numbers of observations from a GP, which may be irregularly spaced. We will describe inference procedures that rely on operations on sparse matrices for computational feasibility. The -RA-block can deal with massive datasets with tens of millions of observations or more, as it amenable to parallel computations on modern distributed computing systems. It can be viewed as a Vecchia-type approximation (Katzfuss and Guinness, 2017), and the approximated covariance matrix is a so-called hierarchical off-diagonal low-rank matrix (e.g., Ambikasaran et al., 2016). The -RA-taper leads to more general sparse matrices, and thus requires more careful algorithms to fully exploit the sparsity structure, but it has the advantage of not introducing artificial discontinuities.

Relative to the -RA-block in Katzfuss (2017), the contributions of our paper are the following: We introduce a general framework for -RAs that provides a new, intuitive perspective on this approach. This allows an extension of the -RA-block of Katzfuss (2017) that removes the requirement that knots at the finest resolution correspond to the observed locations, and it enables us to introduce a novel -RA-taper approach that extends the ideas of Sang and Huang (2012) to more than one resolution. We provide more insights about the theoretical and computational properties of both versions of the -RA. We also include further implementation details and numerical comparisons.

This article is organized as follows. In Section 2, we first describe an exact orthogonal multi-resolution decomposition of a GP, which then leads to the -RA framework and the two special cases described above by applying the appropriate modulating functions. We also study their theoretical properties. In Section 3, we discuss the algorithms necessary for statistical inference using the -RA and provide details of the computational complexity. Numerical comparisons on simulated and real data are given in Sections 4 and 5, respectively. We conclude in Section 6. All proofs can be found in Appendix A. Additional simulation results can be found in a separate Supplementary Material document. All code will be provided upon publication.

2 Multi-resolution approximations

2.1 The true Gaussian process

Let , or , be the true spatial field or process of interest on a continuous (non-gridded) domain , . We assume that is a zero-mean Gaussian process with covariance function . We place no restrictions on

, other than assuming that it is a positive-definite function that is known up to a vector of parameters,

. In real applications,

will often not have mean zero, but estimating and subtracting the mean is usually not a computational problem. Once

has been observed at a set of spatial locations , the basic goal of spatial statistics is to make (likelihood-based) inference on the parameters and to obtain spatial predictions of at a set of locations (i.e., to obtain the posterior distribution of ). Direct computation based on the Cholesky decomposition of the resulting covariance matrix requires time and memory complexity, which is computationally infeasible when .

2.2 Preliminaries

A multi-resolution approximation with resolutions (-RA) requires two main “ingredients”: knots and modulating functions. The multi-resolutional set of knots, , is chosen such that, for all , , is a set of knots, with . We assume that the number of knots increases with resolution (i.e., ). An illustration of such a set of knots in a simple toy example is given in Figure 1.

The second ingredient is a set of “modulating functions” (Sang et al., 2011), , where is a symmetric, nonnegative-definite function. In Section 2.5 we will consider two specific examples, but for now we merely require that is equal to 1 when , and (exactly) equal to 0 when and are far apart. Here, the meaning of “far” depends on the resolution , in that with increasing , the modulating function should be equal to zero for increasingly large sets of pairs of locations in .

Based on these ingredients, we make two definitions:

Definition 1 (Predictive process).

For a generic Gaussian process , define to be the predictive-process approximation (Quiñonero-Candela and Rasmussen, 2005; Banerjee et al., 2008) of based on the knots :

where and , with .

That is, the predictive process is a conditional expectation, and hence a smooth, low-rank approximation of , which can also be written as a linear combination of basis functions (cf. Katzfuss, 2013). Further, the remainder is independent of , with positive-definite covariance function (Sang and Huang, 2012).

Definition 2 (Modulated process).

For a Gaussian process , define to be the “modulated” process corresponding to :

We see that and

have the same variance structure (because

), but has a compactly supported covariance function that is increasingly bad approximation of as and the distance between and increase.

(a) -RA-taper with
(b) -RA-block
Figure 1: For with exponential covariance function on , a set of multi-resolution knots (black dots) and the corresponding basis functions using the orthogonal decomposition in (1) (black lines) and using two versions of the -RA (red lines) with , , and . The -RA-block is exact in this setting (see Proposition 6), and hence the red and black lines overlap.

2.3 Exact multi-resolution decompositions of Gaussian processes

For any Gaussian process (as specified in Section 2.1), using Definition 1, we can write , where is the predictive process of based on the knots , and is independent from and is itself a Gaussian process with (positive-definite) covariance function . This allows us to apply again the predictive process to (this time based on the knots ) to obtain the decomposition , and so forth, up to some resolution .

This idea enables us to exactly decompose any into orthogonal (i.e., independent) components at multiple resolutions:

(1)

where is the predictive process of based on knots , , and for . Further, using the basis-function representation from Definition 1, we can write each component of the decomposition as , where , and starting with , we have for :

(2)

An important feature of this decomposition is that components with low resolution capture mostly smooth, long-range dependence, whereas high-resolution components capture mostly fine-scale, local structure. This is because the predictive process at each resolution is an approximation to the first terms in the Karhunen-Loéve expansion of (Sang and Huang, 2012). Figure 1 illustrates the resulting basis functions in our toy example.

It is straightforward to show that the decomposition of the process in (1) also implies an equivalent decomposition of the covariance function :

(3)

2.4 The multi-resolution approximation

The multi-resolution approximation (-RA) is a “modulated” version of the exact decomposition in (1), which at each resolution modulates the remainder using the function from Section 2.2. The key idea is that the predictive processes at low resolutions pick up the low-frequency variation in , resulting in remainder terms that exhibit variability on smaller and smaller scales as increases, and so approximating the remainder using more and more restrictive modulating functions causes little approximation error.

Definition 3 (Multi-resolution approximation (-Ra)).

For a given , the -RA of a process based on a set of knots and a set of modulating functions , is given by

(4)

where and for ; with ; for ; and

(5)

Figure 1 shows the -RA basis functions in our toy example. As can be seen, the -RA is similar to a wavelet model, in that for increasing resolution , we have an increasing number of basis functions with increasingly compact support. However, in contrast to wavelets, the basis functions and the precision matrix of the corresponding weights in the -RA adapt to the covariance function . Defining the basis functions recursively allows the -RA to approximate , while in other approaches (e.g., wavelets, or Nychka et al., 2015) with explicit expressions for the basis functions, the resulting covariance is less clear.

For ease of notation, we often stack the basis functions as and the corresponding coefficients, , so that

(6)

with and .

2.5 Specific examples

As described in Section 2.2, the -RA requires the choice of two ingredients: knots and modulating functions. In light of the computational complexities discussed in Sections 3.23.3 below, we introduce a factor , often chosen to be equal to 2 or 4. Then, starting with some (small) number of knots at resolution , we henceforth assume for .

Regarding the modulating functions, we will now discuss two choices that lead to two important versions of the -RA.

2.5.1 -RA-block

To define the -RA-block, we need a recursive partitioning of the spatial domain , in which each of regions, , is again divided into smaller subregions, and so forth, up to level :

We then assume for each resolution that the modulated remainder is independent across partitions at the th resolution. That is, the modulating function is defined as

(7)

Simply speaking, we have if are in the same region , and otherwise. At resolution , is split into subregions. Typically, we assume that the knots at each resolution are roughly equally spread throughout the domain, so that there are roughly the same number of knots in every such region.

The -RA-block and the corresponding domain partitioning are illustrated in a toy example in Figure 0(b). The -RA-block was first proposed in Katzfuss (2017) with the restriction that . Another special case for is the block-full-scale approximation (Snelson and Ghahramani, 2007; Sang et al., 2011).

2.5.2 -RA-taper

We can also specify the modulating functions to be compactly supported correlation functions, often refered to as tapering functions. For simplicity, we assume here that the modulating functions are of the form,

with , where is the dimension of , is some norm on , and is a compactly supported correlation function that is scaled such that for all . For simplicity, we will use Kanter’s function (Kanter, 1997) in all data examples:

For other possible choices of tapering functions, see Gneiting (2002). The taper--RA is illustrated in Figure 0(a). A special case of the -RA-taper for is the taper-full-scale approximation (Sang and Huang, 2012; Katzfuss, 2013).

2.6 Properties of the -RA process

Throughout this subsection, let be the -RA (as described in Definition 3) of on domain based on knots and modulating functions . All proofs are given in Appendix A.

Proposition 1 (Distribution of the -Ra).

The -RA is a Gaussian process, , with covariance function

where is defined in (5). We call the -RA of the covariance function .

Proposition 2 (Duplication of knots).

If , then for any and .

This proposition implies that there is no benefit to designate the same locations as knots at multiple resolutions; that is, all knot locations in should be unique.

Proposition 3 (Exact variance).

If , then the -RA variance at location is exact; that is, .

This proposition implies that, in contrast to other recent basis-function approaches (e.g., Lindgren et al., 2011; Nychka et al., 2015), no variance or “edge” correction is needed for the -RA if we place a knot location at each observed and prediction location.

Smoothness (i.e., differentiability) is a very important concept in spatial statistics, which has led to the popularity of the Matérn covariance class with a parameter that flexibly regulates differentiability (e.g., Stein, 1999). The following proposition shows that any desired smoothness can be preserved when applying the -RA:

Proposition 4 (Smoothness).

If is exactly times (mean-square) differentiable at , where , then is also exactly times differentiable at , provided that and are at least times differentiable at , for any and .

Many commonly used covariance functions (e.g., Matérn) are infinitely differentiable away from the origin. If is such a covariance function, the -RA-block thus has the same smoothness as the original process at any that is not located on the boundary between subregions at any resolution (cf. Katzfuss, 2017). Tapering functions are often smooth away from the origin, except at the distance at which they become exactly zero. Thus, the -RA-taper will typically have the same smoothness at as if is at least times differentiable at the origin and is not exactly at distance from any , for all . Note that this result does not require the smoothness of to be the same at all locations ; if the smoothness (or other local characteristics) of the covariance function varies over space, the -RA will automatically adapt to this nonstationarity and vary over space accordingly.

There is, however, an issue with the continuity of the -RA-block process at the region boundaries, which can be highly undesirable in prediction maps:

Proposition 5 (Continuity).

Assume that is a continuous function. Then, for the -RA-taper, realizations of the corresponding process and the posterior mean (i.e., kriging prediction) surface based on observations as in (8) are both continuous, assuming that is continuous for all . In contrast, for the -RA-block, and are both discontinuous in general at any on the boundary between any two subregions.

Proposition 6 (Exactness of -RA-block).

Let be a (stationary) exponential covariance function on the real line, . Further, let be the covariance function of the corresponding -RA-block (see Section 2.5.1) with knots for , which are placed such that at each resolution , a knot is located on each boundary between two subregions at resolution . Then, the -RA is exact at every knot location; that is, for any .

This proposition is illustrated in Figure 0(b). As we will see in Section 3.2, this result allows us to exactly decompose a exponential covariance matrix in terms of a sparse matrix with rows but only about nonzero elements per row with and . This leads to tremendous computational savings (e.g., for billion).

While the exact result in Proposition 6 relies on the Markov property and the exact screening effect of the exponential covariance function (which is a Matérn covariance with smoothness parameter ), similar but approximate results are expected to hold for larger smoothness parameters in one dimension. Specifically, Stein (2011) shows that an asymptotic screening effect holds for when using conditioning sets of size 2, and he conjectures that an asymptotic screening effect holds for any when using conditioning sets of size greater than . This conjecture is also explored numerically in Katzfuss and Guinness (2017). To exploit this screening effect using the -RA-block, we can simply place knots near every subregion boundary (i.e., ).

3 Inference

In this section, we describe inference for the -RA, based on a set of measurements at locations . We assume additive, independent measurement error, such that

(8)

where is a diagonal matrix. We assume that and are fully determined by the parameter vector , which will be assumed fixed at a particular value, unless noted otherwise. For the sparsity and complexity calculations, we assume and .

3.1 General inference results

3.1.1 Prior matrices

For a given set of parameters, the covariance function , and hence the basis functions and the precision matrix in (6) are fixed. The prerequisite for inference is to calculate the prior matrices and . Define and , so that and . For , starting with and , it is straightforward to verify that

(9)

and

(10)

Here, denotes the Hadamard or element-wise product. Note that and both grow in dimension and become increasingly sparse with increasing resolution . We have if , and if .

3.1.2 Posterior inference

Once and have been obtained, the posterior distribution of the unknown weight vector, , is given by well-known formulas for conjugate normal-normal Bayesian models:

(11)

where , , and .

Based on this posterior distribution of , the likelihood can be written as (e.g., Katzfuss and Hammerling, 2017):

(12)

Using this expression, the likelihood can be evaluated quickly for any given value of the parameter vector . This allows us to carry out likelihood-based inference (e.g., maximum likelihood or Metropolis-Hastings) on the parameters in and , by computing the quantities in (9)–(12) for each parameter value.

To obtain spatial predictions for fixed parameters , note that , where . Defining , can be obtained based on the quantities from Section 3.1.1 by calculating and

and setting , for

. The posterior predictive distribution is given by,

(13)

Hence, the main computational effort required for inference is the Cholesky decomposition of , the posterior precision matrix of the basis-function weights in (11). As and are both sparse, is a sparse matrix that can be decomposed quickly. Specifically, has the block structure , where is an matrix whose th element is 0 if such that and . Figure 2 shows the sparsity structures of , , and corresponding to the toy example in Figure 1.

(a) for -RA-taper
(b) for -RA-taper
(c) for -RA-taper
(d) for -RA-block
(e) for -RA-block
(f) for -RA-block
Figure 2: Illustration of the sparsity in the matrices , , and for the toy example in Figure 1. Resolutions are separated by solid black lines. Top row: -RA-taper. Bottom row: -RA-block.

3.1.3 Inference in the absence of measurement error

If there is no measurement error (i.e., ), we have

where . To ensure that (and hence ) has full rank, we assume for this case that (and thus ) and (in light of Proposition 2) that the knots are unique. The likelihood can then be calculated as where , and with .

3.2 Inference details for the -RA-block

For the -RA-block from Section 2.5.1, , , and are block-sparse matrices, with each block roughly of size and corresponding to (the knots at) a pair of regions.

As noted in Section 3.1.1, we have if , and so is a block-diagonal matrix with diagonal blocks , where is the set of roughly knots at resolution that lie in . It is well known that the inverse of a block-diagonal matrix has the same block-diagonal structure as , and so the prior calculations in Section 3.1.1 involving can be carried out at low computational cost.

For the posterior covariance matrix, we have from Section 3.1.2 that if such that and , and so the block in corresponding to regions and is zero if the regions do not overlap (i.e., if ). The Cholesky factor of a (appropriately reordered) matrix with this particular block-sparse structure has zero fill-in, and can thus be carried out very rapidly.

Katzfuss (2017) describe an algorithm for inference in a special case of the -RA-block that can be extended to the more general -RA-block considered here. This algorithm is well suited for parallel and distributed computations for massive datasets, and it leads to efficient storage of the full posterior predictive distribution in (13). The time and memory complexity are shown to be and , respectively.

3.3 Inference details for the -RA-taper

The case of the -RA-taper from Section 2.5.2 results in sparse matrices, but care must be taken to ensure computational feasibility. A crucial observation for the computational results below is that for any location and any resolution , only knots from are within a distance of from (i.e., all sets of the form contain only elements), because we assumed that the knots at resolution are roughly equally spread over the domain , and .

First, consider calculation of the prior matrices as described in Section 3.1.1. The matrices and have and nonzero elements, respectively, because if , and if . Before carrying out the actual inference procedures, it is helpful to pre-calculate , the set of nonzero indices of the matrix , for and . This can typically be done in time (e.g. Vaidya, 1989). In the actual inference procedure, we then only need to calculate the -elements of the matrices in (9). The main difficulty herein is that while is sparse, its inverse is not. However, we only need to compute certain elements of :

Proposition 7.

For and , the matrix can be obtained by computing

(14)

where and . Thus, the element of is not required for calculating the prior matrices in (9) if .
The total time complexity for computing all prior matrices in (9) is , ignoring the cost of computing the from the .

To calculate from , we use a selected inversion algorithm (Erisman and Tinney, 1975; Li et al., 2008; Lin et al., 2011) in which we regard element as a structural zero only if . This algorithm the same computational complexity as the Cholesky decomposition of the same matrix. For one-dimensional domains (), is a banded matrix with bandwidth , and so the time complexity to compute its Cholesky decomposition (and selected inverse) is (e.g., Gelfand et al., 2010, p. 187). For , the rows and columns of should be ordered such that the Cholesky decomposition leads to a (near) minimal fill-in and hence fast computations. Functions for this reordering are readily available in most statistical or linear-algebra software. The discussion in Furrer et al. (2006) indicates that the resulting time complexity for the Cholesky decomposition is roughly linear in the matrix dimension for . Moreover, our numerical experiments showed that the selected inversions only account for a small fraction of the total time required to compute the prior matrices, and so the total computation time for computing the prior matrices scales roughly as .

Once the prior matrices including and have been obtained, posterior inference requires computing and decomposing the posterior precision matrix in (11), with th block . The th element of this block is

As each of the is within distances of and of elements of and , respectively, the time complexity to compute is , and hence computing requires time.

Proposition 8.

The number of nonzero elements in is .

The time complexity for obtaining the Cholesky decomposition of is difficult to quantify, as it depends on its sparsity structure and the chosen ordering, but again our numerical experiments showed that the contribution of the Cholesky decomposition to the overall computation time is relatively small when appropriate reordering algorithms are used.

For prediction, the posterior covariance in (13) is dense and hence cannot be obtained explicitly for a large number of prediction locations. But the posterior covariance matrix of a moderate number of linear combinations can be obtained as , also based on a Cholesky decomposition of .

In summary, the time and memory complexity of the -RA-taper are and , respectively, plus the cost of computing the Cholesky decompositions of and . These decompositions only accounted for a relatively small amount of the overall computation time in our numerical experiments. Thus, the time complexity of the -RA-taper is roughly cubic in while it is square in for the -RA-block. Note that the computational cost for the -RA-taper can be further reduced if the covariance function has a small effective range relative to the size of , because then can be tapered at resolution 0 without causing a large approximation error; in contrast, for the -RA-block, we always have . As explained in Katzfuss (2017), it is often appropriate to expect a good approximation for (and hence ), which results in quasilinear complexity as a function of for the -RA.

4 Simulation study

For this section, we used data simulated from a true Gaussian process to compare the -RA-block and -RA-taper to full-scale approximations, FSA-block (Sang et al., 2011) and FSA-taper (Sang and Huang, 2012), which correspond to the -RA-block and -RA-taper, respectively. An implementation of the methods in Julia (http://julialang.org) version 0.4.5 was run on a 16-core machine with 64G RAM.

The true Gaussian process was assumed to have mean zero and an exponential covariance function,

(15)

with and on a one-dimensional () or two-dimensional () domain. We assumed a nugget or measurement-error variance of (i.e., ). Results for Matérn covariances with different range, smoothness, and variance parameters showed similar patterns as those presented below and can be found in the Supplementary Material.