# Sparse Pseudo-input Local Kriging for Large Non-stationary Spatial Datasets with Exogenous Variables

Gaussian process (GP) regression is a powerful tool for building predictive models for spatial systems. However, it does not scale efficiently for large datasets. Particularly, for high-dimensional spatial datasets, i.e., spatial datasets that contain exogenous variables, the performance of GP regression further deteriorates. This paper presents the Sparse Pseudo-input Local Kriging (SPLK) which approximates the full GP for spatial datasets with exogenous variables. SPLK employs orthogonal cuts which decompose the domain into smaller subdomains and then applies a sparse approximation of the full GP in each subdomain. We obtain the continuity of the global predictor by imposing continuity constraints on the boundaries of the neighboring subdomains. The domain decomposition scheme applies independent covariance structures in each region, and as a result, SPLK captures heterogeneous covariance structures. SPLK achieves computational efficiency by utilizing sparse approximation in each subdomain which enables SPLK to accommodate large subdomains that contain many data points and possess a homogenous covariance structure. We Apply the proposed method to real and simulated datasets. We conclude that the combination of orthogonal cuts and sparse approximation makes the proposed method an efficient algorithm for high-dimensional large spatial datasets.

## Authors

• 1 publication
• 4 publications
• ### Patchwork Kriging for Large-scale Gaussian Process Regression

This paper presents a new approach for Gaussian process (GP) regression ...
01/23/2017 ∙ by Chiwoo Park, et al. ∙ 0

• ### Prediction Using a Bayesian Heteroscedastic Composite Gaussian Process

This research proposes a flexible Bayesian extension of the composite Ga...
06/25/2019 ∙ by Casey B. Davis, et al. ∙ 0

• ### Variable noise and dimensionality reduction for sparse Gaussian processes

The sparse pseudo-input Gaussian process (SPGP) is a new approximation m...
06/27/2012 ∙ by Edward Snelson, et al. ∙ 0

• ### Sparse Additive Gaussian Process Regression

In this paper we introduce a novel model for Gaussian process (GP) regre...
08/23/2019 ∙ by Hengrui Luo, et al. ∙ 0

• ### Efficient Multiscale Gaussian Process Regression using Hierarchical Clustering

Standard Gaussian Process (GP) regression, a powerful machine learning t...
11/06/2015 ∙ by Z. Zhang, et al. ∙ 0

• ### Gaussian Process Model for Estimating Piecewise Continuous Regression Functions

This paper presents a Gaussian process (GP) model for estimating piecewi...
04/13/2021 ∙ by Chiwoo Park, et al. ∙ 0

• ### Standing Wave Decomposition Gaussian Process

We propose a Standing Wave Decomposition (SWD) approximation to Gaussian...
03/09/2018 ∙ by Chi-Ken Lu, et al. ∙ 1

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recent advances in data collection technologies in Geostatistics, such as ubiquitous satellites around the planet, have created an unprecedented opportunity for utilizing data-driven predictive models. Particularly, Kriging (Cressie, 1990), also known as Gaussian Process (GP) regression (Rasmussen and Williams, 2006)

, can theoretically benefit from very large datasets since it is a non-parametric model whose flexibility and performance generally increase by having more data points

(Hastie et al., 2009). However, the computational complexity hinders efficient application of GP regression to large datasets. Specifically, GP regression is dominated by the inversion of covariance matrices which is of , where is the number of data points. A significant body of research in GP regression deals with constructing an approximated covariance matrix, which is less costly to invert. These approximation methods either result in sparse or low-ranked covariance matrices, which reduce the cost of matrix inversion (see Silverman (1985); Snelson and Ghahramani (2006); Tresp (2000); Du et al. (2009)).

The fact that most spatial data are non-stationary presents still another challenge, i.e., the covariance structure changes for different locations in the space, primarily when the behavior of the response variable strongly depends on the underlying geology

(Kim et al., 2005). Thus, in the context of spatial data, GP regression needs to efficiently address the large number of data points while effectively capturing the inhomogeneous covariance structure.

One idea that empowers GP to tackle inhomogeneous data structures is through the class of local Kriging that assumes distinct covariance functions for each region of the domain (Haas, 1990). Based on this idea, we can decompose the domain into smaller subdomains and apply local GP in each subdomain. As such, local Kriging breaks down the global covariance matrix to smaller local matrices, and reduces the total computational complexity to , where is the number of local data points. This idea, however, also presents a challenge: the discontinuities in prediction on boundaries of the subdomains. The Domain Decomposition Method (DDM) is one method in the class Local Kriging that guarantees the continuity of GP on boundaries by defining a few designated control points on each boundary and imposing equality of neighboring local predictors at these points (Park et al., 2011). The superiority of DDM over similar methods such as weighted average techniques (Urtasun and Darrell, 2008; Rasmussen and Ghahramani, 2002; Gramacy and Lee, 2008) emerges as a result of effectively handling the boundary conditions, although the exponential growth in the number of control points as the dimension of the domain increases hinders DDM’s ability to handle dimensional domains greater than two. This is a major drawback since some spatial datasets contain not only the location information as inputs, but also other geographical/meteorological information which results in a dataset with higher dimensional inputs.

Motivated by these challenges, we propose a new method for local Kriging which preserves the continuity of GP in higher dimensional spaces. The new method employs a flexible policy to partition domains, thus minimizing the number of boundaries for the subdomains. Such partitioning may end in large subdomains that contain a large number of data points, which makes the application of a full GP impractical in each subdomain. In order to obviate this obstacle, we utilize methods of sparse approximation for each region. The computational complexity of this algorithm is , where is the number of pseudo-data points in each subdomain, i.e., the data points that help to achieve sparse approximation. The major contribution of our method is that it preserves the continuity of GP while it handles non-stationarity for high dimensional spatial datasets.

The remainder of the paper is organized as follows. Section 2 briefly introduces GP regression, along with SPGP and DDM. Section 3 presents the proposed method. Section 4 compares the proposed method with competing algorithms. Section 5 concludes the paper and suggests future research.

## 2 Gaussian Process Regression

Given a probability space

, and an index Set , where is the sample space, is a sigma-algebra, and is a probability measure, a Gaussian stochastic process, or simply a Gaussian process (GP), is a stochastic process where for any as a finite subset of

, the random vector

follows a multivariate normal distribution

(Rasmussen and Williams, 2006), where is a measurable function from to for a given . Here, we are not interested in GPs that model the time evolution of a process, i.e., when the index set is simply time (Wiener, 1949), but instead we focus on statistical learning (Hastie et al., 2009). Specifically, we consider the index set to be a subset of . Then, for a given , the random vector follows a multivariate normal distribution, where , for .

We call a training dataset, and the goal is to infer for a given , based on the information in and the GP assumption. That is, assuming , where , and and are the mean vector and covariance matrix, respectively, we seek , i.e., the conditional distribution of , given . We note that the mean vector and the covariance matrix , where denotes the expectation operator. If the covariance between and is determined solely by the values of their inputs, and , we can denote the covariance by , where , and , the th and th elements of , respectively. For to be a valid covariance function, it must be positive definite, i.e.,

 ∫k(x,x′)f(x)f(x′)ν(dx)ν(dx′)>0, (1)

for any where is a measure. In fact, a GP is uniquely determined by its mean function and covariance function  (Rasmussen and Williams, 2006). The covariance function plays an important role since it defines the notion of similarity between different points on the domain and determines the sample path properties of the GP. Furthermore, the form of the covariance function specifies the computational complexity of the GP regression, which we discuss next.

In most practical settings, the training dataset consists of noise-contaminated observations , i.e., , where . Specifically, we have and seek , the predictive distribution at , where . When , for

, we call the problem of estimating the predictive distribution

, given the noisy observations , the GP regression.

There are equivalent ways to present the inference based on the GP regression, for example through regularized optimization (Poggio and Girosi, 1990), optimal spatial prediction (Cressie and Wikle, 2011)

, or an extension of Bayesian linear regression. In this paper, we adopt a Bayesian treatment from

Rasmussen and Williams (2006), where we specify a prior for and then calculate the predictive distribution at . Given a training dataset , assume the prior distribution , where for . This implies , where is the identity matrix. To find the marginal distribution of , we can integrate out , i.e.,

 (2)

which results in another normal distribution, namely . In other words, we have a new GP whose covariance function is , where is the Kronecker delta.

Therefore, based on the joint normality assumption,

 (y,f∗)∼N(0,[K+σ2IKD∗KTD∗k∗∗]),

where , is an vector of covariances between the inputs in and , and

is the variance at

. Hence,

 (f∗|y)∼N(KTD∗(K+σ2I)−1y,k∗∗−KTD∗(K+σ2I)−1KD∗), (3)

i.e., the predictive distribution of at is normal whose mean is a linear combination of the observations. The predictive variance is smaller than the prior variance by a factor determined by the covariance between the test point and the training points. We are interested in the fact that the mean of the predictive distribution is a linear combination of the observation in the training dataset. Among all the unbiased linear predictors for the mean of the predictive distribution, i.e., , the mean in (3) has the smallest expected prediction error, .

Note that implementing the GP regression entails inverting matrices of size , where is the number of training data points. Hence, the computational complexity is of , which can be too slow for many practical applications, in particular, spatial statistics. Approximating the covariance matrix as a method to reduce the computational cost has been studied by (Snelson and Ghahramani, 2006; Smola and Bartlett, 2001; Williams and Seeger, 2001; Lawrence et al., 2003; Pourhabib et al., 2014; Park et al., 2011). In this paper, we present a method that utilizes a covariance structure, which effectively captures the non-stationarity in the data and, at the same time, is computationally efficient. Before presenting the proposed method, we discuss two existing methods in more detail.

### 2.1 Sparse Pseudo-input Gaussian Process

Sparse Pseudo-input Gaussian Process (SPGP) is a low-ranked covariance approximation method that uses a set of unobserved pseudo-data points , where . The set denotes realization of another GP with the same covariance function as the underlying GP, . Unlike the training dataset , the pseudo-data points in are not actual observations, but they are a new set of parameters which SPGP utilizes to approximate the likelihood of GP (Snelson and Ghahramani, 2006). Specifically, instead of using the existing data points in to build a Nyström approximation of the covariance matrix (Williams and Seeger, 2001), SPGP uses to build

 QD,¯D=KD¯DK−1¯DK¯DD, (4)

where is the covariance matrix at the pseudo-data locations and is the covariance matrix between the data points in and . To achieve a computationally efficient GP, assume that the observations are conditionally independent, given the pseudo-outputs , i.e.,

 p(y|¯f)=N∏i=1p(yi|¯f)=N∏i=1N(Kxi¯DK−1¯D¯f,Kxixi−Kxi¯DK−1¯DK¯Dxi+σ2) (5) =N(KD¯DK−1¯D¯f,diag(KD−QD¯D)+σ2I).

Then, putting a prior distribution over as and integrating it out yields

 p(y)=∫p(y|¯f)p(¯f)d¯f∼N(0,QN+diag(KD−QD)+σ2I). (6)

Therefore, we obtain the predictive distribution by conditioning the prior distribution on as follows:

 μ(f∗|y)=QtD∗(QD+diag(KD−QD,¯D)σ2I)−1y, (7) σ2(f∗|y)=k∗∗−(QD¯D+diag(KD−QD,¯D)σ2I)−1QD∗, (8)

where is the low-ranked covariance matrix (4) in which we use the test point instead of .

Calculating (7) and (8) still requires inverting the matrices of size ; however, by taking advantage of the matrix inversion lemma (Rasmussen and Williams, 2006) the computational complexity reduces to

. Moreover, when pseudo-data points are considered as parameters of the model, their locations are learnt simultaneously along with the hyperparameters of the covariance function through maximizing the marginal likelihood, which is constructed based on the low-ranked covariance

.

SPGP can be considered as a GP with a particular covariance function

 KSPGP(x,x′)=Q(x,x′)+δ(x,x)(K(x,x′)−Q(x,x)), (9)

where is the Kronecker delta, and  (Snelson, 2007). In addition, we can obtain SPGP as an extension of the sparse approximation framework discussed in Quiñonero-Candela and Rasmussen (2005). Despite improving the computational complexity of the full GP, SPGP’s performance depends on the number of pseudo-data points used, which in general should be large enough to obtain good prediction results. Moreover, SPGP cannot effectively handle the non-stationarity in the data. The latter issue can be dealt by selecting a non-stationary covariance matrix ; however, learning hyperparameters of non-stationary covariance functions is not as straightforward as that for stationary ones.

### 2.2 Domain Decomposition Method

Local Kriging refers to a class of methods that treats the challenges of GP application to spatial data, i.e., handling non-stationarity and dealing with large datasets, in a unifying framework. The idea that pairs of data points correlate less as the distance between the points increases, suggests accounting only for the neighboring data points. In other words, we decompose the domain into smaller pieces, or subdomains, in order to use local predictors, i.e., a GP trained only on the data in a specific subdomain. We can train these local predictors independently with respect to the local data points, and as a result handle the non-stationarity, because we assume differnet covariance functions for each subdomain. Moreover, dealing with local covariance matrices instead of a global covariance matrix, which we can invert more efficiently, reduces the computational complexity to , where is the number of data points in each subdomain. Note that this reduction in training time by considering neighboring points is advantageous in real time learning where decision making in short intervals is crucial (Nguyen-Tuong et al., 2009). We call these methods, which do not impose any continuity constraint on the boundaries, naïve local Kriging. Note that using naïve local Kriging methods deteriorates the prediction accuracy for test points close to boundaries, since neighboring local models are not continuous on their joint boundary as a result of independent training on their respective subdomains. Various weighted average techniques have tried to alleviate the discontinuity problem and improve Local Kriging methods by adhering together the local predictors on boundaries, such as Bayesian committee machine (Tresp, 2000), local probabilistic regression (Urtasun and Darrell, 2008), mixture of Gaussian process experts (Rasmussen and Ghahramani, 2002), and the treed Gaussian process model (Gramacy and Lee, 2008). Although weighted average techniques mitigate the boundary condition problem, they cannot speed up the original algorithm as efficiently as naïve local Kriging methods, due to the computational burden exerted on the algorithm.

Domain Decomposition Method (DDM) (Park et al., 2011), a new method in local Kriging, treats the discontinuities in prediction on the boundaries of subdomains with less computational burden. Noting that the mean of the GP posterior function is the optimal (unbiased) linear predictor, consider as an unbiased linear predictor for ,

 μ(x∗)=u(x∗)ty, (10)

where , i.e., unbiasedness, and . Then define the optimization problem which finds a vector that minimizes the mean squared prediction error,

 E{μ(x∗)−f∗}2=u(x∗)t(σ2I+KD)u(x∗)−2u(x∗)tKD∗+K∗, (11)

where is the identity matrix. We can show that the solution for has the form of , where is a matrix. Therefore we can obtain the best linear predictor by solving the following minimization problem with respect to :

 minAKtD∗At(σ2I+KD)AKD∗−2KtD∗AtKD∗. (12)

Solving optimization (12) is still in the order of . To overcome this problem, DDM decomposes the domain of the data into smaller subdomains for , and tries to solve the local optimization problems as follows:

 minAjKtDj∗Atj(σ2jI+KDj)AjKDj∗−2KtDj∗AtjKDj∗∀Ωj&x∗∈Ωj. (13)

To address the discontinuity problem at the domain boundaries, DDM imposes continuity constraints on the optimization problem. Specifically, define as the boundary of two neighboring subdomains and and let denote the data points on this boundary. DDM imposes that the two local predictors defined on and must be equal at these points, specifically

 KtDjbjkAtjyj=KtDkbjkAtkyk. (14)

In order to systematically insert constraint (14) into the local optimizers, DDM locates control points on each boundary and creates a polynomial function of a certain degree where

, which interpolates these

control points. To obtain such a polynomial function, we utilize Lagrange basis polynomials. Therefore, the local problems will be

 minAjKtDj∗Atj(σ2jI+KDj)AjKDj∗−2KtDj∗AtjKDj∗x∗∈Ωj∀Ωj&x∗∈Ωjsubject toKtDjbjkAtjyj=Ttjkrjk k∈neighborhood(j), (15)

where is the matrix of Lagrange basis polynomials at and is the evaluation of the function in control points on the boundary. By solving (15), the local mean predictor becomes

 ^μj(x∗)=KtDj∗(σ2jI+KDj)−1yj+¯cj, (16)

where is a correction term. Intuitively, we interpret the local mean predictor as the sum of two terms. The first term is the standard local GP predictor and the second term appears because of the continuity constraint, which approaches to zero as the test data points move away from boundaries. See Park et al. (2011) for the derivation of and specific implementation details.

DDM uses rectangular meshing, i.e., it partitions the domain into smaller rectangular subdomains, and assumes each boundary is shared by at most two subdomains. As such, it can effectively be applied only to two-dimensional spatial datasets. Furthermore, since DDM utilizes a full GP in each subdomain, the size of subdomians should be small enough so that the overall procedure runs efficiently.

## 3 Sparse Pseudo-input Local Kriging

We propose a GP approximation method which can effectively address the non-stationarity in large spatial datasets. The task is contingent upon an efficient partitioning of the domain such that in each region the data is stationary and fast GP predictors can be applied to each subdomain. In addition, the local predictors need to seamlessly connect on the boundaries. Section 3.1 presents the partitioning scheme which enables us to partition domains in larger dimensional space, i.e., for , and Section 3.2 presents the formulas for mean and variance predictions. We discuss specific details about implementation in Sections 3.3 and 3.4.

### 3.1 Orthogonal Cuts in the Domain

Recall denotes the training dataset. Assume is a -dimensional box whose boundaries are either orthogonal or parallel to each other, i.e., is a -orthope or hyper-rectangle (Coxeter, 1973), which is a generalization of the notion of a rectangle to higher dimensional spaces. Köppen (2000) shows that the boundaries of a hyper-rectangle that resides in a -dimensional space are lower dimensional orthopes, and the number of such -orthopes are

 Ndm=2d−m(dm), (17)

for , where each of -orthopes are parallel to each other. We denote as vertex, as edge, and as the face of the -dimensional orthopes.

Also, we define an orthogonal cut on a

orthope domain as a hyperplane, i.e., a (

)-dimensional linear space embedded in a -dimensional space, which is either orthogonal or parallel to each boundary; consequently, each -orthope can be cut at most from directions. The proof of the recent claim is straightforward, since if we move the domain to the origin of the space and rotate it so that all the boundaries become orthogonal to one of the axis of the space, the above statement is equivalent to cutting the -orthope with respect to each axis, which is possible in ways. Figure 1 illustrates an example of orthogonal and non-orthogonal hyperplanes cutting a three-orthope domain.

Recall that after cutting the domain of the training data and applying a GP in each subdomain, we need to connect two neighboring local predictors on their joint boundary, i.e., the intersection of the cut and the domain. Define a few control points on the boundaries and stitching the local predictors in these points together. Note that the number of control points cannot be arbitrarily large, since the additional computations reduce the efficiency of the algorithm; consequently, we need to control the growth of the number of boundary points. Denote the total number of control points in the whole domain as , number of subdomains as , number of boundaries of each subdomain as and number of control points on each boundary as . is proportional to these three factors, i.e.,

 T(q)∝qSΓ. (18)

The growth of is dominated by the exponential growth of , since the boundary spaces expand with an exponential rate and more control points are required for higher dimensional spaces. Note that this is not a critical issue for most spatial datasets, since the dimension of the dataset is generally smaller than . Therefore, controlling the other two factors curbs the growth of .

Simple orthogonal cutting may result in very large subdomain, where a full GP is inefficient. However, as Section 3.2 discusses, we fit a sparse GP for each subdomains. This enables us to expand the size of subdomains, and reduce to as little as two by halving the domain. See section 4.3 for several considerations to tune to an appropriate value.

is another parameter we can control by introducing the following meshing policy: Under the stated cutting condition, i.e., orthogonal cuts, a -orthope domain can be decomposed into smaller components, each of which is a smaller d-orthope and adjacent to other -orthopes in its faces. In fact, each face represents a neighbor in this terminology, and according to (17) a -orthope has at most faces and subsequently, neighbors. Consider that each of these faces are parallel and will appear because of the cuts which are parallel to them, or in other words, cutting from each direction creates at most 2 neighbors for a subdomain. If all the potential neighbors of subdomians are desired, cut the domain from all directions; consequently, is a function of the directions of the cuts, which is minimized when all cuts are parallel. Note that this policy is not efficient for the naïve local Kriging, since it causes the width of subdomains to narrow in order to keep them small. On the contrary, using a sparse GP in each subdomain will accommodate the large size subdomains. In this manner, regardless of the dimension of the space, if we slice the domain with parallel cuts we will encounter subdomains, each of which has at most two neighbors and the total number of boundaries is . Figure 1(a) shows a three-orthope which is cut from all three directions, and as a result some sub-three-orthopes have six neighbors, whereas Figure 1(b) shows the current meshing policy with at most two neighbors for each sub-three-orthope. As such, the total number of control points is equal to

 T(q)=cq. (19)

### 3.2 Mean and Variance Prediction

The partitioning scheme presented in Section 3.1

may lead to subdomains containing a large number of training data points, which makes the application of a full GP inefficient. We propose to utilize a sparse covariance function which will speed up model fitting in each subdomain and make the overall algorithm efficient. Recall that the joint distribution of observed variables and the prediction function under the SPGP assumptions is

 (20)

Based on equation (11

), the mean of GP is the best unbiased estimator

. Under the covariance structure imposed in (20), we have

 E{μ(x∗)−f∗}2=u(x∗)t(QD+diag(KD−QD)+σ2I)u(x∗)−2u(x∗)tQD∗+K∗. (21)

Note that by replacing , equation (21) is equivalent to the variance of SPGP predictive distribution, which is indeed the minimum value of (21); consequently, is restricted to the form of , where , and the global minimizer emerges as

 (22)

In order to take advantage of Local Kriging, we decompose the domain into subdomains for , where each has local pseudo-data points. Unlike SPGP the number of pseudo-data points is no longer a global parameter, but it differs from one subdomain to another (see Section (3.4)), and as a result, the local low-ranked covariance matrix with respect to is

 Qjxx′=KxmjK−1mjKmjx′. (23)

By decomposing the domain, the global optimizer breaks down to smaller local optimizers:

 minAjQjtDj∗Atj(QjDj+diag(KDj−QjDj)+σ2jI)AjQjDj∗−2QjtDj∗AtjQjDj∗∀Ωj&x∗∈Ωj, (24)

where and therefore the local predictors have the form

 uj(x∗)=Qjtxj∗Atjyj∀Ωj. (25)

Moreover, we define as the joint boundary of two neighboring subdomains and , and locate control points on each , where is a vector containing the predicted mean of the locations . To obviate the discontinuity problem, the model forces mean predictors of the two subdomains and to be equal to the predicted value of control points on , i.e.,

 QjtDjbjkAtjyj=rjkbjk∈Γjk. (26)

Denote as the collection of all boundaries of and integrate all boundary points for this subdomain into . The boundary conditions for the subdomain becomes

 QjtDjbjAtjyj=rj(bj,rj)∈Γj. (27)

By adding the above constraint to local optimizers, the local minimization model for each appears as

 minAjQjtDj∗Atj(QjDj+diag(KDj−QjDj)+σ2jI)AjQjDj∗−2QjtDj∗AtjQjDj∗subject toQjtDjbjAtjyj=rj(bj,rj)∈Γj. (28)

Not that since the acquired model has a nonlinear objective, we need to take advantage of nonlinear optimization techniques to obtain the solution. Since our model contains only equality constrains, we use Lagrange multipliers to transform the constrained optimization into a Lagrangian function,

 (29) −λtj(QjtDjbjAtjyj−rj),

where is a vector of the Lagrange multipliers having a size equal to the number of all boundary points of subdomain . The tuple minimizes such a function if it satisfies the following system of equations of partial derivatives,

 ddAjΛ(Aj,λj)=0, (30) ddλjΛ(Aj,λj)=0. (31)

We do not present the procedure for solving this system of equations since it is tedious and similar to the solution of DDM (see Park et al. (2011) for details), instead giving the solutions,

 λj=2Gjrj−QjtDjbjHjyjytjHjyj, (32) AjQjDj∗=Hj(QjDj∗+0.5yjλtj¯kbj∗), (33)

where

 Hj=(QjDj+diag(KDj−QjDj)+σ2jI)−1 (34) Gj=({diag1/2[(KtbjbjKbjbj)−1]Kbjbj}∘{QjtDjbjQjDjbjdiag[(QjtDjbjQjDjbj)−1]})−1 (35) ¯kbj∗=[(Qjtbj∗Qjbj∗)−1/2Kbj∗]∘[QjtDjbjQjDj∗(QjtDj∗QjDj∗)−1], (36)

where denotes the entrywise product. Consequently, we obtain the local mean predictor by plugging in (33) into (25), which results in

 ^μj(x∗)=QjtDj∗Hjyj+¯Ktbj∗Gj(rj−QjtDjbjHjyj). (37)

In fact, the objective function of the minimization model is the local variance predictor and therefore by plugging (33) into (24) the predictive variance becomes

 ^σj(x∗)=K∗−QjtDj∗HjQjDj∗ (38) +¯Ktbj∗Gj(rj−QjtDjbjHjyj)(rj−QjtDjbjHjyj)tGtj¯Kbj∗/(ytjHjyj),

where is a constant matrix that did not appear in the optimization procedure. Note that in both (37) and (38) the first term is exactly the predictive mean and variance of local SPGP, and the terms coming after it, which are amplified for local points close to boundaries, appear to maintain the continuity of the global predictive function.

Calculating the local mean and variance predictors requires estimating the values of the control points, i.e., vector . Following Park et al. (2011), we estimate the values so that minimize the summation of all local variances with respect to . Specifically,

 r=argmin{m∑rj=1^σj(rj)}, (39)

which leads to the following estimator for ,

 rjk=ytjHjyjQjtDjbjkHjyj+ytkHkykQjtDkbjkHkykytjHjyj+ytkHkyk∀(bjk,rjk)∈Γj∩Γk, (40)

where we omit the details of the derivation and refer the reader to Park et al. (2011).

Intuitively, we can interpret the above estimation for as the weighted average of two neighboring subdomains and , where each has a weight equal to for . Note that in the current method each boundary is exactly the intersection of two subdomains, whereas for any other meshing policies with more subdomains associated with a boundary we need to use the more complicated version of (40), but by doing so, the algorithm will become computationally inefficient.

Writing the above equations means that we must invert the matrices of size , and obviously it is time consuming for large ; However, by using matrix inversion lemma (Snelson, 2007), known as Woodbury, Sherman and Morrison formula, the computational complexity reduces from to , where . Consequently, the total computational complexity becomes , or equivalently , which means that the current algorithm can handle the non-stationary data in a time asymptotically equivalent to that of SPGP. We name the proposed method that decomposes the domain into smaller subdomains and then utilizes a sparse approximation in each subdomain the Sparse Pseudo-input Local Kriging (SPLK).

### 3.3 Hyperparameter learning

Maximizing the marginal likelihood of the training data, , is a popular for learning the hyperparameters of a model (Rasmussen and Williams, 2006). In the current method, instead of one global marginal likelihood function, we have local functions , each of which can be trained independently. This is the key factor in handling non-stationarity of the data, since SPLK is able to use appropriate hyperparameters for each subdomain depending on the behavior of local data points. Also, recall that our local predictors are in fact SPGP predictors that consider pseudo-data points as parameters of the . Therefore, we have two types of parameters: one is the location of local pseudo-inputs and the other is the hyperparameters of the underlying covariance function. Maximizing the logarithm of local SPGP marginal likelihood functions using gradient descent with respect to local pseudo-inputs and hyperparameters provides local optimal locations for them. Specifically, the log marginal likelihood of each SPLK’s local model j is

 log(p(yj))=−12log|Δj|−12yTjΔ−1jyj−nj2log2π, (41)

where . Snelson and Ghahramani (2006) show that the cost of finding the derivatives and maximizing (41) is in order of that results in as the total computational complexity for training step, where is the number of hyperparameters of the covariance function and is the number of local pseudo-inputs. Note that the training cost is not a function of ; consequently, the SPLK is flexible to use large size subdomains without loss of efficiency. This is not the case for naïve local Kriging, which maximizes the local marginal likelihood,

 log(p(yj))=−12log|Kj+σ2jI|−12yTj(Kj+σ2jI)−1yj−nj2log2π, (42)

since maximizing (42) requires inverting the local covariance matrix which has a cost of . Due to the dependence of training cost to naïve local Kriging is restricted to use small size subdomains, since large covariance matrices are costly to invert.

### 3.4 Number of pseudo-inputs and the location of control points

One important parameter that affects the performance of SPLK is the number of pseudo-inputs, , located in each subdomain. One factor that determines the value of is the stationarity of the data, since locating numerous pseudo-inputs does not improve the predictive power of SPLK for stationary data structures. In addition, there is always a positive correlation between the number of pseudo-inputs and the training time. Consequently, the model is constrained to tune this parameter in a way that maximizes the precision while controlling the training time. Based on empirical results, setting to a proportion of the square root of the number of local training data points, , provides good prediction in a relatively short time. As such, we propose the formula

 mj=k√nj, (43)

where is a real number between 0.1 and 4. In fact, we can think of as a measure of stationarity: we can set to a small value if the data is stationary, while we need to choose a larger value as the non-stationarity increases. See Section 4.3.1 for a sensitivity analysis regarding the value of .

Another important issue is the location of the control points on each -dimensional boundaries. Since we stitch two neighboring subdomains in a limited number of control points, these points should effectively cover the whole boundary. Therefore, we introduce a practical procedure which distributes given control points over the boundary.

Consider the way we construct the -orthope domain, where each edge of the -orthope is orthogonal to one of the axes of the space, . Therefore, all the cuts under the stated meshing policy are parallel to each other and orthogonal to a specific . Let the set be all the parallel cuts, which are orthogonal to for each . Also, let the set denote the faces that appear because of those cuts. Note that there is a one to one relationship between and , and that all vertices in a have the coordinate in common, since the corresponding is orthogonal to . Denote this common coordinate for as . In order to determine the rest of coordinates of vertices, define a tuple, , which contains the maximum and minimum values of the dimension of the training data, i.e., , where , the minimum value of the coordinate in the dataset , and , the maximum value of the coordinate in . Specifically, , where , and similarly for . Therefore, denote each vertex of as

 vℓ=[τℓ11,…,dij,…,τℓdd]Tj=1…c,ℓk=0,1, for k=1,…,d, (44)

where , and the total number of vertices becomes for each subdomain. Note that we obtain the number of vertices in each boundary by setting in (17). Next, determine the location of control points on each . Along each axis, partition the boundary into parts. Specifically, define the arrays for , where . Then, each control point has a representation

 CPℓ={αℓ11,αℓ22,…,dij,…,αℓdd}∀CPℓ∈Bij,j=1…c,ℓk=1,…,λ for k=1,…,d, (45)

where , denotes the element of the array . The total number of control points in each subdomain is . This is where the exponential growth in the number of control points appears. However, experimental results show that the model works well even with . See Section (4.4) for details. Finally, the total number of control points will be

 T(q)=c(λ+1)d−1. (46)

We can choose the width of each subdomain in two different ways. Using a fixed width for each subdomain results in a different number of data points for the subdomains, depending on the density of the data points in different parts of the domain, while using a different width for each subdomain results in the same number of training data points in each subdomain. Note that in the later case, the number of pseudo-inputs can be considered as a global parameter. Note, also, that the selection of control points according to (45

) will benefit from having axes along the direction of maximum variance, because there is a better overlap of the control points with the training data. In order to benefit from maximum variability condition, we can simply rotate the original axes by using, for example, Principal Component Analysis

(Jolliffe, 2002), such that the new axes are along the direction of maximum variability.

## 4 Experimental Results

In this section we apply SPLK to simulated and real datasets and compare its performance with the competing methods, DDM (Park et al., 2011) and SPGP (Snelson, 2007). We selected the former to evaluate the SPLK’s ability to handle non-stationarity and the latter to evaluate SPLK’s ability to speed up GP compared with a state-of-the-art sparse approximation method. We also present sensitivity analysis regarding the parameters in SPLK and propose some guidelines for their selection.

### 4.1 Datasets and evaluation criteria

We implement SPLK in MATLAB and test it on eight different datasets that contain both non-stationary and stationary data structures (see Table 1). The first dataset, TCO contains data collected by NIMBUS7/TOMS satellite to measure the total column of ozone over the globe on October 1, 1988. This dataset is highly non-stationary and constructed on a two-dimensional input space; thus, it is an appropriate dataset for comparing SPLK and DDM.

Recall that one advantage of SPLK over DDM is that SPLK can handle higher dimensional data. Since we could not obtain publicly available high-dimensional large spatial datasets, we appeal to both simulated data and modified real data. Specifically, TCO-3D is the upgraded version of TCO in a three-dimensional space, we added a new dimension to TCO dataset in order to use such a non-stationary and real structure in a higher dimensional domain. To add a new dimension to the domain we define a vector with a size equal to the number of training data points, where each element of

is generated from a uniform distribution

. By attaching this vector to the original input space , we upgrade the domain to a higher dimensional space.

We also need to update the TCO’s response values, so that each observation is a function of both and , while the current noisy observations are only a function of the original input space, i.e., . Consequently, we need to manipulate as well. Note that we do not intend to change the non-stationary structure of ; thus, we add a new term to which can be explained by the new variable ,

 y′i=f(xi)+g(zi). (47)

Since the form of is unknown, we use the original noisy observations to obtain , i.e.,

 y′i=yi+g(zi). (48)

Note that the function should be defined so that it does not cancel out the nonstationary structure of the underlying function . We satisfy this intention by defining under the restriction that . Consider that the new labels have the same amount of noise as the original labels, since we do not manipulate , and is a noiseless function. We generate the vector from the uniform distribution between , and define as: . Consequently, the new dataset turns to . Note that the original labels range between [167,540], whereas the range of [0,2.5] satisfies the stated restriction.

The second dataset, GPML, is a synthetic dataset generated by a stationary GP generator using the GPLM-RND function implemented by (Rasmussen and Williams, 2006). The third dataset, OzoneL2, is another satellite dataset from NASA, and has a two-dimensional input space. We upgrade OzoneL2 by adding two synthetic dimensions to it along with preserving the non-stationary structure to construct OzoneL2-4D. The fifth dataset, Power Plant, is a four-dimensional dataset of data collected over six years (2006-2011) from a Combined Cycle Power Plant. Its features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the plant’s net hourly electrical energy output (EP).

The Syn-3D and Syn-5D datasets are synthetic three-dimensional and five-dimensional datasets, respectively, generated from uniform distribution and labeled by a fluctuated function with the form of,

 f(x)=exp(|Σpixi|c)cos(Σqixi)+ϵ,ϵ∼U[a,b], (49)

where and are the weights of dimension in exponential and cosine functions, respectively, and is the label noise. Function (49) is capable of generating a fluctuated data structure with close to non-stationary behavior. We use for Syn3D dataset, and for Syn5D.

We use two measures to evaluate the performance of each method. The first one is the measure of prediction accuracy, which is assessed by the Mean Squared Error (MSE),

 MSE=1TT∑i=1(yi∗−μi∗)2, (50)

where is the noisy observation of the test location and is the value of the mean predictor of this test location. We randomly partition each dataset into for training and the remaining for testing. The second measure is the training time, which is used to evaluate the SPLK’s success in speeding up the GP. Note that the training time is not an appropriate measure on its own and the corresponding MSE should be taken into account simultaneously, since a reduction in training time without an accurate prediction is not useful.

### 4.2 Computation time and prediction accuracy

Here, we compare the training time and the prediction accuracy of the proposed algorithm with those of DDM and SPGP. Specifically, we consider the MSE as a function of the training time and plot the set of best results for each algorithm. Under this criterion the algorithm associated with the curve closest to the origin will be superior. The parameter selection for each model is as follows.

SPLK has three parameters: