Gaussian sampling; Kriging; Maximum likelihood estimation; Hierarchical matrix; Climate data
A Gaussian random field (GRF) is a random field where all of its finite-dimensional distributions are Gaussian. Often termed as Gaussian processes, GRFs are widely adopted as a practical model in areas ranging from spatial statistics (Stein, 1999), geology (Chilès and Delfiner, 2012), computer experiments (Koehler and Owen, 1996), uncertainty quantification (Smith, 2013)
, to machine learning(Rasmussen and Williams, 2006). Among the many reasons for its popularity, a computational advantage is that the Gaussian assumption enables many computations to be done with basic numerical linear algebra.
Although numerical linear algebra (Golub and Van Loan, 1996) is a mature discipline and decades of research efforts result in highly efficient and reliable software libraries (e.g., BLAS (Goto and Geijn, 2008) and LAPACK (Anderson et al., 1999))111These libraries are the elementary components of commonly used software such as R, Matlab, and python., the computation of GRF models cannot overcome a fundamental scalability barrier. For a collection of scattered sites , , …, , the computation typically requires storage and to arithmetic operations, which easily hit the capacity of modern computers when is large. In what follows, we review the basic notation and a few computational components that underlie this challenge.
Denote by the mean function and the covariance function, which is (strictly) positive definite. Let be a set of sampling sites and let
(column vector) be a realization of the random field at. Additionally, denote by the mean vector with elements and by the covariance matrix with elements .
Realizing a GRF amounts to sampling the multivariate normal distribution. To this end, one performs a matrix factorization (e.g., Cholesky), samples a vector from the standard normal, and computes
Kriging is the estimation of at a new site . Other terminology includes interpolation, regression222Regression often assumes a noise term that we omit here for simplicity. An alternative way to view the noise term is that the covariance function has a nugget., and prediction
. The random variableadmits a conditional distribution with
where is the column vector .
The log-likelihood function of a Gaussian distribution is
The log-likelihood is a function of that parameterizes the mean function and the covariance function . The evaluation of
is an essential ingredient in maximum likelihood estimation and Bayesian inference.
A common characteristic of these examples is the expensive numerical linear algebra computations: Cholesky-like factorization in (1), linear system solutions in (2) and (3), and determinant computation in (3). In general, the covariance matrix is dense and thus these computations have memory cost and arithmetic cost. Moreover, a subtlety occurs in the kriging of more than a few sites. In dense linear algebra, a preferred approach for solving linear systems is not to form the matrix inverse explicitly; rather, one factorizes the matrix as a product of two triangular matrices with cost, followed by triangular solves whose costs are only . Then, if one wants to krige sites, the formulas in (2
), particularly the variance calculation, have a total cost of. This cost indicates that speeding up matrix factorization alone is insufficient for kriging, because vectors create another computational bottleneck.
1.1 Existing Approaches
Scaling up the computations for GRF models has been a topic of great interest in the statistics community for many years and has recently attracted the attention of the numerical linear algebra community. Whereas it is not the focus of this work to extensively survey the literature, we discuss a few representative approaches and their pros and cons.
A general idea for reducing the computations is to restrict oneself to covariance matrices that have an exploitable structure, e.g., sparse, low-rank, or block-diagonal. Covariance tapering (Furrer et al., 2006; Kaufman et al., 2008; Wang and Loh, 2011; Stein, 2013) approximates a covariance function by multiplying it with another one that has a compact support. The resulting compactly supported function potentially introduces sparsity to the matrix. However, often the appropriate support for statistical purposes is not narrow, which undermines the use of sparse linear algebra to speed up computation. Low-rank approximations (Cressie and Johannesson, 2008; Eidsvik et al., 2012) generally approximate by using a low-rank matrix plus a diagonal matrix. In many applications, such an approximation is quite limited, especially when the diagonal component of does not dominate the small-scale variation of the random field (Stein, 2008, 2014). In machine learning under the context of kernel methods, a number of randomized low-rank approximation techniques were proposed (e.g., Nyström approximation (Drineas and Mahoney, 2005) and random Fourier features (Rahimi and Recht, 2007)). In these methods, often the rank may need to be fairly large relative to for a good approximation, particularly in high dimensions (Huang et al., 2014). Moreover, not every low-rank approximation can krige sites efficiently. The block-diagonal approximation casts an artificial independence assumption across blocks, which is unappealing, although this simple approach can outperform covariance tapering and low-rank methods in many circumstances (Stein, 2008, 2014).
There also exists a rich literature focusing on only the parameter estimation of . Among them, spectral methods (Whittle, 1954; Guyon, 1982; Dahlhaus and Künsch, 1987) deal with the data in the Fourier domain. These methods work less well for high dimensions (Stein, 1995) or when the data are ungridded (Fuentes, 2007). Several methods focus on the approximation of the likelihood, wherein the log-determinant term (3) may be approximated by using Taylor expansions (Zhang, 2006) or Hutchinson approximations (Aune et al., 2014; Han et al., 2017; Dong et al., 2017; Ubaru et al., 2017). The composite-likelihood approach (Vecchia, 1988; Stein et al., 2004; Caragea and Smith, 2007; Varin et al., 2011) partitions
into subsets and expands the likelihood by using the law of successive conditioning. Then, the conditional likelihoods in the product chain are approximated by dropping the conditional dependence on faraway subsets. This approach is often competitive. Yet another approach is to solve unbiased estimating equations(Anitescu et al., 2012; Stein et al., 2013; Anitescu et al., 2017) instead of maximizing the log-likelihood . This approach rids the computation of the determinant term, but its effectiveness relies on fast matrix-vector multiplications (Chen et al., 2014) and effective preconditioning of the covariance matrix (Stein et al., 2012; Chen, 2013).
Recently, a multi-resolution approach (Katzfuss, 2017) based on successive conditioning was proposed, wherein the covariance structure is approximated in a hierarchical manner. The remainder of the approximation at the coarse level is filled by the finer level. This approach shares quite a few characteristics with our approach, which falls under the umbrella of “hierarchical matrices” in numerical linear algebra. Detailed connections and distinctions are drawn in the next subsection.
1.2 Proposed Approach
In this work, we take a holistic view and propose an approach applicable to the various computational components of GRF. The idea is to construct covariance functions that render a linear storage and arithmetic cost for (at least) the computations occurring in (1) to (3). Specifically, for any (strictly) positive definite function , which we call the “base function,” we propose a recipe to construct (strictly) positive definite functions as alternatives. The base function is not necessarily stationary. The subscript “h” standards for “hierarchical,” because the first step of the construction is a hierarchical partitioning of the computation domain. With the subscript “h”, the storage of the corresponding covariance matrix , as well as the additional storage requirement incurred in matrix computations, is . Additionally,
the arithmetic costs of matrix construction , factorization , explicit inversion , and determinant calculation are ;
for any dense vector of matching dimension, the arithmetic costs of matrix-vector multiplications and are ; and
for any dense vector of matching dimension, the arithmetic costs of the inner product and the quadratic form are , provided that an preprocessing is done independently of the new site .
The last property indicates that the overall cost of kriging sites and estimating the uncertainties is , which dominates the preprocessing .
The essence of this computationally attractive approach is a special covariance structure that we coin “recursively low-rank.” Informally speaking, a matrix is recursively low-rank if it is a block-diagonal matrix plus a low-rank matrix, with such a structure recursive in the main diagonal blocks. The “recursive” part mandates that the low-rank factors share bases across levels. The matrix resulting from the proposed covariance function is a symmetric positive definite version of recursively low-rank matrices. Interesting properties of the recursively low-rank structure of include that admits exactly the same structure, and that if is symmetric positive definite, it may be factorized as where also admits the same structure, albeit not being symmetric. These are the essential properties that allow for the development of algorithms throughout. Moreover, the recursively low-rank structure is carried out to the out-of-sample vector , which makes it possible to compute inner products and quadratic forms in an cost, asymptotically lower than .
This matrix structure is not a completely new invention without prior inspiration. For decades, researchers in scientific computing have been keenly developing fast methods for multiplying a dense matrix with a vector, , where the matrix is defined based on a kernel function (e.g., Green’s function) that resembles a covariance function. Notable methods include the tree code (Barnes and Hut, 1986), the fast multipole method (FMM) (Greengard and Rokhlin, 1987; Sun and Pitsianis, 2001), hierarchical matrices (Hackbusch, 1999; Hackbusch and Börm, 2002; Börm et al., 2003), and various extensions (Gimbutas and Rokhlin, 2002; Ying et al., 2004; Chandrasekaran et al., 2006a; Martinsson and Rokhlin, 2007; Fong and Darve, 2009; Ho and Ying, 2013; Ambikasaran and O’Neil, 2014; March et al., 2015). These methods were either co-designed, or later generalized, for solving linear systems . They are all based on a hierarchical partitioning of the computation domain, or equivalently, a hierarchical block partitioning of the matrix. The diagonal blocks at the bottom level remain unchanged but (some of) the off-diagonal blocks are low-rank approximated. The differences, however, lie in the fine details, including whether all off-diagonal blocks are low-rank approximated or the ones immediately next to the diagonal blocks should remain unchanged; whether the low-rank factors across levels share bases; and how the low-rank approximations are computed. Our work distinguishes from these methods in the following aspects.
We explicitly define the covariance function on , which is shown to be (strictly) positive definite. Whereas the related methods are all understood as matrix approximations, to the best of our knowledge, none of these works considers the underlying kernel function that corresponds to the approximate matrix. The knowledge of the underlying function is important for out-of-sample extensions, because, for example in kriging (2), one should approximate also the vector in addition to the matrix .
One may argue that if is well approximated (e.g., accurate to many digits), then it suffices to use the nonapproximate for computation. It is important to note, however, that the matrix approximations are elementwise, which does not guarantee good spectral approximations. As a consequence, numerical error may be disastrously amplified through inversion, especially when there is no or a small nugget effect. Moreover, using the nonapproximate for computation will incur a computational bottleneck if one needs to krige a large number of sites, because constructing the vector alone incurs an cost.
On the other hand, we start from the covariance function and hence one needs not interpret the proposed approach as an approximation. All the linear algebra computations are exact in infinite precision, including inversion and factorization. Additionally, positive definiteness is proved. Few methods under comparison hold such a guarantee.
A substantial flexibility in the design of methods under comparison is the low-rank approximation of the off-diagonal blocks. If the approximation is algebraic, the common objective is to minimize the approximation error balanced with computational efficiency (otherwise the standard truncated singular value decomposition suffices). Unfortunately, rarely does such a method maintain the positive definiteness of the matrix, which poses difficulty for Cholesky-like factorization and log-determinant computation. A common workaround is some form of compensation, either to the original blocks of the matrix(Bebendorf and Hackbusch, 2007) or to the Schur complements (Xia and Gu, 2010). Our approach requires no compensation because of the guaranteed positive definiteness.
The fine distinctions in matrix structures lead to substantially different algorithms for matrix operations, if even possible. Our structure is the most similar to that of HSS matrices (Chandrasekaran et al., 2006a; Xia et al., 2010) and of H matrices with weak admissiblity (Hackbusch and Börm, 2002), but distant from that of tree code (Barnes and Hut, 1986), FMM (Greengard and Rokhlin, 1987), H matrices (Hackbusch, 1999), and HODLR matrices (Ambikasaran and O’Neil, 2014). Whereas fast matrix-vector multiplications are a common capability of different matrix structures, the picture starts to diverge for solving linear systems: some structures (e.g., HSS) are amenable for direct factorizations (Chandrasekaran et al., 2006b; Xia and Gu, 2010; Li et al., 2012; Wang et al., 2013), while the others must apply preconditioned iterative methods. An additional complication is that direct factorizations may only be approximate, and thus if the approximation is not sufficiently accurate, it can serve only as a preconditioner but cannot be used in a direct method (Iske et al., 2017). Then, it will be nearly impossible for these matrix structures to perform Cholesky-like factorizations accurately.
In this regard, our matrix structure is the most clean. Thanks to the property that the matrix inverse and the Cholesky-like factor admit the same structure as that of the original matrix, all the matrix operations considered in this work are exact. Moreover, the explicit covariance function also allows for the development of algorithms for computing inner products and quadratic forms, which, to the best of our knowledge, has not been discussed in the literature for other matrix structures.
Although most of the methods under this category enjoy an or (for some small ) arithmetic cost, not every one does so. For example, the cost of skeletonization (Ho and Ying, 2013; Minden et al., 2016) is dimension dependent; in two dimensions it is approximately and in higher dimensions it will be even higher. In general, all these methods are considered matrix approximation methods, and hence there exists a likely tradeoff between approximation accuracy and computational cost. What confounds the approximation is that the low-rank phenomenon exhibited in the off-diagonal blocks fades as the dimension increases (Ambikasaran et al., 2016). In this regard, it is beneficial to shift the focus from covariance matrices to covariance functions where approximation holds in a more meaningful sense. We conduct experiments to show that predictions and likelihoods are well preserved with the proposed approach.
2 Recursively Low-Rank Covariance Function
Let be positive definite for some domain ; that is, for any set of points and any set of coefficients , the quadratic form . We say that is strictly positive definite if the quadratic form is strictly greater than whenever the ’s are distinct and not all of the ’s are 0. Given any and , in this section we propose a recipe for constructing functions that are (strictly) positive definite if is so. We note the often confusing terminology that a strictly positive definite function always yields a positive definite covariance matrix for distinct observations, whereas, for a positive definite function, this matrix is only required to be positive semi-definite.
Some notations are necessary. Let be an ordered list of points in . We will use to denote the matrix with elements for all pairs . Similarly, we use and to denote a column and a row vector, respectively, when one of the arguments passed to contains a singleton .
The construction of is based on a hierarchical partitioning of . For simplicity, let us first consider a partitioning with only one level. Let be partitioned into disjoint subdomains such that . Let be a set of distinct points in . If is invertible, define
In words, (4) states that the covariance for a pair of sites is equal to if they are located in the same subdomain; otherwise, it is replaced by the Nyström approximation . The Nyström approximation is always no greater than and when is strictly positive definite, it attains only when either or belongs to . Following convention, we call the points in landmark points. Throughout this work, we will reserve underscores to indicate a list of landmark points. The term “low-rank” comes from the fact that a matrix generated from Nyström approximation generically has rank (when ), regardless of how large is.
The positive definiteness of follows a simple Schur-complement split. Furthermore, we have a stronger result when is assumed to be strictly positive definite; in this case, carries over the strictness. We summarize this property in the following theorem, whose proof is given in the appendix.
The function defined in (4) is positive definite if is positive definite and is invertible. Moreover, is strictly positive definite if is so.
We now proceed to hierarchical partitioning. Such a partitioning of the domain may be represented by a partitioning tree . We name the tree nodes by using lower case letters such as and let the subdomain it corresponds to be . The root is always and hence . We write to denote the set of all child nodes of . Equivalently, this means that a (sub)domain is partitioned into disjoint subdomains for all . An example is illustrated in Figure 1, where , , and .
We now define a covariance function based on hierarchical partitioning. For each nonleaf node , let be a set of landmark points in and assume that is invertible. The main idea is to cascade the definition of covariance to those of the child subdomains. Thus, we recursively define a function such that if and belong to the same child subdomain of , then ; otherwise, resembles a Nyström approximation. Formally, our covariance function
where for any tree node ,
The auxiliary function cannot be the same as , because positive definiteness will be lost. Instead, we make the following recursive definition when :
To understand the definition, we expand the recursive formulas (5)–(7) for a pair of points and , where and are two leaf nodes. If , it is trivial that . Otherwise, they have a unique least common ancestor . Then,
where is the path in the tree connecting and and similarly is the path connecting and . The vectors and on the two sides of come from recursively applying (7).
Similar to Theorem 1, the positive definiteness of follows from recursive Schur-complement splits across the hierarchy tree. Furthermore, we have that is strictly positive definite if is so. We summarize the overall result in the following theorem, whose proof is given in the appendix.
3 Recursively Low-Rank Matrix
An advantage of the proposed covariance function is that when the number of landmark points in each subdomain is considered fixed, the covariance matrix for a set of points admits computational costs only linear in . Such a desirable scaling comes from the fact that is a special case of recursively low-rank matrices whose computational costs are linear in the matrix dimension. In this section, we discuss these matrices and their operations (such as factorization and inversion). Then, in the section that follows, we will show the specialization of and discuss additional vector operations tied to .
Let us first introduce some notation. Let . The index set may be recursively (permuted and) partitioned, resulting in a hierarchical formation that resembles the second panel of Figure 1. Then, corresponding to a node is a subset . Moreover, we have where the ’s under union are disjoint. For an real matrix , we use to denote a submatrix whose rows correspond to the index set and columns to . We also follow the Matlab convention and use to mean all rows/columns when extracting submatrices. Further, we use to denote the cardinality of an index set . We now define a recursively low-rank matrix.
A matrix is said to be recursively low-rank with a partitioning tree and a positive integer if
for every pair of sibling nodes and with parent , the block admits a factorization
for some , , and ; and
for every pair of child node and parent node not being the root, the factors
for some .
In Definition 1, the first item states that each off-diagonal block of is a rank- matrix. The middle factor is shared by all children of the same parent , whereas the left factor and the right factor may be obtained through a change of basis from the corresponding factors in the child level, as detailed by the second item of the definition. As a consequence, if and , then
From now on, we use the shorthand notation to denote a diagonal block and to denote an off-diagonal block . A pictorial illustration of , which corresponds to the tree in Figure 1, is given in Figure 2. Then, is completely represented by the factors
In computer implementation, we store these factors in the corresponding nodes of the tree. See Figure 3 for an extended example of Figure 1. Clearly, is symmetric when and are symmetric, , and for all appropriate nodes , , and . In this case, the computer storage can be reduced by approximately a factor of through omitting the ’s and ’s; meanwhile, matrix operations with often have a reduced cost, too.
It is useful to note that not all matrix computations concerned in this paper are done with a symmetric matrix, although the covariance matrix is always so. One instance with unsymmetric matrices is sampling, where the matrix is a Cholesky-like factor of the covariance matrix. Hence, in this section, general algorithms are derived whenever may be unsymmetric, but we note the simplification for the symmetric case as appropriate.
The four matrix operations under consideration are:
matrix-vector multiplication ;
matrix inversion ;
determinant ; and
Cholesky-like factorization (when is symmetric positive definite).
Because of space limitation, the detailed algorithms are presented in the appendix. Suffice it to mention here that interestingly, all algorithms are in the form of tree walks (e.g., preorder or postorder traversals) that heavily use the tree data structure illustrated in Figure 3. The inversion and Cholesky-like factorization rely on existence results summarized in the following. The proofs of these theorems are constructive, which simultaneously produce the algorithms. Hence, one may find the proofs inside the algorithms given in the appendix.
Let be recursively low-rank with a partitioning tree and a positive integer . If is invertible and additionally, is also invertible for all pairs of nonroot node and parent , then there exists a recursively low-rank matrix with the same partitioning tree and integer , such that . Following (8), we denote the corresponding factors of to be
Let be recursively low-rank with a partitioning tree and a positive integer . If is symmetric, by convention let be represented by the factors
Furthermore, if is positive definite and additionally, is also positive definite for all pairs of nonroot node and parent , then there exists a recursively low-rank matrix with the same partitioning tree and integer , and with factors
such that .
4 Covariance Matrix as a Special Case of and Out-Of-Sample Extension
As noted at the beginning of the preceding section, the covariance matrix is a special case of recursively low-rank matrices. This fact may be easily verified through populating the factors of defined in Definition 1. Specifically, let be a set of distinct points in and let for all (sub)domains . To avoid degeneracy assume for all . Assign a recursively low-rank matrix in the following manner:
for every leaf node , let ;
for every nonleaf node , let ;
for every leaf node , let where is the parent of ; and
for every nonleaf node not being the root, let where is the parent of .
In this section, we consider two operations with the vector , where is an out-of-sample (i.e., unobserved site). The quantities of interest are
the inner product for a general length- vector ; and
the quadratic form , where is a symmetric recursively low-rank matrix with the same partitioning tree and integer as that used for constructing .
For the quadratic form, in practical use , but the algorithm we develop here applies to a general symmetric . The inner product is used to compute prediction (first equation of (2
)) whereas the quadratic form is used to estimate standard error (second equation of (2)).
Because of space limitation, the detailed algorithms are presented in the appendix. Similar to those in the preceding section, they are organized as tree algorithms. The difference is that both algorithms in this section are split into a preprocessing computation independent of and a separate -dependent computation. The preprocessing still consists of tree traversals that visit all nodes of the hierarchy tree, but the -dependent computation visits only one path that connects the root and the leaf node that lies in. In all cases, one needs not explicitly construct the vector , which otherwise costs storage.
5 Cost Analysis
All the recipes and algorithms developed in this work apply to a general partitioning of the domain . As is usual, if the tree is arbitrary, cost analysis of many tree-based algorithms is unnecessarily complex. To convey informative results, here we assume that the partitioning tree is binary and perfect and the associated partitioning of the point set is balanced. That is, with some positive integer , for all leaf nodes . Then, with a partitioning tree of height , the number of points is . We assume that the number of landmark points, , is equal to for simplicity.
Since the factors , and are stored in the leaf nodes and , , and are stored in the nonleaf nodes (in fact, at the root there is no or ), the storage is clearly
An alternative way to conclude this result is that the tree has nodes, each of which contains an number of matrices of size . Therefore, the storage is . This viewpoint also applies to the additional storage needed when executing all the matrix algorithms, wherein temporary vectors and matrices are allocated. This additional storage is or per node, hence it does not affect the overall assessment .
The analysis of the arithmetic cost of each matrix operation is presented in the appendix. In brief summary, matrix construction is , matrix-vector multiplication is , matrix inversion and Cholesky-like factorization are , determinant computation is , inner product is with preprocessing, and quadratic form is with preprocessing.
6 Numerical Experiments
In this section, we show a comprehensive set of experiments to demonstrate the practical use of the proposed covariance function for various GRF computations. These computations are centered around simulated data and data from test functions, based on a simple stationary covariance model . In the next section we will demonstrate an application with real-life data and a more realistic nonstationary covariance model.
The base covariance function in this section is the Matérn model
where is the sill, is the range, is the smoothness, and is the nugget. In each experiment, the vector of parameters include some of them depending on appropriate setting. We have reparameterized the sill and the nugget through a power of ten, because often the plausible search range is rather wide or narrow. Note that for the extremely smooth case (i.e., ), (9) becomes equivalently the squared-exponential model
We will use this covariance function in one of the experiments. Throughout we assume zero mean for simplicity.
6.1 Small-Scale Example
We first conduct a closed-loop experiment whereby data are simulated on a two-dimensional grid from some prescribed parameter vector . We discard (uniformly randomly) half the data and perform maximum likelihood estimation to verify that the estimated is indeed close to . Afterward, we perform kriging by using the estimated to recover the discarded data. Because it is a closed-loop setting and there is no model misspecification, the kriging errors should align well with the square root of the variance of the conditional distribution (see (2)). We do not use a large , since we will compare the results of the proposed method with those from the standard method that requires expensive linear algebra computations.
The prescribed parameter vector consists of three elements: , , and . We choose to use a zero nugget because in some real-life settings, measurements can be quite precise and it is unclear one always needs a nugget effect. This experiment covers such a scenario. Further, note that numerically accurate codes for evaluating the derivatives with respect to are unknown. Such a limitation poses constraints when choosing optimization methods.
Further details are as follows. We simulate data on a grid of size occupying a physical domain , by using prescribed parameters , , and . Half of the data are discarded, which results in sites for estimation and sites for kriging.
For the proposed method, we build the partitioning tree through recursively bisecting the observed sites, where in each bisection, the partitioning line is drawn perpendicular to the longest dimension of the bounding box of the current point set. We specify the number of landmark points, , to be , and make the height of the partitioning tree such that the number of points in each leaf node is approximately . The landmark points for each subdomain in the hierarchy are placed on a regular grid, whose dimension is approximately proportional to the size of the bounding box of the points.
Figure 4(a) illustrates the random field simulated by using . With this data, maximum likelihood estimation is performed, by using separately and . The parameter estimates and their standard errors are given in Table 1. The numbers between the two methods are both quite close to the truth. With the estimated parameters, kriging is performed, with the results shown in Figure 4(b) and (c). The kriging errors are sorted in the increasing order of the prediction variance. The red curves in the plots are three times the square root of the variance; not surprisingly almost all the errors are below this curve.
6.2 Comparison of Log-Likelihoods and Estimates
One should note that the base covariance function and the proposed are not particularly close, because the number of landmarks for defining is only (compare this number with the number of observed sites, ). Hence, if one compares the covariance matrix with , they agree in only a limited number of digits. However, the reason why is a good alternative of is that the shapes of the likelihoods are similar, as well as the locations of the optimum.
We graph in Figure 5 the cross sections of the log-likelihood centered at the truth . The top row corresponds to and the bottom row to . One sees that in both cases, the center (truth ) is located within a reasonably concave neighborhood, whose contours are similar to each other.
. The unparenthesized number is the mean and the number with parenthesis is the standard deviation. For reference, the uncertainties (denoted as stderr) of the estimates are listed in the second part of the table.
In fact, the maxima of the log-likelihoods are rather close. We repeat the simulation ten times and report the statistics in Table 2. The quantities with a subscript “h” correspond to the proposed covariance function . One sees that for each parameter, the differences of the estimates are generally about of the standard errors of the estimates. Furthermore, the difference of the true log-likelihoods at the two estimates is always substantially less than one unit. These results indicate that the proposed produces highly comparable parameter estimates with the base covariance function .
In this subsection, we verify that the linear algebra costs for the proposed method indeed agree with the theoretical analysis. Namely, random field simulation and log-likelihood evaluation are both , and the kriging of sites is . Note that all these computations require the construction of the covariance matrix, which is .
The experiment setting is the same as that of the preceding subsections, except that we restrict the number of log-likelihood evaluations to to avoid excessive computation. We vary the grid size from to to observe the scaling. The random removal of sites has a minimal effect on the partitioning and hence on the overall time. The computation is carried out on a laptop with eight Intel cores (CPU frequency 2.8GHz) and 32GB memory. Six parallel threads are used.
Figure 6 plots the computation times, which indeed well agree with the theoretical scaling. As expected, log-likelihood evaluations are the most expensive, particularly when many evaluations are needed for optimization. The simulation of a random field follows, with kriging being the least expensive, even when a large number of sites are kriged.
6.4 Large-Scale Example Using Test Function
The above scaling results confirm that handling a large is feasible on a laptop. In this subsection, we perform an experiment with up to one million data sites. Different from the closed-loop setting that uses a known covariance model, here we generate data by using a test function. We estimate the covariance parameters and krige with the estimated model.
The test function is
) for estimation. The high smoothness results in a too ill-conditioned matrix; therefore, a nugget is necessary. The vector of parameters is. We inject independent Gaussian noise to the data so that the nugget will not vanish. As before, we randomly select half of the sites for parameter estimation and the other half for kriging. The number of landmark points, , remains .
Our strategy for large-scale estimation is to first perform a small-scale experiment with the base covariance function that quickly locates the optimum. The result serves as a reference for later use of the proposed in the larger-scale setting. The results are shown in Figure 7 (for the largest grid) and Table 3.
Each of the cross sections of the log-likelihood on the bottom row of Figure 7 is plotted by setting the unseen parameter at the estimated value. For example, the - plane is located at . From these contour plots, we see that the estimated parameters are located at a local maximum with nicely concave contours in a neighborhood of this maximum. The estimated nugget () well agrees with of the actual noise variance. The kriged field (plot (b)) is visually as smooth as the test function. The kriging errors for predicting the test function , again sorted by their estimated standard errors, are plotted in (c). As one would expect, nearly all of the errors are less than three times their estimated standard errors. Note that the kriging errors are counted without the perturbed noise; they are substantially lower than the noise level.
7 Analysis of Climate Data
In this section, we apply the proposed method to analyze a climate data product developed by the National Centers for Environmental Prediction (NCEP) of the National Oceanic and Atmospheric Administration (NOAA).333https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/climate-forecast-system-version2-cfsv2 The Climate Forecast System Reanalysis (CFSR) data product (Saha et al., 2010) offers hourly time series as well as monthly means data with a resolution down to one-half of a degree (approximately 56 km) around the Earth, over a period of 32 years from 1979 to 2011. For illustration purpose, we extract the temperature variable at 500 mb height from the monthly means data and show a snapshot on the top of Figure 8. Temperatures at this pressure (generally around a height of 5 km) provide a good summary of large-scale weather patterns and should be more nearly stationary than surface temperatures. We will estimate a covariance model for every July over the 32-year period.
Through preliminary investigations, we find that the data appears fairly Gaussian after a subtraction of pixelwise mean across time. An illustration of the demeaned data for the same snapshot is given at the bottom of Figure 8. Moreover, the correlation between the different snapshots are so weak that we shall treat them as independent anomalies. Although temperatures have warmed during this period, the warming is modest compared to the interannual variation in temperatures at this spatial resolution, so we assume the anomalies have mean 0. We use to denote the anomaly at time . Then, the log-likelihood with zero-mean independent anomalies is
For random fields on a sphere, a reasonable covariance function for a pair of sites and may be based on their great-circle distance, or equivalently the chordal distance, because of their monotone relationship. Specifically, let a site be represented by latitude and longitude . Then, the chordal distance between two sites and is
Here, we assume that the radius of the sphere is for simplicity, because it can always be absorbed into a range parameter later. We still use the Matérn model
to define the covariance function, where is the chordal distance (11), so that the model is isotropic on the sphere. More sophisticated models based on the same Matérn function and the chordal distance are proposed in (Jun and Stein, 2008). Note that this model depends on the longitudes for and only through their differences modulo . Such a model is called axially symmetric (Jones, 1963).
A computational benefit of an axially symmetric model and gridded observations is that one may afford computations with
even when the latitude-longitude grid is dense. The reason is that for any two fixed latitudes, the cross-covariance matrix between the observations is circulant and diagonalizing it requires only one discrete Fourier transform (DFT), which is efficient. Thus, diagonalizing the whole covariance matrix amounts to diagonalizing only the blocks with respect to each longitude, apart from the DFT’s for each latitude.
Hence, we will perform computations with both the base covariance function and the proposed function and compare the results. We subsample the grid with every other latitude and longitude for parameter estimation. We also remove the two grid lines 90N and 90S due to their degeneracy at the pole. Because of the half-degree resolution, this results in a coarse grid of size for parameter estimation, for a total of observations. The rest of the grid points are used for kriging. As before, we set the number of landmark points to be .
|Initial guess||Terminate at||Log-likelihood|
We set the parameter vector , considering only several values for the smoothness parameter because of the difficulties of numerical optimization of the loglikelihood over . To our experience, blackbox optimization solvers do not always find accurate optima. We show in Table 4 several results of the Matlab solver fminunc when one varies . For each , we start the solver at some initial guess until it claims a local optimum . Then, we use this optimum as the initial guess to run the solver again. Ideally, the solver should terminate at if it indeed is an optimum. However, reading Table 4, one finds that this is not always the case.
When , the second search diverges from the initial . The cross-section plots of the log-likelihood (not shown) indicate that is far from the center of the contours. The solver terminates merely because the gradient is smaller than a threshold and the Hessian is positive-definite (recall that we minimize the negative log-likelihood). The diverging search starting from (with and continuously increasing) implies that the infimum of the negative log-likelihood may occur at infinity, as can sometimes happen in our experience.
When , although the search starting at does not diverge, it terminates at a location quite different from , with the log-likelihood increased by about 4000, which is arguably a small amount given the number of observations. Such a phenomenon is often caused by the fact that the peak of the log-likelihood is flat (at least along some directions); hence, the exact optimizer is hard to locate. This phenomenon similarly occurs in the case . Only when does restarting the optimization yield that is essentially the same as the initial estimate. Incidentally, the log-likelihood in this case is also the largest. Hence, all subsequent results are produced for only .
Near , we further perform a local grid search and obtain finer estimates, as shown in Table 5. One sees that the estimated parameters produced by and are qualitatively similar, although their differences exceed the tiny standard errors. To distinguish the two estimates, we use to denote the one resulting from and from . In Table 6, we list the log-likelihood values and the kriging errors when the covariance function is evaluated at both locations. One sees that the estimate is quite close to in two important regards: first, the root mean squared prediction errors using are the same to four significant figures, and the log-likelihood under differs by 1000, which we would argue is a very small difference for more than 2 million observations. On the other hand, does not provide a great approximation to the loglikelihood itself and the predictions using are slightly inferior to those using no matter which estimate is used. Figure 9 plots the log-likelihoods centered around the respectively optimal estimates. The shapes are visually identical, which supports the use of for parameter estimation. We see that the statistical efficacy of the proposed covariance function depends on the purpose to which it is put.
We have presented a computationally friendly approach that addresses the challenge of formidably expensive computations of Gaussian random fields in the large scale. Unlike many methods that focus on the approximation of the covariance matrix or of the likelihood, the proposed approach operates on the covariance function such that positive definiteness is maintained. The hierarchical structure and the nested bases in the proposed construction allow for organizing various computations in a tree format, achieving costs proportional to the tree size and hence to the data size . These computations range from the simulation of random fields to kriging and likelihood evaluations. More desirably, kriging has an amortized cost of and hence one may perform predictions for as many as
sites easily. Moreover, the efficient evaluation of the log-likelihoods paves the way for maximum likelihood estimation as well as Markov Chain Monte Carlo. Numerical experiments show that the proposed construction yields comparable prediction results and likelihood surfaces with those of the base covariance function, while being scalable to data of ever increasing size.
- Ambikasaran and O’Neil  S. Ambikasaran and M. O’Neil. Fast symmetric factorization of hierarchical matrices with applications. arXiv preprint arXiv:1405.0223, 2014.
- Ambikasaran et al.  S. Ambikasaran, D. Foreman-Mackey, L. Greengard, D. W. Hogg, and M. O’Neil. Fast direct methods for Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38:252–265, 2016.
- Anderson et al.  E. Anderson, Z. Bai,