1 Introduction
Modeling spatial data typically involves specification of spatial dependence in the form of a covariance function or matrix, under an implicit or explicit assumption of joint Gaussianity. A motivating example for this paper is statistical climatemodel emulation (e.g., Castruccio and Stein, 2013; Castruccio et al., 2014; Nychka et al., 2018; Haugen et al., 2019): based on an ensemble of spatial fields generated by an expensive computer model (Figure 1
), the goal is to learn the underlying joint distribution, and then, for instance, to draw additional samples from the distribution. This involves many challenges, including small ensemble sizes, highdimensional distributions, and complex, nonstationary dependence. Thus, there is a need for flexible and scalable methods for inferring highdimensional spatial covariances.
Countless approximations have been proposed to address computational challenges in spatial statistics (see Heaton et al., 2019, for a recent review and comparison). In recent years, there has been increasing interest in the idea of Vecchia (1988), which effectively approximates the Cholesky factor of the precision (i.e., inverse covariance) matrix as sparse. Under certain settings, the Vecchia approximation can provably provide accurate approximations at nearlinear computational complexity in the number of spatial locations (Schäfer et al., 2020). A generalization of the Vecchia approach includes many popular spatial approximations as special cases (Katzfuss and Guinness, 2019). However, Vecchia approaches have mostly been used for approximating parametric and often isotropic covariance functions.
Isotropic, parametric covariance functions (e.g., Matérn) only depend on spatial distance and on a small number of unknown parameters. Despite being highly restrictive, this is the standard assumption in spatial statistics, especially in the absence of replicates. Approaches to relax these assumptions include parametric nonstationary covariances (e.g., as reviewed by Risser, 2016), stationary nonparametric covariances (e.g., Huang et al., 2011; Choi et al., 2013; Porcu et al., 2019), nonparametric and nonstationary covariances (e.g., Fuentes, 2002), and domain transformations (e.g., Sampson and Guttorp, 1992; Damian et al., 2001; Qadir et al., 2019). In the context of local kriging, covariance functions are typically estimated locally from a parametric (e.g., Anderes and Stein, 2011; Nychka et al., 2018) or nonparametric (e.g., Hsing et al., 2016) perspective, but this generally does not imply a valid joint model or positivedefinite covariance matrix.
Outside of spatial statistics, covariance estimation is often performed based on (modified) Cholesky decompositions of the precision matrix. This approach is attractive, because it automatically ensures positivedefiniteness, because sparsity in the Cholesky factor directly corresponds to ordered conditional independence and hence to directed acyclic graphs, and because it allows covariance estimation to be reformulated as a series of regressions. Regularization can be achieved as in other regression settings, for example by enforcing sparsity using a Lassolike penalty or a thresholding procedure (e.g., Huang et al., 2006; Levina et al., 2008) or via Bayesian prior distributions (e.g., Smith and Kohn, 2002). Motivated by a Gaussian Markov random field assumption for spatial data, Zhu and Liu (2009) estimate the Cholesky factor based on an ordering of the spatial locations intended to minimize the bandwidth, which amounts to coordinate ordering on a regular grid, and they regularize the entries of the Cholesky factor using a weighted Lasso penalty depending on spatial distance; this approach scales cubically in the number of spatial locations.
Here, we propose scalable nonparametric Bayesian inference on a highdimensional spatial covariance matrix. The basic idea is to infer a nearlinear number of nonzero entries in a sparse Cholesky factor of the inverse covariance matrix. Our model can be viewed as a nonparametric extension of the Vecchia approach, as regularized inference on a sparse Cholesky factor of the precision matrix, or as a series of Bayesian linear regression or spatial prediction problems. We specify prior distributions that are motivated by recent results (Schäfer et al., 2017, 2020) on the exponential decay of the entries of the inverse Cholesky factor for Matérntype covariances under a maximumminimumdistance ordering of the spatial locations (Guinness, 2018; Schäfer et al., 2017). Our method scales well to very large datasets, as the number of nonzero entries in the Cholesky factor and the computational cost both scale nearlinearly in the number of spatial locations, in effect inferring a nearlinear number of parameters in the sparse inverse Cholesky factor instead of a square number of parameters in the dense covariance matrix. Further speedups are possible, as the main computational efforts are perfectly parallel. Our approach is applicable to a single realization of the spatial field, but the inference will be most useful and accurate if replicate observations are available.
2 Methodology
2.1 Sparse inverse Cholesky approximation for spatial data
Consider a matrix of spatial data,
(1) 
where is the th observation at spatial location . We assume that the locations , and hence the columns of , are ordered according to a maximin ordering (Guinness, 2018; Schäfer et al., 2017), which sequentially selects each location in the ordering to maximize the minimum distance from locations already selected (see Figure 2).
We model the rows of as independent variate Gaussians:
(2) 
We assume that the data are centered, either using an adhoc preprocessing step (e.g., by subtracting locationwise means) or using a more elaborate procedure (see Section 2.8).
Our goal is to make inference on the spatial covariance matrix based on the observations , in the case where is large (at least in the thousands) and is relatively small. Typically, a parametric, and often isotropic, covariance function is assumed such that is a function of only a small number of parameters, which can then be estimated relatively easily. Here, we avoid explicit assumptions of stationarity and isotropy.
We assume a form of ordered conditional independence,
(3) 
where
is an index vector consisting of the indices of the
nearest neighbors to among those ordered previously; that is, is the th nearest neighbor of among (see Figure 2). While (3) holds trivially for , for many covariance structures it even holds (at least approximately) for , as has been demonstrated numerically (e.g., Vecchia, 1988; Stein et al., 2004; Datta et al., 2016; Guinness, 2018; Katzfuss and Guinness, 2019; Katzfuss et al., 2020a, b) and theoretically (Schäfer et al., 2020) in the context of Vecchia approximations of parametric covariance functions. Assume for now that is known.Consider the modified Cholesky decomposition of the precision matrix:
(4) 
where is a diagonal matrix with positive entries , and is an upper triangular matrix with unit diagonal (i.e., ). (To be precise, (4) is the reverseordered Cholesky factorization of the reverseordered , which simplifies our notation later.) The ordered conditional independence assumed in (3) implies that is sparse, with at most nonzero offdiagonal elements per column (e.g., Katzfuss and Guinness, 2019, Prop. 3.1). We define as the nonzero offdiagonal entries in the th column.
2.2 Covariance estimation via Bayesian regressions
From (4), we see that we can estimate the unknown entries of by inferring the variables and . To do so, our data model (2) can be written as a series of linear regression models (Huang et al., 2006):
(5) 
where the “response vector” is the th column of in (1) consisting of the observations at the th spatial location, and the “design matrix” consists of the observations at the neighbor locations of , stored in the columns of with indices ; specifically, is an matrix with th row .
The Bayesian regression models in (5) are completed by independent conjugate normalinversegamma (NIG) priors:
(6) 
where
is a vector of hyperparameters determining
, , , and (see Section 2.3 below). Due to conjugacy, the posterior distributions (conditional on ) are also NIG:(7)  
(8) 
where , , , and .
Using (8), we can easily obtain samples or posterior summaries of the entries of and conditional on . However, in many applications, primary interest will be in computing posterior summaries of and other quantities. If is not too large (, say), we can simply compute (and hence ) from and . For large , it is often not possible to even hold the entire dense matrix in memory, but we can quickly compute useful summaries of it based on the sparse matrices and (e.g., Katzfuss et al., 2020a)
. For example, a selected inversion algorithm can compute the variances
and all entries for which or . We can also compute the covariance matrix for any set of linear combinations as , where . In many applications, including climatemodel emulation, it is of interest to sample new spatial fields from the model, which we can do by sampling , and then setting ; if and are sampled from their posterior distribution given, then we have obtained a sample from the posterior predictive distribution
.2.3 Parameterization of the prior distributions
We now discuss parameterizing the NIG priors for and in (6) as a function of a small number of hyperparameters, , inspired by the behavior of Matérntype covariance functions. The parameter is related to the marginal variance, while and are related to the range and smoothness. In general, our prior parameterizations are motivated by interpreting and as the kriging weights and variance, respectively, for the spatial prediction problem implied by (3), consisting of predicting from ; due to the maximin ordering, the locations of the variables in all have roughly similar distance to (see Figure 2), and this distance decreases systematically with .
First, consider in (6). For an exponential covariance with variance and range , we have ; assuming , we obtain
(9) 
where , and the distance between location and its nearest previously ordered neighbor decreases roughly as for a regular grid on a unit hypercube, . (Throughout, is an index and not the imaginary number.) This motivates a prior for that shrinks toward . While (9) only holds exactly for an exponential covariance with , Figure 3 illustrates that this functional form approximately holds for Matérn covariance functions in two dimensions with as well. Thus, we set the prior mean as , where . In Figure 3, the empirically observed variance of the elements around the fit line decreases with
as well, and so we set the prior standard deviation of
to be half of the mean. Solving for and , we obtain and , because .Recent results based on elliptic boundaryvalue problems (Schäfer et al., 2017, Sect. 4.1.2) imply that the Cholesky entry , corresponding to the th nearest neighbor, decays exponentially as a function of , for Matérn covariance functions whose spectral densities are the reciprocal of a polynomial (ignoring edge effects). Thus, we assume for in in (6). Note that we divide by in , because the prior variance in is multiplied by . Figure 4 demonstrates this exponential decay as the neighbor number increases.
Finally, consider the choice of conditioningset size . Simply setting to a fixed, reasonable value (e.g., , depending on computational constraints) works well in many settings, but the results can be highly inaccurate if is chosen too small, and the computational cost is unnecessarily high if is chosen too large. Hence, we prefer to allow the data to choose by tying to the prior decay of the elements of ; for all of our numerical experiments, we set as the largest such that , where denotes the neighbor number. This coincides with the amount of variation expected to be learnable from the data. Thus, entries of with sufficiently small prior variance as implied by a specific are set to zero, which ensures computational feasibility of our method.
2.4 Inference on the hyperparameters
The hyperparameters determine , , , and as described in Section 2.3. We now discuss how can be inferred based on the data . All elements of are assumed to be positive due to the decay previously discussed, and so we perform all inference on the logarithmic scale.
The crucial ingredient for inference on is the marginal or integrated likelihood, which can be obtained by combining (5) and (6), moving the product over locations outside of the integral over the entries of and , and simplifying:
(10)  
(11) 
where denotes the gamma function, the prior parameters are given in (6), and the posterior parameters are given in (8).
Based on this integrated likelihood, both empirical and fully Bayesian inference are straightforward. Empirical Bayesian inference is based on a point estimate of
obtained by numerically maximizing the log integrated likelihood. Fully Bayesian inference requires the specification of a hyperprior for
, which we simply assume to be flat (on the log scale). As a result, the posterior distribution is proportional to the integrated likelihood in (11). While this distribution cannot be obtained analytically, we can sample from the posterior using the MetropolisHastings (MH) algorithm. To avoid slow mixing due to large negative correlation between and , we employ an adaptive MH algorithm that jointly proposes and learns its covariance matrix online; specifically, we use the implementation in R by Scheidegger (2012).2.5 Computational complexity
The cost for inference, including computing the posteriors in (8), sampling , or evaluating the integrated likelihood in (11), is dominated by computing the matrix , which requires time, and decomposing , which requires time, for each . Hence, the time complexity is for each unique value of , where is often very small (e.g., in most of our numerical experiments). In addition, the most expensive computations can be carried out in parallel over .
For very small numbers of replicates, with , we can use alternative expressions (see below (8)) relying on computing and decomposing the matrix (instead of ), which requires time.
2.6 Asymptotics
Assume temporarily that (2) holds for some true positivedefinite covariance matrix , with fixed and . Then, the data model with the true can be written in the regression form (5) with . Under these assumptions, there are a fixed number (depending only on , not on ) of variables in the regression models, and our prior distributions on the , , and
place nonzero mass on the true model. Hence, using wellknown asymptotic results based on the Bernstein–von Mises theorem
(e.g., Van der Vaart, 2000), the posterior distributions will be asymptotically normal and our posterior of will contract around the true covariance as the number of independent replicates approaches infinity. However, results of this nature are of limited use here, as we are most interested in the case , which we will examine numerically in Sections 3 and 4.2.7 Correlationbased ordering
For our methods, as discussed in Section 2.1, we recommend a maximin ordering of the variables , and then selecting the conditioning sets based on the nearest previously ordered variables, with determined by as described at the end of Section 2.3. So far, these tasks were assumed to be based on the Euclidean distance of the corresponding locations (see Figure 2), which implies that our priors shrink toward isotropy (i.e., distributions for which dependence is only a function of distance). This shrinkage is not appropriate in some realdata applications. However, it is relatively straightforward to adapt our methods to processes (e.g., anisotropic or nonstationary) for which Euclidean distance is not meaningful. We merely require some prior guess of the correlation structure, based on expert knowledge, historical data, or (a regularized version of) the sample correlation of the data ; a simple choice used here is the elementwise product of the sample correlation and an (isotropic) exponential correlation with a large range parameter (e.g., half the maximum distance between any pair of locations in the dataset). Then, our procedures can be carried out as before, except that the ordering and nearestneighbor conditioning is based on a correlation distance, defined as . This implicitly scales the space, so that the process is approximately isotropic in the transformed space. This approach can increase accuracy in the context of Vecchia approximations of parametric covariances (Kang and Katzfuss, in prep.); we propose it here for our nonparametric procedures. Schäfer et al. (2020, Alg. 7) allows us to compute the correlationbased ordering and conditioning sets in quasilinear time in .
2.8 Noise or spatial trend
Our methodology described so far is most appropriate if the data are observed without any noise or nugget, meaning that realizations of the underlying spatial field are continuous over space; in this setting, approximations based on sparse inverse Cholesky factors of many popular covariance functions can be highly accurate (e.g., Katzfuss and Guinness, 2019; Schäfer et al., 2020).
Now consider noisy observations , , with as in (2). One option is to simply apply our methodology directly to the data as before; this will likely work well if the noise variance is small, but the conditionalindependence assumption in (3) is less appropriate if is large (e.g., Katzfuss and Guinness, 2019), meaning that a much larger might be necessary. A larger results in higher computational cost and potentially less accuracy due to the higher number of Cholesky entries that must be estimated.
Hence, for large noise levels, we instead propose a Gibbs sampler that iterates between sampling conditional on and , and sampling and the entries of and conditional on the as in Sections 2.2 and 2.4. The former task can be accomplished without increasing the computational complexity for each Gibbs iteration, by exploiting the sparsity of the Cholesky factor of the prior precision, and approximating the Cholesky factor of the posterior precision using an incomplete Cholesky factorization to avoid fillin as described in Schäfer et al. (2020, Sect. 4.1). (If is unknown, it is straightforward to sample from its fullconditional distribution as well.)
A similar Gibbssampling strategy can be employed to make inference on a spatial trend. For example, if the observations are given by plus a linear spatial trend with a Gaussian prior on the trend coefficients, the coefficients can be sampled in closed form conditional on , and all other unknown quantities can be sampled given the trend coefficients as before based on obtained by subtracting the trend from .
3 Simulation study
We compared the following methods:

[itemsep=1pt,topsep=2pt,parsep=1pt]
 SCOV:

Basic sample covariance
 OURS:

Our method described in the previous sections
 MLE:

Estimate based on the MLEs of and for the regressions in (5) (i.e., no prior shrinkage), with , with implied by OURS estimate
 LASSO:

Lasso for each regression in (5), with all possible previous points included as possible predictors (i.e., )
 SLASSO:

Spatial LASSO with penalty scaled by the spatial distance to favor inclusion of nearer points as predictors, intended to be similar to Zhu and Liu (2009)
The spatial domain for all comparisons was the unit square.
3.1 Uncertainty quantification
First, we fit a fully Bayesian version of OURS to simulated data, to demonstrate the uncertainty quantification in the covariance estimation. Specifically, we considered realizations of a Gaussian process with Matérn covariance function with variance 3, smoothness 1, and range parameter 0.25, at randomly sampled locations. We obtained 50,000 samples of using an adaptive MCMC (Scheidegger, 2012). The trace plots showed good mixing and convergence, and the individual effective sample sizes for the three parameters were all larger than 1,000. After conservatively discarding the first half of the samples for burnin and thinning by a factor of 50, a covariance matrix was calculated from a sample from (8) for each draw.
): (a) Sample estimates (SCOV) and posterior 80% credible intervals using our fully Bayesian method (OURS) for 20 entries of the covariance matrix. (b) 50%, 80%, and 95% credible intervals using OURS for one randomly sampled entry of the covariance matrix corresponding to each unique distance.
Figure 4(a) shows the resulting 80% posterior credible intervals (CIs) along with the SCOV estimates for 20 randomly sampled matrix entries as a function of , the distance between the corresponding spatial locations. Most of the OURS CIs contained the true value and tracked the decay of the covariance as a function of distance. This is also the general trend for CIs at all distances shown in Figure 4(b).
3.2 Comparison to LASSO for small
We compared estimation accuracy using the KullbackLeibler (KL) divergence between the estimated distribution and the true distribution :
where denotes the trace and denotes the determinant. This exclusive KL divergence does not require inverting the estimate and thus avoids issues with SCOV for . To obtain a point estimate for OURS, we computed , where and were the maximum a posteriori (MAP) estimates from (8), based on the value of that maximized the integrated likelihood (11).
Figure 6 shows the results, using the same setup with as in Section 3.1, for various numbers of replicates . MLE was similarly accurate as OURS for large , as expected, but it performed worse for small due to the lack of prior shrinkage. Similarly, the inclusion of spatial information in SLASSO resulted in higher accuracy than LASSO for small . LASSO and SLASSO were not competitive with OURS and MLE, despite increased flexibility in selecting predictors (i.e., conditioning sets) in the regressions (5), and despite much higher computational cost due to calculations involving all possible predictors. Hence, we did not consider (S)LASSO further.
3.3 Comparison for larger
Figure 7 shows further comparisons with spatial locations using the KL divergence in four different settings (counterclockwise from top right), all with a marginal variance of 5: Matérn with smoothness 1 and range parameter 0.5 on a regular spatial grid (corresponding to the middle panel in the bottom row of Figures 3 and 4); a Cauchy covariance with range 0.25 and memory parameters 1 and 0.5 on a regular grid; Matérn covariance with varying anisotropy (Paciorek and Schervish, 2006), for which the range parameter is constant at 0.05 in the direction but varies as (as a function of the coordinate ) in the direction, on a regular grid; Matérn with smoothness 1 and range 0.25 at randomly spaced locations sampled uniformly.
For all scenarios, MLE was roughly as accurate as OURS for very large , but performed poorly for small , indicating that the added shrinkage from our prior improved the accuracy. OURS strongly outperformed SCOV in all settings. For the nonstationary covariance, we also considered the correlationbased ordering (COR) described in Section 2.7. While we used the true correlation for the comparison here, the elementwise product of the sample covariance and an exponential correlation proposed in Section 2.7 resulted in comparable accuracy (not shown). As expected, OURSCOR performed better than OURSMM in this nonstationary setting. We also conducted some experiments (not shown) using a natural ordering by one of the spatial coordinates, which performed comparably to maximin ordering for isotropic covariances on a regular grid, but was much less accurate for randomly sampled locations.
Overall, our method performed well across all simulations, even though our prior distributions were motivated by isotropic Matérnlike covariances. In addition, the computational burden for OURS was relatively low, with the estimated often around ten and always below 30. While we only considered moderate here in order to be able to carry out many comparisons using the KL divergence, it is also possible to run our method on much larger datasets. For example, using a C++ implementation, evaluating the integrated likelihood (11) once only took about 6 seconds on a 4core laptop (Intel i77560U) for , , .
4 Climatemodel emulation
We analyzed climatemodel output from the Community Earth System Model (CESM) Large Ensemble Project (Kay et al., 2015). Specifically, we considered daily mean surface temperature (in Kelvin) on July 1 in consecutive years starting in the year 402, on a roughly longitudelatitude grid of size containing much of the Americas (see Figure 1). The chosen region features ocean, land, islands, and mountain ranges, leading to a complicated, nonstationary dependence structure. The data were defined as the temperature anomalies obtained by standardizing the climatemodel output at each grid point to unit mean and variance. We found no evidence of temporal correlation in the data, and so the assumption of independent replicates in (2) was at least approximately satisfied.
First, we compared three covariance estimates: an exponential covariance with a range parameter estimated from the data (EXP); a tapered sample covariance given by the elementwise product of the sample covariance and an exponential correlation with a range of 6,000 km, with a small added nugget with variance for numerical stability (SCOVT); and the MAP estimate (as in Section 3.2) using our method with correlation ordering (Section 2.7) based on the SCOVT matrix (OURS). Of the 98 replicates (i.e., years), we randomly selected and withheld 18 as test data, and fit the models on subsets of various sizes between 6 and 80. As the true distribution was unknown, it was not possible to compute the KL divergence. Instead, we used the strictly proper log score (e.g. Gneiting and Katzfuss, 2014) given by the average negative log posterior predictive density of the test data based on (2), with replaced by the corresponding estimate for each of the three estimates.
Figure 8 shows the resulting scores, averaged over five random training/test splits. OURS was more accurate than SCOVT for all values of , and more accurate than EXP for all . We also tried OURS with Euclidean (instead of correlationbased) ordering, which resulted in similar scores for large but required almost twice the replicates to surpass EXP (not shown). While it may be possible to find other (e.g., parametric nonstationary) methods that can result in even lower scores than OURS for this dataset, such methods would likely require substantial amounts of manual tuning (e.g., specifying the parametric form of the nonstationarity).
We created a stochastic simulator emulating the climate model, by fitting a fully Bayesian version of OURS to the full dataset with and sampling from the posterior predictive distribution as described at the end of Section 2.2. Four such samples are shown in Figure 9; they look qualitatively similar to the actual samples from the climate model in Figure 1, including reproducing features corresponding to land/ocean effects despite using no explicit information on land boundaries. These results were based on 50,000 MetropolisHastings (MH) samples of (after a burnin of 50,000) with trace plots showing good mixing and effective sample sizes all larger than 1,000; the samples were then thinned by a factor of 50. It took about 200 minutes to train the emulator, and it took 2.5 seconds to obtain a sample for a given value of , on a 4core laptop (Intel i77560U) without parallelization.
5 Conclusions
We have developed a scalable, flexible Bayesian model for spatial covariance estimation and emulation. We regularize our method by taking advantage of a form of ordered conditional independence often assumed for spatial data. This motivates the assumption of sparsity in the Cholesky of the precision matrix, which greatly improves scalability and reduces the number of unknown parameters from quadratic to nearlinear in the number of spatial locations. We describe three hyperparameters related to the marginal variance and the decay of Cholesky entries; these hyperparameters can be quickly optimized or sampled, resulting in an automatic databased selection of the sparsity structure. Hence, our method requires no manual tuning or crossvalidation. While our approach was motivated by the behavior of isotropic covariances on regular grids, our numerical comparisons demonstrated its generality with more complex covariances and irregularly spaced locations. We also applied the method to climatemodel emulation, where it captured the nonstationary behavior better than standard methods. Template R code for our implementation is provided with this article.
There are several interesting extensions for our spatial covariance estimation procedure. Our method can be extended to handle missing values by imputation using a Gibbs sampler similar to the ones described in Section
2.8; however, if the number of observations at a particular location is very small or even zero, the posterior distribution at that location will be very vague and thus generally not particularly useful, unless some additional assumptions about the covariance between the unobserved and observed locations are made. For example, more explicit shrinkage toward a specific parametric covariance could be achieved by setting the prior mean of the nonzero entries of and in (6) to the values implied by a parametric Vecchia approximation (e.g., Katzfuss and Guinness, 2019, Sec. 4.1). Another potential extension is to estimate the covariance as a function of external variables by including them as additional covariates in the regressions in (5); for instance, for climatemodel emulation, the covariance could depend on season, year, elevation, or land versus ocean. Finally, our approach could be extended to data assimilation, by using it to infer the forecast covariance matrices in an ensemble Kalman filter.
Acknowledgments
Katzfuss’s research was partially supported by National Science Foundation (NSF) Grants DMS–1654083, DMS–1953005, and CCF–1934904. We would like to thank Mohsen Pourahmadi, Florian Schäfer, Will Boyles, and Joe Guinness for helpful comments and suggestions.
References

Anderes and Stein (2011)
Anderes, E. B. and Stein, M. L. (2011).
Local likelihood estimation for nonstationary random fields.
Journal of Multivariate Analysis
, 102(3):506–520.  Castruccio et al. (2014) Castruccio, S., McInerney, D. J., Stein, M. L., Crouch, F. L., Jacob, R. L., and Moyer, E. J. (2014). Statistical emulation of climate model projections based on precomputed GCM runs. Journal of Climate, 27(5):1829–1844.
 Castruccio and Stein (2013) Castruccio, S. and Stein, M. L. (2013). Global spacetime models for climate ensembles. Annals of Applied Statistics, 7(3):1593–1611.
 Choi et al. (2013) Choi, I. K., Li, B., and Wang, X. (2013). Nonparametric estimation of spatial and spacetime covariance function. Journal of Agricultural, Biological, and Environmental Statistics, 18(4):611–630.
 Damian et al. (2001) Damian, D., Sampson, P. D., and Guttorp, P. (2001). Bayesian estimation of semiparametric nonstationary spatial covariance structures. Environmetrics, 12(2):161–178.
 Datta et al. (2016) Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016). Hierarchical nearestneighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800–812.
 Fuentes (2002) Fuentes, M. (2002). Spectral methods for nonstationary spatial processes. Biometrika, 89(1):197–210.
 Gneiting and Katzfuss (2014) Gneiting, T. and Katzfuss, M. (2014). Probabilistic forecasting. Annual Review of Statistics and Its Application, 1(1):125–151.
 Guinness (2018) Guinness, J. (2018). Permutation and grouping methods for sharpening Gaussian process approximations. Technometrics, 60(4):415–429.

Haugen et al. (2019)
Haugen, M. A., Stein, M. L., Sriver, R. L., and Moyer, E. J. (2019).
Future climate emulations using quantile regressions on large ensembles.
Advances in Statistical Climatology, Meteorology, and Oceanography, 5:37–55.  Heaton et al. (2019) Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F., and ZammitMangion, A. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological, and Environmental Statistics, 24(3):398–425.
 Hsing et al. (2016) Hsing, T., Brown, T., and Thelen, B. (2016). Local intrinsic stationarity and its inference. Annals of Statistics, 44(5):2058–2088.
 Huang et al. (2011) Huang, C., Hsing, T., and Cressie, N. (2011). Nonparametric estimation of the variogram and its spectrum. Biometrika, 98(4):775–789.
 Huang et al. (2006) Huang, J. Z., Liu, N., Pourahmadi, M., and Liu, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika, 93(1):85–98.
 Katzfuss and Guinness (2019) Katzfuss, M. and Guinness, J. (2019). A general framework for Vecchia approximations of Gaussian processes. Statistical Science, accepted.
 Katzfuss et al. (2020a) Katzfuss, M., Guinness, J., Gong, W., and Zilber, D. (2020a). Vecchia approximations of Gaussianprocess predictions. Journal of Agricultural, Biological, and Environmental Statistics, 25(3):383–414.
 Katzfuss et al. (2020b) Katzfuss, M., Guinness, J., and Lawrence, E. (2020b). Scaled Vecchia approximation for fast computermodel emulation. arXiv:2005.00386.
 Kay et al. (2015) Kay, J. E., Deser, C., Phillips, A., Mai, A., Hannay, C., Strand, G., Arblaster, J. M., Bates, S. C., Danabasoglu, G., Edwards, J., Holland, M., Kushner, P., Lamarque, J. F., Lawrence, D., Lindsay, K., Middleton, A., Munoz, E., Neale, R., Oleson, K., Polvani, L., and Vertenstein, M. (2015). The Community Earth System Model (CESM) Large Ensemble Project: A community resource for studying climate change in the presence of internal climate variability. Bulletin of the American Meteorological Society, 96(8):1333–1349.
 Levina et al. (2008) Levina, E., Rothman, A., and Zhu, J. (2008). Sparse estimation of large covariance matrices via a nested Lasso penalty. Annals of Applied Statistics, 2(1):245–263.
 Nychka et al. (2018) Nychka, D., Hammerling, D., Krock, M., and Wiens, A. (2018). Modeling and emulation of nonstationary Gaussian fields. Spatial Statistics, 28:21–38.
 Paciorek and Schervish (2006) Paciorek, C. and Schervish, M. (2006). Spatial modelling using a new class of nonstationary covariance functions. Environmetrics, 17(5):483–506.
 Porcu et al. (2019) Porcu, E., Bissiri, P. G., Tagle, F., and Quintana, F. (2019). Nonparametric Bayesian modeling and estimation of spatial correlation functions for global data. Tech Report.
 Qadir et al. (2019) Qadir, G. A., Sun, Y., and Kurtek, S. (2019). Estimation of spatial deformation for nonstationary processes via variogram alignment. arXiv:1911.02249.
 Risser (2016) Risser, M. D. (2016). Review: Nonstationary spatial modeling, with emphasis on process convolution and covariatedriven approaches. arXiv:1610.02447.
 Sampson and Guttorp (1992) Sampson, P. D. and Guttorp, P. (1992). Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association, 87(417):108–119.
 Schäfer et al. (2020) Schäfer, F., Katzfuss, M., and Owhadi, H. (2020). Sparse Cholesky factorization by KullbackLeibler minimization. arXiv:2004.14455.
 Schäfer et al. (2017) Schäfer, F., Sullivan, T. J., and Owhadi, H. (2017). Compression, inversion, and approximate PCA of dense kernel matrices at nearlinear computational complexity. arXiv:1706.02205.

Scheidegger (2012)
Scheidegger, A. (2012).
adaptMCMC: Implementation of a generic adaptive Monte Carlo Markov Chain sampler
. R package version 1.0.3.  Smith and Kohn (2002) Smith, M. and Kohn, R. (2002). Parsimonious covariance matrix estimation for longitudinal data. Journal of the American Statistical Association, 97(460):1141–1153.
 Stein et al. (2004) Stein, M. L., Chi, Z., and Welty, L. (2004). Approximating likelihoods for large spatial data sets. Journal of the Royal Statistical Society: Series B, 66(2):275–296.
 Van der Vaart (2000) Van der Vaart, A. W. (2000). Asymptotic Statistics. Cambridge University Press.
 Vecchia (1988) Vecchia, A. (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society, Series B, 50(2):297–312.
 Zhu and Liu (2009) Zhu, Z. and Liu, Y. (2009). Estimating spatial covariance using penalised likelihood with weighted L1 penalty. Journal of Nonparametric Statistics, 21(7):925–942.