1 Introduction
Providing valid predictions of the response at an unobserved location is a fundamental problem in spatial statistics. For example, epidemiologists may wish to extrapolate air pollution concentrations from a network of stationary monitors to the residential locations of the study participants. There are a number of challenges one faces in carrying out valid prediction at a new spatial location, but one of the most pressing is that existing methods are modelbased, so the reliability of the predictions depends crucially on the soundness of the posited model. For example, prediction intervals based on Kriging—see Cressie (1992) and Section 2.1
—often rely on normality, and stationarity is often assumed to facilitate estimating the spatial covariance function required for Kriging. It is now common to perform geostatistical analysis for massive datasets collected over a vast and diverse spatial domain
(e.g., Heaton et al., 2019). For complex processes observed over a large domain, the normality and stationarity assumptions can be questionable. Failing to account for nonstationarity can affect prediction accuracy, but typically has a larger effect on uncertainty quantification such as prediction intervals (Fuglstad et al., 2015). While there are now many methods available for dealing with nonstationary (see Risser, 2016, for a recent review) and nonGaussianity (e.g., Gelfand et al., 2005; Duan et al., 2007; Reich and Fuentes, 2007; Rodriguez and Dunson, 2011), these typically involve heavy computations. This exacerbates the already imposing computational challenges posed by massive datasets. Further, fitting the entire stochastic process may be unnecessary if only prediction intervals are desired. Therefore, having a method that provides provably valid prediction intervals without requiring specification of a statistical model—and, hence, not inheriting the risk of model misspecification biases—would be very useful to practitioners. Developing such a method is the goal of this paper.In recent years, the use of machine learning techniques in statistics has become increasingly more common. While there are numerous examples of this phenomenon, the one most relevant here is
conformal prediction. This method originated in Vovk et al. (2005) and the references therein (see, also, Shafer and Vovk, 2008), but has appeared frequently in the recent statistics literature (e.g., Lei and Wasserman, 2014; Lei et al., 2018; Guan, 2019; Romano et al., 2019; Tibshirani et al., 2019). What makes this method especially attractive is that it provides provably valid prediction intervals without specification of a statistical model. More precisely, the conformal prediction intervals achieve the nominal frequentist prediction coverage probability, uniformly over all data distributions; see Section
2.2. The crucial assumption behind the validity of conformal prediction is that the data are exchangeable.Whether it is reasonable to assume exchangeability in a spatial application depends on how the data are sampled. On the one hand, if the locations are randomly sampled in the spatial domain, then exchangeability holds automatically; see Lemma 1
. Therefore, the standard conformal prediction framework can be used basically off the shelf. On the other hand, if the locations are fixed in the spatial domain, then exchangeability does not hold in general. But we show that, for a wide range of spatial processes, the response variables at tightly concentrated locations are approximately exchangeable; see Theorem
2. Therefore, a version of the basic conformal prediction method applied to these tightly concentrated observations ought to be approximately valid.Using this insight about the connection between exchangeability and the sampling design, we propose two related spatial conformal prediction methods. The first, a socalled global spatial conformal prediction (GSCP) method, described in Section 3, is designed specifically for cases where the spatial locations are sampled at random. In particular, this global method produces a prediction interval which is marginally valid, i.e., valid on average with respect to the distribution of the target location at which prediction is desired. The second, a local spatial conformal prediction (LSCP) method, described in Section 4, is designed specifically for the case when the spatial locations are fixed. Since our goal is to proceed without strong assumptions about the spatial dependence structure, it is only possible to establish approximate or local exchangeability. Therefore, the proposed local spatial conformal prediction method can only provide approximately valid predictions; see Theorem 3. But our goal in the fixedlocation case resembles the “conditional validity” target in the conformal prediction literature (e.g., Barber et al., 2019) so, given the impossibility theorems in the latter context, approximate validity is all that can be expected.
For both the global and local formulations, our proposed method is modelfree, robust, and computationally feasible for large datasets. In Sections 5 and 6, we show using real and simulated data that the proposed methods outperform both standard global Kriging and local approximate Gaussian process regression (laGP; Gramacy and Apley, 2015) for nonstationary and nonGaussian data. In addition to be useful for spatial applications, it is also a major advance in conformal prediction to the case of dependent data, and establishes the conditions on the spatial sampling design and datagenerating mechanism that ensure validity—or at least approximate validity—of the conformal prediction intervals.
2 Background
2.1 Spatial prediction methods
Let be the observation at spatial location . The data consist of observations . Geostatistical analysis often assumes that the data follow a Gaussian process with stationary and isotropic covariance function. That is, , where is a meanzero Gaussian process with covariance , a function of the distance between locations and , and is error. A common example is the Matrn correlation function
where is the correlation range, is the smoothness parameter, is the modified Bessel function of the second kind. Covariates can easily be added to the mean of but for simplicity we assume the process has mean zero. The main assumptions of this model are that the data are Gaussian and the covariance function is stationary and isotropic, i.e., it is a function only of the distance between spatial locations and is thus the same across the spatial domain.
To make prediction inference at the new location
, we study the random variable
together with the other ’s. The full data set, including response to be predicted, , is denoted and the collection of data excluding (but including ) is denoted . Note that represent the full observed data set and .The joint distribution of
, given and the spatial covariance parameters , is , whereis a vector of zeros with length
and is a covariance matrix with element being . Let be an estimate of and. The Kriging prediction and variance of
, given and , are(1) 
Assuming normality and stationarity gives the predictive distribution and standardized residuals
(2) 
In particular, a prediction interval for is , where is the upper quantile of a standard normal distribution.
2.2 Conformal prediction
In this section we take a step back from spatial data analysis and review conformal prediction for nonspatial problems. Suppose we have a data sequence assumed to be exchangeable with (joint) distribution . No assumptions about are made here, beyond that it is exchangeable. We observe values of and the goal is to predict . More specifically, we seek a procedure that returns, for any , a prediction interval that is valid in the sense that
(3) 
where represents the distribution of under , and the infimum is over all exchangeable distributions . That we require attaining the nominal frequentist coverage for all rules out the use of modelbased procedures, such as likelihood or Bayesian methods.
The original conformal prediction method proceeds as follows. Define a nonconformity measure , a function that takes two arguments: the first is a “bag” that consists of a finite collection of data points; the second is a single data point . Then measures how closely represents the data points in bag . For example, if is a prediction rule and is some measure of distance, then we might take , the distance between and the value returned by the prediction rule applied to . The choice of nonconformity measure depends on the context and is at the discretion of the data analyst; often there is a natural choice, as in our spatial prediction setting described in Section 3. The only technical requirement is symmetry in the first argument; that is, shuffling the data in bag does not change the value of .
Given the nonconformity measure , the next step is to transform the data in a suitable way via . Specifically, augment the observed data with a provisional value of ; this value is generic and free to vary. Define
Note that is special because it compares the actual observed data with this provisional value of the unobserved future observation. Next, compute the plausibility of as a value for according to the formula
(4) 
where denotes the indicator of event . Note that this process can be carried out for any provisional value , so the result is actually a mapping which we will refer to as the plausibility contour returned by the conformal algorithm. This function can be plotted (see Figure 1) to visualize the uncertainty about based on the given data , the choice of nonconformity measure, etc. Moreover, a prediction set can be obtained as the upper level set of the plausibility contour,
with threshold , which approximately equals when is large.
For a quick explanation of why this works, i.e., why (3) holds, first note that the event corresponds to . So to calculate the probability in (3), we need to know something about the distribution of the random variable . Towards this, the key point is that the provisional value being used in this latter expression is exactly the being predicted. Then exchangeability of the original data sequence and the permutation invariance of implies that are exchangeable too. And since is proportional to the rank of , the claim in (3
) follows from the elementary result that the rank of any individual in a set of exchangeable random variables has a discrete uniform distribution.
Conformal prediction is somewhat mysterious—it is not obviously part of classical or Bayesian statistics. Some authors
(e.g., Lei et al., 2018) refer to as a pvalue corresponding to a “test” of the hypothesis that . This interpretation is potentially confusing because we are accustomed with testing hypotheses about fixed unknowns, not about random variables. The plausibility notion we are adopting here dates back to Shafer (1976), but has reemerged recently in a new framework for inference and prediction, namely, inferential models (Martin and Liu, 2013, 2016; Martin and Lingham, 2016; Martin, 2019). More recently, Cella and Martin (2020) have shown that the mapping in (4) is precisely the plausibility contour resulting from a suitable inferential model construction. Therefore, the validity property (3) achieved by conformal prediction is a special case of the general inferential model validity theory. For our purposes, we will not need any details from the inferential model framework, but we believe that this characterization is helpful for understanding what conformal prediction is and why it works.3 Global spatial conformal prediction
3.1 Algorithm GSCP
The first approach we consider is a direct application of the original conformal algorithm to spatial prediction but with spatial dependence encoded in the nonconformity measure. In contrast to the local algorithm presented in Section 4, this method equally weights the nonconformity across all spatial locations in the plausibility contour evaluation. Therefore, we refer to this as global spatial conformal prediction, or GSCP for short.
From Section 2.1, recall the notation and ; also, denotes with excluded but included. Now define the nonconformity measure for the GSCP algorithm as
(5) 
where is the point prediction at spatial location , and measures the distance between and . Then we define a plausibility contour exactly like in Section 2.2, with obvious notational changes. That is, for a provisional value of , we have
(6) 
And the % conformal prediction interval for , denoted by , is just an upper level set of the plausibility contour, consisting of all those provisional with plausibility exceeding the threshold. Algorithm 1 states the specific steps of Algorithm GSCP.
Any reasonable choice of point prediction procedure can serve the purpose here, including inverse distance weighting (IDW) predictions (Henley, 2012), Kriging predictions, etc. There is also flexibility in choosing the distance measure. The results are invariant to monotone transformations of the distance measure so that, e.g., and will give the same plausibility contour, and therefore role of the distance measure is merely to rank prediction errors. In our numerical results presented below, we use the standardized Kriging residuals from (2) as the nonconformity measure, but the results are similar for other measures including unstandardized squared residuals. Appendix A shows how to quickly compute the plausibility contour and prediction interval by exploiting the inherent quadratic structure of the Krigingbased nonconformity measure.
3.2 Theoretical validity of GSCP
To start, consider randomly sampled spatial locations, i.e., independent and identically distributed (iid) in . Then write for as before. Note that the distribution, say , of the process is determined by that of the ’s and of the spatial process . Given that exchangeability is crucial to the validity of the conformal prediction method, and the nonexchangeability of the distribution of given the ’s, it is natural to ask if the process defined above is exchangeable. The following lemma answers this question in the affirmative.
Lemma 1.
If the spatial locations are iid, then , with , is an exchangeable sequence.
Proof.
Conditioning on , it is easy to see that are iid. Then the claim follows since conditionally iid is equivalent to exchangeability (e.g., Hewitt and Savage, 1955). ∎
Having established that randomly sampled spatial locations implies that the data are exchangeable, a validity property for GSCP follows immediately from the general theory in, e.g., Shafer and Vovk (2008). The following theorem gives a precise statement. To highlight the prediction interval’s dependence on the location , we use the notation .
Theorem 1.
Let be a stochastic process over and suppose that in are iid. Define for . Then the prediction interval for returned by Algorithm GSCP is valid in the sense that
(7) 
where is the joint distribution of under , and the infimum is over all such exchangeable laws.
Note that Theorem 1 makes no assumptions about the spatial process , so the conclusions hold for nonGaussian and nonstationary processes. However, Theorem 1 only gives a validity result which is marginal in the sense that it accurately predicts the response at a randomly sampled spatial location . It would also be interesting to ask if, still under the randomly sampled spatial location regime, the conformal prediction is conditionally valid, i.e., given , where is a fixed spatial location. There are negative results in the literature (e.g., Lei and Wasserman, 2014) which state that strong conditional validity—for all and almost all targets —is impossible with conformal prediction. A considerable amount of effort has been expended recently trying to achieve “approximate” conditional validity in some sense; see, e.g., Lei and Wasserman (2014), Tibshirani et al. (2019), and Barber et al. (2019).
Surprisingly, there is at least one scenario in which a strong conditional validity result can be achieved in our context. It turns out that one of the obstacles that prevents a conditional validity result in the existing literature is an “edge effect.” That is, conditional validity is typically achieved at targets in the middle of the domain, but fails at targets in the extremes; see Figure 1(b) in Lei and Wasserman (2014). So if it were possible to eliminate the edge effect—even if in a trivial way, by eliminating the edge itself—then there is hope for establishing a conditional validity result. In our spatial context, but perhaps not in other cases, it may not be unreasonable to assume that the spatial locations are sampled iid from a uniform distribution on a sphere. Since the sphere has no edges and a uniform distribution has no extremes, there is no “edge effect,” and, hence, conditional validity can be established. Some additional structure in the process is also needed here, in particular, it should be isotropic in the sense that the correlation structure only depends on the distance between spatial locations.
Proposition 1.
Let be an isotropic stationary process over the sphere , and suppose that the spatial locations are independent and uniformly distributed on . Define for . Then the prediction interval for returned by the GSCP algorithm is conditionally valid in the sense that
where is the joint distribution of under , and the infimum is over all such exchangeable laws with uniform marginals for the spatial locations.
Proof.
See Appendix B.1. ∎
To our knowledge, the above proposition gives the first conditional validity result for conformal prediction in the literature. We were able to sidestep the usual difficulties by taking advantage of the special structure of the sphere and the uniform distribution thereon. However, these structures, along with the assumed stationarity of , limit our flexibility. Fortunately, we can regain that flexibility with the local algorithm to be described in Section 4.
4 Local spatial conformal prediction
4.1 Local exchangeability
Let be the spatial domain, which we take to be for our analysis here. Suppose that the process defined over can be decomposed as
(8) 
where and are independent, is a continuous spatial process, is a nonspatial process, and is a deterministic, continuous, and realvalued function, such as or .
Define the sequence of localized processes around as
(9) 
indexed by the unit disk and the radius . Similarly, write and for the localized versions of the and processes. We will require the following:

is continuous at , i.e., as ; and

is locally iid at , i.e., converges in distribution to an iid process as .
This general formulation covers a wide range of common models for a continuous response, including the additive model in Section 2.1, certain generalized spatial linear models Diggle et al. (1998), spatial copula models (Krupskii and Genton, 2019), and maxstable processes (Reich and Shaby, 2012). For example, in generalized spatial linear model, with suitable spatial process
and Gaussian white noise
, take(10) 
where is the distribution function for the exponential family with natural parameter , is the chosen link function, and is the standard normal distribution function. Of course, to achieve the continuity requirement for , which is essential, the exponential family must have a density with respect to Lebesgue measure.
These assumptions yield a certain kind of local exchangeability which can then be used to show that our local spatial conformal prediction achieves a desired validity property. We first establish this local exchangeability result, which may be of independent interest.
Theorem 2.
Proof.
See Appendix B.2. ∎
Theorem 2 says that, if the spatial locations are situated sufficiently close to , then the joint distribution of the corresponding response values is approximately exchangeable and, therefore, a local spatial conformal prediction approach—LSCP for short—that is based on only those data points sufficiently close to ought to be valid. The next section formally states the LSCP algorithm and establishes its theoretical validity properties.
4.2 LSCP algorithm and its theoretical validity
Recall that is the unit square. Take a uniform grid of spatial locations and denote these as for , with . We assume that the process can be observed on this deterministic grid of spatial locations. As is common in the spatial statistics literature (e.g., Stein, 1990; Cressie, 1992), we adopt an infill asymptotic regime in which . That is, the region remains fixed while the number of observations in that region increases, filling the space. The relevant point for our analysis is that under this regime the number of locations at which is observed in any neighborhood of will go to infinity. Such a regime is natural in cases like ours where without structural assumptions about , e.g., stationarity, it is simply not possible to learn the local features of a process at if data are not concentrated in a neighborhood around .
Take a neighborhood around of radius . The number of spatial grid points in that neighborhood is roughly , so we need to choose so that this number is sufficiently large. In particular, to ensure that the neighborhood contains at least grid points, where is some fixed integer, we roughly need
(11) 
It follows from Theorem 2 that the joint distribution of the response at these many spatial locations around (corresponding to many vectors in ) would be approximately exchangeable and, consequently, a conformal prediction algorithm that creates nonconformity scores using only these observations would be valid for predicting .
More precisely, form , for , based on only those locations closest to . Set and take as a provisional value for . Using a suitable prediction procedure, like that based on Kriging discussed above, define
and the corresponding plausibility contour
(12) 
Specific details are presented in Algorithm 2.
With the help of the local exchangeability result in Theorem 2, we can establish the (asymptotically approximate) theoretical validity of the local spatial conformal prediction method. As before, to highlight the conformal prediction interval’s dependence on , we write it as .
Theorem 3.
Consider an infill asymptotic regime with spatial grid points, with . Fix an integer and let the radius satisfy (11). Then under the assumptions of Theorem 2, the prediction intervals returned by Algorithm LSCP are asymptotically valid at in the sense that
where is the joint distribution of at the spatial grid points, as well as at , and the outermost infimum is over all such distributions for . The inequality above can be made arbitrary close to equality by taking sufficiently large.
Proof.
See Appendix B.3. ∎
4.3 Algorithm sLSCP: smoothed LSCP
Theorem 3 implies that the local spatial conformal prediction with nearest neighbors is approximately valid under the infill asymptotic regime. However, in practice completely disregarding the contribution of the observations outside the nearest neighbors may be unsatisfactory, so we propose a smoothed version of the LSCP algorithm (sLSCP).
The GSCP algorithm weights all nonconformity measures equally in the plausibility computation in (4), but this is questionable for nonstationary processes with stochastic properties that vary throughout the spatial domain. To allow for nonstationarity, the sLSCP algorithm weights the nonconformity measures by how far the corresponding is from the prediction location. Let be a nonincreasing function, and define weights
where , , and the proportionality constant is so that . Different functions can be applied, but we recommend the normalized Gaussian kernel function with bandwidth ,
(13) 
Note that, if , then for each , which corresponds to the GSCP algorithm. Finally, with these new weights, the plausibility contour at a provisional value of is given by
(14) 
As before, while many of nonconformity measures are possible, we recommend the Krigingbased strategy with and defined in (2).
Since we are interested only in the local structure of , it is natural that locations far from have negligible weight, as in (13). But including all observations requires some nontrivial calculations, e.g., inverting a large covariance matrix. Therefore, to avoid cumbersome and irrelevant computation, we recommend using only the observations closest to for both the Kriging predictions that determine and in the plausibility scores in (14). The resulting method is both locally adaptive and computationally efficient for large data sets.
The tuning parameter can be selected using cross validation, as illustrated in Sections 5 and 6. The value of is determined by the bandwidth so that all observations with substantial are included, as are observations that are required for the Kriging prediction of these observations. Typically the number of nearby observations required to approximate the Kriging prediction is a small subset of the total number of observations (Stein, 2002). As a rule of thumb, could be selected to roughly include all observations within radius of , where captures observations with substantial weights, and is selected so that all the observations include the nearest 15 neighbors of the observation within of . We summarize the details of Algorithm sLSCP in Algorithm 3. For simplicity, we use sLSCP and LSCP indistinguishably.
5 Simulation study
5.1 Data generation
We consider one meanzero Gaussian stationary process (Scenario 1) and seven nonGaussian and/or nonstationary datagenerating scenarios (Scenarios 2 – 8). The seven scenarios are generated based on transformations of a meanzero stationary Gaussian process and white noise process with standard normal distribution, where . The stationary Gaussian process process has a Matrn covariance function with variance , range and smoothness . Data are sampled on the grid of points in the unit square, , with or .
Specifically, the eight scenarios under consideration are:

;

;

where is the standard normal distribution function and is the quantile function;

;

;

where ;

; and

where .
Scenario 1 is Gaussian and stationary, Scenarios 2–4 are stationary but nonGaussian, and Scenarios 5–8 are nonstationary either in the spatial variance (Scenarios 5 and 6), error variance (Scenario 7), or mean (Scenario 8). Figure 2 shows one realization of all the scenarios with .
5.2 Prediction methods and metrics
For each dataset we apply the global and local (with ) conformal spatial prediction algorithms. For the parametric Kriging method and the initial Kriging predictions of our proposed conformal prediction, we estimate the spatial covariates parameters (Section 2.1) using empirical variograms methods (Cressie, 1992). The empirical variograms are calculated using the variog function in the R package geoR, and the covariance parameters are chosen to minimize the weighted (by number of observations) squared error between the empirical and modelbased variograms. We compare the conformal methods with standard global Kriging prediction using (1) to construct prediction interval, and the local Kriging (laGP) method of Gramacy and Apley (2015) that dynamically defines the support of a Gaussian process predictor based on a local subset of the data. For laGP, we use the function provided by the laGP package in R the local sequential design scheme starting from 6 points to 50 points through an empirical Bayes’ meansquare prediction error criterion.
Methods are compared using leaveoneout cross validation, i.e., we generate a prediction interval for each observation given all other observations. Each scenario is repeated 100 times, and performance is evaluated using average coverage of % prediction intervals, average interval width, and average interval score (Gneiting and Raftery, 2007), defined as
where represents the prediction interval and are the observations. A smaller interval score is desirable as this rewards both high coverage and narrow intervals. We use in this simulation study.
5.3 Results
We present the results for Scenarios 1–6 in this section; the others are in Appendix C. For the stationary Scenarios 1–4 we present results averaged over data sets and all spatial locations in Table 1. For the nonstationary scenarios varying across (Scenarios 5–7), we present the results by the first spatial coordinate () averaged over the data sets and the second coordinate () in Figures 4, e.g., the value of coverage plotted at is the average of the coverage over the points of the form for .
Scen  Method  Cov90  Width  IntScore  Cov90  Width  IntScore 

1  GSCP  89.9%  4.58  5.77  90.0%  3.99  5.01 
LCSP  88.5%  4.55  5.96  89.2%  3.94  5.07  
Kriging  90.7%  4.68  5.75  88.5%  3.82  5.01  
laGP  86.9%  4.74  6.52  88.7%  4.26  5.52  
2  GSCP  89.9%  31.84  63.07  90.0%  21.65  41.43 
LCSP  90.7%  30.72  55.32  90.9%  21.06  33.77  
Kriging  93.7%  43.70  65.54  92.4%  26.77  42.47  
laGP  91.7%  40.07  67.49  93.4%  32.07  46.27  
3  GSCP  89.9%  4.64  6.22  90.0%  4.00  5.14 
LCSP  88.7%  4.58  6.22  89.2%  3.94  5.14  
Kriging  90.9%  4.78  6.19  88.9%  3.87  5.15  
laGP  87.8%  4.73  6.71  89.0%  4.27  5.61  
4  GSCP  89.9%  6.93  11.03  90.0%  6.21  9.84 
LCSP  88.9%  6.82  11.21  90.0%  5.96  9.26  
Kriging  91.7%  7.65  11.02  91.2%  6.61  9.85  
laGP  89.2%  7.37  11.93  91.0%  6.68  10.08 
Table 1 shows the results for data generated from stationary processes (Scenarios 1–4). In Scenario 1
, the Gaussian and stationary process, the performance of GSCP, LSCP, and Kriging are comparable. Kriging performs well in this case since the data generating mechanism aligns with its underlying assumption, but the conformal methods are competitive with the parametric model.
In Scenarios 2, 3, and 4, the nonGaussian but stationary processes, GSCP, LSCP, and Kriging perform more or less the same in terms of interval score and outperform laGP. However, the coverage of the conformal methods, especially the GSCP algorithm, is closer to the nominal level than Kriging and the Kriging intervals are generally wider than the conformal intervals. Figure 3 examines the results in more detail for Scenario 3. Since this is a stationary process, the width of the prediction intervals should be similar across different locations, which is the case for the global methods GSCP and Kriging but not the local methods LSCP and laGP. This extra variation is reflected in the coverage proportion across datasets. The coverage of GSCP stabilizes at 90% for each simulation while the coverage for other methods varies considerably and is often below the desired level. This detailed analysis focused on Scenario 3, but the same conclusions can be drawn in Scenarios 2 and 4.
Figure 4 shows the results for nonstationary Scenarios 5 and 6 when . The results for and the results of the other nonstationary scenarios are in Appendix C. LSCP performs the best among the four methods for these nonstationary scenarios. For Scenario 5, the process is Gaussian to the west and nonGaussian to the east. The global prediction methods GSCP and Kriging generate prediction intervals with similar width for all (ignoring edge effects), while LSCP and laGP provide wider intervals on the east ( near 1) than the west ( near 0). LSCP has coverage around 90% for , and the lowest interval scores, especially in the east where the process is more nonGaussian. Similarly, in Scenario 6, the correlation is stronger in the east than the west, and the LSCP performs the best by providing adaptive prediction interval width and valid coverage across space.
We also conducted a simulation study when observations are sampled uniformly on . The performance is very similar to that sampled on equallyspaced grid points so we only show the results of sampling on grid points in this paper.
5.4 Sensitivity Analysis
Finally, we study sensitivity of the proposed method to tuning parameter selection and to the initial estimates of the Matrn covariance parameters. Figure 5 gives the average interval score of LSCP with different tuning parameter for Scenarios 5 and 6, along with the average interval score of GSCP, Kriging, and laGP. We find that there is a Vshaped relationship between resulting interval score and tuning parameter . The average performance is the best when , and LSCP outperforms the other methods regardless the selection of . To achieve the best performance, in practice, we would select using cross validation.
The proposed spatial conformal prediction methods rely on estimation of the Matrn covariance parameter, . To test for sensitivity to this step, we repeat simulation Scenario 1 since in this case the optimal is known to be the true Matrn covariance parameters used to generate the data. When generating prediction intervals, rather than estimating the Matrn covariance parameter from the simulated data, we use the values of listed in the first four columns of Table 2 for conformal prediction and Kriging, where each of the four parameters is modified by plus or minus 50% of the true value. We use the GSCP method, and for this sensitivity analysis. The corresponding average coverage and average width of the conformal prediction intervals and the Kriging intervals are listed in the last four columns.
Nugget  Partial sill  Range  Smoothness  GSCP  Kriging  

()  ()  ()  ()  Cov90  Width  Cov90  Width 
1.0  3.0  0.10  0.70  89.69%  4.57  90.05%  4.60 
1.5  3.0  0.10  0.70  89.69%  4.61  93.73%  5.24 
0.5  3.0  0.10  0.70  89.70%  4.59  82.76%  3.84 
1.0  4.5  0.10  0.70  89.68%  4.63  92.64%  5.04 
1.0  1.5  0.10  0.70  89.68%  4.63  85.25%  4.09 
1.0  3.0  0.15  0.70  89.68%  4.60  86.71%  4.21 
1.0  3.0  0.05  0.70  89.67%  4.61  95.13%  5.54 
1.0  3.0  0.10  1.05  89.69%  4.58  85.56%  4.08 
1.0  3.0  0.10  0.35  89.70%  4.63  94.98%  5.53 
While using the true leads to slightly more narrow intervals, the average coverage of the conformal prediction intervals are between 89.67% and 89.70% across . In contrast, the coverage of the Kriging intervals ranges from 82.76% to 95.13% across . Therefore, the conformal methods are more robust to estimation of Matrn covariance parameters than the parametric Kriging model.
6 Real data analysis
This section demonstrates the performance of conformal prediction method using the canopy height data in Figure 5(a). The data were originally presented in Cook et al. (2013) and were analyzed using a stationary Kriging model in Datta et al. (2016). The data are available in the spNNGP package (Finley et al., 2017) in R. There are observations and clear nonstationarity and nonnormality. For example, there are several heterogeneous areas with small height canopies around the location with longitude and latitude being 729,000 and 470,000, respectively.
We compare methods using 90% prediction intervals for 10,000 randomly chosen test locations. We use LSCP and choose the tuning parameters as because this minimized the interval score averaged over 1,000 randomly chosen validation locations. The average interval score and tuning parameter also show a Vshape relationship similar to that in Figure 5. Table 3 compares the performance of LSCP on the 10,000 test locations with Kriging and laGP. LSCP outperforms the other methods as the empirical coverage of LSCP is the closet to the desired 90% and the LSCP minimizes the interval score.
Width  Cov90  IntScore  

LSCP  3.00  89.2%  4.31 
Kriging  5.54  96.7%  6.89 
laGP  5.04  91.3%  6.61 
Figure 6 plots the interval widths for each method. Unlike Kriging, the LSCP and laGP interval widths are locally adaptive with wider intervals in heterogeneous areas and more narrow intervals in homogeneous areas. Comparing LSCP and laGP, LSCP generally provides narrower intervals than laGP, which means the proposed method is more efficient than laGP. In addition, the locations of the observations that fall outside the prediction intervals are uniformly distributed for LSCP and laGP, but clustered in the heterogeneous areas for Kriging. In short, the proposed spatial conformal prediction algorithm shows its superiority in this real data analysis.
7 Discussion
In this paper, we proposed a spatial conformal prediction algorithm to provide valid, robust, and modelfree prediction intervals by combining spatial methods and the classical conformal prediction. We provided both global and local versions to accommodate different stationarity cases and sampling designs. We proved their validity under various sampling designs and datagenerating mechanisms. To the authors’ knowledge, this work is the first in making the classical conformal algorithm work for nonexchangeable data. Our simulation studies and real data analyses demonstrate the advantage of the proposed spatial conformal prediction algorithms. We also developed an R package entitled scp (https://github.com/mhuiying/scp) to compute plausibility and generate spatial prediction intervals using either Kriging residual or any other userdefined nonconformity measure.
An attractive feature of the proposed algorithms is that they are modelfree in the sense that their theoretical validity does not depend on specifying the correct model. In our implementations we use the squared residuals from a simple parametric model to define the nonconformity terms. However, as anticipated by our theoretical results, our simulation study shows that the methods work well even if the parametric family or the mean and covariance functions are misspecified, and that the results are insensitive to inaccurate estimation of the parameters in the parametric model. This robustness allows the methods to be applied broadly and with confidence.
The local conformal prediction method relies on a dense grid of points around each prediction location and thus a large dataset. Computation time often prohibits application of spatial methods to large datasets. Fortunately, we are able to apply the conformal prediction method to large datasets by exploiting local algorithms and an explicit solution of the plausibility contour assuming the nonconformity measure is the squared Kriging residual. Similar derivations are needed for prediction procedures other than Kriging in order to maintain computational efficiency.
Future research directions include extending the work to spatial processes with discrete observations, e.g., when is binary or a count. Generalized spatial linear models are compatible with our current framework, see (10), but continuity in the distribution function is required. Therefore, further studies would be required to establish the validity of conformal prediction for discrete data. It is also of interest to extend the proposed spatial conformal prediction methods to temporal or multivariate spatial data.
Acknowledgments
HM (DMS–1638521) and RM (DMS–1811802) are supported by the National Science Foundation. BJR is supported by the National Institutes of Health (R01ES031651 and R01ES027892) and King Abdullah University of Science and Technology (3800.2).
References
 Barber et al. (2019) Barber, R. F., Candès, E. J., Ramdas, A., and Tibshirani, R. J. (2019). The limits of distributionfree conditional predictive inference. arXiv preprint arXiv:1903.04684.
 Cella and Martin (2020) Cella, L. and Martin, R. (2020). Valid distributionfree inferential models for prediction. arXiv preprint arXiv:2001.09225.
 Cook et al. (2013) Cook, B., Nelson, R., Middleton, E., Morton, D., McCorkel, J., Masek, J., Ranson, K., Ly, V., Montesano, P., et al. (2013). Nasa goddard’s lidar, hyperspectral and thermal (gliht) airborne imager. Remote Sensing, 5(8):4045–4066.
 Cressie (1992) Cressie, N. (1992). Statistics for spatial data. Terra Nova, 4(5):613–617.
 Datta et al. (2016) Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016). Hierarchical nearestneighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800–812.
 Diggle et al. (1998) Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998). Modelbased geostatistics. Journal of the Royal Statistical Society: Series C (Applied Statistics), 47(3):299–350.
 Duan et al. (2007) Duan, J. A., Guindani, M., and Gelfand, A. E. (2007). Generalized spatial Dirichlet process models. Biometrika, 94(4):809–825.
 Finley et al. (2017) Finley, A., Datta, A., and Banerjee, S. (2017). spNNGP: Spatial regression models for large datasets using nearest neighbor Gaussian processes. R package version 0.1.0, 1.
 Fuglstad et al. (2015) Fuglstad, G. A., Simpson, D., Lindgren, F., and Rue, H. (2015). Does nonstationary spatial data always require nonstationary random fields? Spatial Statistics, 14:505–531.
 Gelfand et al. (2005) Gelfand, A. E., Kottas, A., and MacEachern, S. N. (2005). Bayesian nonparametric spatial modeling with Dirichlet process mixing. Journal of the American Statistical Association, 100(471):1021–1035.
 Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378.
 Gramacy and Apley (2015) Gramacy, R. B. and Apley, D. W. (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578.
 Guan (2019) Guan, L. (2019). Conformal prediction with localization. arXiv preprint arXiv:1908.08558.
 Heaton et al. (2019) Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F., and ZammitMangion, A. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 24(3):398–425.
 Henley (2012) Henley, S. (2012). Nonparametric Geostatistics. Springer Science & Business Media.
 Hewitt and Savage (1955) Hewitt, E. and Savage, L. J. (1955). Symmetric measures on cartesian products. Transactions of the American Mathematical Society, 80(2):470–501.

Krupskii and Genton (2019)
Krupskii, P. and Genton, M. G. (2019).
A copula model for nonGaussian multivariate spatial data.
Journal of Multivariate Analysis
, 169:264–277.  Lei et al. (2018) Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018). Distributionfree predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111.
 Lei and Wasserman (2014) Lei, J. and Wasserman, L. (2014). Distributionfree prediction bands for nonparametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71–96.
 Martin (2019) Martin, R. (2019). False confidence, nonadditive beliefs, and valid statistical inference. International Journal of Approximate Reasoning, 113:39–73.
 Martin and Lingham (2016) Martin, R. and Lingham, R. T. (2016). Priorfree probabilistic prediction of future observations. Technometrics, 58(2):225–235.
 Martin and Liu (2013) Martin, R. and Liu, C. (2013). Inferential models: A framework for priorfree posterior probabilistic inference. Journal of the American Statistical Association, 108(501):301–313.
 Martin and Liu (2016) Martin, R. and Liu, C. (2016). Inferential Models, volume 147 of Monographs on Statistics and Applied Probability. CRC Press, Boca Raton, FL.
 Reich and Fuentes (2007) Reich, B. J. and Fuentes, M. (2007). A multivariate semiparametric Bayesian spatial modeling framework for hurricane surface wind fields. The Annals of Applied Statistics, 1(1):249–264.
 Reich and Shaby (2012) Reich, B. J. and Shaby, B. A. (2012). A hierarchical maxstable spatial model for extreme precipitation. The Annals of Applied Statistics, 6(4):1430.
 Risser (2016) Risser, M. D. (2016). Nonstationary spatial modeling, with emphasis on process convolution and covariate–driven approaches. arXiv preprint arXiv:1610.02447.
 Rodriguez and Dunson (2011) Rodriguez, A. and Dunson, D. B. (2011). Nonparametric Bayesian models through probit stickbreaking processes. Bayesian Analysis, 6(1).
 Romano et al. (2019) Romano, Y., Patterson, E., and Candes, E. (2019). Conformalized quantile regression. In Advances in Neural Information Processing Systems, pages 3538–3548.
 Shafer (1976) Shafer, G. (1976). A Mathematical Theory of Evidence, volume 42. Princeton university press.
 Shafer and Vovk (2008) Shafer, G. and Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9(Mar):371–421.
 Stein (1990) Stein, M. L. (1990). A comparison of generalized cross validation and modified maximum likelihood for estimating the parameters of a stochastic process. The Annals of Statistics, 18(3):1139–1157.
 Stein (2002) Stein, M. L. (2002). The screening effect in kriging. The Annals of Statistics, 30(1):298–323.
 Tibshirani et al. (2019) Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. (2019). Conformal prediction under covariate shift. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 2530–2540. Curran Associates, Inc.
 van der Vaart (1998) van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge.
 Vovk et al. (2005) Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic Learning in a Random World. Springer Science & Business Media.
Appendix A Computational details
Applying the predictive distributions in (1) and provisionally setting , for we can write
(15) 
where , , and .
Since the inverse covariance matrix is positive definite, and , so and . Therefore, the satisfying are within two roots, denoted as , of the quadratic equation . Then the plausibility calculation (6) becomes , which is a step function with jumping points being ’s and ’s, and the plausibility function for sLSCP in (14) simply becomes .
With the help of the step function, we can solve for the prediction interval directly with no need to enumerate for possible satisfying .
Appendix B Proofs
b.1 Proof of Proposition 1
To be even more explicit about notation, for a given set of data and a new location , write . Now define the conditional coverage probability function
Comments
There are no comments yet.