Valid model-free spatial prediction

06/28/2020 ∙ by Huiying Mao, et al. ∙ 0

Predicting the response at an unobserved location is a fundamental problem in spatial statistics. Given the difficulty in modeling spatial dependence, especially in non-stationary cases, model-based prediction intervals are at risk of misspecification bias that can negatively affect their validity. Here we present a new approach for model-free spatial prediction based on the conformal prediction machinery. Our key observation is that spatial data can be treated as exactly or approximately exchangeable in a wide range of settings. For example, when the spatial locations are deterministic, we prove that the response values are, in a certain sense, locally approximately exchangeable for a broad class of spatial processes, and we develop a local spatial conformal prediction algorithm that yields valid prediction intervals without model assumptions. Numerical examples with both real and simulated data confirm that the proposed conformal prediction intervals are valid and generally more efficient than existing model-based procedures across a range of non-stationary and non-Gaussian settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Providing valid predictions of the response at an unobserved location is a fundamental problem in spatial statistics. For example, epidemiologists may wish to extrapolate air pollution concentrations from a network of stationary monitors to the residential locations of the study participants. There are a number of challenges one faces in carrying out valid prediction at a new spatial location, but one of the most pressing is that existing methods are model-based, so the reliability of the predictions depends crucially on the soundness of the posited model. For example, prediction intervals based on Kriging—see Cressie (1992) and Section 2.1

—often rely on normality, and stationarity is often assumed to facilitate estimating the spatial covariance function required for Kriging. It is now common to perform geostatistical analysis for massive datasets collected over a vast and diverse spatial domain

(e.g., Heaton et al., 2019). For complex processes observed over a large domain, the normality and stationarity assumptions can be questionable. Failing to account for nonstationarity can affect prediction accuracy, but typically has a larger effect on uncertainty quantification such as prediction intervals (Fuglstad et al., 2015). While there are now many methods available for dealing with nonstationary (see Risser, 2016, for a recent review) and non-Gaussianity (e.g., Gelfand et al., 2005; Duan et al., 2007; Reich and Fuentes, 2007; Rodriguez and Dunson, 2011), these typically involve heavy computations. This exacerbates the already imposing computational challenges posed by massive datasets. Further, fitting the entire stochastic process may be unnecessary if only prediction intervals are desired. Therefore, having a method that provides provably valid prediction intervals without requiring specification of a statistical model—and, hence, not inheriting the risk of model misspecification biases—would be very useful to practitioners. Developing such a method is the goal of this paper.

In recent years, the use of machine learning techniques in statistics has become increasingly more common. While there are numerous examples of this phenomenon, the one most relevant here is

conformal prediction. This method originated in Vovk et al. (2005) and the references therein (see, also, Shafer and Vovk, 2008), but has appeared frequently in the recent statistics literature (e.g., Lei and Wasserman, 2014; Lei et al., 2018; Guan, 2019; Romano et al., 2019; Tibshirani et al., 2019)

. What makes this method especially attractive is that it provides provably valid prediction intervals without specification of a statistical model. More precisely, the conformal prediction intervals achieve the nominal frequentist prediction coverage probability, uniformly over all data distributions; see Section 

2.2. The crucial assumption behind the validity of conformal prediction is that the data are exchangeable.

Whether it is reasonable to assume exchangeability in a spatial application depends on how the data are sampled. On the one hand, if the locations are randomly sampled in the spatial domain, then exchangeability holds automatically; see Lemma 1

. Therefore, the standard conformal prediction framework can be used basically off the shelf. On the other hand, if the locations are fixed in the spatial domain, then exchangeability does not hold in general. But we show that, for a wide range of spatial processes, the response variables at tightly concentrated locations are approximately exchangeable; see Theorem 

2. Therefore, a version of the basic conformal prediction method applied to these tightly concentrated observations ought to be approximately valid.

Using this insight about the connection between exchangeability and the sampling design, we propose two related spatial conformal prediction methods. The first, a so-called global spatial conformal prediction (GSCP) method, described in Section 3, is designed specifically for cases where the spatial locations are sampled at random. In particular, this global method produces a prediction interval which is marginally valid, i.e., valid on average with respect to the distribution of the target location at which prediction is desired. The second, a local spatial conformal prediction (LSCP) method, described in Section 4, is designed specifically for the case when the spatial locations are fixed. Since our goal is to proceed without strong assumptions about the spatial dependence structure, it is only possible to establish approximate or local exchangeability. Therefore, the proposed local spatial conformal prediction method can only provide approximately valid predictions; see Theorem 3. But our goal in the fixed-location case resembles the “conditional validity” target in the conformal prediction literature (e.g., Barber et al., 2019) so, given the impossibility theorems in the latter context, approximate validity is all that can be expected.

For both the global and local formulations, our proposed method is model-free, robust, and computationally feasible for large datasets. In Sections 5 and 6, we show using real and simulated data that the proposed methods outperform both standard global Kriging and local approximate Gaussian process regression (laGP; Gramacy and Apley, 2015) for non-stationary and non-Gaussian data. In addition to be useful for spatial applications, it is also a major advance in conformal prediction to the case of dependent data, and establishes the conditions on the spatial sampling design and data-generating mechanism that ensure validity—or at least approximate validity—of the conformal prediction intervals.

2 Background

2.1 Spatial prediction methods

Let be the observation at spatial location . The data consist of observations . Geostatistical analysis often assumes that the data follow a Gaussian process with stationary and isotropic covariance function. That is, , where is a mean-zero Gaussian process with covariance , a function of the distance between locations and , and is error. A common example is the Matrn correlation function

where is the correlation range, is the smoothness parameter, is the modified Bessel function of the second kind. Covariates can easily be added to the mean of but for simplicity we assume the process has mean zero. The main assumptions of this model are that the data are Gaussian and the covariance function is stationary and isotropic, i.e., it is a function only of the distance between spatial locations and is thus the same across the spatial domain.

To make prediction inference at the new location

, we study the random variable

together with the other ’s. The full data set, including response to be predicted, , is denoted and the collection of data excluding (but including ) is denoted . Note that represent the full observed data set and .

The joint distribution of

, given and the spatial covariance parameters , is , where

is a vector of zeros with length

and is a covariance matrix with element being . Let be an estimate of and

. The Kriging prediction and variance of

, given and , are

(1)

Assuming normality and stationarity gives the predictive distribution and standardized residuals

(2)

In particular, a prediction interval for is , where is the upper quantile of a standard normal distribution.

2.2 Conformal prediction

In this section we take a step back from spatial data analysis and review conformal prediction for non-spatial problems. Suppose we have a data sequence assumed to be exchangeable with (joint) distribution . No assumptions about are made here, beyond that it is exchangeable. We observe values of and the goal is to predict . More specifically, we seek a procedure that returns, for any , a prediction interval that is valid in the sense that

(3)

where represents the distribution of under , and the infimum is over all exchangeable distributions . That we require attaining the nominal frequentist coverage for all rules out the use of model-based procedures, such as likelihood or Bayesian methods.

The original conformal prediction method proceeds as follows. Define a non-conformity measure , a function that takes two arguments: the first is a “bag” that consists of a finite collection of data points; the second is a single data point . Then measures how closely represents the data points in bag . For example, if is a prediction rule and is some measure of distance, then we might take , the distance between and the value returned by the prediction rule applied to . The choice of non-conformity measure depends on the context and is at the discretion of the data analyst; often there is a natural choice, as in our spatial prediction setting described in Section 3. The only technical requirement is symmetry in the first argument; that is, shuffling the data in bag does not change the value of .

Given the non-conformity measure , the next step is to transform the data in a suitable way via . Specifically, augment the observed data with a provisional value of ; this value is generic and free to vary. Define

Note that is special because it compares the actual observed data with this provisional value of the unobserved future observation. Next, compute the plausibility of as a value for according to the formula

(4)

where denotes the indicator of event . Note that this process can be carried out for any provisional value , so the result is actually a mapping which we will refer to as the plausibility contour returned by the conformal algorithm. This function can be plotted (see Figure 1) to visualize the uncertainty about based on the given data , the choice of non-conformity measure, etc. Moreover, a prediction set can be obtained as the upper level set of the plausibility contour,

with threshold , which approximately equals when is large.

For a quick explanation of why this works, i.e., why (3) holds, first note that the event corresponds to . So to calculate the probability in (3), we need to know something about the distribution of the random variable . Towards this, the key point is that the provisional value being used in this latter expression is exactly the being predicted. Then exchangeability of the original data sequence and the permutation invariance of implies that are exchangeable too. And since is proportional to the rank of , the claim in (3

) follows from the elementary result that the rank of any individual in a set of exchangeable random variables has a discrete uniform distribution.

Figure 1: Plot of a plausibility contour and a visualization of two level sets corresponding to (red dash line) and (blue dotted line).

Conformal prediction is somewhat mysterious—it is not obviously part of classical or Bayesian statistics. Some authors

(e.g., Lei et al., 2018) refer to as a p-value corresponding to a “test” of the hypothesis that . This interpretation is potentially confusing because we are accustomed with testing hypotheses about fixed unknowns, not about random variables. The plausibility notion we are adopting here dates back to Shafer (1976), but has reemerged recently in a new framework for inference and prediction, namely, inferential models (Martin and Liu, 2013, 2016; Martin and Lingham, 2016; Martin, 2019). More recently, Cella and Martin (2020) have shown that the mapping in (4) is precisely the plausibility contour resulting from a suitable inferential model construction. Therefore, the validity property (3) achieved by conformal prediction is a special case of the general inferential model validity theory. For our purposes, we will not need any details from the inferential model framework, but we believe that this characterization is helpful for understanding what conformal prediction is and why it works.

3 Global spatial conformal prediction

3.1 Algorithm GSCP

The first approach we consider is a direct application of the original conformal algorithm to spatial prediction but with spatial dependence encoded in the non-conformity measure. In contrast to the local algorithm presented in Section 4, this method equally weights the non-conformity across all spatial locations in the plausibility contour evaluation. Therefore, we refer to this as global spatial conformal prediction, or GSCP for short.

From Section 2.1, recall the notation and ; also, denotes with excluded but included. Now define the non-conformity measure for the GSCP algorithm as

(5)

where is the point prediction at spatial location , and measures the distance between and . Then we define a plausibility contour exactly like in Section 2.2, with obvious notational changes. That is, for a provisional value of , we have

(6)

And the % conformal prediction interval for , denoted by , is just an upper level set of the plausibility contour, consisting of all those provisional with plausibility exceeding the threshold. Algorithm 1 states the specific steps of Algorithm GSCP.

Any reasonable choice of point prediction procedure can serve the purpose here, including inverse distance weighting (IDW) predictions (Henley, 2012), Kriging predictions, etc. There is also flexibility in choosing the distance measure. The results are invariant to monotone transformations of the distance measure so that, e.g., and will give the same plausibility contour, and therefore role of the distance measure is merely to rank prediction errors. In our numerical results presented below, we use the standardized Kriging residuals from (2) as the non-conformity measure, but the results are similar for other measures including unstandardized squared residuals. Appendix A shows how to quickly compute the plausibility contour and prediction interval by exploiting the inherent quadratic structure of the Kriging-based non-conformity measure.

Input: observations , ; prediction location ; non-conformity measure ; significance level ; and a fine grid of candidate values
Output: prediction interval, , for
1 for  in the specified grid do
2       for  do
3             define by removing from ;
4             ;
5            
6       end for
7      compute plausibility for as in (6);
8       include in if ;
9      
10 end for
return .
Algorithm 1 Global spatial conformal prediction (GSCP).

3.2 Theoretical validity of GSCP

To start, consider randomly sampled spatial locations, i.e., independent and identically distributed (iid) in . Then write for as before. Note that the distribution, say , of the process is determined by that of the ’s and of the spatial process . Given that exchangeability is crucial to the validity of the conformal prediction method, and the non-exchangeability of the distribution of given the ’s, it is natural to ask if the process defined above is exchangeable. The following lemma answers this question in the affirmative.

Lemma 1.

If the spatial locations are iid, then , with , is an exchangeable sequence.

Proof.

Conditioning on , it is easy to see that are iid. Then the claim follows since conditionally iid is equivalent to exchangeability (e.g., Hewitt and Savage, 1955). ∎

Having established that randomly sampled spatial locations implies that the data are exchangeable, a validity property for GSCP follows immediately from the general theory in, e.g., Shafer and Vovk (2008). The following theorem gives a precise statement. To highlight the prediction interval’s dependence on the location , we use the notation .

Theorem 1.

Let be a stochastic process over and suppose that in are iid. Define for . Then the prediction interval for returned by Algorithm GSCP is valid in the sense that

(7)

where is the joint distribution of under , and the infimum is over all such exchangeable laws.

Note that Theorem 1 makes no assumptions about the spatial process , so the conclusions hold for non-Gaussian and non-stationary processes. However, Theorem 1 only gives a validity result which is marginal in the sense that it accurately predicts the response at a randomly sampled spatial location . It would also be interesting to ask if, still under the randomly sampled spatial location regime, the conformal prediction is conditionally valid, i.e., given , where is a fixed spatial location. There are negative results in the literature (e.g., Lei and Wasserman, 2014) which state that strong conditional validity—for all and almost all targets —is impossible with conformal prediction. A considerable amount of effort has been expended recently trying to achieve “approximate” conditional validity in some sense; see, e.g., Lei and Wasserman (2014), Tibshirani et al. (2019), and Barber et al. (2019).

Surprisingly, there is at least one scenario in which a strong conditional validity result can be achieved in our context. It turns out that one of the obstacles that prevents a conditional validity result in the existing literature is an “edge effect.” That is, conditional validity is typically achieved at targets in the middle of the domain, but fails at targets in the extremes; see Figure 1(b) in Lei and Wasserman (2014). So if it were possible to eliminate the edge effect—even if in a trivial way, by eliminating the edge itself—then there is hope for establishing a conditional validity result. In our spatial context, but perhaps not in other cases, it may not be unreasonable to assume that the spatial locations are sampled iid from a uniform distribution on a sphere. Since the sphere has no edges and a uniform distribution has no extremes, there is no “edge effect,” and, hence, conditional validity can be established. Some additional structure in the process is also needed here, in particular, it should be isotropic in the sense that the correlation structure only depends on the distance between spatial locations.

Proposition 1.

Let be an isotropic stationary process over the sphere , and suppose that the spatial locations are independent and uniformly distributed on . Define for . Then the prediction interval for returned by the GSCP algorithm is conditionally valid in the sense that

where is the joint distribution of under , and the infimum is over all such exchangeable laws with uniform marginals for the spatial locations.

Proof.

See Appendix B.1. ∎

To our knowledge, the above proposition gives the first conditional validity result for conformal prediction in the literature. We were able to side-step the usual difficulties by taking advantage of the special structure of the sphere and the uniform distribution thereon. However, these structures, along with the assumed stationarity of , limit our flexibility. Fortunately, we can regain that flexibility with the local algorithm to be described in Section 4.

4 Local spatial conformal prediction

4.1 Local exchangeability

Let be the spatial domain, which we take to be for our analysis here. Suppose that the process defined over can be decomposed as

(8)

where and are independent, is a continuous spatial process, is a non-spatial process, and is a deterministic, continuous, and real-valued function, such as or .

Define the sequence of localized processes around as

(9)

indexed by the unit disk and the radius . Similarly, write and for the localized versions of the and processes. We will require the following:

  • is -continuous at , i.e., as ; and

  • is locally iid at , i.e., converges in distribution to an iid process as .

This general formulation covers a wide range of common models for a continuous response, including the additive model in Section 2.1, certain generalized spatial linear models Diggle et al. (1998), spatial copula models (Krupskii and Genton, 2019), and max-stable processes (Reich and Shaby, 2012). For example, in generalized spatial linear model, with suitable spatial process

and Gaussian white noise

, take

(10)

where is the distribution function for the exponential family with natural parameter , is the chosen link function, and is the standard normal distribution function. Of course, to achieve the continuity requirement for , which is essential, the exponential family must have a density with respect to Lebesgue measure.

These assumptions yield a certain kind of local exchangeability which can then be used to show that our local spatial conformal prediction achieves a desired validity property. We first establish this local exchangeability result, which may be of independent interest.

Theorem 2.

Suppose that can be decomposed as in (8), where is -continuous at , is locally iid at , and and are independent. Then the localized process in (9) converges in distribution as , and the limit is an exchangeable process in the sense that its finite-dimensional distributions are exchangeable.

Proof.

See Appendix B.2. ∎

Theorem 2 says that, if the spatial locations are situated sufficiently close to , then the joint distribution of the corresponding response values is approximately exchangeable and, therefore, a local spatial conformal prediction approach—LSCP for short—that is based on only those data points sufficiently close to ought to be valid. The next section formally states the LSCP algorithm and establishes its theoretical validity properties.

4.2 LSCP algorithm and its theoretical validity

Recall that is the unit square. Take a uniform grid of spatial locations and denote these as for , with . We assume that the process can be observed on this deterministic grid of spatial locations. As is common in the spatial statistics literature (e.g., Stein, 1990; Cressie, 1992), we adopt an infill asymptotic regime in which . That is, the region remains fixed while the number of observations in that region increases, filling the space. The relevant point for our analysis is that under this regime the number of locations at which is observed in any neighborhood of will go to infinity. Such a regime is natural in cases like ours where without structural assumptions about , e.g., stationarity, it is simply not possible to learn the local features of a process at if data are not concentrated in a neighborhood around .

Take a neighborhood around of radius . The number of spatial grid points in that neighborhood is roughly , so we need to choose so that this number is sufficiently large. In particular, to ensure that the neighborhood contains at least grid points, where is some fixed integer, we roughly need

(11)

It follows from Theorem 2 that the joint distribution of the response at these -many spatial locations around (corresponding to -many vectors in ) would be approximately exchangeable and, consequently, a conformal prediction algorithm that creates non-conformity scores using only these observations would be valid for predicting .

More precisely, form , for , based on only those locations closest to . Set and take as a provisional value for . Using a suitable prediction procedure, like that based on Kriging discussed above, define

and the corresponding plausibility contour

(12)

Specific details are presented in Algorithm 2.

Input: observations ; predict location ; non-conformity measure ; significance level ; and a fine grid of candidate values
number of neighbors to consider Output: prediction interval, , for :
1 form , based on locations closest to ;
2 ;
3 for  in the specified grid do
4       for i = 1 to m+1 do
5             define by removing from ;
6             ;
7       end for
8      compute plausibility for as in (12);
9       include in if ;
10 end for
return .
Algorithm 2 Local conformal spatial prediction (LCSP).

With the help of the local exchangeability result in Theorem 2, we can establish the (asymptotically approximate) theoretical validity of the local spatial conformal prediction method. As before, to highlight the conformal prediction interval’s dependence on , we write it as .

Theorem 3.

Consider an infill asymptotic regime with spatial grid points, with . Fix an integer and let the radius satisfy (11). Then under the assumptions of Theorem 2, the prediction intervals returned by Algorithm LSCP are asymptotically valid at in the sense that

where is the joint distribution of at the spatial grid points, as well as at , and the outermost infimum is over all such distributions for . The inequality above can be made arbitrary close to equality by taking sufficiently large.

Proof.

See Appendix B.3. ∎

4.3 Algorithm sLSCP: smoothed LSCP

Theorem 3 implies that the local spatial conformal prediction with nearest neighbors is approximately valid under the infill asymptotic regime. However, in practice completely disregarding the contribution of the observations outside the nearest neighbors may be unsatisfactory, so we propose a smoothed version of the LSCP algorithm (sLSCP).

The GSCP algorithm weights all non-conformity measures equally in the plausibility computation in (4), but this is questionable for non-stationary processes with stochastic properties that vary throughout the spatial domain. To allow for non-stationarity, the sLSCP algorithm weights the non-conformity measures by how far the corresponding is from the prediction location. Let be a non-increasing function, and define weights

where , , and the proportionality constant is so that . Different functions can be applied, but we recommend the normalized Gaussian kernel function with bandwidth ,

(13)

Note that, if , then for each , which corresponds to the GSCP algorithm. Finally, with these new weights, the plausibility contour at a provisional value of is given by

(14)

As before, while many of non-conformity measures are possible, we recommend the Kriging-based strategy with and defined in (2).

Since we are interested only in the local structure of , it is natural that locations far from have negligible weight, as in (13). But including all observations requires some non-trivial calculations, e.g., inverting a large covariance matrix. Therefore, to avoid cumbersome and irrelevant computation, we recommend using only the observations closest to for both the Kriging predictions that determine and in the plausibility scores in (14). The resulting method is both locally adaptive and computationally efficient for large data sets.

The tuning parameter can be selected using cross validation, as illustrated in Sections 5 and 6. The value of is determined by the bandwidth so that all observations with substantial are included, as are observations that are required for the Kriging prediction of these observations. Typically the number of nearby observations required to approximate the Kriging prediction is a small subset of the total number of observations (Stein, 2002). As a rule of thumb, could be selected to roughly include all observations within radius of , where captures observations with substantial weights, and is selected so that all the observations include the nearest 15 neighbors of the observation within of . We summarize the details of Algorithm sLSCP in Algorithm 3. For simplicity, we use sLSCP and LSCP indistinguishably.

Input: observations ; predict location ; non-conformity measure ; significance level ; and a fine grid of candidate values
weight parameter ; number of neighbors to consider Output: prediction interval, , for
1 determine through if not given;
2 form based on locations closest to ;
3 ;
4 calculate weights as in (13);
5 for  in the specified grid do
6       for i = 1 to M+1 do
7             define by removing from ;
8             ;
9       end for
10      compute plausibility for as in (14);
11       include in if ;
12 end for
return .
Algorithm 3 Smoothed local spatial conformal prediction (sLSCP).

5 Simulation study

5.1 Data generation

We consider one mean-zero Gaussian stationary process (Scenario 1) and seven non-Gaussian and/or nonstationary data-generating scenarios (Scenarios 2 – 8). The seven scenarios are generated based on transformations of a mean-zero stationary Gaussian process and white noise process with standard normal distribution, where . The stationary Gaussian process process has a Matrn covariance function with variance , range and smoothness . Data are sampled on the grid of points in the unit square, , with or .

Specifically, the eight scenarios under consideration are:

  1. ;

  2. ;

  3. where is the standard normal distribution function and is the quantile function;

  4. ;

  5. ;

  6. where ;

  7. ; and

  8. where .

Scenario 1 is Gaussian and stationary, Scenarios 24 are stationary but non-Gaussian, and Scenarios 58 are nonstationary either in the spatial variance (Scenarios 5 and 6), error variance (Scenario 7), or mean (Scenario 8). Figure 2 shows one realization of all the scenarios with .

Figure 2: One realization each from simulation scenarios with

5.2 Prediction methods and metrics

For each dataset we apply the global and local (with ) conformal spatial prediction algorithms. For the parametric Kriging method and the initial Kriging predictions of our proposed conformal prediction, we estimate the spatial covariates parameters (Section 2.1) using empirical variograms methods (Cressie, 1992). The empirical variograms are calculated using the variog function in the R package geoR, and the covariance parameters are chosen to minimize the weighted (by number of observations) squared error between the empirical and model-based variograms. We compare the conformal methods with standard global Kriging prediction using (1) to construct prediction interval, and the local Kriging (laGP) method of Gramacy and Apley (2015) that dynamically defines the support of a Gaussian process predictor based on a local subset of the data. For laGP, we use the function provided by the laGP package in R the local sequential design scheme starting from 6 points to 50 points through an empirical Bayes’ mean-square prediction error criterion.

Methods are compared using leave-one-out cross validation, i.e., we generate a prediction interval for each observation given all other observations. Each scenario is repeated 100 times, and performance is evaluated using average coverage of % prediction intervals, average interval width, and average interval score (Gneiting and Raftery, 2007), defined as

where represents the prediction interval and are the observations. A smaller interval score is desirable as this rewards both high coverage and narrow intervals. We use in this simulation study.

5.3 Results

We present the results for Scenarios 16 in this section; the others are in Appendix C. For the stationary Scenarios 14 we present results averaged over data sets and all spatial locations in Table 1. For the non-stationary scenarios varying across (Scenarios 57), we present the results by the first spatial coordinate () averaged over the data sets and the second coordinate () in Figures 4, e.g., the value of coverage plotted at is the average of the coverage over the points of the form for .

Scen Method Cov90 Width IntScore Cov90 Width IntScore
1 GSCP 89.9% 4.58 5.77 90.0% 3.99 5.01
LCSP 88.5% 4.55 5.96 89.2% 3.94 5.07
Kriging 90.7% 4.68 5.75 88.5% 3.82 5.01
laGP 86.9% 4.74 6.52 88.7% 4.26 5.52
2 GSCP 89.9% 31.84 63.07 90.0% 21.65 41.43
LCSP 90.7% 30.72 55.32 90.9% 21.06 33.77
Kriging 93.7% 43.70 65.54 92.4% 26.77 42.47
laGP 91.7% 40.07 67.49 93.4% 32.07 46.27
3 GSCP 89.9% 4.64 6.22 90.0% 4.00 5.14
LCSP 88.7% 4.58 6.22 89.2% 3.94 5.14
Kriging 90.9% 4.78 6.19 88.9% 3.87 5.15
laGP 87.8% 4.73 6.71 89.0% 4.27 5.61
4 GSCP 89.9% 6.93 11.03 90.0% 6.21 9.84
LCSP 88.9% 6.82 11.21 90.0% 5.96 9.26
Kriging 91.7% 7.65 11.02 91.2% 6.61 9.85
laGP 89.2% 7.37 11.93 91.0% 6.68 10.08
Table 1: Performance comparison for simulation Scenarios (“Scen”) 14. The metrics are the empirical coverage of 90% prediction intervals (“Cov90”), the width of prediction intervals (“Width”) and the interval score (“IntScore”), each averaged over location and dataset. The methods are global (GSCP) and local (LSCP) conformal prediction, stationary and Gaussian Kriging (“Kriging”) and local approximate Gaussian process (“laGP”) regression.

Table 1 shows the results for data generated from stationary processes (Scenarios 14). In Scenario 1

, the Gaussian and stationary process, the performance of GSCP, LSCP, and Kriging are comparable. Kriging performs well in this case since the data generating mechanism aligns with its underlying assumption, but the conformal methods are competitive with the parametric model.

In Scenarios 2, 3, and 4, the non-Gaussian but stationary processes, GSCP, LSCP, and Kriging perform more or less the same in terms of interval score and outperform laGP. However, the coverage of the conformal methods, especially the GSCP algorithm, is closer to the nominal level than Kriging and the Kriging intervals are generally wider than the conformal intervals. Figure 3 examines the results in more detail for Scenario 3. Since this is a stationary process, the width of the prediction intervals should be similar across different locations, which is the case for the global methods GSCP and Kriging but not the local methods LSCP and laGP. This extra variation is reflected in the coverage proportion across datasets. The coverage of GSCP stabilizes at 90% for each simulation while the coverage for other methods varies considerably and is often below the desired level. This detailed analysis focused on Scenario 3, but the same conclusions can be drawn in Scenarios 2 and 4.

Figure 3: Detailed performance comparison for Scenario 3 with

. For each dataset and method we compute the standard deviation of interval width across space and the average coverage across space, and the boxplots summarize the sampling distribution of these statistics across the 100 simulated datasets.

Figure 4 shows the results for nonstationary Scenarios 5 and 6 when . The results for and the results of the other nonstationary scenarios are in Appendix C. LSCP performs the best among the four methods for these nonstationary scenarios. For Scenario 5, the process is Gaussian to the west and non-Gaussian to the east. The global prediction methods GSCP and Kriging generate prediction intervals with similar width for all (ignoring edge effects), while LSCP and laGP provide wider intervals on the east ( near 1) than the west ( near 0). LSCP has coverage around 90% for , and the lowest interval scores, especially in the east where the process is more non-Gaussian. Similarly, in Scenario 6, the correlation is stronger in the east than the west, and the LSCP performs the best by providing adaptive prediction interval width and valid coverage across space.

Figure 4: Performance comparison for Scenario 5: and Scenario 6: where when .

We also conducted a simulation study when observations are sampled uniformly on . The performance is very similar to that sampled on equally-spaced grid points so we only show the results of sampling on grid points in this paper.

5.4 Sensitivity Analysis

Finally, we study sensitivity of the proposed method to tuning parameter selection and to the initial estimates of the Matrn covariance parameters. Figure 5 gives the average interval score of LSCP with different tuning parameter for Scenarios 5 and 6, along with the average interval score of GSCP, Kriging, and laGP. We find that there is a V-shaped relationship between resulting interval score and tuning parameter . The average performance is the best when , and LSCP outperforms the other methods regardless the selection of . To achieve the best performance, in practice, we would select using cross validation.

Figure 5: Sensitivity analysis for simulation Scenarios 5 and 6. The interval score for the LSCP method is computed for several kernel bandwidth parameters, , and compared to other methods.

The proposed spatial conformal prediction methods rely on estimation of the Matrn covariance parameter, . To test for sensitivity to this step, we repeat simulation Scenario 1 since in this case the optimal is known to be the true Matrn covariance parameters used to generate the data. When generating prediction intervals, rather than estimating the Matrn covariance parameter from the simulated data, we use the values of listed in the first four columns of Table 2 for conformal prediction and Kriging, where each of the four parameters is modified by plus or minus 50% of the true value. We use the GSCP method, and for this sensitivity analysis. The corresponding average coverage and average width of the conformal prediction intervals and the Kriging intervals are listed in the last four columns.

Nugget Partial sill Range Smoothness GSCP Kriging
() () () () Cov90 Width Cov90 Width
1.0 3.0 0.10 0.70 89.69% 4.57 90.05% 4.60
1.5 3.0 0.10 0.70 89.69% 4.61 93.73% 5.24
0.5 3.0 0.10 0.70 89.70% 4.59 82.76% 3.84
1.0 4.5 0.10 0.70 89.68% 4.63 92.64% 5.04
1.0 1.5 0.10 0.70 89.68% 4.63 85.25% 4.09
1.0 3.0 0.15 0.70 89.68% 4.60 86.71% 4.21
1.0 3.0 0.05 0.70 89.67% 4.61 95.13% 5.54
1.0 3.0 0.10 1.05 89.69% 4.58 85.56% 4.08
1.0 3.0 0.10 0.35 89.70% 4.63 94.98% 5.53
Table 2: Sensitivity analysis for the simulation study. The first four columns represent the Matrn covariance parameter () used to generate the conformal and Kriging prediction intervals; the first row gives the true value. The last four columns give the average coverage and interval width of the 90% intervals produced by the conformal and Kriging methods when using the corresponding .

While using the true leads to slightly more narrow intervals, the average coverage of the conformal prediction intervals are between 89.67% and 89.70% across . In contrast, the coverage of the Kriging intervals ranges from 82.76% to 95.13% across . Therefore, the conformal methods are more robust to estimation of Matrn covariance parameters than the parametric Kriging model.

6 Real data analysis

This section demonstrates the performance of conformal prediction method using the canopy height data in Figure 5(a). The data were originally presented in Cook et al. (2013) and were analyzed using a stationary Kriging model in Datta et al. (2016). The data are available in the spNNGP package (Finley et al., 2017) in R. There are observations and clear nonstationarity and non-normality. For example, there are several heterogeneous areas with small height canopies around the location with longitude and latitude being 729,000 and 470,000, respectively.

We compare methods using 90% prediction intervals for 10,000 randomly chosen test locations. We use LSCP and choose the tuning parameters as because this minimized the interval score averaged over 1,000 randomly chosen validation locations. The average interval score and tuning parameter also show a V-shape relationship similar to that in Figure 5. Table 3 compares the performance of LSCP on the 10,000 test locations with Kriging and laGP. LSCP outperforms the other methods as the empirical coverage of LSCP is the closet to the desired 90% and the LSCP minimizes the interval score.

Width Cov90 IntScore
LSCP 3.00 89.2% 4.31
Kriging 5.54 96.7% 6.89
laGP 5.04 91.3% 6.61
Table 3: Performance comparison for the canopy height data. The metrics are the width, coverage (“Cov90”) and interval score (“IntScore”) of 90% prediction intervals, each averaged over 10,000 randomly chosen test locations. The methods are local conformal prediction (LSCP), stationary and Gaussian Kriging (“Kriging”) and local approximate Gaussian process (“laGP”) regression.

Figure 6 plots the interval widths for each method. Unlike Kriging, the LSCP and laGP interval widths are locally adaptive with wider intervals in heterogeneous areas and more narrow intervals in homogeneous areas. Comparing LSCP and laGP, LSCP generally provides narrower intervals than laGP, which means the proposed method is more efficient than laGP. In addition, the locations of the observations that fall outside the prediction intervals are uniformly distributed for LSCP and laGP, but clustered in the heterogeneous areas for Kriging. In short, the proposed spatial conformal prediction algorithm shows its superiority in this real data analysis.

(a) Original data
(b) LSCP result
(c) Kriging result
(d) laGP result
Figure 6: (a) Heatmap of the canopy height data; (b)–(d) Prediction interval width (color) and locations not covered (points) for LSCP, Kriging, and laGP. Longitude and latitude are in UTM Zone 18.

7 Discussion

In this paper, we proposed a spatial conformal prediction algorithm to provide valid, robust, and model-free prediction intervals by combining spatial methods and the classical conformal prediction. We provided both global and local versions to accommodate different stationarity cases and sampling designs. We proved their validity under various sampling designs and data-generating mechanisms. To the authors’ knowledge, this work is the first in making the classical conformal algorithm work for non-exchangeable data. Our simulation studies and real data analyses demonstrate the advantage of the proposed spatial conformal prediction algorithms. We also developed an R package entitled scp (https://github.com/mhuiying/scp) to compute plausibility and generate spatial prediction intervals using either Kriging residual or any other user-defined non-conformity measure.

An attractive feature of the proposed algorithms is that they are model-free in the sense that their theoretical validity does not depend on specifying the correct model. In our implementations we use the squared residuals from a simple parametric model to define the non-conformity terms. However, as anticipated by our theoretical results, our simulation study shows that the methods work well even if the parametric family or the mean and covariance functions are misspecified, and that the results are insensitive to inaccurate estimation of the parameters in the parametric model. This robustness allows the methods to be applied broadly and with confidence.

The local conformal prediction method relies on a dense grid of points around each prediction location and thus a large dataset. Computation time often prohibits application of spatial methods to large datasets. Fortunately, we are able to apply the conformal prediction method to large datasets by exploiting local algorithms and an explicit solution of the plausibility contour assuming the non-conformity measure is the squared Kriging residual. Similar derivations are needed for prediction procedures other than Kriging in order to maintain computational efficiency.

Future research directions include extending the work to spatial processes with discrete observations, e.g., when is binary or a count. Generalized spatial linear models are compatible with our current framework, see (10), but continuity in the distribution function is required. Therefore, further studies would be required to establish the validity of conformal prediction for discrete data. It is also of interest to extend the proposed spatial conformal prediction methods to temporal or multivariate spatial data.

Acknowledgments

HM (DMS–1638521) and RM (DMS–1811802) are supported by the National Science Foundation. BJR is supported by the National Institutes of Health (R01ES031651 and R01ES027892) and King Abdullah University of Science and Technology (3800.2).

References

  • Barber et al. (2019) Barber, R. F., Candès, E. J., Ramdas, A., and Tibshirani, R. J. (2019). The limits of distribution-free conditional predictive inference. arXiv preprint arXiv:1903.04684.
  • Cella and Martin (2020) Cella, L. and Martin, R. (2020). Valid distribution-free inferential models for prediction. arXiv preprint arXiv:2001.09225.
  • Cook et al. (2013) Cook, B., Nelson, R., Middleton, E., Morton, D., McCorkel, J., Masek, J., Ranson, K., Ly, V., Montesano, P., et al. (2013). Nasa goddard’s lidar, hyperspectral and thermal (g-liht) airborne imager. Remote Sensing, 5(8):4045–4066.
  • Cressie (1992) Cressie, N. (1992). Statistics for spatial data. Terra Nova, 4(5):613–617.
  • Datta et al. (2016) Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800–812.
  • Diggle et al. (1998) Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998). Model-based geostatistics. Journal of the Royal Statistical Society: Series C (Applied Statistics), 47(3):299–350.
  • Duan et al. (2007) Duan, J. A., Guindani, M., and Gelfand, A. E. (2007). Generalized spatial Dirichlet process models. Biometrika, 94(4):809–825.
  • Finley et al. (2017) Finley, A., Datta, A., and Banerjee, S. (2017). spNNGP: Spatial regression models for large datasets using nearest neighbor Gaussian processes. R package version 0.1.0, 1.
  • Fuglstad et al. (2015) Fuglstad, G. A., Simpson, D., Lindgren, F., and Rue, H. (2015). Does non-stationary spatial data always require non-stationary random fields? Spatial Statistics, 14:505–531.
  • Gelfand et al. (2005) Gelfand, A. E., Kottas, A., and MacEachern, S. N. (2005). Bayesian nonparametric spatial modeling with Dirichlet process mixing. Journal of the American Statistical Association, 100(471):1021–1035.
  • Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378.
  • Gramacy and Apley (2015) Gramacy, R. B. and Apley, D. W. (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578.
  • Guan (2019) Guan, L. (2019). Conformal prediction with localization. arXiv preprint arXiv:1908.08558.
  • Heaton et al. (2019) Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F., and Zammit-Mangion, A. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 24(3):398–425.
  • Henley (2012) Henley, S. (2012). Nonparametric Geostatistics. Springer Science & Business Media.
  • Hewitt and Savage (1955) Hewitt, E. and Savage, L. J. (1955). Symmetric measures on cartesian products. Transactions of the American Mathematical Society, 80(2):470–501.
  • Krupskii and Genton (2019) Krupskii, P. and Genton, M. G. (2019). A copula model for non-Gaussian multivariate spatial data.

    Journal of Multivariate Analysis

    , 169:264–277.
  • Lei et al. (2018) Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111.
  • Lei and Wasserman (2014) Lei, J. and Wasserman, L. (2014). Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71–96.
  • Martin (2019) Martin, R. (2019). False confidence, non-additive beliefs, and valid statistical inference. International Journal of Approximate Reasoning, 113:39–73.
  • Martin and Lingham (2016) Martin, R. and Lingham, R. T. (2016). Prior-free probabilistic prediction of future observations. Technometrics, 58(2):225–235.
  • Martin and Liu (2013) Martin, R. and Liu, C. (2013). Inferential models: A framework for prior-free posterior probabilistic inference. Journal of the American Statistical Association, 108(501):301–313.
  • Martin and Liu (2016) Martin, R. and Liu, C. (2016). Inferential Models, volume 147 of Monographs on Statistics and Applied Probability. CRC Press, Boca Raton, FL.
  • Reich and Fuentes (2007) Reich, B. J. and Fuentes, M. (2007). A multivariate semiparametric Bayesian spatial modeling framework for hurricane surface wind fields. The Annals of Applied Statistics, 1(1):249–264.
  • Reich and Shaby (2012) Reich, B. J. and Shaby, B. A. (2012). A hierarchical max-stable spatial model for extreme precipitation. The Annals of Applied Statistics, 6(4):1430.
  • Risser (2016) Risser, M. D. (2016). Nonstationary spatial modeling, with emphasis on process convolution and covariate–driven approaches. arXiv preprint arXiv:1610.02447.
  • Rodriguez and Dunson (2011) Rodriguez, A. and Dunson, D. B. (2011). Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Analysis, 6(1).
  • Romano et al. (2019) Romano, Y., Patterson, E., and Candes, E. (2019). Conformalized quantile regression. In Advances in Neural Information Processing Systems, pages 3538–3548.
  • Shafer (1976) Shafer, G. (1976). A Mathematical Theory of Evidence, volume 42. Princeton university press.
  • Shafer and Vovk (2008) Shafer, G. and Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9(Mar):371–421.
  • Stein (1990) Stein, M. L. (1990). A comparison of generalized cross validation and modified maximum likelihood for estimating the parameters of a stochastic process. The Annals of Statistics, 18(3):1139–1157.
  • Stein (2002) Stein, M. L. (2002). The screening effect in kriging. The Annals of Statistics, 30(1):298–323.
  • Tibshirani et al. (2019) Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. (2019). Conformal prediction under covariate shift. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 2530–2540. Curran Associates, Inc.
  • van der Vaart (1998) van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge.
  • Vovk et al. (2005) Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic Learning in a Random World. Springer Science & Business Media.

Appendix A Computational details

Applying the predictive distributions in (1) and provisionally setting , for we can write

(15)

where , , and .

Since the inverse covariance matrix is positive definite, and , so and . Therefore, the satisfying are within two roots, denoted as , of the quadratic equation . Then the plausibility calculation (6) becomes , which is a step function with jumping points being ’s and ’s, and the plausibility function for sLSCP in (14) simply becomes .

With the help of the step function, we can solve for the prediction interval directly with no need to enumerate for possible satisfying .

Appendix B Proofs

b.1 Proof of Proposition 1

To be even more explicit about notation, for a given set of data and a new location , write . Now define the conditional coverage probability function