Since their first high-profile application two decades ago (Mann et al., 1998), multi-proxy spatiotemporal climate field reconstructions (CFRs) have become increasingly popular in the climate science community for their ability to reconstruct global climate variability on seasonal and annual timescales over many hundreds of years into the past (Jones et al., 2009; Smerdon and Pollack, 2016; Christiansen and Ljungqvist, 2017). The climate reconstruction is critical because data from instrumental observations are only available for the past 100-150 years. CFRs therefore provide estimates of past climate variability and extreme events that may not be well represented over the instrumental interval. This helps to better characterize the physical dynamics of the climate system and how climate may change in the future.
The basic approach of CFRs is to statistically relate a collection of climate proxies, such as isotopic information in ice cores, the width of tree rings, or coral isotope data, to observed climate variables like temperature and soil moisture during their periods of overlap (Masson-Delmotte et al., 2013). Once the relationship between the proxies and the climate variables is established, the proxies are used to estimate climate variability during periods when observations are not available in the past. CFRs thus depend critically on the imperfect proxy information and the robustness with which their relationship to observed climate variables can be defined. A central approach to this problem in the past has been through regularized versions of multivariate regression techniques (e.g. Lee et al., 2008; Jones et al., 2009; Tingley et al., 2012; Smerdon, 2012; Guillot et al., 2015; Smerdon and Pollack, 2016; Li et al., 2016; Christiansen and Ljungqvist, 2017). More advanced techniques have been emerging, however, all of which are associated with advantages and challenges that require further evaluation and assessment.
A recent CFR innovation are the paleoclimatic Data Assimilation (DA) algorithms, which are a class of reconstruction methods that optimally combine general circulation models (GCMs) with proxy information to create paleoclimate reconstructions (Goosse et al., 2012; Steiger et al., 2014; Hakim et al., 2016; Steiger et al., 2018; Tardif et al., 2019). The primary advantage of DA approaches is their ability to jointly reconstruct multiple atmosphere-ocean variables and to do so in a manner that is physically consistent within the framework of a climate model. An additional advantage is that DA algorithms naturally provide probabilistic, ensemble estimates of past climate. Such ensemble reconstructions first begin with a background ensemble of states from a climate model. These states are then updated through the equations of DA (Steiger et al., 2014), based on the available proxy information and the uncertainties involved, to arrive at an analysis ensemble state estimate. This probabilistic analysis state provides an uncertainty quantification that is critical given the noisy relationship between paleoclimate proxies and climate variables.
Despite the rapid development of DA-based reconstruction methods, much remains to be characterized about the influence of each of their two components: climate models and paleoclimate proxies. In currently published DA-based CFRs (e.g., Steiger et al. (2018)), it is hard to quantify how much information the models and the proxies each contribute to the end product. Furthermore, it is unclear whether or not the climate model-based background is fundamentally distinct from the analysis. If the background and analysis are not in fact distinct, then this would imply that DA-based CFRs are essentially dominated by the underlying climate model and fail to glean information from the historical proxy data. A lack of proxy influence would therefore indicate a need to fundamentally re-evaluate DA methodologies.
In this paper, we quantify the level of proxy influence in the analysis states of a DA product by introducing a robust and computationally efficient method for evaluating the exchangeability of two ensembles of random fields. The purpose of this study is therefore twofold: to answer an important climatological question by quantifying and assessing the influence of proxies in a new DA based CFR product and to develop a new statistical test for comparing the distributions of two sets of random fields. In the following two subsections, we provide background on the methodological development embodied in this paper and the characteristics of the DA-based CFR that we analyze.
1.1 Previous work in random fields comparisons
Comparing two spatial processes has been addressed in both the geostatistics and functional data analysis literature. The general strategy in both frameworks is to reduce the dimension of the random process either by a low-rank decomposition or by parameterization, and then to develop a test for evaluating differences in the reduced dimension.
The wavelet decomposition has been widely used to reduce a stochastic process to a finite number of wavelet coefficients, then the comparison between two processes can be transformed into the comparison between two sets of wavelet coefficients (Briggs and Levine, 1997; Shen et al., 2002; Pavlicova et al., 2008). Snell et al. (2000) and Wang et al. (2007)
introduced methods for comparing random fields based on their spatial interpolation root-mean-square error and Rcoefficient. Their methods were later extended by Hering and Genton (2011)
to include more arbitrary loss functions. Motivated byLund and Li (2009) that compared two time series, Li and Smerdon (2012)
proposed a parametric method to jointly assess the first two moments between two random fields.
Functional data analysis approaches assume that the spatial random fields are noisy realizations of an underlying continuous function. The majority of existing functional approaches have focused on testing the equality of the mean functions arising from two functional data sets (Ramsay and Silverman, 2005; Zhang and Chen, 2007; Horvath et al., 2013; Staicu et al., 2014), although more recently the second order structure of functional data has also been considered (Zhang and Shao, 2015). Li et al. (2016) extended Zhang and Shao (2015) to evaluate the joint difference in mean and covariance structure as well as in the trend surface between two spatiotemporal random fields. A nice feature of functional data analysis methods, as opposed to geostatistical methods, is that assumptions about distribution and model specification can be relaxed if there are replicate observations in the data.
All the above procedures are nevertheless inadequate for our problem because the proxies can simultaneously affect the mean, covariance, and higher order structures of the reconstructed climate field. The most comprehensive way to identify proxy influence is therefore to compare the distributions of the background and analysis states. The rich ensemble structure of the background and analysis states also allows us to examine more information than differences in the mean and covariance parameters. We take advantage of the ensembles by employing a functional data approach that is both distribution and parameter free.
The problem of comparing the distributions of functions has remained relatively unexplored. Hall and Van Keilegom (2007) proposed a Cramer-von Mises-like test by constructing an empirical distribution over each of the samples and measuring the distance between the empirical distributions. Benko et al. (2009) introduced a permutation test on the leading coefficients of the common functional principal components (FPCs) and Corain et al. (2014) introduced three omnibus tests for combining pointwise tests on the observations of the functions. Each of these methods depends on a resampling procedure that renders them computationally prohibitive for large ensembles like the DA ensemble output that we consider.
Pomann et al. (2016) proposed a method based on marginal FPCs that does not require resampling, called the Functional Anderon-Darling () test. The test compares the distributions of the marginal FPCs using the two sample Anderson-Darling test and a Bonferroni correction. Lopez-Pintado and Romo (2009) proposed a rank based band depth test (). The test is closely relate to the multivariate distribution test based on the Quality Index (Liu and Singh, 1993) but it replaces the multivariate simplicial depth (Liu, 1990) with the functional band depth (Lopez-Pintado and Romo, 2009)
. Both of these tests are inadequate for our data because, as we show in the supplement, they are incapable of detecting heterogeneous variance changes across the domain of the generating process. This causes thetest to experience a severe loss of power and for to miss an important trend (Section 4.1) in the PHYDA dataset.
In this paper, we propose a new non-parametric statistic, based on the concept of data depth, for assessing the equality of distributions between two spatial data sets. Our test falls into the general category of functional data analysis methods for comparing spatial random fields, but is conceptually different from previous efforts in this area. The use of data depth for comparing two multivariate distributions was first explored by Liu and Singh (1993) who introduced the Quality Index () for comparing two multivariate distributions. The
essentially measures the mean outlyingness of one sample from a reference sample using data depth. We will extend their ideas to the functional setting and propose a modification that makes our test statistic invariant to the reference distribution. The use of depth, and particularly Integrated Tukey depth(Cuevas and Fraiman, 2009), ensures our test is computationally efficient, distribution free, and invariant to location, scale, warping, and other nuisance properties that could influence the testing (Nagy et al., 2016).
1.2 Reconstruction data
The DA-based CFR that we analyze comes from the Paleo Hydrodynamics Data Assimilation product (PHYDA), which is a global paleoclimate reconstruction of both temperature and moisture variables (Steiger et al., 2018). PHYDA incorporates a simulation from the Community Earth System Model (CESM) last millennium ensemble experiment, run over the historical years 850 C.E. to 1850 C.E. (Otto-Bliesner et al., 2016).
A collection of modeled climate fields from the CESM simulation are used to form the background state in the DA scheme. For the purpose of our analysis herein, we will specifically use the modeled and reconstructed 2-meter surface temperature fields. The temperature fields are processed from the native model output by annual averages and spatially discretizing onto a latitude and longitude grid ( grid points). Annual in this context is defined as the interval between April and March of the following year, thus yielding 998 such climatological years to be used for the background ensemble. Because of the large data files produced by PHYDA, we only used a 100 member sub-ensemble, randomly drawn from the original 998 member ensemble, for our analyses. The final processed background state therefore consists of 100 spatial fields, each observed on the same grid points.
The 998 analysis states are derived from the background state by using DA to incorporate temporally available proxy information during each year of reconstruction. Each analysis state is also a 100 member ensemble of 2m surface temperature fields discretized to the same latitude and longitude grid as the background ensemble. Quantifying the influence that proxies have in the analysis states is quite challenging due to their small individual effect sizes, and the fact that they can effect higher order structures of the data beyond the mean and variance. Identifying the full effect of the proxies would therefore require testing for distributional changes.
2 Statistical Solution
We first formulate our scientific problem into a hypothesis testing, then introduce the integrated Tukey depth and propose our test statistic, followed by a discussion of the asymptotic distribution of our test statistics under the null hypothesis.
2.1 Formulation of evaluating proxy influence
Let and respectively represent the ensemble in the background state and the ensemble in the analysis state at time in PHYDA. Under the assimilation design, the proxies at time are the only contributors to the differences between the two sets of ensembles. Our goal is to define and quantify the differences between and at each year in order to assess the proxy influence.
The amount that proxies impact the analysis states depends on many factors including the proxy type (e.g., tree ring, ice core, coral), where proxies were collected, and the interval over which the proxies were observed (Steiger et al., 2018). As shown in Fig. 1, the effects from proxies may be small and thinly diffused over a non-contiguous area due to spatial correlations and teleconnections. In fact, most of the induced mean differences generally fall within the natural variation of the background fields. The most comprehensive approach to test for the proxy’s cumulative influence is therefore to test for changes in the distributions of and . We thus formulate our problem into the following hypotheses:
where means equality in distribution. In addition to the outcome of these hypothesis tests at each time , we are also equally interested in the pattern of those outcomes as increases. Because proxies are progressively added into the background states, we may expect that differences between the two distributions will increase over time.
Under the functional data analysis regime, we assume that the observed data are generated from continuous functions combined with additive noise, instead of from a spatially correlated stochastic processes. In this framework, each ensemble member represents a single observation over a spatial domain where grid points are embedded. This distinction allows us to consider each ensemble member as an i.i.d realization of a stochastic process in a functional space.
We develop our test statistic for the testing problem (1) at any given in a general context. For ease of notation, we suppress from . Let and , where and is a compact subspace of . Without loss of generality, let be and let each functional datum be observed at the same locations in . We assume that each function and is a univariate continuous function on the domain , i.e. ; In other words, each (or ) is an element of the class of univariate continuous functions on , denoted by . Specific to our data, we have and and respectively represent the th background state and the th analysis state at location .
Let and be two absolutely continuous distributions on and suppose each and each . We are interested in testing if the functional data in and in follow the same distribution, so (1) is equivalent to the hypotheses,
for any given . We will use functional data depth to construct a two sample Kolmogorov-Smirnov-type test. Other distribution free tests such as the Anderson-Darling or Cramer-Von Mises test could equally have been applied. We chose Kolmogorov-Smirnov for its convenient asymptotic form and its ubiquity in testing distributions.
2.2 Integrated Tukey depth
Data depth is a statistical concept for quantifying the centrality or “depth” of the observed data points with respect to a reference distribution. The closer an observation is to the center of the distribution, the higher its depth value should be to indicate its centrality. As the reference distribution is typically unknown, the depth of an observation has to be estimated via an empirical notion of data depth. Many notions of data depth for functional data have been developed including the integrated band depth (Lopez-Pintado and Romo, 2009), extremal depth (Narisetty and Nair, 2017), and various integrated univariate depths (Fraiman and Muniz, 2001). Each of these depth functions has its own strengths and weaknesses but none dominates the others in all aspects, see Cuevas and Fraiman (2009) and Nagy et al. (2016) for a review. We chose the integrated Tukey depth as the basis of our test for its simplicity, robustness, computational tractability, and highly desirable theoretical properties.
Integrated depths are a well studied class of functional data depth measures that were first introduced by Fraiman and Muniz (2001) and then studied extensively by Cuevas and Fraiman (2009) and Nagy et al. (2016). To define an integrated depth function, a univariate depth function is first defined over a collection of one dimensional “projections” of the data which often refers to the observed values of the functions at each location . The univariate depth is then integrated over these projections to yield the integrated depth. Among all the univariate depths, the Tukey depth and the simplicial depth are perhaps the two most popular ones. We opted to use the Tukey depth but the simplicial depth would have been equally effective because the orderings they induce are nearly identical.
The integrated Tukey depth is defined as follows. Let be a distribution for , and let be the marginal distribution of at . The univariate Tukey depth of with respect to is
and the integrated Tukey depth of with respect to is
To ensure that this depth function is proper, we refer to the criteria proposed by Zuo and Serfling (2000) and Mosler and Polyakova (2012). In Nagy et al. (2016) it was shown that the integrated Tukey depth satisfies translation invariance, function scale invariance, measure-preserving rearrangement invariance, maximality at the center, continuity, and quasi-concavity of the induced level sets. They also demonstrated strong universal consistency and weak uniform consistency, which assure that the integrated Tukey depth behaves well and asymptotically converges to its population counterpart under regularity conditions.
2.3 Test statistic
We propose a test statistic , called the Kolmogorov Depth () statistic, for our hypothesis testing problem (2) based on the integrated Tukey depth. The statistic measures the outlyingness of a sample from the distribution as well as the outlyingness of a sample from the distribution . It takes the maximum of the two outlyingness measures as its value. This way we can correctly detect differences between and even when they do not appear mutually outlying from each other under data depth. For example, if one of the distributions is nested inside the other then the nested distribution will not appear outlying to the other distribution.
Denote as the empirical estimate of based on the sample and the empirical estimate of based on . We start by considering fixed and aim to measure the outlyingness of over . To do this we first define the following two empirical measures for any given :
Essentially, is a rescaling of to its standardized rank, i.e. if is the smallest, if
is the second smallest, and so on. It acts as the empirical cumulative distribution function of
evaluated at itself and thus follows a discrete uniform distribution. The second quantitycan be considered as the empirical cumulative distribution function of evaluated at . Under in (2), should be approximately uniform, so a deviation of from the uniform distribution indicates an outlyingness of from . The introduction of and
allows us to reduce the problem of comparing two sets of random fields to assessing the difference in distribution between two sets of random variables,and for . The latter can be naturally quantified using the Kolmogorov distance over the set :
To measure the outlyingness of over we now fix rather than . Following the same scheme, we define the two empirical measures for any given as:
These two quantities exactly mirror and except that now is uniform on the depth values , for , and is the indicator for the outlyingness of from . We again take the Kolmogorov distance, but now over the set , as the measure of outlyingness
We define the overall test statistic by taking the maximum of the two distances:
The test statistic attains a level of symmetry by making the test invariant to the reference distribution. It is strictly non-negative and it equals 0 only under in the hypothesis (2). Thus the originally stated hypothesis (1) can be tested by evaluating whether is significantly greater than 0.
One major difference between our test statistic and the in Liu and Singh (1993) is that our test does not depend on a reference distribution while requires one of the samples to be used as the reference. Our test computes the outlyingness of two samples from each other and aggregates the results into one single test. This is a more efficient use of the two samples and enables to detect a larger range of alternative hypotheses, such as the nesting situation mentioned above. We discuss the critical values of in the following section.
2.4 Computing critical values
Deriving the asymptotic distribution of is nontrivial because explicitly depends on two non i.i.d. processes, and . This renders standard results on the Kolmogorov-Smirnov test inapplicable. Nevertheless, we conjecture without formal proof that either follows the same limiting distribution as the regular Kolmogorov-Smirnov two sample statistic, i.e.
or converges to a distribution that can be closely approximated by . Although we are unable to prove this result in its full generality, we consider two special cases below and show that both conform to the conjecture of converging to . Our extensive simulation studies in Section 3 demonstrate convergence in the general case.
We first consider a special case where is known and we are interested in testing if for . In this case, in (3) becomes the uniform distribution at . Then in (5), which is the test statistic in this special case, reduces to
Because is an empirical distribution of the random variables at , is exactly the one sample Kolmogorov-Smirnov statistic for testing the uniformity of . Therefore,
We further consider another special case where and are both unknown but with either or . We can show that (or ) converges to the Kolmogorov distribution under (). We encapsulate this result in the following proposition.
Suppose that , then under the null hypothesis,
where follows the Kolmogorov distribution.
The proof is deferred to the Appendix.
Generalizing the results of these special cases is challenging. This issue was also noted in Liu and Singh (1993) where the authors conjectured that their two sample
asymptotically followed a normal distribution, as its one sample version does. Their conjecture was only later proven inZuo and He (2006) after substantial theoretical development. The techniques that emerged from the proof in Zuo and He (2006) relied heavily on being an expectation, making them largely inapplicable to our context involving suprema. Proving the conjecture would require the development of advanced theoretical machinery that can accommodate the complex dependence nature of the distribution functions of the depth measures. We leave this problem open for independent theoretical research in the future.
In lieu of the proposed asymptotic distribution we may consider using permutations to find critical values for (Good, 2013). Permutation works well for small samples or sparsely observed functions, but it quickly becomes computationally infeasible on large volumes of data, such as our reconstruction data. For this reason, the conjectured Kolmogorov distribution is more appealing in practice.
3 Simulation Study
Simulation studies are conducted to assess the convergence of to , and the size and power of the test. Each of these properties is evaluated using two-dimensional functional data because our main application considers ensembles of spatial fields. All functional data in the simulation are generated from Gaussian random processes with the Matérn covariance function (Stein, 2012),
where is the Gamma function, is a modified Bessel function, is the marginal variance of the random process, and and are two nonegative parameters called range and smoothness. The range parameter, , governs how quickly the correlation decays between points. The smoothness parameter, , determines how smooth sampled functions are in terms of their differentiability.
In each simulation, we consider the sample as the baseline and as the sample to be varied. For the size and convergence simulations the marginal variance will always be set to 1, while and will be allowed to vary. For the power simulations , , , and will all be allowed to vary.
We use simulations to validate the conjectured asymptotic Kolmogorov distribution of our test statistic (6) under the null hypothesis. The main idea is to evaluate how well the permutation distribution of the test statistic is approximated by the Kolmogorov distribution, even at moderate sample sizes. Functional data and are each generated with mean,
, and standard deviation,, on the spatial domain . Because the integrated Tukey depth is invariant to the location and scale of functional data, we only vary the range and smoothness of the covariance function: 0.2, 0.3, 0.4, 0.5 and 0.5, 1.0, 1.5, respectively. The number of replicates, , in each sample is also varied between 25, 50, 75, 100. We also considered the unbalanced sample size case by fixing the number of replicates in to be 75 and allowing the number in to vary between 25, 50, 75, 100. The results were nearly identical as to those when the sample sizes were balanced so only the balanced case is presented here.
The permutation distribution was constructed by recomputing on 500 permutations of the generated and samples. We then calculated the distance between the permutation distribution and the Kolmogorov distribution and the difference between critical values derived from the permutation and Kolmogorov distribution at three common significance levels: 0.01, 0.05, and 0.10. Due to the computational cost of constructing permutation distributions we ran 100 simulations for each combination of , , and to obtain the boxplots in Figures 2 and 3.
Figure 2 demonstrates convergence of the permutation distribution to Kolmogorov in . For even small sample sizes, such as , the distance between the two distributions is already vanishingly small for smooth data ( and ). The largest deviations are only observed when both the range and smoothness are small, and . This is typically not an issue in practice because functional data are generally preprocessed with a smoothing step; effectively increasing and . In all cases the norm decreases rapidly with an increasing sample size such that the convergence even applies to unprocessed noisy data if the sample sizes are large enough.
Figure 3 evaluates the convergence of the two sets of critical values at the significance levels 0.01, 0.05, and 0.10. This figure shows that testing decisions reached under the asymptotic Kolmogorov distribution are generally not biased away from decisions reached under the permutation decision. Again, a sufficient amount of smoothness ( or ) is required to have well-behaved critical values. If the data are not sufficiently smooth then the Kolmogorov distribution tends to have smaller critical values than the corresponding permutation distribution. The size will therefore be slightly inflated by using Kolmogorov and so the permutation distribution should be preferred when computationally feasible. Once a sufficient level of smoothness has been reached, in this case or , the critical values of the permutation distribution become highly comparable with the Kolmogorov distribution. The observed differences are minuscule such that any decision reached using the Kolmogorov distribution is likely to be the same as if the permutation distribution were used. With noisy raw data a sufficient number of samples () can begin to compensate for a lack of smoothness or correlation.
3.2 Size and power
Using the same data generating process as in Section 3.1, we evaluate the size of our test using critical values from the asymptotic Kolmogorov distribution and compare our size to the test. Again only , , and the two sample sizes and will be varied. The size under each combination of , , , and was estimated using 2000 simulations; the results of which are presented in Table 1.
|n||m||r = 0.2||0.3||0.4||0.5||0.2||0.3||0.4||0.5||0.2||0.3||0.4||0.5|
Our simulations show that for even small samples, such as , our test can control the size near the prescribed level if the range or smoothness is sufficiently high; that is or . Smoothness and range are in fact more important for controlling size than the number of replicates. Under the noisiest setting, and , the lowest attained size (0.07) occurs when . This is only moderate improvement over the size (0.15) when , and is still above the nominal level. If instead the number of replicates were fixed at but the range and smoothness increased to either and , respectively, then the size is controlled at the nominal level and thereafter. As with the convergence simulations though, these minimal smoothness conditions are not all that impactful in practice because functions are typically smoothed before analysis. Moreover, the sizes are stable at the nominal level once the range exceeds a threshold between 0.3 and 0.4 or for the spatial domain . The test appears to inflate the size in nearly all cases compared to our test.
We compare the power of our test and the test in detecting changes in the four parameters , , , and that govern the underlying Gaussian process in our data generation. We set the number of replicates to and for the two samples and , respectively, and sample the functional observations in from a Gaussian process with , , , and . This setup ensures that the sizes of and are similar (see Table 1) so that their power functions are comparable. To generate samples of we let each of the parameters in vary around the parameter values in . The mean, , was set from -1 to 1 in 0.1 increments, was set between 0.1 and 2 in 0.05 increments, from 0.05 to 1 in 0.05 increments, and from 0.1 to 2 in 0.1 increments. This gave a total of 96 alternative models because the parameters were varied individually. The power of and were then calculated under each of these alternative models using 2000 simulations each.
Figure 4 shows the empirical power functions for both and on each parameter. Both tests are almost equally powerful in detecting mean changes and increases in standard deviation. However, shows a strong improvement over in detecting changes in range, smoothness, and decreases in standard deviation. Two caveats about the power functions should be noted. The first is that there is a slight advantage to in the testing of mean, range and smoothness because still appears to slightly inflate size, which can be seen by observing its power function at the null value of these three parameters. The second caveat is that was not designed to detect decreases in standard deviation because the application for which it was designed found a drop in standard deviation desirable.
Further simulation results comparing against the Functional Anderon-Darling () and the Rank based Band Depth Test () are available in the supplement. While shows considerable power in detecting mean changes it falls short of the other methods for detecting variance changes. On average our method maintained the highest power across the different parameters on average.
The situation where the mean or variance is shifted uniformly over the entire domain of the function may be a little too simplified. A more realistic scenario is that the mean, variance, and other aspects of the distribution differ heterogeneously; higher in some regions and lower in others. To study this situation we conduct another set of simulations where the mean and variance are both allowed to vary non-uniformly over the domain, though the range and smoothness are kept constant throughout at and . More specifically, we generate the mean and standard deviation of as two dimensional sine waves centered about 0 and 1, respectively. Then we slowly increase the amplitude of sine waves to make and deviate more in their parameters. The two sine waves are as follows:
where , is set to vary from 0.05 to 1 in increments of 0.05, and is the Kronecker product. We fixed the number of replicates to and and again used 2000 simulations per value to estimate the power at .
Figure 5 shows the power functions of and under heterogeneous mean and standard deviation changes. For detecting mean changes, both and maintain comparable powers although our test carries more power than the test at certain ranges of mean change. It is worth noting that the power curves in this setting appear to be similar to those under the homogeneous mean change which indicates no serious power loss when the mean change is heterogeneous. A huge difference between and is observed, however, when the standard deviation change is heterogeneous: still maintains its power while seems to lose power.
Further simulations are again made available in the supplement comparing , , and in the heterogeneous case. is extremely powerful in detecting mean changes but like and it loses all power in detecting heterogeneous variance changes. Our method again shows the highest average power of the four approaches.
4 Evaluating Proxy Influence in Assimilated CFRs
We now apply the proposed statistic to evaluate the influence of proxies on the 2m surface temperature reconstruction by examining the differences between the background and analysis states in PHYDA. In our experiment, the background state consists of a single 100 member ensemble of 2m surface temperature fields that are randomly sampled from a single climate model simulation run. For every year of the reconstruction, the analysis state consists of a 100 member ensemble of 2m surface temperatures. We will use our statistic to test for distributional differences between the background ensemble and each year’s analysis ensemble so as to test for proxy influence during each reconstruction year. Formally, this refers to testing the hypothesis (1) that was formulated in Section 2.1. We will then further subdivide the background and analysis states into 12 regions, corresponding with the five oceans and seven continents, and repeat our analysis on the regions separately. Finally, we investigate how correlation between regions may impact the influence of proxies at the regional level.
4.1 Global reconstructions
Figure 8 shows the values of over time along with their associated p-values. A larger value of corresponds to a smaller p-value and more separation between the background and analysis states in their distribution. The p-values were adjusted by the Benjamini-Yekutieli procedure (Benjamini et al., 2001) to have a false discovery rate of 0.05. The uniformly near zero p-values strongly indicate that the background and analysis are significantly different in distribution each year, which suggests that the proxies indeed change the distribution of the background and have a material influence over the PHYDA reconstructions. Despite the uniformly small p-values, the magnitudes of indicate a relatively weak separation between the background and analysis in the beginning followed by a steadily increasing separation over time until the end of the reconstruction period. The apparent rise in separation is caused by the fact that proxy information is sequentially introduced into the model over time. As the reconstruction approaches the present day more proxies become available for assimilation and thus the assimilated fields diverge further from the background.
4.2 Regional variation of proxy influence
Analysis of the global reconstruction is important for establishing the strength of proxy influence at the global level and for confirming the upward trend of the proxy influence. A natural next step is to investigate how these effects propagate down to a regional level, namely how proxies impact the temperature reconstruction at the continental and oceanic level. Proxies are not collected uniformly across all regions as shown in Figure 11, so a weaker influence might be expected of the proxies in the poorly sampled regions than those with dense sampling. We therefore use our method to investigate the local influence of proxies.
We divide the globe into 12 regions corresponding to the five oceans and seven continents as in Figure 11. Within each of the twelve regions we apply our test to evaluate the difference between the background and analysis states over the full reconstruction period. Analogous to the global study in Section 4.1, the progression of over time for all regions is summarized in Figure 12. It is surprising to see that values over all regions share a consistent increasing trend, even for the regions with scarce proxies. Intuitively, we expect that the increasing trend holds only for the regions with abundant proxies because gradually introduced proxy information in those regions will make the analysis states more and more distinct from the background. However, due to the complex dependence structure of the climate system, these regional deficits may be being mitigated. This counter-intuitive result intrigues us to study whether the long range dependence in the background climate states helps to stabilize the reconstruction in data-sparse regions.
We investigate this conjecture using the correlation maps in Figure 14, for which the value at each location describes the strongest correlation, i.e., the maximum , between the temperature time series at this location and every temporally available proxy location during the representative years of 1000, 1400, and 1800 CE, respectively. These maps indicate the strength of spatial diffusion of proxy information over time. As more proxies are added towards the present, more global area becomes highly correlated with the proxy locations.
The maps in Figure 14 are helpful in understanding why some regions such as the Pacific have few proxies but show a high degree of divergence between their background and analysis states. The Pacific is strongly correlated with nearby continental regions such as North America, which has many proxies, due to the El Niño-Southern Oscillation phenomenon. Conversely, Australia, which has few proxies and weak proxy correlations, shows a correspondingly low degree of divergence between its background and analysis states. Regions with densely sampled proxies tend to show high proxy correlation due to proximity with their own proxies, and also a high degree of background-analysis separation, as expected.
The Arctic and Southern oceans represent two anomalies with regards to their apparent proxy information. Most notably around 1600 the Arctic ocean experiences a strong trend reversal in , just when other regions are experiencing trend increases. This runs counter to the fact that both the number of proxies and the proxy-point correlations shown in Figure 14 are increasing in the Arctic over this time period. One possible explanation is that this effect may be due to Arctic tree rings dropping out of the reconstruction prior to 1600 CE, i.e., the tree rings were used as proxies only for the reconstruction period after 1600 CE. Before this only ice core proxies were used for reconstruction. It may be that high northern latitude tree rings are informative for local land areas but not for the adjacent Arctic ocean. Because the O of ice cores have their source regions over the ocean they could potentially be more informative for reconstructing climate there than tree rings (Steiger et al., 2017).
Conversely, the Southern ocean has relatively large values of when the correlation information in Figure 14 would lead us to believe that they should be much smaller. The Southern ocean has no local proxies and perhaps has the weakest overall proxy-point correlation strength, yet it experiences a strong and significant divergence between its background and analysis states. This unexpectedly strong divergence may be due to the large amount of moderate correlations observed near the Pacific, along the coast of Antarctica, and off the southern tip of South America. The cumulative effect of those moderate correlations may lead to the results in Figure 12.
5 Discussion and Conclusions
Motivated by the newly available PHYDA reconstruction product, we developed a non-parametric statistical test to compare the distributions of the ensembles in the background states and in the analysis states. The PHYDA data product was derived using a DA scheme that merges information from climate model simulations and climate proxies, the latter of which is expected to provide its due influence on the derived analysis fields. However, the nature of the DA approach and the variation of proxy information back in time makes it difficult to assess the degree to which the proxies influence the final analysis product. By treating each ensemble member in the background and analysis states as continuous two dimensional surfaces, we constructed a test statistic based on functional data depth to measure the difference in distribution between ensembles in the two states.
Due to the nonparametric nature of functional data depth, our method does not require any distributional or model assumptions on the observations. Our method also does not require that the curves be square integrable, second-order stationary, or even strictly continuous. We showed numerically that the asymptotic distribution of our test statistic converges to or at least is well approximated by the Kolmogorov distribution. Additional simulations in the supplement shows the same conclusion even when the spatial data is from a Non-Gaussian process (see Figures 3 and 4 in the supplement). We also demonstrated that the sizes of our test are well controlled near the nominal level, even under moderate sample sizes, and that our test’s powers are highly competitive with the test in Liu and Singh (1993).
Our results provide strong evidence of a clear divergence between the background and analysis states associated with PHYDA. The degree of separation, however, depends greatly on geographical location and time period. An overall upward trend in proxy influence is seen and it is generally maintained even when subdividing the globe into oceanic and continental sub-regions. With the notable exception of the Arctic, these findings are consistent with the fact that proxy information steadily increases as the reconstruction period approaches the present day. This confirms that increasing proxy information is associated with commensurate influence on the assimilated reconstructions, which suggests that the influence of the model prior is minimized as proxy networks become considerably more dense, therefore placing less emphasis on which model should be used to form the prior. This also suggests that more proxies should be collected further back in time to improve reconstruction skill over all parts of the Common Era.
We have also found that, despite the stark imbalance in proxy density in the different geographic regions, most regions exhibit an increasing separation between the background and analysis states. The mitigating effect for the proxy deficit regions is mostly attributable to the long-range dependency structure that proxies and temperatures often display. Some regions such as the Pacific Ocean and South America have very few local proxies but due to their strong overall correlation with other regions they still benefit from proxies collected remotely. These results therefore suggest that the desirable addition of proxy information to data assimilated reconstructions extends beyond the immediate regions where proxies are densely sampled. This provides credence to the idea that the geographic regions outside of dense proxy sampling may still establish some reconstruction skill, particularly in the last several centuries before the present.
Before now, testing for significant proxy influence over DA reconstructions has not been conducted. Optimally adding proxy information is one of the principal qualities of a DA-based reconstruction and thus knowing the cumulative effect of adding proxies is of fundamental importance, particularly as DA becomes increasingly more popular (Franke et al., 2017; Steiger et al., 2018; Tardif et al., 2019). Our test provides a direct and powerful way to quantify the information added by proxies to the analysis states based on changes in their distribution from the background.
In addition to the important results for assessing assimilated reconstruction products, our test is much more broadly applicable. Our generic formulation allows it to be applied to any functional data that the depth function can handle, including curves on and higher dimensional functions on . In our framework, each and can also be multivariate valued so long as they both map to the same subspace of . This only changes integration to be over a multivariate depth instead of a univariate depth. Our method can also be useful for comparing image data in medical studies, and meet the increasing demand of comparing simulated climate from different climate models and comparing the simulated climate to observations.
PHYDA is publicly available at the Zenodo data repository as NetCDF4 files: https://doi.org/10.5281/zenodo.1154913. The 100 member ensembles of PHYDA used herein are available at: http://clifford.ldeo.columbia.edu/nsteiger/recon_output/phyda_ens/.
Appendix A Appendix
let be a distribution on and suppose and are two i.i.d samples from . Let and be defined as before with each converging in distribution to , the distribution over . Let , then
Because . By the enforced uniformity of we get that and so the following upper bound
The second term is simply a one sample Kolmogorov-Smirnov statistic so the whole quantity converges to the Kolmogorov distribution. ∎
a.2 Power comparisons with other tests
While the quality index is the closest method to ours, it is not the only other method for comparing the distributions of functions. One particularly powerful method is the Functional Anderson-Darling () test of (Pomann et al., 2016). In their paper they demonstrated superior power over all other functional distribution tests except for the Rank based Band Depth Test () of (Lopez-Pintado and Romo, 2009), which was not compared against. We compare our method against the Quality Index (QI), , and under the same simulation settings as in the main paper.
The method is extremely powerful against changes in the mean of the data, however compared with the depth based methods its noticeably less powerful against variance changes (Figure 15). Under the heterogeneous changes (Figure 16) our test is still the only test to maintain its power in detecting heterogeneous variance changes.
a.3 Convergence under a non-Gaussian Process
Because our test does not depend on any parametric assumptions of the data we wanted to see how convergence, size, and power were maintained when the data came from a markedly Non-Gaussian process. For these simulations we used the same settings as in the main paper’s simulations except that the functions were generated with a multivariate t distribution instead of a multivariate Gaussian distribution. We analogously denote these functions as coming from at-process.
Under a t-process, convergence in is observed to be slower than the corresponding Gaussian process. Critical values, however, are almost immediately unbiased verses their asymptotic counterparts. Together these indicate that the distribution of is harder to approximate when the data is heavy tailed, but that this is relatively unimpactful since decisions regarding significance are unaffected by using the asymptotic distribution.
a.4 Size under a Non-Gaussian Process
We next looked at the size under t-process data. Size is controlled at relatively the same levels as when Gaussian process data was used. This is due to the critical values of the permutation distribution and the asymptotic distribution being in near agrement, even at small sample sizes. The same pattern of needed sufficient range or smoothness to achieve the nominal level is still observed.
|n||m||r = 0.2||0.3||0.4||0.5||0.2||0.3||0.4||0.5||0.2||0.3||0.4||0.5|
a.5 Power comparisons under a Non-Gaussian Process
Finally we considered power under homogeneous and heterogeneous parameter changes under Non-Gaussian data (t-process). The same settings to test power in the main paper’s simulations were against used to generate data. As in the convergence and size simulation the sampled functions were generated from a t-process with 3 degrees of freedom.. The power curves (Figures19 and 20) are generally flatter than the corresponding power curves under a Gaussian process, however the relationship between methods remains the same. FAD still dominates detecting changes in the mean and KD, MBD, and QI dominate detecting changes in the standard deviation. All methods lose considerable power in detecting range and smoothness changes. Notably the FAD test ran into computational issues trying to estimate the functional principal components due the t-process frequently generating very outlying curves.
a.6 v.s. on PHYDA
The preceding power plots show that their is no clear dominating method, between and across all of the parameters in the Gaussian and Non-Gaussian simulations. clearly detects mean differences better and clearly detects standard deviation differences better. This is particularly true in the case of heterogeneous mean and variance changes under a t-process (Figure 20), i.e. the more realistic setting. We argue that because fails to detect heterogeneous changes in the variance, it misses out on the crucial finding in our data analysis, namely that the analysis ensembles become more distinct from the background over time. These changes appear to be primarily driven by a downward trend in the variance of the analysis state (see Figure 23).
As can be seen in Figure 23, the average difference between the background and analysis remains relatively constant over time. Because the averages differences are even slightly different from 0, has no issue with detecting a significant difference. The real differentiator is how the ratio of the variances changes over time. With the exception of the very end of the reconstruction, the average variance ratio increases almost monotonically. This pattern reveals that one of the primary effects of including additional proxies is a reduction in uncertainty. This near monotonic increase in uncertainty reduction is largely reflected in the associated time series of K values (Figure 26). If we compare against the values of over time (Figure 26) we can see that it does not register this aspect of the distribution change. generally only follows the trend of the mean differences, while follows both.
- The control of the false discovery rate in multiple testing under dependency. The annals of statistics 29 (4), pp. 1165–1188. Cited by: §4.1.
- Common functional principal components. Annals of Statistics 37 (1), pp. 1–34. Cited by: §1.1.
- Wavelets and field forecast verification. Monthly Weather Review 125, pp. 1329–1341. Cited by: §1.1.
- Challenges and perspectives for large-scale temperature reconstructions of the past two millennia. Reviews of Geophysics 55 (1), pp. 40–96. External Links: Cited by: §1, §1.
- New insights on permutation approach for hypothesis testing on functional data. Advances in Data Analysis and Classification 8 (3), pp. 339–356. Cited by: §1.1.
On depth measures and dual statistics. a methodology for dealing with general data.
Journal of Multivariate Analysis100 (4), pp. 753–766. Cited by: §1.1, §2.2, §2.2.
- Trimmed means for functional data. Test 10 (2), pp. 419–440. Cited by: §2.2, §2.2.
- Reconstructing late holocene north atlantic atmospheric circulation changes using functional paleoclimate networks. Climate of the Past 13 (11), pp. 1593–1608. Cited by: §5.
- Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer Science & Business Media. Cited by: §2.4.
- The role of forcing and internal dynamics in explaining the medieval climate anomaly. Climate dynamics 39 (12), pp. 2847–2866. Cited by: §1.
- Statistical paleoclimate reconstructions via markov random fields. The Annals of Applied Statistics 9 (1), pp. 324–352. Cited by: §1.
- The last millennium climate reanalysis project: framework and first results. Journal of Geophysical Research: Atmospheres 121, pp. 6745–6764. Cited by: §1.
- Two sample tests in functional data analysis starting from discrete data. Statistica Sinica 17, pp. 1511–1531. Cited by: §1.1.
- Comparing spatial predictions. Technometrics 53, pp. 414–425. Cited by: §1.1.
- Estimation of the mean of functional time series and a two sample problem.. Journal of the Royal Statistical Society: Series B 75, pp. 103–122. Cited by: §1.1.
- High-resolution palaeoclimatology of the last millennium: a review of current status and future prospects. The Holocene 19 (1), pp. 3–49. External Links: Cited by: §1, §1.
- Evaluation of proxy-based millennial reconstruction methods. Climate Dynamics 31 (2-3), pp. 263–281. Cited by: §1.
- Defining spatial comparison metrics for evaluation of paleoclimatic field reconstructions of the common era. Environmetrics 23, pp. 394–406. Cited by: §1.1.
- Comparison between spatiotemporal random processes and application to climate model data. Environmetrics 27, pp. 267–279. Cited by: §1.1, §1.
- A quality index based on data depth and multivariate rank test. Journal of the American Statistical Association 88, pp. 252–260. Cited by: §1.1, §1.1, §2.3, §2.4, §5.
- On a notion of data depth based on random simplices. Annals of Statistics 18 (1), pp. 405–414. Cited by: §1.1.
- On the concept of depth for functional data. Journal of the American Statistical Association 104, pp. 718–734. Cited by: §A.2, §1.1, §2.2.
- Revisiting climate region definitions via clustering. Journal of Climate 22, pp. 1787–1800. Cited by: §1.1.
- Global-scale temperature patterns and climate forcing over the past six centuries. Nature 392, pp. 779 EP –. External Links: Cited by: §1.
- Information from paleoclimate archives. In Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change, T. F. Stocker, D. Qin, G.-K. Plattner, M. Tignor, S. K. Allen, J. Doschung, A. Nauels, Y. Xia, V. Bex, and P. M. Midgley (Eds.), pp. 383–464. External Links: Cited by: §1.
- General notions of depth for functional data. arXiv preprint arXiv:1208.1981. Cited by: §2.2.
Integrated depth for functional data: statistical properties and consistency.
European Series in Applied and Industrial Mathematics: Probability and Statistics20, pp. 95–130. Cited by: §1.1, §2.2, §2.2, §2.2.
- Extremal depth for functional data and applications. Journal of the American Statistical Association 111, pp. 1705–1714. Cited by: §2.2.
- Climate variability and change since 850 ce: an ensemble approach with the community earth system model. Bulletin of the American Meteorological Society 97 (5), pp. 735–754. Cited by: §1.2.
- Detecting signals in fmri data using powerful fdr procedures. Statistics and Its Interface 1, pp. 23–32. Cited by: §1.1.
A two sample distribution free test for functional data with application to a diffusion tensor imaging study of multiple sclerosis. Journal of the Royal Statistical Society: Series C Applied Statistics 65 (3), pp. 395–414. Cited by: §A.2, §1.1.
- Functional data analysis. Springer. Cited by: §1.1.
- Nonparametric hypothesis testing for a spatial signal. Journal of the American Statistical Association 97, pp. 1122–1140. Cited by: §1.1.
- Reconstructing earth’s surface temperature over the past 2000 years: the science behind the headlines. Wiley Interdisciplinary Reviews: Climate Change 7 (5), pp. 746–771. External Links: Cited by: §1, §1.
- Climate models as a test bed for climate reconstruction methods: pseudoproxy experiments. Wiley Interdisciplinary Reviews: Climate Change 3 (1), pp. 63–77. Cited by: §1.
Spatial interpolation of surface air temperatures using artificial neural networks: evaluating their use for downscaling gcms. Journal of Climate 13, pp. 886–895. Cited by: §1.1.
- Likelihood ratio tests for dependent data with applications to longitudinal and functional data analysis. Scandinavian Journal of Statistics 41, pp. 932–949. Cited by: §1.1.
- Assimilation of time-averaged pseudoproxies for climate reconstruction. Journal of Climate 27, pp. 426–441. Cited by: §1.
- A reconstruction of global hydroclimate and dynamical variables over the common era. Scientific Data 5, pp. 180086 EP. Cited by: §1.2, §1, §1, §2.1, Figure 11, §5.
- Climate reconstruction using data assimilation of water isotope ratios from ice cores. Journal of Geophysical Research: Atmospheres 122 (3), pp. 1545–1568. External Links: Cited by: §4.2.
- Interpolation of spatial data: some theory for kriging. Springer Science & Business Media. Cited by: §3.
- Last millennium reanalysis with an expanded proxy database and seasonal proxy modeling. Climate of the Past 15 (4), pp. 1251–1273. Cited by: §1, §5.
- Piecing together the past: statistical insights into paleoclimatic reconstructions. Quaternary Science Reviews 35, pp. 1–22. Cited by: §1.
- Intraseasonal interactions between temperature and vegetation over the boreal forests. Earth Interactions 11, pp. 1–30. Cited by: §1.1.
- Statistical inferences for functional data. Annals of Statistics 35, pp. 1052–1079. Cited by: §1.1.
- Two sample inference for the second order property of temporally dependent functional data. Bernoulli 21, pp. 909–929. Cited by: §1.1.
- On the limiting distributions of multivariate depth-based rank sum statistics and related tests. Annals of Statistics 34 (6), pp. 2879–2896. Cited by: §2.4.
- General notions of statistical depth function. Annals of Statistics 28 (2), pp. 461–482. Cited by: §2.2.