Universal Convergence of Kriging

10/18/2017 ∙ by Wenjia Wang, et al. ∙ Georgia Institute of Technology 0

Kriging based on Gaussian random fields is widely used in reconstructing unknown functions. The kriging method has pointwise predictive distributions which are computationally simple. However, in many applications one would like to predict for a range of untried points simultaneously. In this work we obtain some error bounds for the (simple) kriging predictor under the uniform metric. It works for a scattered set of input points in an arbitrary dimension, and also covers the case where the covariance function of the Gaussian process is misspecified. These results lead to a better understanding of the rate of convergence of kriging under the Gaussian or the Matérn correlation functions, the relationship between space-filling designs and kriging models, and the robustness of the Matérn correlation functions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Kriging is a widely used methodology to reconstruct functions based on their scattered evaluations. Originally, kriging was introduced to geostatistics by Matheron (1963). Later, it has been applied to computer experiments (Sacks et al., 1989)

, machine learning

(Rasmussen, 2006)

, small area estimation from survey data

(Rao and Molina, 2015)

, and other areas. With kriging, one can obtain an interpolant of the observed data, that is, the predictive curve or surface goes through all data points. Conventional regression methods, like the linear regression, the local polynomial regression

(Fan and Gijbels, 1996) and the smoothing splines (Wahba, 1990), do not have this property. It is suitable to use interpolation in spatial statistics and machine learning when the random noise of the data is negligible. The interpolation property is particularly helpful in computer experiments, in which the aim is to construct a surrogate model for a deterministic computer code, such as a finite element solver.

A key element of kriging prediction is the use of conditional inference based on Gaussian processes. At each untried point of the design region (i.e., domain for the input variables), the conditional distribution of a Gaussian process is normal with explicit mean and variance. The pointwise confidence interval of the kriging predictor is then constructed using this conditional distribution. In many applications, it is desirable to have a joint confidence region of the kriging predictor over a continuous set of the input variables such as an interval or rectangular region. The pointwise confidence interval for each design point cannot be amalgamated over the points in the design region to give a confidence region/limit with guaranteed coverage probability, even asymptotically. To address this question, it would be desirable to have a theory that gives good bounds on the worst (i.e., maximum) error of the kriging predictor over the design region. This bound can be useful in the construction of confidence regions with guaranteed coverage property, albeit somewhat conservatively.

In this work, we derive error bounds of the (simple) kriging predictor under a uniform metric. The predictive error is bounded in terms of the maximum pointwise predictive variance of kriging. A key implication of our work is to show that the overall predictive performance of a Gaussian process model is tied to the smoothness of the underlying correlation function as well as the space-filling property of the design (i.e., collection of the design points). This has two major consequences. First, we show that a less smooth correlation function is more robust in prediction, in the sense that prediction consistency can be achieved for a broader range of true correlation functions, while a smoother correlation function can achieve a higher rate of convergence provided that it is no smoother than the true correlation. Second, these error bounds are closely related to the fill distance, which is a space-filling property of the design. This suggests that it makes a good design by minimizing its fill distance. The theory of radial basis function approximation (Wendland, 2004) and a maximum inequality for Gaussian processes (Adler and Taylor, 2009) are employed as key tools in our technical development.

This paper is organized as follows. In Section 2, we review the mathematical foundation of simple kriging and state the objectives of this paper. In Section 3, we present our main results on the uniform error bounds for kriging predictors. Comparison with existing results in the literature is given in Section 3.3. Some simulation studies are presented in Section 4, which confirm our theoretical analysis. Concluding remarks and discussion are given in Section 5. Appendix A includes some necessary mathematical tools. Appendix B contains the proof of Theorem 1, the main theorem of this work.

2 Preliminaries and motivation

In Sections 2.1 and 2.2, we review the kriging method and introduce some proper notation. In Section 2.3, we state the primary goal of our work.

2.1 Review on the simple kriging method

Let be a Gaussian process on . In this work, we suppose that has mean zero and is stationary, i.e., the covariance function of depends only on the difference between the two input variables. Specifically, we denote

for any , where is the variance and is the correlation function. The correlation function should be positive definite and satisfy . In particular, we consider two important families of correlation functions. The isotropic Gaussian correlation function is defined as


with some , where denotes the Euclidean norm. The isotropic Matérn correlation function (Santner et al., 2003; Stein, 1999) is defined as


where and is the modified Bessel function of the second kind. The parameter is often called the smoothness parameter, because it determines the smoothness of the Gaussian process (Cramér and Leadbetter, 1967).

Suppose that we have observed , in which are distinct points. We shall use the terminology in design of experiments (Wu and Hamada, 2011) and call the design points, although in some situations (e.g., in spatial statistics and machine learning) these points are observed without the use of design. In this paper, we do not assume any (algebraic or geometric) structure for the design points . They are called scattered points in the applied mathematics literature.

The aim of (simple) kriging is to predict at an untried based on the observed data , which is done by calculating the conditional distribution. It follows from standard arguments (Santner et al., 2003; Banerjee et al., 2004) that, conditional on ,

is normally distributed, with


where and .

The conditional expectation is a natural predictor of using , because it is the best linear predictor (Stein, 1999; Santner et al., 2003). It is worth noting that a nice property of the Gaussian process models is that the predictor (2.3) has an explicit expression, which explains why kriging is so popular and useful.

The above simple kriging method can be extended. Instead of using a mean zero Gaussian process, one may introduce extra degrees of freedom by assuming that the Gaussian process has an unknown constant mean. More generally, one may assume the mean function is given by a linear combination of known functions. The corresponding methods are referred to as ordinary kriging and universal kriging, respectively. A standard prediction scheme is the best linear unbiased prediction

(Santner et al., 2003; Stein, 1999). For the ease of mathematical treatment, we only consider simple kriging in this work. The convergence theory for ordinary and universal kriging requires separate developments. Further discussions are deferred to Section 5.

2.2 Kriging interpolant

The conditional expectation in (2.3) defines an interpolation scheme. To see this, let us suppress the randomness in the probability space and then becomes a deterministic function, often called a sample path. It can be verified that, as a function of , in (2.3) goes through each .

The above interpolation scheme can be applied to an arbitrary function . Specifically, given design points and observations , we define the kriging interpolant by


where and . This interpolation scheme is also refered to as the radial basis function interpolation (Wendland, 2004). The only difference between (2.5) and (2.3) is that we replace the Gaussian process by a function here. In other words,


As mentioned in Section 2.1, the conditional expectation is a natural predictor of . A key objective of this work is to derive a uniform bound of the predictive error of the kriging method, given by , which is equal to almost surely.

In practice, is usually unknown. Thus it is desirable to develop a theory that also covers the cases with misspecified correlations. In this work, we suppose that we use another correlation function for prediction. We call the true correlation function and the imposed correlation function. Under the imposed correlation function, the kriging interplant of the underlying Gaussian process becomes . In this situation, the interpolant cannot be interpreted as the conditional expectation. With an abuse of terminology, we still call it a kriging predictor.

2.3 Goal of this work

Our aim is to study the approximation power of the kriging predictor. Specifically, we are interested in bounding the maximum prediction error over a region ,


in a probabilistic manner, where is the region of interest, also called the experimental region, and .

Our obtained results on the error bound in (2.7) can be used to address or answer the following three questions.

First, the quantity (2.7) captures the worst case prediction error of kriging. In many practical problems, we are interested in recovering a whole function rather than predicting for just one point. Therefore, obtaining uniform error bounds are of interest because they provide some insight on how we can modify the pointwise error bound to achieve a uniform coverage.

Second, we study the case of misspecified correlation functions. This address a common question in kriging when the true correlation function is unknown: how to gain model robustness under a misspecified correlation function and how much efficiency loss is incurred.

Third, our framework allows the study of an arbitrary set of design points (also called scattered points). Thus our results can facilitate the study of kriging with fixed or random designs. In addition, our theory can be used to justify the use of space-filling designs (Santner et al., 2003), in which the design points spread (approximatedly) evenly in the design region.

3 Uniform error bounds for kriging

This section contains our main theoretical results on the prediction error of kriging.

3.1 Error bound in terms of the power function

It will be shown that, the predictive variance (2.4) plays a curial role on the prediction error, when the true correlation function is known, i.e., . To incorporate the case of misspecified correlation functions, we define the power function as


where , and .

The statistical interpretation of the power function is evident. From (2.4) it can be seen that, if , the power function is the kriging predictive variance for a Gaussian process with . Clearly, we have .

To pursue a convergence result under the uniform metric, we define


We now state the main results on the error bounds for kriging predictors. Recall that the prediction error under the uniform metric is given by (2.7).

The results depend on some smoothness conditions on the imposed kernel. Given any function , let

be its Fourier transform. According to the inversion formula in Fourier analysis,

is the spectral density of the stationary process if is continuous and integrable on .

Condition 1.

The kernels and are continuous and integrable on , satisfying


In addition, there exists , such that


Now we are able to state the main theorem of this paper. Recall that is the variance of .

Theorem 1.

Suppose Condition 1 holds, and the design set is dense enough in the sense that defined in (3.2) is no more than some given constant . Then for any , with probability at least , the kriging predictive error has the upper bound


Here the constants depend only on and .

Theorem 1 presents an upper bound on the maximum prediction error of kriging. This answers the first question posed in Section 2.3. We will give more explicit error bounds in terms of the design and the kernel in Section 3.2. Theorem 1 can also be used to study the case of misspecified correlation functions, provided that condition (3.3) is fulfilled. Condition (3.3) essentially requires that the imposed correlation is no smoother than the true correlation function . Theorem 1 can also be used to address the third question posed in Section 2.3. Note that the right side of (3.5) is a deterministic function depending on the design, and is decreasing in if is large enough. Therefore, it is reasonable to consider designs which minimize . Such a construction depends on the specific form of . In Section 3.2, we will further show that, by maximizing certain space-filling measure, one can arrive at the optimal rate of convergence for a broad class of correlation functions.

From Theorem 1, we also observe that the constant in (3.3) determines the decay rate of the maximum prediction error. In other words, the maximum prediction error appears more concentrated around its mean when the imposed kernel is closer to the true correlation function. Note that condition (3.4

) requires a moment condition on the spectral density, which is fulfilled for any Matérn or Gaussian kernel.

The non-asymptotic upper bound in Theorem 1 implies some asymptotic results which are of traditional interests in spatial statistics and related areas. For instance, suppose we adopt a classic setting of fixed-domain asymptotics (Stein, 1999) in which the probabilistic structure of and the kernel function are fixed, and the number of design points increases so that tends to zero. Corollary 1 is an immediate consequence of Theorem 1, which shows the weak convergence and convergence of the maximum prediction error.

Corollary 1.

For fixed and , we have the following asymptotic results


for any , as .


Theorem 1 implies (3.6) directly. For (3.7), it follows from

where the inequality follows from Theorem 1. ∎

We believe that (3.6) and (3.7) are the full convergence rate because from (A.1) in Appendix A.1 we can see that the convergence rate of the radial basis approximation for deterministic functions in the reproducing kernel Hilbert space is and these two rates are nearly at the same order of magnitude, expect for a logarithmic factor. This is reasonable because the support of a Gaussian process is typically larger than the corresponding reproducing kernel Hilbert space (van der Vaart and van Zanten, 2008). As said earlier in this section, if ,

is the supremum of the pointwise predictive standard deviation. Thus Corollary

1 implies that, if is known, the predictive error of kriging under the uniform metric is not much larger than its pointwise error.

3.2 Error bounds in terms of the fill distance

Our next step is to find error bounds which are easier to interpret and compute than that in Theorem 1. To this end, we wish to find an upper bound of , in which the effects of the design and the kernel can be made explicit and separately. This step is generally more complicated, but fortunately some upper bounds are available in the literature, especially for the Gaussian and the Matérn kernels. These bounds are given in terms of the fill distance, which is a quantity depending only on the design . Given the experimental region , the fill distance of a design is defined as


Clearly, the fill distance quantifies the space-filling property (Santner et al., 2003) of a design. A design having the minimum fill distance among all possible designs with the same number of points is known as a minimax distance design (Johnson et al., 1990).

The upper bounds of in terms of the fill distance for Gaussian and Matérn kernels are given in Lemmas 1 and 2, respectively.

Lemma 1 (Wendland, 2004, Theorem 11.22).

Let ; be a Gaussian kernel given by (2.1). Then there exist constants depending only on and the scale parameter in (2.1), such that provided that .

Lemma 2 (Wu and Schaback, 1993, Theorem 5.14).

Let be compact and convex with a positive Lebesgue measure; be a Matérn kernel given by (2.2) with the smoothness parameter . Then there exist constants depending only on and the scale parameter in (2.2), such that provided that .

Using the upper bounds of given in Lemmas 1 and 2, we can further deduce error bounds of the kriging predictor in terms of the fill distance defined in (3.8). We demonstrate these results in Examples 1-3.

Example 1.

Here we assume is a Matérn kernel in (2.2) with smoothness parameter . It is known that

where is the scale parameter in (2.2). See, for instance, Wendland (2004); Tuo and Wu (2015). Suppose is a Matérn correlation function with smoothness . It can be verified that Condition 1 holds if and only if . Therefore, if , we can invoke Lemma 2 and Theorem 1 to obtain that the kriging predictor converges to the true Gaussian process with a rate at least as tends to zero. It can be seen that the rate of convergence is maximized at . In other words, if the true smoothness is known a priori, one can obtain the greatest rate of convergence.

Example 2.

Suppose is the same as in Example 1, and is a Gaussian correlation function in (2.1), with spectral density (Santner et al., 2003) where is the scale parameter in (2.1). Then Condition 1 holds for any choice of . Then we can invoke Lemma 2 and Theorem 1 to obtain the same rate of convergence as in Example 1.

Example 3.

Suppose , and is a Gaussian kernel in (2.1). Then we can invoke Lemmas 1 and Theorem 1 to obtain the rate of convergence for some constant . Note that this rate is faster than the rates obtained in Examples 1-3, because it decays faster than any polynomial of . Such a rate is known as a spectral convergence order (Xiu, 2010; Wendland, 2004).

The upper bounds in Lemmas 1 and 2 explain more explicitly how the choice of designs can affect the prediction performance. Note that in Examples 1-3, the upper bounds are increasing in . This suggests that we should consider the designs with a minimum value, which are known as the maximin distance designs (Johnson et al., 1990). Therefore our theory shows that the maximin distance designs enjoy nice theoretical guarantees for all Gaussian and Matérn kernels. In contrast with the designs minimizing as discussed after Theorem 1, it would be practically beneficial to use the maximin distance designs because they can be constructed without knowing which specific kriging model is to be used.

3.3 Comparison with some existing results

We make some remarks on the relationship between our results and some existing results. In Table 1, we list some related results in the literature concerning the prediction of some underlying function, which is either a realization of a Gaussian process or a determinisitc function in a reproducing kernel Hilbert spaces. It can be seen from Table 1 that only Buslaev and Seleznjev (1999) and Wu and Schaback (1993) address the uniform convergence problem.

Article/Book Model assumption Predictor Design Type of convergence Rate of convergence (Matérn kernels)
present work Gaussian process with misspecification kriging scattered points conv., uniform in
Yakowitz and Szidarovszky (1985) stochastic process with misspecification kriging scattered points Mean square conv., pointwise in NA
Stein (1990b) Stochastic process with misspecification kriging regular grid points Mean square conv., pointwise in
Buslaev and Seleznjev (1999) Gaussian process best linear approximation optimally chosen points in an interval conv., uniform in
Ritter (2000) Gaussian process kriging optimally chosen points Mean square conv., in
Wu and Schaback (1993) Deterministic function kriging scattered points uniform in
Table 1: Comparison between our work and some existing results

Buslaev and Seleznjev (1999) study the rate of convergence of the best linear approximation under an optimally chosen points in an interval. In other words, the predictor is constructed using the best linear combination of the observed data. Thus this predictor is in general different from the kriging predictor. Also, note that their work is limited to the one dimensional case where the points are chosen in a specific way. Therefore, this theory does not directly address the question raised in this paper. However, their result, together with our findings in Example 1, does imply an interesting property of kriging. Recall that in Example 1, the rate of convergence for a (known) Matérn correlation is at least . If a space-filling design is used in an interval, then and the convergence rate is , which coincides with the best possible rate of convergence given by a linear predictor. Because the kriging predictor is a linear predictor, we can conclude that our uniform upper bound for kriging is sharp in the sense that it captures the acctual rate of convergence.

Among the papers listed in Table 1, Wu and Schaback (1993) is the only one comparable to ours, in the sense that they consider a uniform prediction error under a scatter set of design points. They obtain error estimates of the kriging-type interpolants for a deterministic function, known as the radial basis function approximation. Although the mathematical formulations of the interpolants given by kriging and radial basis functions are similar, the two methods are different in their mathematical settings and assumptions. In radial basis function approximation, the underlying function is assumed fixed, while kriging utilizes a probabilistic model, driven by a Gaussian random field.

Kriging with misspecified correlation functions is discussed in Yakowitz and Szidarovszky (1985); Stein (1988, 1990a, 1990b). It has been proven in these papers that some correlation functions, especially the Matérn correlation family, are robust against model misspecification. However, they do not consider convergence under a uniform metric. More discussions on this point are given in Section 5.

4 Simulation studies

In Example 1, we have shown that if and are Matérn kernels with smoothness parameters and , respectively, and , then the kriging predictor converges with a rate at least . In this section we report simulation studies that verify that this rate is sharp, i.e., the true convergence rate coincides with that given by the theoretical upper bound.

We denote the expectation of the left-hand side of (3.5) by . If the error bound (3.5) is sharp, we have the approximation

for some constant independent of . Taking logarithm on both sides of the above formula yields


Since is much smaller than , the effect of is negligible in (4.1). Consequently, we get our second approximation


As shown in (4.2), is approximately a linear function in with slope . Therefore, to assess whether (3.5) is sharp, we should verify if the regression coefficient (slope) of with respect to is close to .

In our simulation studies, the experimental region is chosen to be . To estimate the regression coefficient in (4.2), we choose 50 different maximin Latin hypercube designs (Santner et al., 2003) with sample sizes , for . Note that each design corresponds to a specific value of the fill distance . For each , we simulate the Gaussian processes 100 times to reduce the simulation error. For each simulated Gaussian process, we compute to approximate the sup-error , where is the set of grid points with grid length . This should give a good approximation since the grid is dense enough. Next, we calculate the average of over the 100 simulations to approximate . Then the regression coefficient is estimated using the least squares method.

We conduct four simulation studies with different choices of the true and imposed smoothness of the Matérn kernels, denoted by and , respectively. We summarize the simulation results in Table 2.

Regression coefficient Theoretical assertion Relative difference
3 2.5 2.697 2.5 0.0788
5 3.5 3.544 3.5 0.0126
3.5 3.5 3.582 3.5 0.0234
5 5 4.846 5 0.0308
Table 2: Numerical studies on the convergence rates of kriging prediction. (The first two columns show the true and imposed smoothness parameters of the Matérn kernels. The third column shows the convergence rate obtained from the simulation. The fourth column shows the convergence rate given by Theorem 1. The last column shows the relative difference between the third and the fourth columns, given by |regression coefficient-theoretical assertion|/(theoretical assertion).)

Figure 1 shows the relationship between the logarithm of the fill distance (i.e., ) and the logarithm of the average prediction error (i.e., ) in scatter plots for the four cases given in Table 2. The solid line in each panel shows the linear regression fit calculated from the data. Each of the regression lines in Figure 1 fits the data very well, which gives an empirical confirmation of the approximation in (4.2). It is also observed from Figure 1 that, as the fill distance decreases, the maximum prediction error also decreases.

Figure 1: The regression line of on . Each point denotes one average prediction error for each . Panel 1: , . Panel 2: , . Panel 3: . Panel 4: .

Returning to Table 2, it is seen that the regression coefficients are close to the values given by our theoretical analysis, with relative error no more than 0.08. By comparing the third and the fourth rows of Table 2, we find that the regression coefficient does not have a significant change when remains the same, even if changes. On the other hand, the third and the fifth rows show that, the regression coefficient changes significantly as changes, even if keeps unchanged. This shows convincingly that the convergence rate is independent of the true smoothness of the Gaussian process, and the rate given by Theorem 1 is sharp. Note that our simulation studies justify the use of the leading term in (4.1) to assess the convergence rate but they do not cover the second term , which is of lower order.

From the simulation studies, we can see that if the smoothness of the imposed kernel is lower, the kriging predictor converges slower. Therefore, to maximize the prediction efficiency, it is beneficial to set the smoothness parameter of the imposed kernel the same as the true correlation function.

5 Conclusions and further discussion

We first summarize the statistical implications of this work. We prove that the kriging predictive error converges to zero under a uniform metric, which justifies the use of kriging as a function reconstruction tool. Kriging with a misspecified correlation function is also studied. Theorem 1 shows that there is a tradeoff between the predictive efficiency and the robustness. Roughly speaking, a less smooth correlation function is more robust against model misspecification. However, the price for robustness is to incur a small loss in prediction efficiency. With the help of the classic results in radial basis function approximation (in Lemmas 1 and 2), we find that the predictive error of kriging is associated with the fill distance, which is a space-filling measure of the design. This justifies the use of space-filling designs for (stationary) kriging models.

In this paper, we only consider Gaussian process models with mean zero, which is referred to as the simple kriging. A natural extension of this work is to include the Gaussian process models with a mean function modeled as a linear combination of a finite set of functions, known as the universal kriging. In this situation, one would consider the best linear unbiased predictor (BLUP) instead of the conditional expectation. The mathematical treatments should follow the same lines but the details are more cumbersome. Another extension is to consider the impact of a misspecified mean function. See, for example, Jiang et al. (2011) who addresses this problem in the context of small area estimation. We can consider the study of the effect of joint misspecification of mean and correlation functions in a future work.

We have proved in Theorem 1 that the kriging predictor is consistent if the imposed correlation function is undersmoothed, i.e., the imposed correlation function is no smoother than the true correlation function. One would ask whether a similar result can be proven for the case of oversmoothed correlation functions. Yakowitz and Szidarovszky (1985) proved that kriging with an oversmoothed Matérn correlation function also achieves (pointwisely) predictive consistency. In light of this result, we may consider extensions of Theorem 1 to the oversmoothed case in a future work.

In a series of papers, Stein (1988, 1990a, 1990b, 1993) investigated the asymptotic efficiency of the kriging predictor. The theory in our work does not give assertions about prediction efficiency, although we provide explicit error bounds for kriging predictors with scattered design points in general dimensions.

Another important topic is the kriging predictive performance when the correlation function is estimated. In this paper, we consider kriging with a misspecified but fixed kernel function. It is shown that the prediction error can be minimized if the imposed kernel has the same smoothness as the true correlation function. A natural question is whether the optimal rate of convergence can be achieved with an estimated correlation function.


Appendix A Auxiliary tools

In this section, we review some mathematical tools which are used in the proof of Theorem 1 in Appendix B.

a.1 Reproducing kernel Hilbert spaces

There are several equivalent ways to define the reproducing kernel Hilbert spaces. See, for example, Wahba (1990); Gu (2013); Wendland (2004); van der Vaart and van Zanten (2008). Here we adopt the one using the Fourier transform. See Girosi et al. (1995) and Theorem 10.12 of Wendland (2004). Let be the space of complex-valued square integrable functions on , and be the space of continuous real-valued functions on .

Definition 1.

Let be a positive definite kernel function which is continuous and integrable in . Define the reproducing kernel Hilbert space as

with the inner product

Denote the norm of by . For , a pointwise error bound for the radial basis function interpolation is given by (Wu and Schaback (1993), Theorem 4.5):


for each , where is defined in (3.1).

a.2 A Maximum inequality for Gaussian processes

It is worth noting that is a linear map between two functions, and therefore is also a Gaussian process. Therefore, the problem in (2.7) is to bound the maximum value of a Gaussian process.

The theory of bounding the maximum value of a Gaussian process is well-established in the literature. The main step of finding an upper bound is to calculate the covering number of the index space. Here we review the main results. Detailed discussions can be found in Adler and Taylor (2009).

Let be a Gaussian process indexed by . Here can be an arbitrary set. The Gaussian process induces a metric on , defined by


The -covering number of the metric space , denoted as , is the minimum integer so that there exist distinct balls in with radius , and the union of these balls covers . Let be the diameter of . The supremum of a Gaussian process is closely tied to a quantity called the entropy integral, defined as


Lemma 3 gives a maximum inequality for Gaussian processes, which is a direct consequence of Theorems 1.3.3 and 2.1.1 of Adler and Taylor (2009).

Lemma 3.

Let be a centered separable Gaussian process on a -compact , the metric, and the -covering number. Then there exists a universal constant such that for all ,


where .

Appendix B Proof of Theorem 1

Without loss of generality, assume , because otherwise we can consider the upper bound of instead. Let . For any ,

where , , , and .

The rest of our proof consists of the following steps. In step 1, we bound the covering number . Next we bound the diameter . In step 3, we invoke Theorem 3 to obtain a bound for the entropy integral. In the last step, we use (A.4) to obtain the desired results.

Step 1: Bounding the covering number

Let and . It can verified that

By Condition 1, , since for any . Thus, by (A.1),


By Definition 1,


Using Condition 1 and (B.2), we obtain


We need the following inequality to bound . For any and , we have


This inequality is trivial when because ; and for the case that , (B.4) can be proven using the mean value theorem and the fact that . Note that the definition of implies that . Thus, by the Fourier inversion theorem and (B.4), we have


for any . In particular, we now choose . Now we consider . It follows from a similar argument that . The definition of implies .

For any , the Fourier inversion theorem and Condition 1 yield


Then we choose to get


Let . Then . By (A.1) and the fact that , we have


Thus, if , by a similar argument in (B.5) (with the choice ), and together with (B.8) and (B), we have


for a constant .

In view of (B.1), (B.6) and (B.10), there exists a constant such that


Therefore, the covering number is bounded above by


The right side of (B.12) involves the covering number of a Euclidean ball, which is well understood in the literature. See Lemma 4.1 of Pollard (1990). This result leads to the bound