On Integral Theorems: Monte Carlo Estimators and Optimal Functions

07/22/2021
by   Nhat Ho, et al.
0

We introduce a class of integral theorems based on cyclic functions and Riemann sums approximating integrals theorem. The Fourier integral theorem, derived as a combination of a transform and inverse transform, arises as a special case. The integral theorems provide natural estimators of density functions via Monte Carlo integration. Assessments of the quality of the density estimators can be used to obtain optimal cyclic functions which minimize square integrals. Our proof techniques rely on a variational approach in ordinary differential equations and the Cauchy residue theorem in complex analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/28/2020

Multivariate Smoothing via the Fourier Integral Theorem and Fourier Kernel

Starting with the Fourier integral theorem, we present natural Monte Car...
10/18/2020

Creative Telescoping on Multiple Sums

We showcase a collection of practical strategies to deal with a problem ...
06/11/2021

Statistical Analysis from the Fourier Integral Theorem

Taking the Fourier integral theorem as our starting point, in this paper...
02/15/2021

Method Monte-Carlo for solving of non-linear integral equations

We offer in this short report a simple Monte-Carlo method for solving a ...
09/10/2018

Extension and Application of Deleting Items and Disturbing Mesh Theorem of Riemann Integral

The deleting items and disturbing mesh theorems of Riemann Integral are ...
10/27/2018

Applying Fourier Analysis to Judgment Aggregation

The classical Arrow's Theorem answers "how can n voters obtain a collect...
01/10/2022

A Coq Formalization of the Bochner integral

The Bochner integral is a generalization of the Lebesgue integral, for f...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Fourier integral theorem, see for example (Wiener, 1933) and (Bochner, 1959), is a remarkable result. For all real integrable and continuous function , it yields

(1)

The derivation of this result is from a combination of the Fourier and inverse Fourier transforms. As far as we are aware, no other approach to obtaining such integral theorems exist in the literature, a point supported by the paper

(Fowler, 1921).

The Fourier integral theorem had been employed as Monte Carlo estimators in several statistics and machine learning applications 

(Parzen, 1962; Davis, 1975; Ho and Walker, 2021)

, such as multivariate density estimation, nonparametric mode clustering and modal regression, quantile regression, and generative model. The methodological benefits of the Fourier integral theorem come from rewriting equation (

1) as

(2)

where and . Equation (2) contains an important insight: even though we have certain dependent structures in , by taking the products of sinc functions our Monte Carlo estimators based on that equation are still able to preserve these dependent structures. That eliminates the cumbersome and delicate procedure of choosing covariance matrix to guarantee the good practical performance of previous Monte Carlo estimators based on multivariate Gaussian kernels (Wand, 1992; Staniswalis et al., 1993; Chacon and Duong, 2018).

Contribution.

The aim in this paper is to highlight a general class of integral theorems that we can use as Monte Carlo estimators for applications in statistics and machine learning while has some benefits over the Fourier integral theorem under certain settings. We do not come at this class using transforms and inverse transforms as that of the Fourier integral theorem, but rather use two novel ideas; a cyclic function which integrates to 0 over each cyclic interval, and a Riemann sum approximation to an integral.

In particular, we define almost everywhere differentiable cyclic functions such that

(3)

The Fourier integral theorem corresponds to the kernel

(4)

for all . Then, via Riemann sums approximating integrals theorem, we demonstrate that

(5)

Similar to the Fourier integral theorem (2), the general integral theorems in equation (5) are also able to automatically preserve the dependence structures in function . Now, with our finding of large classes of integral functions, the question posed is which kernel has any optimal properties in terms of estimation.

In this work, we specifically answer this question in the context of multivariate kernel density estimation problem 

(Rosenblatt, 1956; Parzen, 1962; Yakowitz, 1985; Györfi et al., 1985; Terrell and Scott, 1992; Wand and Jones, 1993; Wasserman, 2006; Botev et al., 2010; Giné and Nickl, 2010; Jiang, 2017). Indeed, the kernel density estimator based on equation (5) would be given by:

(6)

where represent the sample from the density function and a finite is required for the smoothing. We study upper bounds for bias of the estimator based on and the sample size in Theorem 1 and Corollary 1.

In order to find the optimal kernel , we use asymptotic mean integrated square error, which is key property determining the quality of an estimator; see for example (Wand, 1992, 1994). To ease the findings, we specifically consider the univariate settings, i.e., and use the following two terms in that error to determine the optimal kernel: (1) the first term

(7)

which provides an upper bound on the variance of density estimator

; the second term

(8)

which yields more precise asymptotic behaviors of the variance of the density estimator than that from equation (7). We demonstrate that by minimizing the first term (7) subject to the constraints (3), the optimal kernel is the sin function (4) in the Fourier integral theorem. This is achieved via a variational approach in ordinary differential equations. On the other hand, by using Cauchy residue theorem in complex analysis, we prove that the optimal kernel is not sin kernel for minimizing the second term (8) subject to the constraints (3). It also demonstrates the usefulness in finding other integral theorems to the Fourier integral theorem.

The organization of the paper is as follows. In Section 2, we first revisit the Fourier integral theorem and establish the bias of its density estimator via Riemann sums approximating integrals theorem. Then, using the insight from that theorem, we introduce a general class of integral theorems that possess similar approximation errors. After deriving our class of integral theorems, in Section 3 we study optimal kernels that minimize either the problem (7) or the problem (8) subject to the constraints (3). Finally, we conclude the paper with a few discussions in Section 4 while deferring the proofs of the remaining results in the paper to the Appendix.

2 Integral Theorems

We first study the bias of kernel density estimator (6) or equivalently approximation property of the Fourier integrals theorem via the Riemann sums approximating integral theorem in Section 2.1. Then, using the insight from that result, we introduce a general class of integral theorems that possesses similar approximation behavior like the Fourier integral theorem in Section 2.2.

2.1 The Fourier integral theorem revisited

Before going into the details of the general integral theorem, we reconsider the approximation property of the Fourier integral theorem. In (Ho and Walker, 2021) the authors utilize the tail behaviors of the Fourier transform of the function to characterize an approximation error of the Fourier integral theorem when truncating one of the integrals. However, the technique in the proof is inherently based on properties of the sin kernel and is non-trivial to extend to other choices of useful cyclic functions; examples of such functions are in Section 2.2.

In this paper, we provide insight into the approximation error of the Fourier integral theorem via the Riemann sums approximating integral theorem. This insight can be generalized into any cyclic function which integrates to 0 over the cyclic interval, thereby enriching the family of integral theorems beyond Fourier’s. To simplify the presentation, we define

(9)

By simple calculation, where the outer expection is taken with respect to i.i.d. samples from and is the density estimator (6) when is the sin kernel. Therefore, to study the bias of the kernel density estimator in equation (6), it is sufficient to consider the approximation error of the Fourier integral theorem, namely, we aim to upper bound for all . To obtain the bound, we start with the following definition of the class of univariate functions that we use throughout our study.

Definition 1.

The univariate function is said to belong to the class if for any , the function satisfies the following conditions:

  1. The function is differentiable, uniformly continuous up to the -th order, and the limits for any where denotes the -th order derivative of ;

  2. The integrals are finite for all .

Note that, for the function in Definition 1, for any when , we choose . Based on Definition 1, we now state the following result.

Theorem 1.

Assume that the univariate functions for any where are given positive integer numbers. Then, if we have or for any , there exist universal constants and depending on such that as long as we obtain

where .

The proof is presented in the Appendix. To appreciate the proof we demonstrate the key idea in the one dimensional case. Here

which we write as

where . Without loss of generality, we set to get

where . Now due to the cyclic behaviour of the sin function we can write this as

The term is a Riemann sum approximation to an integral which converges to a constant, for all , as . The overall convergence to 0 is then a consequence of . Hence, it is how the Riemann sum converges to a constant which determines the speed at which .

2.2 General integral theorem

It is interesting to note that the sin function in the Fourier integral theorem could be replaced by any cyclic function which integrates to 0 over the cyclic interval. In particular, we consider the following general form of integral theorem:

(10)

where the univariate function is a cyclic function on . A direct example of the function is for all , which corresponds to the Fourier integral theorem. Using the proof technique of Theorem 1 and the assumptions with function in that theorem, we also obtain the following approximation error of the general integral theorem:

Corollary 1.

Assume that the kernel satisfies the constraints (3) and the function satisfies the assumptions in Theorem 1. Then, there exist universal constants and depending on and the function such that when we have

where is defined as in Theorem 1.

Therefore, we have a general class of integral theorems that possesses similar approximation errors as that of the Fourier integral theorem. Furthermore, the general integral theorems are also able to automatically maintain the dependence structures in function .

We now discuss some examples of function that have connection to Haar wavelet and splines.

Example 1.

(Haar wavelet integral theorem) We consider the piece-wise linear function

where

and

It demonstrates that the derivative of the function is the Haar wavelet function, which can be named the “Haar wavelet integral theorem” for this choice of .

Example 2.

(Spline integral theorem) Here we take into account the piece-wise quadratic function

where

A direct calculation shows that

Therefore, the first derivative of is a piece-wise linear function. The particular form of justifies the spline integral theorem for this choice of kernel function .

Finally, we would like to highlight that the Haar wavelet and spline integral theorems are just two instances of the integral theorems. In general, the class of cyclic function satisfying an integral theorem is vast.

3 Optimal Functions

In this section, we discuss optimal functions from the integral theorem with respect to the kernel density estimation problem. To ease the findings, we specifically consider the univariate settings, namely, . Subject to the constraints in equation (3), as mentioned in the introduction, we consider minimizing either the problem (7) or the problem (8). We show that the sin function minimizes

subject to constraints in equation (3), whereas this is not the case when we introduce a density function, i.e., the aim now being to minimize

for some density function , which yields more precise asymptotic behaviors of the variance of density estimator in equation (6).

3.1 The sin function

As we have mentioned, a direct application of the general integral theorem is for a Monte Carlo estimator of density functions; (Ho and Walker, 2021). The bias of the estimator , namely, , has been established in Corollary 1. A natural question to ask is the form of optimal kernel that leads to a good quality of density estimator . To answer that question, we use asymptotic mean integrated square error (Wand, 1992, 1994), which is equivalent to find optimal kernel that leads to good variance for the estimator . A simple calculation shows that

(11)

for any

where the outer variance is taken with respect to random variable

following density function . With an assumption that , we can upper bound the variance of as follows:

(12)

The integral in the upper bound in equation (12) is convenient as it does involve the function . It indicates that the optimal kernel minimizing that integral will be independent of , which also yields good insight into the behavior of variance of the density estimator for all and . Therefore, we consider minimizing the upper bound (12) with respect to the constraints that is almost surely differentiable cyclic function in and satisfies the constraints (3). It is equivalent to solving the following objective function

such that satisfies (3). That is the objective function (7) that we mentioned in the introduction.

To study the optimal function that satisfy these constraints, we define the following functions:

for any . Now, we would like to prove that

(13)

In fact, from the infinite product representation of the sinc function, we have

(14)

for any . By taking the logarithm of both sides of the equation and take the derivative with respect to , we obtain that

By using the change of variable changing , we obtain the conclusion that . The form of can be obtained direct by taking the derivative of . Therefore, we obtain the conclusion of claim (13).

Now, we state our main result for the optimal kernel solving the objective function (7).

Theorem 2.

The optimal cyclic and almost everywhere differentiable function that solves the objective function (7) subject to constraints (3) is for all .

Interestingly, if we consider a truncation of the sinc function, which corresponds to the optimal kernel for solving objective function (7), at in equation (14), we obtain the Epanechnikov kernel (Epanechnikov, 1967; Mueller, 1984), which is given by , for and 0 otherwise. This kernel had been shown to have optimal efficiency among non-negative kernels that are differentiable up to the second order (Tsybakov, 2009). Direct calculation shows that

Therefore, if we use the term (7) as an indication for the quality of our variance, the Epanechikov kernel is not better than the sin kernel from the Fourier integral theorem. It also aligns with an observation from Tsybakov (2009) that we can construct better kernels, which can take negative values, than the Epanechnikov kernel without restricting to only the non-negative kernels.

Proof.

Since the function is cyclic in , we have

Similarly, we also obtain that

Given the above equations, the original problem can be rewritten as follows:

(15)
such that

The Lagrangian function corresponding to the objective function (15) takes the form:

Since is almost surely differentiable, the function can be rewritten as follows:

where is the function such that and . To find that minimizes the function , we use the Euler-Lagrange equation, see for example (Young, 1969), which entails that

This equation leads to

where and can be determined by solving the conditions (3).

Given the form of optimal and the forms of , in equation (17), the first condition in equation (3) leads to

That equation demonstrates that . Given that, . Now, the second condition in equation (3) indicates that

That equation leads to . Therefore, we have the optimal kernel for all . As a consequence, we obtain the conclusion of the theorem. ∎

3.2 Alternative optimal functions

In the previous section we saw that the sin function is optimal for minimizing subject to the constraints (3), namely, and and being a cyclic function on . That objective function stems from the upper bound (12) on the variance of density estimator .

In this section we demonstrate the usefulness in finding other integral theorems to Fourier’s by finding optimal function , also satisfying the constraints (3), which minimizes the leading term from the variance of , which is given by:

for some density function on . Since the above integral captures the leading term of the variance of , it gives more precise asymptotic behavior of the variance than that of the term (7). As that integral involves the density function , it indicates that the optimal kernel also depends on . To illustrate our findings, we specifically consider the setting when is the Cauchy density; i.e., for all , which would be useful when modeling heavy tailed distributions and moreover we are able to solve the relevant equations.

Figure 1: Optimal function for solving objective function (8) with respect to the Cauchy density function. As we can observe, it is different from the optimal sin function from the Fourier integral theorem for solving objective function (7).

The proof idea for obtaining the optimal kernel is the same as in that of Theorem 2. The only thing that is different is that we need to find a new function , which we refer to as ; given now by:

To find the closed-form expression of , we utilize contour integration from complex analysis (see for example (Priestley, 1985)); so consider

where is a circle in the complex plane of radius around the origin. The simple poles occur at for all integers , giving a total residue of , since relevant coefficient in the Laurent expansion of is 2; also at for which the residues are . There is a double pole at for which the residue is the first derivative of

evaluated at . From direct calculation, this term is .

Now using the Cauchy residue theorem and noting that as , and expanding , we obtain

As shown in the proof of Theorem 2, the optimal function is of the form

(16)

where recall that , with the , values being now able to capture the coefficient of . Different from Theorem 2, we do not have closed-form expressions for and . However, numerical integration can be used to determine the values to meet the constraints (3), and this yields and . A picture of the optimal function is given in Figure 1.

To show the benefit of the new optimal kernel in equation (16) over the sin kernel in the Fourier integral theory, in Figure 2 we plot the kernel density estimators using both the optimal kernel and the sin kernel, alongside the difference between the two estimators. This is based on a sample of size from a standard Cauchy density. There is apparently little difference between the two. However, when computing the square of the integral in equation (8) we note that for the optimal kernel we obtain a value of 0.0416 and for the sin kernel a value of 0.0774.

We note in passing that such a result is not restricted to only the setting when the samples are generated from the Cauchy distribution; actually, when we have samples from a Gaussian distribution, using the kernel

also yields slightly better variance than the sin kernel from the Fourier integral theorem. We leave a detailed investigation of the benefit of over the sin kernel for general settings of density estimation problem in the future work.

Figure 2: Density estimators from a Cauchy sample with both optimal and sin kernels (left panel); difference between the two estimators (right panel).

4 Discussion

In this paper we have introduced a general class of integral theorems. These integral theorems provide natural Monte Carlo density estimators that can automatically preserve the dependence structure of a dataset. Under the univariate density estimation setting, we demonstrate that the Fourier integral theorem is the optimal integral theorem when we minimize a square integral in equation (7), a term that indicates a good variance of the density estimator and when we do not want to have the density function to involve in the upper bound of the variance. To show the benefit of a general class of integral theorems, we also consider optimal kernel that minimizes the term (8), which provides more precise nature of the variance of the kernel density estimator. Our study proves that the optimal kernels for that objective function are generally not the sin kernel from the Fourier integral theorem.

Now, we discuss a few future directions from this work. First, in this work, we only obtain the optimal kernels in our general class of integral theorems for the density estimation task. It is important to study the optimal kernels for other statistical estimation tasks, such as nonparametric (modal) regression and mode clustering. Second, the current work of Lee-Thorp et al. (2021)

propose using double Fourier transforms to approximate a nonparametric function that can capture well both the correlation of words in each sequence and the correlation of sequences in natural language processing tasks. Given our study with the general integral theorems, it is of interest to investigate whether we can develop the general notion of double Fourier transforms in the similar way as we do for the integral theorems and whether the choice of double Fourier transforms is optimal for estimating the nonparametric function arising in the natural language processing tasks.

5 Appendix

In this appendix, we give the proof of Theorem 1. To ease the presentation, the values of universal constants (e.g., , , , etc.) can change from line-to-line. For any , we denote .

5.1 Proof of Theorem 1

We first prove the result of Theorem 1 when . In particular, we would like to show that when the function , there exists a universal constant such that we have

In fact, from the definition of in equation (9), we have

For simplicity of the presentation, for any we write for all . Then, we can rewrite the above equality as

Invoking the change of variables , the above equation becomes

(17)

Since , the function is differentiable up to the -th order. Therefore, using a Taylor expansion up to the -th order, leads to

Plugging the above Taylor expansion into equation (17), we have

(18)

where, for , we define

We now find a bound for for ; we will demonstrate that

(19)

where is some universal constant. To obtain these bounds, we will use an inductive argument on . We first start with . In fact, we have

An application of Taylor expansion leads to

Now for any and , we have

Collecting the above results, we find that

Using a Riemann sums approximating integrals theorem, we have

where the finite value of the integral is due to the assumption that . Furthermore, the above limit is uniform in terms of as is uniformly continuous. Collecting the above results, there exists a universal constant such that as long as , the following inequality holds:

where is some universal constant. Combining all of the previous results, we obtain

Since , using integration by parts, we get

Therefore, we obtain the conclusion of equation (19) when .

Now assume that the conclusion of equation (19) holds for . We will prove that the conclusion also holds for . With a similar argument to the setting , we obtain

(20)

Using a Taylor expansion, we have