1 Introduction
The Fourier integral theorem, see for example (Wiener, 1933) and (Bochner, 1959), is a remarkable result. For all real integrable and continuous function , it yields
(1) 
The derivation of this result is from a combination of the Fourier and inverse Fourier transforms. As far as we are aware, no other approach to obtaining such integral theorems exist in the literature, a point supported by the paper
(Fowler, 1921).The Fourier integral theorem had been employed as Monte Carlo estimators in several statistics and machine learning applications
(Parzen, 1962; Davis, 1975; Ho and Walker, 2021), such as multivariate density estimation, nonparametric mode clustering and modal regression, quantile regression, and generative model. The methodological benefits of the Fourier integral theorem come from rewriting equation (
1) as(2) 
where and . Equation (2) contains an important insight: even though we have certain dependent structures in , by taking the products of sinc functions our Monte Carlo estimators based on that equation are still able to preserve these dependent structures. That eliminates the cumbersome and delicate procedure of choosing covariance matrix to guarantee the good practical performance of previous Monte Carlo estimators based on multivariate Gaussian kernels (Wand, 1992; Staniswalis et al., 1993; Chacon and Duong, 2018).
Contribution.
The aim in this paper is to highlight a general class of integral theorems that we can use as Monte Carlo estimators for applications in statistics and machine learning while has some benefits over the Fourier integral theorem under certain settings. We do not come at this class using transforms and inverse transforms as that of the Fourier integral theorem, but rather use two novel ideas; a cyclic function which integrates to 0 over each cyclic interval, and a Riemann sum approximation to an integral.
In particular, we define almost everywhere differentiable cyclic functions such that
(3) 
The Fourier integral theorem corresponds to the kernel
(4) 
for all . Then, via Riemann sums approximating integrals theorem, we demonstrate that
(5) 
Similar to the Fourier integral theorem (2), the general integral theorems in equation (5) are also able to automatically preserve the dependence structures in function . Now, with our finding of large classes of integral functions, the question posed is which kernel has any optimal properties in terms of estimation.
In this work, we specifically answer this question in the context of multivariate kernel density estimation problem
(Rosenblatt, 1956; Parzen, 1962; Yakowitz, 1985; Györfi et al., 1985; Terrell and Scott, 1992; Wand and Jones, 1993; Wasserman, 2006; Botev et al., 2010; Giné and Nickl, 2010; Jiang, 2017). Indeed, the kernel density estimator based on equation (5) would be given by:(6) 
where represent the sample from the density function and a finite is required for the smoothing. We study upper bounds for bias of the estimator based on and the sample size in Theorem 1 and Corollary 1.
In order to find the optimal kernel , we use asymptotic mean integrated square error, which is key property determining the quality of an estimator; see for example (Wand, 1992, 1994). To ease the findings, we specifically consider the univariate settings, i.e., and use the following two terms in that error to determine the optimal kernel: (1) the first term
(7) 
which provides an upper bound on the variance of density estimator
; the second term(8) 
which yields more precise asymptotic behaviors of the variance of the density estimator than that from equation (7). We demonstrate that by minimizing the first term (7) subject to the constraints (3), the optimal kernel is the sin function (4) in the Fourier integral theorem. This is achieved via a variational approach in ordinary differential equations. On the other hand, by using Cauchy residue theorem in complex analysis, we prove that the optimal kernel is not sin kernel for minimizing the second term (8) subject to the constraints (3). It also demonstrates the usefulness in finding other integral theorems to the Fourier integral theorem.
The organization of the paper is as follows. In Section 2, we first revisit the Fourier integral theorem and establish the bias of its density estimator via Riemann sums approximating integrals theorem. Then, using the insight from that theorem, we introduce a general class of integral theorems that possess similar approximation errors. After deriving our class of integral theorems, in Section 3 we study optimal kernels that minimize either the problem (7) or the problem (8) subject to the constraints (3). Finally, we conclude the paper with a few discussions in Section 4 while deferring the proofs of the remaining results in the paper to the Appendix.
2 Integral Theorems
We first study the bias of kernel density estimator (6) or equivalently approximation property of the Fourier integrals theorem via the Riemann sums approximating integral theorem in Section 2.1. Then, using the insight from that result, we introduce a general class of integral theorems that possesses similar approximation behavior like the Fourier integral theorem in Section 2.2.
2.1 The Fourier integral theorem revisited
Before going into the details of the general integral theorem, we reconsider the approximation property of the Fourier integral theorem. In (Ho and Walker, 2021) the authors utilize the tail behaviors of the Fourier transform of the function to characterize an approximation error of the Fourier integral theorem when truncating one of the integrals. However, the technique in the proof is inherently based on properties of the sin kernel and is nontrivial to extend to other choices of useful cyclic functions; examples of such functions are in Section 2.2.
In this paper, we provide insight into the approximation error of the Fourier integral theorem via the Riemann sums approximating integral theorem. This insight can be generalized into any cyclic function which integrates to 0 over the cyclic interval, thereby enriching the family of integral theorems beyond Fourier’s. To simplify the presentation, we define
(9) 
By simple calculation, where the outer expection is taken with respect to i.i.d. samples from and is the density estimator (6) when is the sin kernel. Therefore, to study the bias of the kernel density estimator in equation (6), it is sufficient to consider the approximation error of the Fourier integral theorem, namely, we aim to upper bound for all . To obtain the bound, we start with the following definition of the class of univariate functions that we use throughout our study.
Definition 1.
The univariate function is said to belong to the class if for any , the function satisfies the following conditions:

The function is differentiable, uniformly continuous up to the th order, and the limits for any where denotes the th order derivative of ;

The integrals are finite for all .
Note that, for the function in Definition 1, for any when , we choose . Based on Definition 1, we now state the following result.
Theorem 1.
Assume that the univariate functions for any where are given positive integer numbers. Then, if we have or for any , there exist universal constants and depending on such that as long as we obtain
where .
The proof is presented in the Appendix. To appreciate the proof we demonstrate the key idea in the one dimensional case. Here
which we write as
where . Without loss of generality, we set to get
where . Now due to the cyclic behaviour of the sin function we can write this as
The term is a Riemann sum approximation to an integral which converges to a constant, for all , as . The overall convergence to 0 is then a consequence of . Hence, it is how the Riemann sum converges to a constant which determines the speed at which .
2.2 General integral theorem
It is interesting to note that the sin function in the Fourier integral theorem could be replaced by any cyclic function which integrates to 0 over the cyclic interval. In particular, we consider the following general form of integral theorem:
(10) 
where the univariate function is a cyclic function on . A direct example of the function is for all , which corresponds to the Fourier integral theorem. Using the proof technique of Theorem 1 and the assumptions with function in that theorem, we also obtain the following approximation error of the general integral theorem:
Corollary 1.
Therefore, we have a general class of integral theorems that possesses similar approximation errors as that of the Fourier integral theorem. Furthermore, the general integral theorems are also able to automatically maintain the dependence structures in function .
We now discuss some examples of function that have connection to Haar wavelet and splines.
Example 1.
(Haar wavelet integral theorem) We consider the piecewise linear function
where
and
It demonstrates that the derivative of the function is the Haar wavelet function, which can be named the “Haar wavelet integral theorem” for this choice of .
Example 2.
(Spline integral theorem) Here we take into account the piecewise quadratic function
where
A direct calculation shows that
Therefore, the first derivative of is a piecewise linear function. The particular form of justifies the spline integral theorem for this choice of kernel function .
Finally, we would like to highlight that the Haar wavelet and spline integral theorems are just two instances of the integral theorems. In general, the class of cyclic function satisfying an integral theorem is vast.
3 Optimal Functions
In this section, we discuss optimal functions from the integral theorem with respect to the kernel density estimation problem. To ease the findings, we specifically consider the univariate settings, namely, . Subject to the constraints in equation (3), as mentioned in the introduction, we consider minimizing either the problem (7) or the problem (8). We show that the sin function minimizes
subject to constraints in equation (3), whereas this is not the case when we introduce a density function, i.e., the aim now being to minimize
for some density function , which yields more precise asymptotic behaviors of the variance of density estimator in equation (6).
3.1 The sin function
As we have mentioned, a direct application of the general integral theorem is for a Monte Carlo estimator of density functions; (Ho and Walker, 2021). The bias of the estimator , namely, , has been established in Corollary 1. A natural question to ask is the form of optimal kernel that leads to a good quality of density estimator . To answer that question, we use asymptotic mean integrated square error (Wand, 1992, 1994), which is equivalent to find optimal kernel that leads to good variance for the estimator . A simple calculation shows that
(11) 
for any
where the outer variance is taken with respect to random variable
following density function . With an assumption that , we can upper bound the variance of as follows:(12) 
The integral in the upper bound in equation (12) is convenient as it does involve the function . It indicates that the optimal kernel minimizing that integral will be independent of , which also yields good insight into the behavior of variance of the density estimator for all and . Therefore, we consider minimizing the upper bound (12) with respect to the constraints that is almost surely differentiable cyclic function in and satisfies the constraints (3). It is equivalent to solving the following objective function
such that satisfies (3). That is the objective function (7) that we mentioned in the introduction.
To study the optimal function that satisfy these constraints, we define the following functions:
for any . Now, we would like to prove that
(13) 
In fact, from the infinite product representation of the sinc function, we have
(14) 
for any . By taking the logarithm of both sides of the equation and take the derivative with respect to , we obtain that
By using the change of variable changing , we obtain the conclusion that . The form of can be obtained direct by taking the derivative of . Therefore, we obtain the conclusion of claim (13).
Now, we state our main result for the optimal kernel solving the objective function (7).
Theorem 2.
Interestingly, if we consider a truncation of the sinc function, which corresponds to the optimal kernel for solving objective function (7), at in equation (14), we obtain the Epanechnikov kernel (Epanechnikov, 1967; Mueller, 1984), which is given by , for and 0 otherwise. This kernel had been shown to have optimal efficiency among nonnegative kernels that are differentiable up to the second order (Tsybakov, 2009). Direct calculation shows that
Therefore, if we use the term (7) as an indication for the quality of our variance, the Epanechikov kernel is not better than the sin kernel from the Fourier integral theorem. It also aligns with an observation from Tsybakov (2009) that we can construct better kernels, which can take negative values, than the Epanechnikov kernel without restricting to only the nonnegative kernels.
Proof.
Since the function is cyclic in , we have
Similarly, we also obtain that
Given the above equations, the original problem can be rewritten as follows:
(15)  
such that 
The Lagrangian function corresponding to the objective function (15) takes the form:
Since is almost surely differentiable, the function can be rewritten as follows:
where is the function such that and . To find that minimizes the function , we use the EulerLagrange equation, see for example (Young, 1969), which entails that
This equation leads to
where and can be determined by solving the conditions (3).
Given the form of optimal and the forms of , in equation (17), the first condition in equation (3) leads to
That equation demonstrates that . Given that, . Now, the second condition in equation (3) indicates that
That equation leads to . Therefore, we have the optimal kernel for all . As a consequence, we obtain the conclusion of the theorem. ∎
3.2 Alternative optimal functions
In the previous section we saw that the sin function is optimal for minimizing subject to the constraints (3), namely, and and being a cyclic function on . That objective function stems from the upper bound (12) on the variance of density estimator .
In this section we demonstrate the usefulness in finding other integral theorems to Fourier’s by finding optimal function , also satisfying the constraints (3), which minimizes the leading term from the variance of , which is given by:
for some density function on . Since the above integral captures the leading term of the variance of , it gives more precise asymptotic behavior of the variance than that of the term (7). As that integral involves the density function , it indicates that the optimal kernel also depends on . To illustrate our findings, we specifically consider the setting when is the Cauchy density; i.e., for all , which would be useful when modeling heavy tailed distributions and moreover we are able to solve the relevant equations.
The proof idea for obtaining the optimal kernel is the same as in that of Theorem 2. The only thing that is different is that we need to find a new function , which we refer to as ; given now by:
To find the closedform expression of , we utilize contour integration from complex analysis (see for example (Priestley, 1985)); so consider
where is a circle in the complex plane of radius around the origin. The simple poles occur at for all integers , giving a total residue of , since relevant coefficient in the Laurent expansion of is 2; also at for which the residues are . There is a double pole at for which the residue is the first derivative of
evaluated at . From direct calculation, this term is .
Now using the Cauchy residue theorem and noting that as , and expanding , we obtain
As shown in the proof of Theorem 2, the optimal function is of the form
(16) 
where recall that , with the , values being now able to capture the coefficient of . Different from Theorem 2, we do not have closedform expressions for and . However, numerical integration can be used to determine the values to meet the constraints (3), and this yields and . A picture of the optimal function is given in Figure 1.
To show the benefit of the new optimal kernel in equation (16) over the sin kernel in the Fourier integral theory, in Figure 2 we plot the kernel density estimators using both the optimal kernel and the sin kernel, alongside the difference between the two estimators. This is based on a sample of size from a standard Cauchy density. There is apparently little difference between the two. However, when computing the square of the integral in equation (8) we note that for the optimal kernel we obtain a value of 0.0416 and for the sin kernel a value of 0.0774.
We note in passing that such a result is not restricted to only the setting when the samples are generated from the Cauchy distribution; actually, when we have samples from a Gaussian distribution, using the kernel
also yields slightly better variance than the sin kernel from the Fourier integral theorem. We leave a detailed investigation of the benefit of over the sin kernel for general settings of density estimation problem in the future work.4 Discussion
In this paper we have introduced a general class of integral theorems. These integral theorems provide natural Monte Carlo density estimators that can automatically preserve the dependence structure of a dataset. Under the univariate density estimation setting, we demonstrate that the Fourier integral theorem is the optimal integral theorem when we minimize a square integral in equation (7), a term that indicates a good variance of the density estimator and when we do not want to have the density function to involve in the upper bound of the variance. To show the benefit of a general class of integral theorems, we also consider optimal kernel that minimizes the term (8), which provides more precise nature of the variance of the kernel density estimator. Our study proves that the optimal kernels for that objective function are generally not the sin kernel from the Fourier integral theorem.
Now, we discuss a few future directions from this work. First, in this work, we only obtain the optimal kernels in our general class of integral theorems for the density estimation task. It is important to study the optimal kernels for other statistical estimation tasks, such as nonparametric (modal) regression and mode clustering. Second, the current work of LeeThorp et al. (2021)
propose using double Fourier transforms to approximate a nonparametric function that can capture well both the correlation of words in each sequence and the correlation of sequences in natural language processing tasks. Given our study with the general integral theorems, it is of interest to investigate whether we can develop the general notion of double Fourier transforms in the similar way as we do for the integral theorems and whether the choice of double Fourier transforms is optimal for estimating the nonparametric function arising in the natural language processing tasks.
5 Appendix
In this appendix, we give the proof of Theorem 1. To ease the presentation, the values of universal constants (e.g., , , , etc.) can change from linetoline. For any , we denote .
5.1 Proof of Theorem 1
We first prove the result of Theorem 1 when . In particular, we would like to show that when the function , there exists a universal constant such that we have
In fact, from the definition of in equation (9), we have
For simplicity of the presentation, for any we write for all . Then, we can rewrite the above equality as
Invoking the change of variables , the above equation becomes
(17) 
Since , the function is differentiable up to the th order. Therefore, using a Taylor expansion up to the th order, leads to
Plugging the above Taylor expansion into equation (17), we have
(18) 
where, for , we define
We now find a bound for for ; we will demonstrate that
(19) 
where is some universal constant. To obtain these bounds, we will use an inductive argument on . We first start with . In fact, we have
An application of Taylor expansion leads to
Now for any and , we have
Collecting the above results, we find that
Using a Riemann sums approximating integrals theorem, we have
where the finite value of the integral is due to the assumption that . Furthermore, the above limit is uniform in terms of as is uniformly continuous. Collecting the above results, there exists a universal constant such that as long as , the following inequality holds:
where is some universal constant. Combining all of the previous results, we obtain
Since , using integration by parts, we get
Therefore, we obtain the conclusion of equation (19) when .
Now assume that the conclusion of equation (19) holds for . We will prove that the conclusion also holds for . With a similar argument to the setting , we obtain
(20) 
Using a Taylor expansion, we have
Comments
There are no comments yet.