1 Introduction
Random Fourier features (RFF) rahimi2007random
is a powerful technique in kernelbased learning which samples a series of random features from a distribution (obtained by Fourier transform) to approximate the kernel function. It brings promising performance and solid theoretical guarantees on scaling up kernel methods in classification
sun2018but; li2019towards, nonlinear component analysis xie2015scale; lopez2014randomized, and neural tangent kernel (NTK) jacot2018neural; arora2019exact. It is noteworthy that random features can be regarded as a class of twolayer neural networks
belkin2019reconciling in the lazy regime chizat2019lazy, and thus can be utilized to analyze overparameterized neural networks in belkin2019reconciling; arora2019exact. Due to the great success of RFF in machine learning society, Rahimi and Recht won the TestofTime Award in NeurIPS 2017 for their seminal work on RFF
rahimi2007random and Li et al. li2019towards won the Honorable Mentions (best paper finalist) in ICML 2019 for their unified theoretical analysis on RFF.The theoretical foundations behind RFF is intuitive: a positive definite (PD) function corresponds to a nonnegative and finite Borel measure, i.e., a probability distribution, via Fourier transform. Then we can sample random features from this distribution so as to approximate this PD function.
Theorem 1 (Bochner’s Theorem bochner2005harmonic).
Let be a bounded continuous function satisfying the shiftinvariant property, i.e., . Then, is positive definite if and only if it is the (conjugate) Fourier transform of a nonnegative and finite Borel measure
Here is the kernel function throughout this paper. Typically, the kernel in practical uses is realvalued and thus the imaginary part can be discarded, i.e., . It is clear that Bochner’s theorem requires the kernel function to be (i) shiftinvariant (also called “stationary”) and (ii) positive definite. These two conditions together exclude a series of commonly used kernels including

Dotproduct kernels^{1}^{1}1In this case, the Fourier basis functions are spherical harmonics smola2001regularization.: polynomial kernels smola2001regularization, arccosine kernels cho2009kernel, NTK jacot2018neural.

Indefinite kernels in a reproducing kernel Kreĭn space (RKKS) Cheng2004Learning: (i) a linear combination of positive definite kernels: DeltaGaussian oglic18a; (ii) conditionally positive definite kernels, e.g., log kernel, power kernel boughorbel2005conditionally.

Specifically designed but indefinite kernels: Gaussian kernel on manifolds Feragen2015Geodesic, polynomial kernels on spheres pennington2015spherical, TL1 kernel huang2017classification, kernel smola2001regularization, and indefinite kernels with arbitrary types.
Pennington et al. pennington2015spherical point out that dotproduct kernels with normalized data, i.e., on the unit sphere, can be shiftinvariant but not always PD. Therefore, in this paper, we consider stationary kernels, either positive definite or indefinite, admitting the following integration representation by denoting
(1) 
where the integral region is chosen as either or in this paper and are some nonnegative coefficients. Here is a signed measure athreya2006measure which is a generalized measure by allowing it to have negative values. It admits the Jordan decomposition kubrusly2015essentials, i.e., where and are two nonnegative measures with at least one of them being finite. The total mass is defined as . Specifically, to make sampling process feasible in algorithm implementation, the total mass is required to be finite, i.e., . The functions and are not limited to in Fourier transform (or in the sense of tempered distribution) and can be extended to other formulations. The integration representation in Eq. (1) is general to cover various kernels, e.g., PD kernels, dotproduct kernels on the unit sphere, and indefinite kernels. Note that the Fourier transform of nonPD kernels, i.e., the measure cannot be regarded as a nonnegative Borel measure, or even a measure. Analysis of nonPD kernels is often based on RKKS, but we do not know whether a nonPD kernel can be identified with a reproducing kernel in RKKS, which in fact corresponds to a longlasting open question Cheng2004Learning; Guo2017Optimal; Huang2016Indefinite: does any indefinite kernel have a positive decomposition? The introduced signed measure decomposition would be an accessible way to answer this question, and accordingly we make the following contributions:

In Section 3, by introducing the signed measure, we generalize RFF to a series of kernels that are not positive definite. We provide a sufficient and necessary condition to answer the above open question in RKKS via the measure decomposition technique. Moreover, this condition also guides us how to find a specific positive decomposition in practice, and thus we can devise a sampling strategy to obtain randomized feature maps. To the best of our knowledge, this is the first work to generate unbiased estimation for nonPD kernel approximation by random features.

In Section 4, we demonstrate the feasibility of our random feature algorithm on several indefinite kernels. We begin with an intuitive example, e.g., a linear combination of positive definite kernels, and then consider dotproduct kernels on the unit sphere for kernel approximation. Moreover, we prove that the popular NTK in twolayer ReLU network is shiftinvariant but not positive definite if we use normalized data, which motivates us to reconsider its induced functional spaces and related properties.

In Section 5, we evaluate various nonPD kernels on several typical largescale datasets in terms of kernel approximation and the subsequent classification task. Our experimental results validate the theoretical claims and demonstrate the effectiveness of the proposed kernel approximation algorithm.
2 Related Works
A series of research focus on nonstationary kernel approximation, e.g., approximating polynomial kernels by Maclaurin expansion kar2012random
, the tensor sketch technique
Pham2013Fast; approximating additive kernels li2010random; Vedaldi2012Efficient; approximating the Gaussian kernel on the unit sphere by the discrete cosine transform kafai2018croification. However, the considered kernels in the above works are still positive definite and the designed approximation algorithms are infeasible to indefinite kernels. Regarding to nonPD kernel approximation, we notice that there are four papers on this task: (i) Pennington et al. pennington2015sphericalfind that the polynomial kernel on the unit sphere is not PD, and they use a PD kernel (associated with a positive sum of Gaussian distributions) to approximate it;
(ii) Liu et al. liu2019double decompose (a subset of) kernel matrix into two PD kernel matrices, and then learn their respective randomized feature maps by infinite Gaussian mixtures. However, this approach in fact focuses on approximating kernel matrices rather than kernel functions; (iii) Mehrkanoon et al. mehrkanoon2018indefinite investigate Nyström approximation for indefinite kernels in spectral learning; (iv) Oglic and Gärtner oglic2019scalable propose Nyström methods for lowrank approximation of indefinite kernels in RKKS. The first two works are based on RFF and the last two focus on Nytröm approximation. Up till now, approximating nonPD kernels by random features cannot ensure unbiased and has not yet been fully investigated. Instead, based on the measure decomposition technique, our work achieves both simplicity and effectiveness by having (i) an unbiased estimator, (ii) incurring no extra parameters.3 Randomized Feature Map via Signed Measure
In this section, we begin with the concept of signed measures and answer the open question, and then devise the sampling strategy for random features. For simplicity of notation, we denote and . Moreover, a function is called radial if , where the distance is usually defined by the norm. To notify, the considered stationary kernels in this paper are all radial, and accordingly, their Fourier transforms are also radial, i.e., .
3.1 Signed measure
Let be a measure on a set satisfying and additivity (i.e., countably additive). We call a finite measure if . Specifically, is a probability measure if , and the triple is referred as the corresponding probability space. Here we consider the signed measure, a generalized version of a measure allowing for negative values.
Definition 1.
(Signed measure athreya2006measure) Let be some set, be a algebra of subsets on . A signed measure is a function satisfying additivity.
Based on the definition of signed measures, the following theorem shows that any signed measure can be represented by the difference of two nonnegative measures.
Theorem 2.
(Jordan decomposition kubrusly2015essentials) Let be a signed measure defined on the algebra as given in Definition 1. There exists two (nonnegative) measures and (one of them is a finite measure) such that .
The total mass of on is defined as . Note that this decomposition is not unique. It can be characterized by the Hahn decomposition theorem doss1980hahn: space can be decomposed by with . Here, is a positive set for , i.e., for all subsets of ; while is a negative set, i.e., for all subsets of .
3.2 Answer the open question in RKKS
A reproducing kernel Kreĭn space (RKKS) bognar1974indefinite; Ga2016Learning is an innerproduct space, that is analogous to a reproducing kernel Hilbert space (RKHS), and can be decomposed into a direct sum with two RKHSs . The key difference with RKHSs is that the inner products might be negative for RKKSs, i.e., there exists such that . RKKS provides a justification to analyze indefinite kernels as it admits positive decomposition bognar1974indefinite such that , where is a reproducing kernel associated with RKKS, and two PD kernels and are reproducing kernels associated with and , respectively. Apparently, this decomposition is not necessarily unique. Preliminaries on RKKS can be found in Supplementary Materials A.
It is important to note that, not every indefinite kernel admits a representation as a difference between two positive definite kernels. In other words, we do not know how to verify that an indefinite kernel can be associated with RKKS except for some intuitive examples, e.g., a linear combination of PD kernels. In the past, we usually assume that a (reproducing) indefinite kernel is in RKKS in practice while the theoretical gap cannot be ignored. By virtue of measure decomposition of the signed measure, we provide a sufficient and necessary condition in Theorem 3 to answer the longstanding open question in RKKS: does any indefinite kernel have a positive decomposition? Moreover, this condition serves as a guidance for us to find a specific positive decomposition in practice.
Theorem 3.
Assume that an indefinite kernel is stationary, i.e., , and its (generalized) Fourier transform is denoted by the measure . Then can be identified with a reproducing kernel in RKKS if and only if the total mass of the measure except for the origin is finite, i.e., .
Proof.
The proof can be found in Supplementary Materials B. ∎
Remark: We provide an explicit sufficient and necessary condition to link the Jordan decomposition of signed measures to positive decomposition in RKKS.
We make the following remarks.
(i) Theorem 3 provides an access via Fourier transform to verify whether a (reproducing) indefinite kernel belongs to RKKS or not. The measure decomposition is much easier to be founded than positive decomposition in RKKS that cannot be verified in practice.
In the next section, we
give some examples including a linear combination of PD kernels, dotproduct kernels on the unit sphere, to illustrate our condition in practice.
(ii) Theorem 3 also includes some nonsquaredintegrable kernel functions, e.g., conditionally positive definite kernels wendland2004scattered, of which the standard Fourier transform does not exist.
In this case, we need to consider the Fourier transform in Schwartz space donoghue2014distributions.
For example, Theorem 2.3 in sun1993conditionally demonstrates that conditionally positive kernels correspond to a positive Borel measure on with an analytic function in Schwartz space.
(iii) In addition to the abovementioned indefinite kernels, based on Theorem 3, we can also verify various indefinite kernels, e.g., the TL1 kernel huang2017classification, the kernel smola2001regularization, and any distancebased kernel function in the similar way.
Specifically, their Fourier transform (the dimensional integration) are still radial, which makes the Fourier transform more easily computed.
3.3 Randomized feature map
Based on Theorem 3, we are ready to develop our random feature algorithm for nonPD kernels. By considering to be , Eq. (1) can be reformulated as follows:
(2) 
where are two finite nonnegative measures and are their corresponding Borel measures. According to the Bochner’s theorem, these two Borel measures can be associated with two PD kernels and , respectively. Therefore, the Monte Carlo sampling is feasible to approximate by where is the explicit feature mapping with being
(3) 
where random features are obtained by and . It can be easily seen from Eqs. (2) and (3) that this estimation is unbiased. The real and imaginary part in corresponds to and , i.e., and , respectively. Moreover, in Eq. (3), the imaginary part
can be interpreted as rotating the vector
by 90 degrees. It shows the consistency for the RKKS associated with as an orthogonal direct sum: , where are two RKHSs associated with . Though we introduce the imaginary unit to the feature mapping, the computed kernel approximation result remains realvalued. The complete random features process is summarized in Algorithm 1. For any given kernel, the required , can be precomputed, independent of the training data. In this way, our algorithm achieves the same complexity with the standard RFF by time and memory.The formulation in Eq. (2), as well as Algorithm 1, is general enough to cover various PD and nonPD kernels. Stationary PD kernels admit Eq. (2) by choosing and where we have associated with , i.e., a probability measure. Hence, the Bochner’s theorem can be regarded as a special case of the considered integration representation (2) in this paper. In the next section, we will demonstrate the feasibility of our Algorithm 1 by considering several typical indefinite kernels, including the linear combination of PD kernels, dotproduct kernels on the unit sphere, etc.
4 Examples
In this section, we investigate a series of indefinite kernels for a better understanding of our random features algorithm. We begin with an intuitive example, the indefinite linear combination of PD kernels. Then we employ several dotproduct kernels on the unit sphere, including the polynomial kernel pennington2015spherical, the arccosine kernel cho2009kernel, and the NTK kernel on twolayer ReLU network bietti2019inductive. Note that by considering these dotproduct kernels on the unit sphere, we in fact conduct an equivalent preprocessing by normalizing the data to the unit norm. This operator is common to avoid the unboundedness of dotproduct kernels pennington2015spherical; Hamid2014Compact and it is beneficial to theoretical analysis bietti2019inductive; ghorbani2019linearized; gao2019convergence.
A linear combination of positive definite kernels: Kernels in this class have the formulation , i.e., a linear combination of PD kernels with the corresponding coefficients . This is a typical example of indefinite kernels in RKKS, which admits positive decomposition such that with two PD kernels . Theorem 3 guides us to find based on the sign of . Hence we explicitly decompose an indefinite kernel in this class into the difference of two PD kernels, i.e., . Then the corresponding nonnegative measures can be subsequently obtained due to the additivity of Fourier transform.
We take the DeltaGaussian kernel oglic18a as an example. This kernel admits and in Eq. (2), and its random feature mapping is given by Eq. (3) with and .
After providing the above simple and intuitive warmingup example, we now discuss some sophisticated indefinite kernels, e.g., dotproduct kernels on the unit sphere, and demonstrate the feasibility of our random features algorithm.
Polynomial kernels on the sphere: Pennington et al. pennington2015spherical point out that a polynomial kernel on the unit sphere is of for and and . This kernel is indefinite since its Fourier transform is not a nonnegative measure as shown in pennington2015spherical
(4) 
which results from the oscillatory behavior of the Bessel function of the first kind . We demonstrate (see in Supplementary Materials C), which makes the integration representation in Eq. (2) () and our random features algorithm feasible. Since is a signed measure, it can be decomposed into two nonnegative measures by Eq. (4), that is, with and . Then random feature map for this kernel can be also given by Eq. (3) with and . Therefore, Algorithm 1 is suitable for this kernel and the distributions and can be numerically acquired by a set of uniformly generated 10,000 samples in a range of . Based on the above result, we conclude that polynomial kernels on the unit sphere can be associated with RKKS, as shown in Figure 1 with the decomposition of . Compared to pennington2015spherical using a positive sum of Gaussians to approximate , where parameters in Gaussians need to be optimized aforehand, our algorithm achieves both simplicity and effectiveness by having (i) an unbiased estimator, (ii) incurring no extra parameters.
Next we consider the NTK of twolayer ReLU networks on the unit sphere bietti2019inductive. Since this kernel in fact consists of arccosine kernels cho2009kernel, we combine them together for discussion.
NTK of Twolayer ReLU networks on the unit sphere: Bietti and Mairal bietti2019inductive consider a twolayer ReLU network of the form , with the parameter initialized according to . By formulating ReLU as , we have the following corresponding NTK bietti2019inductive; chizat2019lazy:
(5) 
Moreover, this kernel can be further represented by with . Here, corresponds to the zeroorder arccosine kernel and is the firstorder arccosine kernel. Furthermore, this kernel is proved to be stationary but indefinite by the following theorem.
Theorem 4.
For any on the unit sphere, the NTK kernel of a two layer ReLU network of the form is shiftinvariant, that is,
where . Specifically, the function is not positive definite.^{2}^{2}2The behavior of with is undefined. Following pennington2015spherical, we set for .
Proof.
The proof can be found in Supplementary Materials D. ∎
Remark: The indefiniteness of NTK on the unit sphere motivates us to scrutinize the approximation performance, functional spaces, and generalization properties of overparameterized networks in the future, which in return expands the usage scope of indefinite kernels.
Since the above NTK on the unit sphere can be formulated as associated with arccosine kernels, we have the direct corollary for arccosine kernels as follows:
Corollary 4.1.
For any on the unit sphere, denote , the zeroorder arccosine kernel and the firstorder arccosine kernel are both shiftinvariant but indefinite.
Remark:
(i) Obtaining via Fourier transform of arccosine kernels and the NTK kernel on the unit sphere appears nontrivial due to a dimensional integration of Bessel functions.
However, we still manage to obtain for a zeroorder arccosine kernel according to Corollary 4.1, refer to Supplementary Materials E for details.
Hence Algorithm 1 is still suitable for this kernel.
(ii) If we relax the shiftinvariant constraint in Eq. (1), its integration representation also covers some dotproduct kernels, e.g., arccosine kernels and NTK kernels.
For example, an acrcosine kernel admits
where and in its zeroorder case and in its firstorder case.
That means, in Eq. (1) , the function is substituted by for arccosine kernels. Likewise, we can devise for NTK according to Eq. (5).
In the above cases, the coefficients are and and accordingly, we can conduct the sampling strategy on for kernel approximation.
5 Experiments
We evaluate the proposed method with SRF (Spherical Random Features) pennington2015spherical
and DIGMM (DoubleInfinite Gaussian Mixtures Model)
liu2019double on four representative datasets including letter^{3}^{3}3https://archive.ics.uci.edu/ml/datasets.html., ijcnn1^{4}^{4}4https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, covtype^{4}, and the MNIST dataset L1998Gradient, see in Table 1. The datasets are normalized to by a minmax scaling scheme and are provided by the pregiven training/test partition except for the covtype. In this case, we randomly partition the dataset by half as the training and test sets respectively for the covtype. In our experiment, the used indefinite kernels include the polynomial kernel on the sphere with in pennington2015spherical, and the DeltaGaussian kernel with and in oglic18a. Specifically, we also include Random Maclaurin (RM) kar2012random and Tensor Sketch (TS) Pham2013Fastfor polynomial kernel approximation. Note that the related error bars and standard deviations are obtained by running the experiments for
times. All experiments are implemented in MATLAB and carried out on a PC with Intel i78700K CPU (3.70 GHz) and 64 GB RAM. The source code of our implementation will be made public.Datasets  #training  #test  

letter  16  12,000  6,000 
ijcnn1  22  49,990  91,701 
covtype  54  290,506  290,506 
MNIST  784  60,000  10,000 
Kernel approximation: The relative error is chosen to measure the approximation quality where and denote the exact kernel matrix on 1,000 random selected samples and its approximated kernel matrix, respectively. Figure 2 shows the approximation error under two indefinite kernels as a function of the number of random features . Our method always achieves lower approximation error than the other algorithms on these datasets. A clear look at the case of DetaGaussian kernel approximation will find that our approach significantly improves the approximation quality compared to SRF and DIGMM. This larger approximation error of SRF results from two steps: (i) approximating the indefinite kernel by a PD kernel; (ii) approximating this PD kernel by a sum of Gaussians; DIGMM only focuses on approximating a subset of the kernel matrix. Different from these two, our method directly approximates the indefinite kernel function by an unbiased estimator, which incurs no extra loss for kernel approximation.
Classification with linear SVM:
The obtained randomized feature map is used for classification by training a linear classifier with liblinear
fan2008liblinear.^{5}^{5}5Though learning with nonPD kernels is nonconvex, the optimization algorithm in liblinear still converges. The balanced parameter in linear SVM is tuned by fivefold cross validation on a grid of points: . The test accuracy of various algorithms are shown in Figure 3. As we expected, higherdimensional randomized feature map outputs higher classification accuracy. Our method achieves the best performance in most cases.Computational time: Figure 4 shows time spent on generating randomized feature map. Admittedly, our method takes a little more time to generate randomized feature maps than SRF pennington2015spherical as our feature map introduces the extra imaginary part. However, on each dataset, SRF requires to obtain parameters of a sum of Gaussians in advance by an offline grid search scheme. This extra time cost often takes tens of seconds, which is not included in our reported results.
6 Conclusion
We answer the longlasting open question of indefinite kernels by the introduced measure decomposition technique. Accordingly, we develop a general random features algorithm with unbiased estimation for various kernels that are nonstationary or/and positive definite. Besides, our findings on the indefiniteness of NTK on the unit sphere encourages us to have better scrutiny on the approximation performance, functional spaces, and generalization properties in overparameterized networks in the future. Moreover, the mathematical technique in this paper, Fourier analysis in Schwartz spaces, can also be used to study ReLU networks in Fourier domain, where the ReLU activation function is also nonsquared integrable, refer to
rahaman2019spectral for details, which expands the usage of indefinite kernels and Fourier analysis to neural networks.Broader Impact
This is a theoretical paper that investigates the decomposition of signed measures for random features. This work gives further insight into kernel approximation, and hence might inspire new practical ideas for kernelbased learning tasks. Our work can be used to speed up kernel methods in largescale situations. This is beneficial to the contemporary environment of data analysis. The developed technique in this paper contributes to fair and nonoffensive societal consequences.
Acknowledgement
This work was supported in part by the European Research Council under the European Union’s Horizon 2020 research and innovation program / ERC Advanced Grant EDUALITY (787960), in part by the National Natural Science Foundation of China 61977046, in part by the National Key Research and Development Project (No. 2018AAA0100702). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information; Research Council KUL C14/18/068; Flemish Government FWO project GOA4917N; Onderzoeksprogramma Artificiele Intelligentie (AI) Vlaanderen programme.
References
Appendix A Preliminaries of RKKS
Here we briefly review on the Kreĭn spaces and the reproducing kernel Kreĭn space (RKKS). Detailed expositions can be found in book [bognar1974indefinite]. Most of the readers would be familiar with Hilbert spaces. Kreĭn spaces share some properties of Hilbert spaces but differ in some key aspects which we shall emphasize as follows.
Kreĭn spaces are indefinite inner product spaces endowed with a Hilbertian topology.
Definition 2.
(Kreĭn space [bognar1974indefinite]) An inner product space is a Kreĭn space if there exist two Hilbert spaces and such that
i) , it can be decomposed into , where and , respectively.
ii) , .
Accordingly, the Kreĭn space can be decomposed into a direct sum . Besides, the inner product on is nondegenrate, i.e., for , if for any , we have . From the definition, the decomposition is not necessarily unique. For a fixed decomposition, the inner product is given accordingly [Ga2016Learning, oglic18a]. The key difference from Hilbert spaces is that the inner products might be negative for Kreĭn spaces, i.e., there exists such that . If and are two RKHSs, the Kreĭn space is a RKKS associated with a unique indefinite reproducing kernel such that the reproducing property holds, i.e., .
Proposition 1.
(positive decomposition [bognar1974indefinite]) Let be a realvalued kernel function. Then there exists an associated reproducing kernel Kreĭn space identified with a reproducing kernel if and only if admits a positive decomposition , where and are two positive definite kernels.
From the definition, this decomposition is not necessarily unique. Typical examples include a wide range of commonly used indefinite kernels, such as a linear combination of PD kernels [Cheng2005Learning], and conditionally PD kernels [schaback1999native, wendland2004scattered]. It is important to note that, not every indefinite kernel function admits a representation as a difference between two positive definite kernels.
Appendix B Proof of Theorem 3
Proof.
(i) Necessity.
An stationary indefinite kernel associated with RKKS admits the positive decomposition
where and are two positive definite kernels. According to the Bochner’s theorem, there exist two probability measures , such that by denoting ,
Denote , it is clear that is a signed measure, and its total mass is finite since .
(ii) Sufficiency.
Let and be the smallest algebra containing all open subsets of , and
Since we assume that has total mass except the origin, indicates that is a finite measure (except the origin). Therefore, is a signed measure with . By virtue of Jordan decomposition, there exist two nonnegative finite measures and such that . By using the inverse Fourier transform and Plancherel’s theorem [donoghue2014distributions], we have
where and are two nonnegative Borel measures, which correspond to two positive definite kernels and , respectively. By defining and , we have
This completes the proof. ∎
Appendix C Polynomial kernels on the unit sphere with finite total mass
We consider the asymptotic properties of the Bessel function of the first kind under the large and small cases to study the .
c.1 A small
Consider the asymptotic behavior for small . The Bessel function of the first kind is asymptotically equivalent to
In this case, the measure is formulated as
(6) 
which can be regarded as a generalized version of a uniform distribution. Therefore,
is absolutely integrable over a finite range , where is some constant satisfying .c.2 A large
Consider the asymptotic behavior for large . The Bessel function of the first kind is asymptotically equivalent to
The Fourier transform of the polynomial kernel on the sphere, i.e., the measure , is hence given by [pennington2015spherical]
(7) 
In this way, we have for a large , where is some constant satisfying .
Appendix D Proof of Theorem 4
To prove Theorem 4, we firstly derive its formulation on the unit sphere and then demonstrate that it is a shiftinvariant but not positive definite kernel via completely monotone functions.
Definition 3.
(Completely monotone [schoenberg1938metric]) A function is called completely monotone on if it satisfies and
for all and all . Moreover, is called completely monotone on if it is additionally defined in .
Note that the definition of completely monotone functions can be also restricted to a finite interval, i.e., is completely monotone on , see in [pennington2015spherical].
Besides, we need the following lemma that demonstrates the connection between positive definite and completely monotone functions for the proof.
Lemma 1.
(Schoenberg’s theorem [schoenberg1938metric]) A function is completely monotone on if and only if is radial and positive definite function on all for every .
Now let us prove Theorem 4.
Proof.
By virtue of and , we have . Therefore, the standard NTK of a twolayer ReLU network can be formulated as
which is shiftinvariant.
Next, we prove that is not a positive definite kernel, i.e., is not a completely monotone function over by Lemma 1. In other words, there exist some value such that for some . To this end, the function is given by
and its firstorder derivative is
Since is continuous, and and , there exists a constant such that over and over . That is to say, holds for , which violates the definition of completely monotone functions. In this regard, is not a completely monotone function over and thus is not positive definite. ∎
Appendix E The measure of the zeroorder arccosine kernel
In this section, we derive the measure of the zeroorder arccosine kernel admitting . Accordingly, we have
(8)  
where is a radial function, i.e., with , and thus its Fourier transform is also a radial function, i.e., with . Obviously, the integrand in Eq. (8) and the integration region are both bounded, and thus we have . Following the proof of for polynomial kernels on the unit sphere in Section C, we can also demonstrate that for the zeroorder arccosine kernel on the unit sphere.
Comments
There are no comments yet.