Generalizing Random Fourier Features via Generalized Measures

05/30/2020 ∙ by Fanghui Liu, et al. ∙ Shanghai Jiao Tong University 0

We generalize random Fourier features, that usually require kernel functions to be both stationary and positive definite (PD), to a more general range of non-stationary or/and non-PD kernels, e.g., dot-product kernels on the unit sphere and a linear combination of positive definite kernels. Specifically, we find that the popular neural tangent kernel in two-layer ReLU network, a typical dot-product kernel, is shift-invariant but not positive definite if we consider ℓ_2-normalized data. By introducing the signed measure, we propose a general framework that covers the above kernels by associating them with specific finite Borel measures, i.e., probability distributions. In this manner, we are able to provide the first random features algorithm to obtain unbiased estimation of these kernels. Experiments on several benchmark datasets verify the effectiveness of our algorithm over the existing methods. Last but not least, our work provides a sufficient and necessary condition, which is also computationally implementable, to solve a long-lasting open question: does any indefinite kernel have a positive decomposition?

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Random Fourier features (RFF) rahimi2007random

is a powerful technique in kernel-based learning which samples a series of random features from a distribution (obtained by Fourier transform) to approximate the kernel function. It brings promising performance and solid theoretical guarantees on scaling up kernel methods in classification

sun2018but; li2019towards, nonlinear component analysis xie2015scale; lopez2014randomized, and neural tangent kernel (NTK) jacot2018neural; arora2019exact

. It is noteworthy that random features can be regarded as a class of two-layer neural networks

belkin2019reconciling in the lazy regime chizat2019lazy, and thus can be utilized to analyze over-parameterized neural networks in belkin2019reconciling; arora2019exact

. Due to the great success of RFF in machine learning society, Rahimi and Recht won the Test-of-Time Award in NeurIPS 2017 for their seminal work on RFF

rahimi2007random and Li et al. li2019towards won the Honorable Mentions (best paper finalist) in ICML 2019 for their unified theoretical analysis on RFF.

The theoretical foundations behind RFF is intuitive: a positive definite (PD) function corresponds to a nonnegative and finite Borel measure, i.e., a probability distribution, via Fourier transform. Then we can sample random features from this distribution so as to approximate this PD function.

Theorem 1 (Bochner’s Theorem bochner2005harmonic).

Let be a bounded continuous function satisfying the shift-invariant property, i.e., . Then, is positive definite if and only if it is the (conjugate) Fourier transform of a nonnegative and finite Borel measure

Here is the kernel function throughout this paper. Typically, the kernel in practical uses is real-valued and thus the imaginary part can be discarded, i.e., . It is clear that Bochner’s theorem requires the kernel function to be (i) shift-invariant (also called “stationary”) and (ii) positive definite. These two conditions together exclude a series of commonly used kernels including

  • Dot-product kernels111In this case, the Fourier basis functions are spherical harmonics smola2001regularization.: polynomial kernels smola2001regularization, arc-cosine kernels cho2009kernel, NTK jacot2018neural.

  • Indefinite kernels in a reproducing kernel Kreĭn space (RKKS) Cheng2004Learning: (i) a linear combination of positive definite kernels: Delta-Gaussian oglic18a; (ii) conditionally positive definite kernels, e.g., log kernel, power kernel boughorbel2005conditionally.

  • Specifically designed but indefinite kernels: Gaussian kernel on manifolds Feragen2015Geodesic, polynomial kernels on spheres pennington2015spherical, TL1 kernel huang2017classification, kernel smola2001regularization, and indefinite kernels with arbitrary types.

Pennington et al. pennington2015spherical point out that dot-product kernels with -normalized data, i.e., on the unit sphere, can be shift-invariant but not always PD. Therefore, in this paper, we consider stationary kernels, either positive definite or indefinite, admitting the following integration representation by denoting

(1)

where the integral region is chosen as either or in this paper and are some nonnegative coefficients. Here is a signed measure athreya2006measure which is a generalized measure by allowing it to have negative values. It admits the Jordan decomposition kubrusly2015essentials, i.e., where and are two nonnegative measures with at least one of them being finite. The total mass is defined as . Specifically, to make sampling process feasible in algorithm implementation, the total mass is required to be finite, i.e., . The functions and are not limited to in Fourier transform (or in the sense of tempered distribution) and can be extended to other formulations. The integration representation in Eq. (1) is general to cover various kernels, e.g., PD kernels, dot-product kernels on the unit sphere, and indefinite kernels. Note that the Fourier transform of non-PD kernels, i.e., the measure cannot be regarded as a nonnegative Borel measure, or even a measure. Analysis of non-PD kernels is often based on RKKS, but we do not know whether a non-PD kernel can be identified with a reproducing kernel in RKKS, which in fact corresponds to a long-lasting open question Cheng2004Learning; Guo2017Optimal; Huang2016Indefinite: does any indefinite kernel have a positive decomposition? The introduced signed measure decomposition would be an accessible way to answer this question, and accordingly we make the following contributions:

  • In Section 3, by introducing the signed measure, we generalize RFF to a series of kernels that are not positive definite. We provide a sufficient and necessary condition to answer the above open question in RKKS via the measure decomposition technique. Moreover, this condition also guides us how to find a specific positive decomposition in practice, and thus we can devise a sampling strategy to obtain randomized feature maps. To the best of our knowledge, this is the first work to generate unbiased estimation for non-PD kernel approximation by random features.

  • In Section 4, we demonstrate the feasibility of our random feature algorithm on several indefinite kernels. We begin with an intuitive example, e.g., a linear combination of positive definite kernels, and then consider dot-product kernels on the unit sphere for kernel approximation. Moreover, we prove that the popular NTK in two-layer ReLU network is shift-invariant but not positive definite if we use -normalized data, which motivates us to reconsider its induced functional spaces and related properties.

  • In Section 5, we evaluate various non-PD kernels on several typical large-scale datasets in terms of kernel approximation and the subsequent classification task. Our experimental results validate the theoretical claims and demonstrate the effectiveness of the proposed kernel approximation algorithm.

2 Related Works

A series of research focus on non-stationary kernel approximation, e.g., approximating polynomial kernels by Maclaurin expansion kar2012random

, the tensor sketch technique

Pham2013Fast; approximating additive kernels li2010random; Vedaldi2012Efficient; approximating the Gaussian kernel on the unit sphere by the discrete cosine transform kafai2018croification. However, the considered kernels in the above works are still positive definite and the designed approximation algorithms are infeasible to indefinite kernels. Regarding to non-PD kernel approximation, we notice that there are four papers on this task: (i) Pennington et al. pennington2015spherical

find that the polynomial kernel on the unit sphere is not PD, and they use a PD kernel (associated with a positive sum of Gaussian distributions) to approximate it;

(ii) Liu et al. liu2019double decompose (a subset of) kernel matrix into two PD kernel matrices, and then learn their respective randomized feature maps by infinite Gaussian mixtures. However, this approach in fact focuses on approximating kernel matrices rather than kernel functions; (iii) Mehrkanoon et al. mehrkanoon2018indefinite investigate Nyström approximation for indefinite kernels in spectral learning; (iv) Oglic and Gärtner oglic2019scalable propose Nyström methods for low-rank approximation of indefinite kernels in RKKS. The first two works are based on RFF and the last two focus on Nytröm approximation. Up till now, approximating non-PD kernels by random features cannot ensure unbiased and has not yet been fully investigated. Instead, based on the measure decomposition technique, our work achieves both simplicity and effectiveness by having (i) an unbiased estimator, (ii) incurring no extra parameters.

3 Randomized Feature Map via Signed Measure

In this section, we begin with the concept of signed measures and answer the open question, and then devise the sampling strategy for random features. For simplicity of notation, we denote and . Moreover, a function is called radial if , where the distance is usually defined by the -norm. To notify, the considered stationary kernels in this paper are all radial, and accordingly, their Fourier transforms are also radial, i.e., .

3.1 Signed measure

Let be a measure on a set satisfying and -additivity (i.e., countably additive). We call a finite measure if . Specifically, is a probability measure if , and the triple is referred as the corresponding probability space. Here we consider the signed measure, a generalized version of a measure allowing for negative values.

Definition 1.

(Signed measure athreya2006measure) Let be some set, be a -algebra of subsets on . A signed measure is a function satisfying -additivity.

Based on the definition of signed measures, the following theorem shows that any signed measure can be represented by the difference of two nonnegative measures.

Theorem 2.

(Jordan decomposition kubrusly2015essentials) Let be a signed measure defined on the -algebra as given in Definition 1. There exists two (nonnegative) measures and (one of them is a finite measure) such that .

The total mass of on is defined as . Note that this decomposition is not unique. It can be characterized by the Hahn decomposition theorem doss1980hahn: space can be decomposed by with . Here, is a positive set for , i.e., for all subsets of ; while is a negative set, i.e., for all subsets of .

3.2 Answer the open question in RKKS

A reproducing kernel Kreĭn space (RKKS) bognar1974indefinite; Ga2016Learning is an inner-product space, that is analogous to a reproducing kernel Hilbert space (RKHS), and can be decomposed into a direct sum with two RKHSs . The key difference with RKHSs is that the inner products might be negative for RKKSs, i.e., there exists such that . RKKS provides a justification to analyze indefinite kernels as it admits positive decomposition bognar1974indefinite such that , where is a reproducing kernel associated with RKKS, and two PD kernels and are reproducing kernels associated with and , respectively. Apparently, this decomposition is not necessarily unique. Preliminaries on RKKS can be found in Supplementary Materials A.

It is important to note that, not every indefinite kernel admits a representation as a difference between two positive definite kernels. In other words, we do not know how to verify that an indefinite kernel can be associated with RKKS except for some intuitive examples, e.g., a linear combination of PD kernels. In the past, we usually assume that a (reproducing) indefinite kernel is in RKKS in practice while the theoretical gap cannot be ignored. By virtue of measure decomposition of the signed measure, we provide a sufficient and necessary condition in Theorem 3 to answer the longstanding open question in RKKS: does any indefinite kernel have a positive decomposition? Moreover, this condition serves as a guidance for us to find a specific positive decomposition in practice.

Theorem 3.

Assume that an indefinite kernel is stationary, i.e., , and its (generalized) Fourier transform is denoted by the measure . Then can be identified with a reproducing kernel in RKKS if and only if the total mass of the measure except for the origin is finite, i.e., .

Proof.

The proof can be found in Supplementary Materials B. ∎

Remark: We provide an explicit sufficient and necessary condition to link the Jordan decomposition of signed measures to positive decomposition in RKKS. We make the following remarks.
(i) Theorem 3 provides an access via Fourier transform to verify whether a (reproducing) indefinite kernel belongs to RKKS or not. The measure decomposition is much easier to be founded than positive decomposition in RKKS that cannot be verified in practice. In the next section, we give some examples including a linear combination of PD kernels, dot-product kernels on the unit sphere, to illustrate our condition in practice.
(ii) Theorem 3 also includes some non-squared-integrable kernel functions, e.g., conditionally positive definite kernels wendland2004scattered, of which the standard Fourier transform does not exist. In this case, we need to consider the Fourier transform in Schwartz space donoghue2014distributions. For example, Theorem 2.3 in sun1993conditionally demonstrates that conditionally positive kernels correspond to a positive Borel measure on with an analytic function in Schwartz space.
(iii) In addition to the above-mentioned indefinite kernels, based on Theorem 3, we can also verify various indefinite kernels, e.g., the TL1 kernel huang2017classification, the kernel smola2001regularization, and any distance-based kernel function in the similar way. Specifically, their Fourier transform (the -dimensional integration) are still radial, which makes the Fourier transform more easily computed.

3.3 Randomized feature map

Based on Theorem 3, we are ready to develop our random feature algorithm for non-PD kernels. By considering to be , Eq. (1) can be reformulated as follows:

(2)

where are two finite nonnegative measures and are their corresponding Borel measures. According to the Bochner’s theorem, these two Borel measures can be associated with two PD kernels and , respectively. Therefore, the Monte Carlo sampling is feasible to approximate by where is the explicit feature mapping with being

(3)

where random features are obtained by and . It can be easily seen from Eqs. (2) and (3) that this estimation is unbiased. The real and imaginary part in corresponds to and , i.e., and , respectively. Moreover, in Eq. (3), the imaginary part

can be interpreted as rotating the vector

by 90 degrees. It shows the consistency for the RKKS associated with as an orthogonal direct sum: , where are two RKHSs associated with . Though we introduce the imaginary unit to the feature mapping, the computed kernel approximation result remains real-valued. The complete random features process is summarized in Algorithm 1. For any given kernel, the required , can be pre-computed, independent of the training data. In this way, our algorithm achieves the same complexity with the standard RFF by time and memory.

The formulation in Eq. (2), as well as Algorithm 1, is general enough to cover various PD and non-PD kernels. Stationary PD kernels admit Eq. (2) by choosing and where we have associated with , i.e., a probability measure. Hence, the Bochner’s theorem can be regarded as a special case of the considered integration representation (2) in this paper. In the next section, we will demonstrate the feasibility of our Algorithm 1 by considering several typical indefinite kernels, including the linear combination of PD kernels, dot-product kernels on the unit sphere, etc.

Input: A kernel function with , the training data , and the number of random features .
Output: Random feature map such that .
1. Obtain the measure of the kernel via (generalized) Fourier transform ;
2. Given , let be the Jordan decomposition with two nonnegative measures and compute the total mass ;
3. Sample and ;
4. Output the explicit feature mapping with given in Eq. (3).
Algorithm 1 Random features for various indefinite kernels via generalized measures.

4 Examples

In this section, we investigate a series of indefinite kernels for a better understanding of our random features algorithm. We begin with an intuitive example, the indefinite linear combination of PD kernels. Then we employ several dot-product kernels on the unit sphere, including the polynomial kernel pennington2015spherical, the arc-cosine kernel cho2009kernel, and the NTK kernel on two-layer ReLU network bietti2019inductive. Note that by considering these dot-product kernels on the unit -sphere, we in fact conduct an equivalent pre-processing by normalizing the data to the unit -norm. This operator is common to avoid the unboundedness of dot-product kernels pennington2015spherical; Hamid2014Compact and it is beneficial to theoretical analysis bietti2019inductive; ghorbani2019linearized; gao2019convergence.

A linear combination of positive definite kernels: Kernels in this class have the formulation , i.e., a linear combination of PD kernels with the corresponding coefficients . This is a typical example of indefinite kernels in RKKS, which admits positive decomposition such that with two PD kernels . Theorem 3 guides us to find based on the sign of . Hence we explicitly decompose an indefinite kernel in this class into the difference of two PD kernels, i.e., . Then the corresponding nonnegative measures can be subsequently obtained due to the additivity of Fourier transform.

We take the Delta-Gaussian kernel oglic18a as an example. This kernel admits and in Eq. (2), and its random feature mapping is given by Eq. (3) with and .

After providing the above simple and intuitive warming-up example, we now discuss some sophisticated indefinite kernels, e.g., dot-product kernels on the unit sphere, and demonstrate the feasibility of our random features algorithm.

Polynomial kernels on the sphere: Pennington et al. pennington2015spherical point out that a polynomial kernel on the unit sphere is of for and and . This kernel is indefinite since its Fourier transform is not a nonnegative measure as shown in pennington2015spherical

(4)

which results from the oscillatory behavior of the Bessel function of the first kind . We demonstrate (see in Supplementary Materials C), which makes the integration representation in Eq. (2) () and our random features algorithm feasible. Since is a signed measure, it can be decomposed into two nonnegative measures by Eq. (4), that is, with and . Then random feature map for this kernel can be also given by Eq. (3) with and . Therefore, Algorithm 1 is suitable for this kernel and the distributions and can be numerically acquired by a set of uniformly generated 10,000 samples in a range of . Based on the above result, we conclude that polynomial kernels on the unit sphere can be associated with RKKS, as shown in Figure 1 with the decomposition of . Compared to pennington2015spherical using a positive sum of Gaussians to approximate , where parameters in Gaussians need to be optimized aforehand, our algorithm achieves both simplicity and effectiveness by having (i) an unbiased estimator, (ii) incurring no extra parameters.

Next we consider the NTK of two-layer ReLU networks on the unit sphere bietti2019inductive. Since this kernel in fact consists of arc-cosine kernels cho2009kernel, we combine them together for discussion.

NTK of Two-layer ReLU networks on the unit sphere: Bietti and Mairal bietti2019inductive consider a two-layer ReLU network of the form , with the parameter initialized according to . By formulating ReLU as , we have the following corresponding NTK bietti2019inductive; chizat2019lazy:

(5)

Moreover, this kernel can be further represented by with . Here, corresponds to the zero-order arc-cosine kernel and is the first-order arc-cosine kernel. Furthermore, this kernel is proved to be stationary but indefinite by the following theorem.

Theorem 4.

For any on the unit -sphere, the NTK kernel of a two layer ReLU network of the form is shift-invariant, that is,

where . Specifically, the function is not positive definite.222The behavior of with is undefined. Following pennington2015spherical, we set for .

Proof.

The proof can be found in Supplementary Materials D. ∎

Remark: The indefiniteness of NTK on the unit sphere motivates us to scrutinize the approximation performance, functional spaces, and generalization properties of over-parameterized networks in the future, which in return expands the usage scope of indefinite kernels.

Since the above NTK on the unit sphere can be formulated as associated with arc-cosine kernels, we have the direct corollary for arc-cosine kernels as follows:

Corollary 4.1.

For any on the unit -sphere, denote , the zero-order arc-cosine kernel and the first-order arc-cosine kernel are both shift-invariant but indefinite.

Remark: (i) Obtaining via Fourier transform of arc-cosine kernels and the NTK kernel on the unit sphere appears non-trivial due to a -dimensional integration of Bessel functions. However, we still manage to obtain for a zero-order arc-cosine kernel according to Corollary 4.1, refer to Supplementary Materials E for details. Hence Algorithm 1 is still suitable for this kernel.
(ii) If we relax the shift-invariant constraint in Eq. (1), its integration representation also covers some dot-product kernels, e.g., arc-cosine kernels and NTK kernels. For example, an acr-cosine kernel admits where and in its zero-order case and in its first-order case. That means, in Eq. (1) , the function is substituted by for arc-cosine kernels. Likewise, we can devise for NTK according to Eq. (5). In the above cases, the coefficients are and and accordingly, we can conduct the sampling strategy on for kernel approximation.

5 Experiments

We evaluate the proposed method with SRF (Spherical Random Features) pennington2015spherical

and DIGMM (Double-Infinite Gaussian Mixtures Model)

liu2019double on four representative datasets including letter333https://archive.ics.uci.edu/ml/datasets.html., ijcnn1444https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, covtype4, and the MNIST dataset L1998Gradient, see in Table 1. The datasets are normalized to by a min-max scaling scheme and are provided by the pre-given training/test partition except for the covtype. In this case, we randomly partition the dataset by half as the training and test sets respectively for the covtype. In our experiment, the used indefinite kernels include the polynomial kernel on the sphere with in pennington2015spherical, and the Delta-Gaussian kernel with and in oglic18a. Specifically, we also include Random Maclaurin (RM) kar2012random and Tensor Sketch (TS) Pham2013Fast

for polynomial kernel approximation. Note that the related error bars and standard deviations are obtained by running the experiments for

times. All experiments are implemented in MATLAB and carried out on a PC with Intel i7-8700K CPU (3.70 GHz) and 64 GB RAM. The source code of our implementation will be made public.

(a) (b)
Figure 1: The nonnegative Borel measures of the spherical polynomial kernel on the letter dataset.
Table 1: Benchmark datasets.
Datasets #training #test
letter 16 12,000 6,000
ijcnn1 22 49,990 91,701
covtype 54 290,506 290,506
MNIST 784 60,000 10,000
(a) letter
(b) ijcnn1
(c) covtype
(d) MNIST
Figure 2: Comparisons of various algorithms for approximation error across the polynomial kernel on the unit sphere (top) and the Delta-Gaussian kernel (bottom) on four datasets.

Kernel approximation: The relative error is chosen to measure the approximation quality where and denote the exact kernel matrix on 1,000 random selected samples and its approximated kernel matrix, respectively. Figure 2 shows the approximation error under two indefinite kernels as a function of the number of random features . Our method always achieves lower approximation error than the other algorithms on these datasets. A clear look at the case of Deta-Gaussian kernel approximation will find that our approach significantly improves the approximation quality compared to SRF and DIGMM. This larger approximation error of SRF results from two steps: (i) approximating the indefinite kernel by a PD kernel; (ii) approximating this PD kernel by a sum of Gaussians; DIGMM only focuses on approximating a subset of the kernel matrix. Different from these two, our method directly approximates the indefinite kernel function by an unbiased estimator, which incurs no extra loss for kernel approximation.

(a) letter
(b) ijcnn1
(c) covtype
(d) MNIST
Figure 3: Comparisons of various algorithms for classification accuracy with libSVM across the polynomial kernel on the unit sphere (top) and the Delta-Gaussian kernel (down) on four datasets.

Classification with linear SVM:

The obtained randomized feature map is used for classification by training a linear classifier with liblinear

fan2008liblinear.555Though learning with non-PD kernels is non-convex, the optimization algorithm in liblinear still converges. The balanced parameter in linear SVM is tuned by five-fold cross validation on a grid of points: . The test accuracy of various algorithms are shown in Figure 3. As we expected, higher-dimensional randomized feature map outputs higher classification accuracy. Our method achieves the best performance in most cases.

(a) letter
(b) ijcnn1
(c) covtype
(d) MNIST
Figure 4: Comparisons of computational time to generate randomized feature map across the polynomial kernel on the unit sphere (top) and the Delta-Gaussian kernel (down) on four datasets.

Computational time: Figure 4 shows time spent on generating randomized feature map. Admittedly, our method takes a little more time to generate randomized feature maps than SRF pennington2015spherical as our feature map introduces the extra imaginary part. However, on each dataset, SRF requires to obtain parameters of a sum of Gaussians in advance by an off-line grid search scheme. This extra time cost often takes tens of seconds, which is not included in our reported results.

6 Conclusion

We answer the long-lasting open question of indefinite kernels by the introduced measure decomposition technique. Accordingly, we develop a general random features algorithm with unbiased estimation for various kernels that are non-stationary or/and positive definite. Besides, our findings on the indefiniteness of NTK on the unit sphere encourages us to have better scrutiny on the approximation performance, functional spaces, and generalization properties in over-parameterized networks in the future. Moreover, the mathematical technique in this paper, Fourier analysis in Schwartz spaces, can also be used to study ReLU networks in Fourier domain, where the ReLU activation function is also non-squared integrable, refer to

rahaman2019spectral for details, which expands the usage of indefinite kernels and Fourier analysis to neural networks.

Broader Impact

This is a theoretical paper that investigates the decomposition of signed measures for random features. This work gives further insight into kernel approximation, and hence might inspire new practical ideas for kernel-based learning tasks. Our work can be used to speed up kernel methods in large-scale situations. This is beneficial to the contemporary environment of data analysis. The developed technique in this paper contributes to fair and non-offensive societal consequences.

Acknowledgement

This work was supported in part by the European Research Council under the European Union’s Horizon 2020 research and innovation program / ERC Advanced Grant E-DUALITY (787960), in part by the National Natural Science Foundation of China 61977046, in part by the National Key Research and Development Project (No. 2018AAA0100702). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information; Research Council KUL C14/18/068; Flemish Government FWO project GOA4917N; Onderzoeksprogramma Artificiele Intelligentie (AI) Vlaanderen programme.

References

Appendix A Preliminaries of RKKS

Here we briefly review on the Kreĭn spaces and the reproducing kernel Kreĭn space (RKKS). Detailed expositions can be found in book [bognar1974indefinite]. Most of the readers would be familiar with Hilbert spaces. Kreĭn spaces share some properties of Hilbert spaces but differ in some key aspects which we shall emphasize as follows.

Kreĭn spaces are indefinite inner product spaces endowed with a Hilbertian topology.

Definition 2.

(Kreĭn space [bognar1974indefinite]) An inner product space is a Kreĭn space if there exist two Hilbert spaces and such that
i) , it can be decomposed into , where and , respectively.
ii) , .

Accordingly, the Kreĭn space can be decomposed into a direct sum . Besides, the inner product on is non-degenrate, i.e., for , if for any , we have . From the definition, the decomposition is not necessarily unique. For a fixed decomposition, the inner product is given accordingly [Ga2016Learning, oglic18a]. The key difference from Hilbert spaces is that the inner products might be negative for Kreĭn spaces, i.e., there exists such that . If and are two RKHSs, the Kreĭn space is a RKKS associated with a unique indefinite reproducing kernel such that the reproducing property holds, i.e., .

Proposition 1.

(positive decomposition [bognar1974indefinite]) Let be a real-valued kernel function. Then there exists an associated reproducing kernel Kreĭn space identified with a reproducing kernel if and only if admits a positive decomposition , where and are two positive definite kernels.

From the definition, this decomposition is not necessarily unique. Typical examples include a wide range of commonly used indefinite kernels, such as a linear combination of PD kernels [Cheng2005Learning], and conditionally PD kernels [schaback1999native, wendland2004scattered]. It is important to note that, not every indefinite kernel function admits a representation as a difference between two positive definite kernels.

Appendix B Proof of Theorem 3

Proof.

(i) Necessity.
An stationary indefinite kernel associated with RKKS admits the positive decomposition

where and are two positive definite kernels. According to the Bochner’s theorem, there exist two probability measures , such that by denoting ,

Denote , it is clear that is a signed measure, and its total mass is finite since .

(ii) Sufficiency.
Let and be the smallest -algebra containing all open subsets of , and

Since we assume that has total mass except the origin, indicates that is a finite measure (except the origin). Therefore, is a signed measure with . By virtue of Jordan decomposition, there exist two nonnegative finite measures and such that . By using the inverse Fourier transform and Plancherel’s theorem [donoghue2014distributions], we have

where and are two nonnegative Borel measures, which correspond to two positive definite kernels and , respectively. By defining and , we have

This completes the proof. ∎

Appendix C Polynomial kernels on the unit sphere with finite total mass

We consider the asymptotic properties of the Bessel function of the first kind under the large and small cases to study the .

c.1 A small

Consider the asymptotic behavior for small . The Bessel function of the first kind is asymptotically equivalent to

In this case, the measure is formulated as

(6)

which can be regarded as a generalized version of a uniform distribution. Therefore,

is absolutely integrable over a finite range , where is some constant satisfying .

c.2 A large

Consider the asymptotic behavior for large . The Bessel function of the first kind is asymptotically equivalent to

The Fourier transform of the polynomial kernel on the sphere, i.e., the measure , is hence given by [pennington2015spherical]

(7)

In this way, we have for a large , where is some constant satisfying .

Accordingly, combining Eq. (7) with Eq. (6), we conclude that

where we use is finite due to the continuous, bounded Bessel function on a finite region .

Appendix D Proof of Theorem 4

To prove Theorem 4, we firstly derive its formulation on the unit sphere and then demonstrate that it is a shift-invariant but not positive definite kernel via completely monotone functions.

Definition 3.

(Completely monotone [schoenberg1938metric]) A function is called completely monotone on if it satisfies and

for all and all . Moreover, is called completely monotone on if it is additionally defined in .

Note that the definition of completely monotone functions can be also restricted to a finite interval, i.e., is completely monotone on , see in [pennington2015spherical].

Besides, we need the following lemma that demonstrates the connection between positive definite and completely monotone functions for the proof.

Lemma 1.

(Schoenberg’s theorem [schoenberg1938metric]) A function is completely monotone on if and only if is radial and positive definite function on all for every .

Now let us prove Theorem 4.

Proof.

By virtue of and , we have . Therefore, the standard NTK of a two-layer ReLU network can be formulated as

which is shift-invariant.

Next, we prove that is not a positive definite kernel, i.e., is not a completely monotone function over by Lemma 1. In other words, there exist some value such that for some . To this end, the function is given by

and its first-order derivative is

Since is continuous, and and , there exists a constant such that over and over . That is to say, holds for , which violates the definition of completely monotone functions. In this regard, is not a completely monotone function over and thus is not positive definite. ∎

Appendix E The measure of the zero-order arc-cosine kernel

In this section, we derive the measure of the zero-order arc-cosine kernel admitting . Accordingly, we have

(8)

where is a radial function, i.e., with , and thus its Fourier transform is also a radial function, i.e., with . Obviously, the integrand in Eq. (8) and the integration region are both bounded, and thus we have . Following the proof of for polynomial kernels on the unit sphere in Section C, we can also demonstrate that for the zero-order arc-cosine kernel on the unit sphere.

To compute the integration in Eq. (8), we take the Taylor expansion of with terms

and thus the integration in Eq. (8) can be integrated by each term regarding to Bessel functions. Moreover, by virtue of , the above integral can be computed by parts

(9)

where the first term equals to