On the numerical rank of radial basis function kernel matrices in high dimension

06/23/2017 ∙ by Ruoxi Wang, et al. ∙ Duke University Stanford University 0

Low-rank approximations are popular techniques to reduce the high computational cost of large-scale kernel matrices, which are of significant interest in many applications. The success of low-rank methods hinges on the matrix rank, and in practice, these methods are effective even for high-dimensional datasets. The practical success has elicited the theoretical analysis of the rank in this paper. We will consider radial basis functions (RBF) and present theorems on the rank and error bounds. Our three main results are as follows. First, the rank of RBFs grows polynomially with the data dimension, in the worst case; second, precise approximation error bounds in terms of function properties and data structures are derived; and last, a group pattern in the decay of singular values for RBF kernel matrices is analyzed, and is explained by a grouping of the expansion terms. Empirical results verify and illustrate the theoretical results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Kernel matrices [38, 7, 6]

are widely used across fields including machine learning, inverse problems, graph theory and PDEs 

[39, 29, 20, 22, 21, 34]. The ability to generate data at the scale of millions and even billions has increased rapidly, posing computational challenges to systems involving large-scale matrices. The lack of scalability has made algorithms that accelerate matrix computations particularly important.

There have been algebraic algorithms proposed to reduce the computational burden, mostly based on low-rank approximations of the matrix or certain submatrices [39]

. The singular value decomposition (SVD) 

[18] is optimal but has an undesirable cubic complexity. Many methods [25, 19, 23, 30, 9, 17, 8, 43] have been proposed to accelerate the low-rank constructions with an acceptable loss of accuracy. The success of these low-rank algorithms hinges on a large spectrum gap or a fast decay of the spectrum of the matrix itself or its submatrices. However, to our knowledge, there is no theoretical guarantee that these conditions always hold.

Nonetheless, algebraic low-rank techniques are effective in many cases where the data dimension ranges from moderate to high, motivating us to study the growth rate of matrix ranks in high dimensions. A precise analysis of the matrix rank is nontrivial, and we turn to analyzing its upper bound, that is, the function rank of kernels that will be defined in what follows. The function rank is the number of terms in the minimal separable form of , when is approximated by a finite sum of separate products where and are real-valued functions. If the function rank does not grow exponentially with the data dimension, neither will the matrix rank.

If, however, we expand the multivariable function by expanding each variable in turn using function basis (per dimension), i.e.,

where and , respectively, are the -th function bases in dimension for and . Then, the number of terms will be , which grows exponentially with the data dimension. The exponential growth is striking in the sense that even for a moderate dimension, a reasonable accuracy would be difficult to achieve. However, in practice, people have observed much lower matrix ranks. A plausible reason is that both the functions and the data of practical interest enjoy some special properties, which should be considered when carrying out the analysis.

The aim of this paper, therefore, is to analytically describe the relationship between the function rank and the properties of the function and the data, including measures of function smoothness, the data dimension, and the domain diameter. Such relation has not been described before. We hope the conclusions of this paper on functions can provide some theoretical foundations for the practical success of low-rank matrix algorithms.

In this paper, we present three main results. First, we show that under common smoothness assumptions and up to some precision, the function rank of RBF kernels grows polynomially with increasing dimension in the worst case. Second, we provide explicit error bounds for the low-rank approximations of RBF kernel functions. And last, we explain the observed “decay-plateau” behavior of the singular values of smooth RBF kernel matrices.

1.1 Related Work

There has been extensive interest in kernel properties in a high-dimensional setting.

One line of research focuses on the spectrum of kernel matrices. There is a rich literature on the smallest eigenvalues, mainly concerning matrix conditioning. Several papers 

[1, 27, 31, 32] provided lower bounds for the smallest (in magnitude) eigenvalues. Some work further studied the eigenvalue distributions. Karoui [11] obtained the spectral distribution in the limit by applying a second-order Taylor expansion to the kernel function. In particular, Karoui considered kernel matrices with the -th entry and , and showed that as data dimension , the spectral property is the same as that of the covariance matrix . Wathen [40] described the eigenvalue distribution of RBF kernel matrices more explicitly. Specifically, the authors provided formulas to calculate the number of eigenvalues that decay like as , for a given . This group pattern in eigenvalues was observed earlier in [14] but with no explanation. The same pattern also occurs in the coefficients of the orthogonal expansion in the RBF-QR method proposed in [12]. There have also been studies focusing on the “flat-limit” situation where [10, 33, 13].

Another line of research is on developing efficient methods for function expansion and interpolation. The goal is to diminish the exponential dependence on the data dimension introduced by a tensor-product based approach. Barthelmann

et al[2] considered polynomial interpolation on a sparse grid [15]. Sparse grids are based on a high-dimensional multiscale basis and involve only degrees of freedom, where is the number of grid points in one coordinate direction at the boundary. This is in contrast with the degrees of freedom from tensor-product grids. Barthelmann showed that when , the number of selected points grows as , where is related to the function smoothness.

Trefethen [37] commented that to ensure a uniform resolution in all directions, the Euclidean degree of a polynomial (defined as for a multi-index ) may be the most useful. He investigated the complexity of polynomials with degrees defined by 1-, 2- and norms and concluded that by using the 2-norm we achieve similar accuracy as with the -norm, but with fewer points.

1.2 Main Results

In this paper, we study radial basis functions (RBF). RBFs are functions whose value depends only on the distance to the origin. In our manuscript, for convenience, we will consider the form . The square inside the argument of is to ensure that if is smooth then the RBF function is smooth as well. If we instead use and pick for example, the kernel is not differentiable when .

We define the numerical function rank of a kernel related to error , to which we will frequently refer.

where and are real functions on , and the separable form will be referred to as a low-rank kernel, or a low-rank representation of rank at most . Note that the rank definition concerns the function rank instead of the matrix rank.

Our two main results are as follows. First, we show that under common smoothness assumptions of RBFs and for a fixed precision, the function rank for RBF kernels is a polynomial function of the data dimension . Specifically, the function rank where is related to the low-rank approximation error. Furthermore, precise and detailed error bounds will be proved.

Figure 1: Group patterns in singular values. The singular values are normalized and ordered s.t. . The data were randomly generated in dimension 6 with default random seed in MATLAB. The legend shows the data size and the kernel functions: Cauchy () and Gaussian ().

Second, we observe that the singular values of RBF kernel matrices form groups with plateaus. A pictorial example is in Figure 1. There are 5 groups (plateau) of singular values with a sharp drop in magnitude between groups; the group cardinalities are dependent on the data dimension, but independent of the data size. We explain this phenomenon by applying an appropriate analytic expansion of the function and grouping expansion terms appropriately.

1.3 Organization

This paper is organized as follows. Section 2 presents our theorems concerning the function rank of the approximation of the RBF kernel function, and the error bound of the approximations. Section 3 provides the theorem proofs. Section 4 shows that for a fixed precision, the polynomial growth rate of the derived rank cannot be improved. Section 5 verifies our theorems experimentally. Finally, in Section 6, we investigate and discuss the group pattern in the singular values of RBF kernel matrices.

2 Main theorems

In this section, we present theorems concerning the function rank and function and data properties. Each theorem approximates the RBF kernels in the norm with low-rank kernels where the function rank and the error bound are given in explicit formulas. We briefly describe the theorems and then delve into further details.

The first four theorems consider kernels with two types of smoothness assumptions, and for each type, we present the deterministic result and the probabilistic result in two theorems, respectively. The probabilistic results take into account the concentration of measure for large data dimensions. The separable form is obtained by applying a Chebyshev expansion of followed by a further expansion of .

The key advantage of this approach is that the accuracy of the expansion only depends on instead of , which lies in a -dimensional space. Assume we have expanded to order with error . Then, we substitute , expand the result, and re-arrange the terms to identify the number of distinct separate products of the form in the final representation. This number becomes our upper bound on the function rank.

The theorems show that for a fixed precision, the function rank grows polynomially with data dimension , and that the error for low-rank approximations decreases with decreasing diameter of the domain that contains and .

The last theorem considers kernels with finite smoothness assumptions. The separable form is obtained by applying a Fourier expansion of followed by a Taylor expansion on each Fourier term. Additional to what the previous theorems suggest, the formulas for the error and the function rank capture subtler relations between different parameters, and the theorem shows that the error decreases when either the diameter of the domain that contains or that contains decreases. Before presenting our theorems, we introduce some notations.

Notations. Let and

denote the expectation and variance, respectively. Let

be the Bernstein ellipse defined on with parameter , an open region bounded by an ellipse. For an arbitrary interval, the ellipse is scaled and shifted and is referred to as the transformed Bernstein ellipse. For instance, given an interval , let be a linear mapping from to . And the transformed Bernstein ellipse for is defined to be . In this case, the parameter still characterizes the shape of the transformed Bernstein ellipse. Therefore, throughout this paper, when we say a transformed Bernstein ellipse with parameter , we refer to the parameter of the Bernstein ellipse defined on [-1, 1]. Let the function domain be , and we refer to as the target domain and as the source domain. We assume the domain is not a manifold, where lower ranks can be expected. Let the sub-domain containing the data of interest be .

The following theorems assume the bandwidth parameter in to be fixed at 1. A scaled kernel will not be considered because it can be handled by rescaling the data points instead. We start with some assumptions on the kernel type, function domain, and probabilistic distribution that will be used in the theorems, and then we present our theorems.

RBF Kernel Assumption.

Consider a function and kernel function with and . We assume that , , where is a constant independent of . And this implies .

Analytic Assumption.

is analytic in , and is analytically continuable to a transformed Bernstein ellipse with parameter , and inside the ellipse.

Finite Smoothness Assumption.

and its derivatives through are absolutely continuous on and the -th derivative has bounded total variation on ,

Probability Distribution Assumption.

and

are i.i.d. random variables, with

and

, and their second moments exist. Let

and

Then, with respect to , i.e., the mean distance between pairs of points neither goes to 0 nor with . And (a concentration of measure).

Theorem 2.1.

Suppose the creftypecap RBF Kernel Assumption and the creftypecap Analytic Assumption hold. Then, for , the kernel can be approximated in the norm by a low-rank kernel of function rank at most

(1)

where and are two sequences of -variable polynomials. And the error term is bounded as

(2)

Remark. If an approximation with a given maximal rank is requested, we need to select an such that . Then, we obtain an approximation with error and function rank at most . The low-rank kernel is of order , which can be revealed from the explicit form of in the proof (see Section 3.2). For the space of -variate polynomials with maximum total degree , the dimension is . In contrast, our upper bound is . When , our formula becomes favorable for a large range of .

Corollary 2.2.

Under the same assumptions in Theorem 2.1 and with fixed, the low-rank kernel approximation, for a fixed precision , is achievable with a rank proportional to , where and are positive constants.

The proofs of Theorem 2.1 and Corollary 2.2 can be found in Section 3.1 and Section 3.2, respectively.

Theorem 2.1 suggests that for some precision , the function rank grows polynomially with increasing data dimension , i.e., , where is determined by the desired precision , and . This can be seen from with fixed and .

For a fixed and for a sub-domain with diameter , the error bound decreases, namely in the following sense. In this case, the same function on the sub-domain can be analytically extended to a Bernstein ellipse whose parameter is larger than , reducing the error bound in Eq. 2. Therefore, when the diameter of the domain that contains our data decreases, we will observe a lower approximation error for low-rank approximations with a fixed function rank, and similarly, we will observe a lower function rank for low-rank approximations with a fixed accuracy.

Along the same line of reasoning, for a fixed kernel on a fixed domain, when the point sets become denser, we should expect the function rank to remain unchanged for a fixed precision. The result for function ranks turns out to be in perfect agreement with the observations in practical situations on matrix ranks, assuming there are sufficiently many points to make the matrix rank visible before reaching a given precision.

We now turn to the case when is large. Because we have assumed and to be in , by concentration of measure, the values of will fall into a small-sized subinterval of

with high probability. Therefore, we are interested in quantifying this probabilistic error bound.

Theorem 2.3.

Suppose the creftypecap RBF Kernel Assumption and the creftypecap Analytic Assumption hold, and points and

are sampled under the probability distribution involving

, and in creftypecap Probability Distribution Assumption. We define function . Then, is analytic in , with the parameter of its transformed Bernstein ellipse to be , and inside the ellipse. Defining the same error as in Theorem 2.1

(3)

we obtain that for , with probability at least

(4)

the error can be bounded by

And with the same probability, the distance of a sampled pair will fall into the following interval

The proof of Theorem 2.3 can be found in Section 3.3.

In Theorem 2.3, as , needs to decrease with to maintain the same probability. If we choose with being a very large number, then the probability remains close to 1 because . Moreover, we can keep small while reducing , because . This means that for sufficiently large and for a given error, goes down as increases. Asymptotically, reaches 0, and the function rank reaches 1. On the other hand, for a fixed , the error bound decreases when increases.

Note that is the size of the subinterval where the values of ’s fall into with probability given by Eq. 4 and, by concentration of measure, with the same probability, the interval size shrinks with increasing . This is consistent with what we have discussed that needs to decrease with to maintain the same probability.

The analytic assumption in Theorem 2.1 and Theorem 2.3 is very strong because many RBFs are not infinitely differentiable when the domain contains zero. However, most RBFs of practical interest are -times differentiable. In the following theorem, we weaken the analytic assumption to a finite-smoothness assumption and compute the corresponding error bound.

Theorem 2.4.

Suppose the creftypecap RBF Kernel Assumption and the creftypecap Finite Smoothness Assumption hold. Then for , the kernel can be approximated in the norm by a low-rank kernel of function rank at most

where and are two sequences of -variable polynomials. And the error term is bounded as

(5)

Remark. We can weaken the assumption of having bounded total variation to being Lipschitz continuous, and this does not impose assumptions on . With this weaker assumption, we obtain the same error rate ; however, the trade off is the absent of explicit constants in the upper bound Eq. 5.

The proof of Theorem 2.4 can be found in Section 3.4.

Compared to Theorem 2.1, the convergence rate slows down from a nice geometric convergence rate to an algebraic convergence rate . Each time the function becomes one derivative smoother ( increased by 1), the convergence rate will also become one order faster. The domain diameter affects the error bound by , where represents the smoothness of the function. For a sub-domain with diameter , it is straightforward to obtain that the error is bounded by , and for a fixed , a decrease in will reduce the error.

We also consider the phenomenon of concentration of measure and present the probabilistic result in the following theorem. and are i.i.d. random variables, with and , and have their second moments exist.

Theorem 2.5.

Suppose the creftypecap RBF Kernel Assumption and the creftypecap Finite Smoothness Assumption hold. We further assume and are sampled under the probability distribution involving , , and in creftypecap Probability Distribution Assumption. Defining the same as in Theorem 2.4

Then, for , we obtain the following bound

with probability at least

The proof of Theorem 2.5 can be found in Section 3.4.

Up to now, we have only considered a single parameter that characterizes the domain, to make the error bound more informative as in response to subtler changes of the domain, we also consider the diameters of the target domain and of the source domain . The following theorem nicely quantifies the influences of and on the error. Our result theoretically offers critical insights and motivations for many algorithms that take advantage of the low-rank property of sub-matrices, where these sub-matrices usually relate to data clusters of small diameters.

Theorem 2.6.

Suppose the creftypecap RBF Kernel Assumption hold, and there are and , such that and .

Let be a -periodic extension of , where 111see details in [5] is 1 on and smoothly decays to 0 outside of the interval. We assume that and its derivatives through are continuous, and the -th derivative is piecewise continuous with the total variation over one period bounded by .

Then, for with , the kernel can be approximated by a low-rank kernel of rank at most

And the error is bounded by

The proof of Theorem 2.6 can be found in Section 3.5.

In contrast to the previous theorems where the domain information only enters the error as , in Theorem 2.6, the diameters of the source domain and the target domain also appear in error. The form suggests that a decrease in will reduce the error, which can be achieved when either the source or the target domain has a smaller diameter. This property has motivated people to approach matrix approximation problems by identifying low-rank blocks in a matrix, which is partially achieved by partitioning the data into clusters of small diameters.

The function rank still remains a polynomial growth and it grows as , when and are fixed and . represents the Fourier expansion order of , and each term in the expansion is further expanded into Taylor terms up to order . We assumed to be the same across all the Fourier terms for simplicity. If we decrease the Taylor order with increasing Fourier order to preserve more information of low-order Fourier terms, then a lower error bound can be attained for the same function rank.

Remark. We summarize the assumptions, error bounds and function ranks of the theorems in Table 1, and discuss the similarities and differences in the function rank and the error bound. We refer to Theorem 2.1 and Theorem 2.4 as the Chebyshev approach and Theorem 2.6 as the Fourier-Taylor approach based on their proof techniques. The function rank is determined by the data dimension and the expansion order, and it is a power of the dimension, where the power is the expansion order and is different in the Chebyshev approach and the Fourier-Taylor approach. The error bounds quantify the influences from the expansion order and the domain diameter: a higher expansion order reduces the error bound, so does a smaller domain diameter. The domain diameter occurs as a single parameter in the Chebyshev approach but as , and in the Fourier-Taylor approach.

From the practical viewpoint, the absence of exponential growth for the function rank agrees with the practical situation where people observe lower matrix ranks for high dimension data. And, the fact that decreasing or reduces the error is also in agreement with practice and moreover, it provides an insight of why point clusterings followed by local interpolations often leads to a more memory efficient approximation.

angle=90

Approach
Chebyshev expansion
+ Exact expansion of
Fourier expansion
+ Taylor expansion of
Deterministic (analytic) Probabilistic (analytic) Deterministic (finite smoothness) Probabilistic (finite smoothness) Deterministic (finite smoothness)
Condition creftypecap RBF Kernel Assumption creftypecap Analytic Assumption creftypecap RBF Kernel Assumption creftypecap Analytic Assumption with parameter of its transformed Bernstein ellipse to be creftypecap Probability Distribution Assumption creftypecap RBF Kernel Assumption creftypecap Finite Smoothness Assumption creftypecap RBF Kernel Assumption creftypecap Finite Smoothness Assumption creftypecap Probability Distribution Assumption creftypecap RBF Kernel Assumption is the -periodic extension of and The first derivatives of are continuous, and the -th derivative on is piece-wise continuous with bounded total variation
Error
with probability at least for
with probability at least for
Rank
Notation : Chebyshev expansion order
:


:
:
: Fourier expansion order
: Taylor expansion order
Table 1: Theorem Summary

3 Theorem Proofs

In this section, we prove the theorems in Section 2. All the proofs consist of three components: separating into a finite sum of products of real-valued functions , counting the terms to obtain an upper bound for the function rank, and calculating the error bound. Similar techniques can be found in [26, 33, 40, 44]. We describe the high-level procedure of the separation step; the rest steps should be straightforward.

In the proofs of Theorem 2.1 and Theorem 2.4, the separable form was obtained by first expanding the kernel into polynomials of of a certain order to settle the error bound, and then expanding the terms . The key advantage of this approach has been discussed at the beginning of Section 2. We seek approximation theorems in 1D that provide optimal convergence rate and explicit error bounds. Chebyshev theorems (Theorem 8.2 and Theorem 7.2 in [36]) are ideal choices. Analogous results also exist, e.g., the classic Bernstein and Jackson’s approximation theorems (page 257 in [3]), but the downside is that they only provide an error rate rather than an explicit formula, and moreover, they will not improve our results or simplify the proofs.

In the proof of Theorem 2.6, the separable form was obtained by first applying a Fourier expansion on to separate the cross term , then applying a Taylor expansion on the cross term.

Before stating the detailed proofs, we introduce some notations that will be used.
Notations. For multi-index

and vector

, we define , and the multinomial coefficient with to be

3.1 Proof of Theorem 2.1

We first introduce a lemma on the identity of binomial coefficients.

Lemma 3.1.

For and , the following identity holds:

Proof.

The proof can be done by induction and follows that from Lemma 2.4 in [41]. ∎

Proof.

The proof consists of two components. First, we map the domain of to (for the convenience of the proof) and approximate with a Chebyshev polynomial, and this settles the error. Second, we further separate terms in the polynomial and count the number of distinct terms to be an upper bound of the function rank.

  • [leftmargin=*]

  • Approximation by Chebyshev polynomials. We first linearly map the domain of to and denote the new function as :

    (6)

    Because , it follows that . From our assumptions, is analytic in and is analytically continuable to the open Bernstein ellipse with parameter (consider a shifted ellipse).

    According to Theorem 8.2 in [36] that follows from [24], for , we can approximate by its Chebyshev truncations in the norm with error

    (7)

    and

    (8)

    where , and is the Chebyshev polynomial of the first kind of degree defined by the relation:

    (9)

    Rearranging the terms in Eq. 8 we obtain a polynomial of :

    (10)

    where depends on but is independent of and .

  • Separable form. We separate each term in Eq. 10 into a finite sum of separate products:

    (11)

    where . Substituting Eq. 11 into Eq. 10, we obtain a separable form of :

    (12)

    where is a constant independent of and . Therefore, the function rank of can be upper bounded by the total number of separate terms:

    where the equality follows from the result in Lemma 3.1. To summarize, we have proved that can be approximated by the separable form in Eq. 12 in the norm with rank at most

    (13)

    and approximation error

    (14)

3.2 Proof of Corollary 2.2

Proof.

For a fixed kernel function and fixed , we define two constants and . Then, the truncation error can be rewritten as

and equivalently,

(15)

We relate function rank to error and dimension . When , we obtain,

(16)

where is a constant for a fixed . Therefore, an error is achievable with the function rank proportional to . ∎

3.3 Proof of Theorem 2.3

Proof.

We consider the concentration of measure phenomenon and apply concentration inequalities to obtain a probabilistic error bound. The proof mostly follows the proof of Theorem 2.1, and we will focus on computing the error bound for a smaller domain.

To simplify the proof, we consider a function that is shifted by such that

and we will see later that this shift ensures the inputs for to fall into an interval that centers around 0 with some probability. inherits the analyticity of ; therefore, it is analytic on , and can be analytically extended to a transformed Bernstein Ellipse with parameter .

Let us denote and we will shortly apply concentration inequality to . With the assumptions that and are i.i.d. random variables where and , it follows that the are statistically independent with mean zero and are bounded by . By applying Bernstein’s inequality [4] on the sum of the , we conclude that for ,

(17)

where is a constant. In other words, with probability at least

This also means that with the same probability in Eq. 17, the inputs for will fall into the interval .

Therefore, for a probability associated with , we can turn to considering on the domain . We assume that is analytically extended to a transformed Bernstein ellipse with parameter , with the value of inside the ellipse bounded by . Following the same argument in the proof for Theorem 2.1, we obtain that for and with probability in Eq. 17, the approximation error for and sampled from the above distribution is bounded by

(18)

This sharper bound can be achieved with the same function rank as in Eq. 13 and with the same low-rank representation as in Eq. 12 except for coefficients.

Next, we rewrite the upper bound in Eq. 18 with the parameters , , and . If we linearly map the domain of from to [-1,1], then the Bernstein ellipse with parameter will be scaled by . We seek the largest such that the Bernstein ellipse with parameter will be contained in the transformed Bernstein ellipse with parameter . In that case, the lengths of their semi-minor axises match and the largest satisfies

(19)

and we obtain . In the special case where , , we recover the error bound . To simplify the bound, we use the relation that . Substituting this into Eq. 18, along with the fact that , we obtain