# On the Complexity of Learning with Kernels

A well-recognized limitation of kernel learning is the requirement to handle a kernel matrix, whose size is quadratic in the number of training examples. Many methods have been proposed to reduce this computational cost, mostly by using a subset of the kernel matrix entries, or some form of low-rank matrix approximation, or a random projection method. In this paper, we study lower bounds on the error attainable by such methods as a function of the number of entries observed in the kernel matrix or the rank of an approximate kernel matrix. We show that there are kernel learning problems where no such method will lead to non-trivial computational savings. Our results also quantify how the problem difficulty depends on parameters such as the nature of the loss function, the regularization parameter, the norm of the desired predictor, and the kernel matrix rank. Our results also suggest cases where more efficient kernel learning might be possible.

## Authors

• 42 publications
• 60 publications
• 60 publications
• ### Is Input Sparsity Time Possible for Kernel Low-Rank Approximation?

Low-rank approximation is a common tool used to accelerate kernel method...
11/05/2017 ∙ by Cameron Musco, et al. ∙ 0

• ### Hierarchically Compositional Kernels for Scalable Nonparametric Learning

We propose a novel class of kernels to alleviate the high computational ...
08/02/2016 ∙ by Jie Chen, et al. ∙ 0

• ### Face Verification via learning the kernel matrix

The kernel function is introduced to solve the nonlinear pattern recogni...
01/21/2020 ∙ by Ning Yuan, et al. ∙ 0

• ### Structured Block Basis Factorization for Scalable Kernel Matrix Evaluation

Kernel matrices are popular in machine learning and scientific computing...
05/03/2015 ∙ by Ruoxi Wang, et al. ∙ 0

• ### Sparse Algorithm for Robust LSSVM in Primal Space

As enjoying the closed form solution, least squares support vector machi...
02/07/2017 ∙ by Li Chen, et al. ∙ 0

• ### Understanding and Eliminating the Large-kernel Effect in Blind Deconvolution

Blind deconvolution consists of recovering a clear version of an observe...
06/06/2017 ∙ by Li Si-Yao, et al. ∙ 0

• ### Kernel quadrature with DPPs

We study quadrature rules for functions living in an RKHS, using nodes s...
06/18/2019 ∙ by Ayoub Belhadji, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the well-known problem of kernel learning (see, e.g., [21]), where given a training set of labeled examples from a product domain , our goal is to find a linear predictor in a reproducing kernel Hilbert space which minimizes the average loss, possibly with some regularization. Formally, our goal is to solve

 minw∈W1mm∑t=1ℓ(⟨w,ψ(xt)⟩,yt)+λ2∥w∥2, (1)

where is a convex subset of some reproducing kernel Hilbert space , is a feature mapping to the Hilbert space, is a loss function convex in its first argument, and

is a regularization parameter. For example, in the standard formulation of Support Vector Machines, we take

to be the hinge loss, pick some , and let be the entire Hilbert space. Alternatively, one can also employ hard regularization, e.g., setting and taking .

It is well-known that even if is high or infinite dimensional, we can solve (1) in polynomial time, provided there is an efficiently computable kernel function such that . The key insight is provided by the representer theorem, which implies that an optimum of (1) exists in the span of . Therefore, instead of optimizing over , we can optimize over a coefficient vector , which implicitly specifies a predictor via . In this case, (1) reduces to

 minα:w(α)∈W1mm∑t=1ℓ(m∑j=1αj⟨ψ(xj),ψ(xt)⟩,yt)+λ2∥w(α)∥2 .

Defining the kernel matrix , we can re-write the above as

 minα:w(α)∈W1mm∑t=1ℓ(α⊤Ket,yt)+λ2α⊤Kα . (2)

This is a convex problem, which can generally be solved in polynomial time. The resulting implicitly defines the linear predictor in the Hilbert space: Given a new point to predict on, this can be efficiently done according to

Unfortunately, a major handicap of kernel learning is that it requires computing and handling an matrix, where is the size of the training data, and this can be prohibitive in large-data applications. This has led to a large literature on efficient kernel learning, which attempts to reduce its computational complexity. As far as we know, the algorithms proposed so far fall into one or more of the following categories (see below for specific references):

• Limiting the number of kernel evaluations: A dominant computational bottleneck in kernel learning is computing all entries of the kernel matrix. Thus, several algorithms attempt to learn using a much smaller number of kernel evaluations – either by sampling them or using other schemes which require “reading” only a small part of the kernel matrix.

• Low-Rank Kernel Approximation: Instead of using the full kernel matrix, one can use instead a low-rank approximation of it. Learning with a low-rank matrix can be done in a computationally much more efficient manner than with a general kernel matrix (e.g., [21, 1]).

• Projection to a low-dimensional space: Each instance is mapped to a finite-dimensional vector where , so that . Note that this is equivalent to a kernel problem where the rank of the kernel matrix is , so it can be seen as a different kind of low-rank kernel approximation technique.

Existing theoretical results focus on performance guarantees for various algorithms. In this work, we consider a complementary question, which surprisingly has not been thoroughly explored (to the best of our knowledge): What are the inherent obstacles to efficient kernel learning? For example, is it possible to reduce the number of kernel evaluations while maintaining the same learning performance? Is there always a price to pay for low-rank matrix approximation? Can finite-dimensional projection methods match the performance of algorithms working on the original kernel matrix?

Specifically, we study information-theoretic lower bounds on the attainable performance, measured in terms of optimization error on a given training set. We consider two distinct types of constraints:

• The number of kernel evaluations (or equivalently, the number of entries of the kernel matrix observed) is bounded by , where is generally assumed to be much smaller than (the number of entries in the kernel matrix).

• The algorithm solves (2), but using some low-rank matrix instead of . This can be seen as using a low-rank kernel matrix approximation.

We make no assumptions whatsoever on which kernel evaluations are used, or the type of low-rank approximation, so our results apply to all the methods mentioned previously, and any future potential method which uses these types of approaches. We note that although we focus on optimization error on a given training set, our lower bounds can also be potentially extended to generalization error, where the data is assumed to be sampled i.i.d. from an underlying distribution. We discuss this point further in Section 5.

Our first conclusion, informally stated, is that it is generally impossible to make kernel learning more efficient in a non-trivial manner. For example, suppose we have a budget on the number of kernel evaluations, where . Then the following “trivial” sub-sampling method turns out to be optimal in general: Sub-sample examples from the training data uniformly at random (throwing away all other examples), compute the full kernel matrix based on the sub-sample, and train a predictor using this matrix. This is an extremely näive algorithm, throwing away almost all of the data, yet we show that there are cases where no algorithm can be substantially better. Another pessimistic result can be shown for the low-rank matrix approximation approach: There are cases where any low-rank approximation will impact the attainable performance.

Our formal results go beyond these observations, and quantify the attainable performance as a function of several important problem parameters, such as the kernel matrix rank, regularization parameter, norm of the desired predictor, and the nature of the loss function. In particular:

• Given a kernel evaluation budget constraint :

• For the absolute loss, no regularization (), and a constant norm constraint on the domain, we have an error lower bound of . A matching upper bound is obtained by the sub-sampling algorithm discussed earlier.

• For soft regularization (with regularization parameter and no norm constraint), we attain error lower bounds which depend on the structure of the loss function. Some particular corollaries include:

• For the absolute loss, . Again, a matching upper bound is attained by a sub-sampling algorithm.

• For the hinge loss, as long as . Although it only applies in a certain budget regime, it is tight in terms of identifying the kernel evaluation budget required to make the error sub-constant. Moreover, it sheds some light on previous work (e.g., [6]) which considered efficient kernel learning methods for the hinge loss.

• For the squared loss, , as long as . Like the result for the other losses, it implies that no sub-constant error is possible unless .

• For learning with low-rank approximation, with rank parameter

, in the case of Ridge Regression (squared loss and soft regularization), we attain an error lower bound of

. Thus, to get sub-constant error, we need the rank to scale at least like .

The role of the loss function is particularly interesting, since it has not been well-recognized in previous literature, yet our results indicate that it may play a key role in the complexity of kernel learning. For example, as we discuss in Section 3, efficient kernel learning is trivial with the linear loss, harder for smooth and non-linear losses, and appears to be especially hard for non-smooth losses. Our results also highlight the importance of the kernel matrix rank in determining the difficulty of kernel learning. While it has been recognized that low rank can make kernel learning easy (see references below), our results formally establish the reverse direction, namely that (some) high-rank matrices are indeed hard to learn with any algorithm.

### Related Work

The literature on efficient kernel methods is vast and we cannot do it full justice. A few representative examples include sparse greedy kernel approximations [21], Nyström-based methods, which sample a few rows and columns and use it to construct a low-rank approximation [10, 15], random finite-dimensional kernel approximations such as random kitchen sinks [18, 19, 8]

, the kernelized stochastic batch Perceptron for learning with few kernel evaluations

[6], the random budget Perceptron and the Forgetron [4, 9], divide-and-conquer approaches [26, 13], sequential algorithms with early stopping [25, 20], other numerical-algebraic methods for low-rank approximation, e.g., [11, 22, 2, 17, 15], combinations of the above [8], and more. Several works provide a theoretical analysis on the performance of such methods, as a function of the rank, number of kernel evaluations, dimensionality of the finite-dimensional space, and so on. Beyond the works mentioned above, a few other examples include [5, 24, 1, 16].

In terms of lower bounds, we note that there are existing results on the error of matrix approximation, based on partial access to the matrix (see [3, 12]). However, the way the error is measured is not suitable to our setting, since they focus on the Frobenius norm of , where is the original matrix and is the approximation. In contrast, in our setting, we are interested in the error of a resulting predictor rather than the quality of matrix approximation. Therefore, even if is large, it could be that can still be used to learn an excellent predictor. Another distinct line of work studies how to reduce the complexity of a kernel predictor at test time, e.g., by making it supported on a few support vectors (see for instance [7] and references therein). This differs from our work, which focuses on efficiency at training time.

### Paper Organization

Our paper is organized as follows. In Section 2, we introduce the class of kernel matrices which shall be used to prove our results, and discuss how they can be generated by standard kernels. In Section 3, we provide lower bounds in a model where the algorithm is constrained in terms of the number of kernel evaluations used. We consider this model in two flavors, one where there is a norm constraint and no regularization (Subsection 3.1), and one where there is regularization without norm constraint (Subsection 3.2). In the former case, we focus on a particular loss, while in the latter case, we provide a more general result and discuss how different types of losses lead to different types of lower bounds. In Section 4, we consider the model where the algorithm is constrained to use a low-rank kernel matrix approximation. We conclude and discuss open questions in Section 5. Proofs appear in Appendix A.

## 2 Hard Kernel Matrices

For our results, we utilize a set of “hard” kernel matrices, which are essentially permutations of block-diagonal matrices with at most blocks. More formally:

###### Definition 1.

Let be the class of all block-diagonal matrices, composed of at most blocks, with entry values of within each block. We define to be all matrices which belong to under some permutation of their rows and columns:

 Kd,m={K∈{0,1}m×m: ∃ π,K′∈K′d,ms.t. ∀i,j∈{1…m}, Ki,j=K′π(i),π(j)} .

From the definition, it is immediate that any is positive semidefinite (and hence is a valid kernel matrix), with rank at most . Moreover, the magnitude of the diagonal elements is at most , which means that our data lies in the unit ball in the Hilbert space.

Since our focus is on generic kernel learning, it is sufficient to consider this class in order to establish hardness results. However, it is still worthwhile to consider what kernels can induce this class of kernel matrices. A sufficient condition can be quantified via the following lemma.

###### Lemma 1.

Suppose there exist such that . Then any is induced by some instances .

The proof is immediate: Given any , for any block of size , create copies of , and order the instances according to the relevant permutation.

It is straightforward to see that Lemma 1 holds for linear kernels and for homogeneous polynomial kernels . It also holds (approximately) for Gaussian kernels if there exist equi-distant points in , where the squared distance is much larger than . In that case, instead of outside the blocks, we will have where is exponentially small, and can be shown to be negligible for our purposes.

However, a close inspection of our results reveals that they are in fact applicable to a much larger class of matrices: All we truly require is to have such that and for some distinct constants for all . This condition holds for most kernels we are aware of. For example, if there are equi-distant points , then this condition is fulfilled for any shift-invariant kernel (where is some function of ). Similarly, if there are points which have the same inner product, then the condition is fulfilled for any inner product kernel (where is some function of ). In order to keep a more coherent presentation we will concentrate here on the Boolean case defined previously, where and .

Although our formal results and proofs contain many technical details, their basic intuition is quite simple: When is sufficiently large, any matrix in is of high rank, and cannot be approximated well by any low-rank matrix. Therefore, under suitable conditions, no low-rank matrix approximation approach can work well. Moreover, when is large, then the kernel matrix is quite sparse, and contains a large number of relatively small blocks. Thus, for an appropriate randomized choice of a matrix in , any algorithm with a limited budget of kernel evaluations will find it difficult to detect these blocks. With a suitable construction, we can reduce the kernel optimization problem to that of detecting these blocks, from which our results follow.

## 3 Budget Constraints

We now turn to present our results for the case of budget constraints. In this setting, the learning algorithm is given the target values , but not the kernel matrix . Instead, the algorithm may query at most entries in the kernel matrix (where is a user-defined positive integer), and then return a coefficient vector based on this information. This model represents approaches which attempt to reduce the computational complexity of kernel learning by reducing the number of kernel evaluations needed. Standard learning algorithms essentially require , and the goal is to learn to similar accuracy with a budget . In this section, we discuss the inherent limitations of this approach.

### 3.1 Norm Constraint, Absolute Loss

We begin by demonstrating a lower bound using the absolute loss on the domain (or equivalently, coefficient vectors satisfying ), and our goal is to minimize the average loss, which equals

 minα:α⊤Kα≤21mm∑t=1∣∣α⊤Ket−yt∣∣ .
###### Theorem 1.

For any rank parameter , any sample size , any budget size , and for any budgeted algorithm, there exists a kernel matrix and target values , such that the returned coefficient vector satisfies

 (1mm∑t=1∣∣α⊤Ket−yt∣∣)−minα:α⊤Kα≤21mm∑t=1∣∣α⊤Ket−yt∣∣≥170√d . (3)

The proof and the required construction appears in Subsection A.1. Note that the algorithm is allowed to return any coefficient vector (not necessarily one satisfying the domain constraint ).

The theorem provides a lower bound on the attainable error, for any rank parameter and assuming the sample size and budget are in an appropriate regime. A different way to phrase this is that if is sufficiently smaller than , then we can find some on the order of , such that Theorem 1 holds. More formally:

###### Corollary 1.

There exist universal constants such that if , there is an kernel matrix (belonging to for some appropriate ) and target values in such that the returned coefficient vector satisfies

 (1mm∑t=1∣∣α⊤Ket−yt∣∣)−minα:α⊤Kα≤21mm∑t=1∣∣α⊤Ket−yt∣∣≥c′4√B .

In words, the attainable error given a budget of cannot go down faster than . Next, we show that this is in fact the optimal rate, and is achieved by the following simple strategy:

1. Given a training set of size , sample training examples uniformly at random (with replacement), getting .

2. Compute the kernel matrix defined as , using at most queries.

3. Solve the kernel learning problem on the sampled set, getting a coefficient vector :

 minˆα:ˆα⊤^Kˆα≤2⎛⎝1⌊√B⌋⌊√B⌋∑j=1∣∣ˆα⊤ˆKej−ytj∣∣⎞⎠ .
4. Return the coefficient vector such that for , and otherwise.

Essentially, this strategy approximately solves the original problem by drawing a subset of the training data —small enough so that we can compute its kernel matrix in full— and solving the learning problem on that data. Since we use a sample of size , then by standard generalization guarantees for learning bounded-norm predictors using Lipschitz loss functions (e.g., [14]), we get a generalization error upper bound of

 O⎛⎜ ⎜⎝1√⌊√B⌋⎞⎟ ⎟⎠=O(14√B)

which matches the lower bound in Corollary 1 up to constants.

To summarize, we see that with the absolute loss, given a constraint on the number of kernel evaluations, there exist no better method than throwing away most of the data, and learning on a sufficiently small subset. Moreover, any method using a non-trivial budget (significantly smaller than ) must suffer a performance degradation.

### 3.2 Soft Regularization, General Losses

Having obtained an essentially tight result for the absolute loss, it is natural to ask what can be obtained for more generic losses. To study this question, it will be convenient to shift to the setting where the domain is the entire Hilbert space, and we use a regularization term. Following (2), this reduces to solving

 minα1mm∑t=1ℓ(α⊤Ket,yt)+λ2α⊤Kα .

We start by defining the main quantity we are interested in,

 Δℓ(m,α,K,λ,y)=(1mm∑t=1ℓ(α⊤Ket,yt)+λ2α⊤Kα)−minα(1mm∑t=1ℓ(α⊤Ket,yt)+λ2α⊤Kα),

where is a loss function.

First, we provide a general result, which applies to any non-negative loss function, and then draw from it corollaries for specific losses:

###### Theorem 2.

Suppose the loss function is non-negative. For any rank parameter , any sample size , any budget , and for any budgeted algorithm, there exists a kernel matrix and target values in , such that the returned coefficient vector satisfies

 Δℓ(m,α,K,λ,y)≥160λd minp∈[12,2]maxy∈Y(2u∗1−u∗2)2 (4)

where

 u∗1=argminuℓ(u,y)+pλdu2andu∗2=argminuℓ(u,y)+pλd2u2 .

The proof and the required construction appears in Subsection A.1.2.

Roughly speaking, to get a non-trivial bound, we need the loss to be such that when the regularization parameter is order of , then scaling it by a factor of changes the location of the optimum by a factor different than . For instance, this rules out linear losses of the form . For such a loss, we have

 u∗1=argminunmyu+λu2=−ny2λmandu∗2=argminunmyu+λ2u2=−nyλm .

Thus we get that and the lower bound is trivially . While this may seem at first like an unsatisfactory bound, in fact this should be expected: For linear loss and no domain constraints, we don’t need to observe the kernel matrix at all in order to find the optimal solution! To see this, note that the optimization problem in (2) reduces to

 minα1mm∑t=1ytα⊤Ket+λ2α⊤Kα

or, equivalently,

 minα α⊤Kv+λ2α⊤Kα

where is a known vector and is the partially-unknown kernel matrix. Differentiating the expression by and equating to , getting

 Kv+λKα=0 .

Thus, an optimum of this problem is simply , regardless of what is . This shows that for linear losses, we can find the optimal predictor with zero queries of the kernel matrix.

Thus, the kernel learning problem is non-trivial only for non-linear losses, which we now turn to examine in more detail.

#### 3.2.1 Absolute Loss

First, let us consider again the absolute loss in this setting. We easily get the following corollary of Theorem 2:

###### Corollary 2.

Let be the absolute loss. There exist universal constants , such that if , then for any budgeted algorithm there exists an kernel matrix and target values such that is lower bounded by .

###### Proof.

To apply Theorem 2, let us compute , where we use the particular choice . It is readily verified that , leading to the lower bound

 160λd minp∈[12,2](u∗1)2 =160λd minp∈[12,2](12pλd)2 =160λd(14λd)2=1960λd .

In particular, suppose we choose . Then we get a lower bound of for . The conditions of Theorem 2 are satisfied if and . The latter always holds, whereas the former is indeed true if is smaller than for . ∎

As in the setting of Theorem 1, this lower bound is tight, and we can get a matching upper bound by learning with a random sub-sample of training examples, using generalization bounds for minimizers of strongly-convex and Lipschitz stochastic optimization problems [23].

Note that, unlike our other lower bounds, Corollary 2 is proven using a different choice of for each . It is not clear whether this requirement is real, or is simply an artifact of our proof technique.

#### 3.2.2 Hinge Loss

Intuitively, the proof of Corollary 2 relied on the absolute loss having a non-smooth “kink” at , which prevented the optimal from moving as a result of the changed regularization parameter. Results of similar flavor can be obtained with any other loss which has an optimum at a non-smooth point. However, when we do not control the location of the “kink” the results may be weaker. A good example is the hinge loss, , which is non-differentiable at the fixed location :

###### Corollary 3.

Let be the hinge loss. There exist universal constants , such that if and , then for any budgeted algorithm, there exist an kernel matrix and target values in such that is lower bounded by .

###### Proof.

To apply Theorem 2, let us compute , where we use the particular choice . It is readily verified that , as long as , and is certainly satisfied for any by assuming . Therefore, if , then in Theorem 2, we get , and thus a lower bound of .

In particular, suppose we pick . Since we assume , this means that the lower bound above is . The conditions of Theorem 2 are satisfied if and , which are indeed implied by the corollary’s conditions. ∎

Unlike the bound for the absolute loss, here the result is weaker, and only quantifies a regime under which sub-constant error is impossible. In particular, the condition is not interesting for constant . However, in learning problems usually scales down as where and often . In that case, we get constant error as long as , which establishes that learning is impossible for a budget smaller than a quantity in the range from to , depending on the value of . For , that is , learning is impossible without querying a constant fraction of the kernel matrix.

Moreover, it is possible to show that our lower bound is tight, in terms of identifying the threshold for making the error sub-constant. As before, we consider the strategy of sub-sampling training examples and learning with respect to the induced kernel matrix. Since we use examples and -strongly convex regularization, the expected error scales as [23]. This is sub-constant in the regime , and matches our lower bound. We emphasize that when is , we do not have a non-trivial lower bound, and it remains an open problem to understand what can be attained for the hinge loss in this regime.

Another interesting consequence of the corollary is the required budget as a function of the norm of a “good predictor” we want to compete with. In [6], several algorithmic approaches have been studied, which were all shown to require kernel evaluations to be competitive with a given predictor , even in the “realizable” case where the predictor attains zero average hinge loss. An examination of the proof of theorem 2 reveals that the construction is such that there exists a predictor which attains zero hinge loss on all the examples, and whose norm111To see this, recall that we use a block-diagonal kernel matrix composed of at most all-ones blocks, and where always. So by picking for any index in block (where is the size of the block), we get zero hinge loss, and the norm is . Moreover, in the proof of Corollary 3 we pick , so the norm is . is . Corollary 3 shows that the budget must be at least to get sub-constant error in the worst case. Although our setting is slightly different than [6], this provides evidence that the bounds in [6] are tight in terms of the norm dependence.

#### 3.2.3 Squared Loss

In the case of absolute loss and hinge loss, the results depend on a non-differentiable point in the loss function. It is thus natural to conclude by considering a smooth differentiable loss, such as the squared loss:

###### Corollary 4.

Let be the squared loss. There exist universal constants , such that

• If , then for any budgeted algorithm there exists an kernel matrix and target values in such that is lower bounded by .

• If , then for any budgeted algorithm there exists an kernel matrix and target values in such that is lower bounded by .

This lower bound is weaker than the lower bound attained for the absolute loss. This is essentially due to the smoothness of the squared loss, and we do not know if it is tight. In any case, it proves that even for the squared loss, at least kernel evaluations are required to get sub-constant error. In learning problems, where often scales down as (where and often ), we get a required budget size of . This is super-linear when , and becomes when – in other words, we need to compute a constant portion of the entire kernel matrix.

###### Proof.

To apply Theorem 2, let us compute . It is readily verified that and , leading to the lower bound

 =160λd minp∈[12,2]maxy∈Y⎛⎜ ⎜⎝y(1+pλd)(1+pλd2)⎞⎟ ⎟⎠2 ≥160λd minp∈[12,2]maxy∈Y(y(1+pλd)2)2.

Taking in particular , we get

 160λd minp∈[12,2](1(1+pλd)2)2 ≥ 160λd(1+2λd)4. (5)

We now consider two ways to pick , corresponding to the two cases considered in the corollary:

• If , we pick . Since , we have , and this means that (5) is bounded below by . The conditions of Theorem 2 are satisfied if and . These are satisfied by assuming for .

• If , we pick . Plugging this into (5) and using the assumption (or equivalently, ), we get a lower bound of for an appropriate constant . Moreover, the conditions of Theorem 2 are satisfied if and . The latter always holds, whereas the former indeed holds if is less than for .

This completes the proof. ∎

## 4 Low-Rank Constraints

In this section, we turn to discuss the second broad class of approaches, which replace the original kernel matrix by a low-rank approximation . As explained earlier, many rank-reduction approaches – including Nyström method and random features – use a low-rank approximation with entries defined by , where is the training set and is a given feature mapping, typically depending on the data.

The next result shows a lower bound on the error for any such low-rank approximation method when the algorithm used for learning is kernel Ridge Regression (i.e., when we use the squared loss and employ soft regularization).

###### Theorem 3.

Suppose there exist a kernel function on and points such that . Then there exists a training set , with corresponding kernel matrix , such that for any feature mapping (possibly depending on the training set), the coefficient vector returned by the Ridge Regression algorithm operating on the matrix with entries satisfies

 Δℓ2(m,α,K,λ,y) ≥12(λd)2(1+λd)

where is any upper bound on the rank of such that divides .

When , we get a bound. This bears similarities to the bound in Corollary 4, which considered the squared loss in the budgeted setting, where is replaced by (i.e., when ). The bound implies that to get sub-constant error, the rank required must be larger than . When itself scales down with the sample size , we get that the required rank grows with the sample size. When , the required rank is , which means that any low-rank approximation scheme (where ) will lead to constant error. As in the case of Corollary 4, we do not know whether our lower bound is tight.

## 5 Discussion and Open Questions

In this paper, we studied fundamental information-theoretic barriers to efficient kernel learning, focusing on algorithms which either limit the number of kernel evaluations, or use a low-rank kernel matrix approximation. We provided several results under various settings, highlighting the influence of the kernel matrix rank, regularization parameter, norm constraint and nature of the loss function on the attainable performance.

For general losses and kernel matrices, our conclusion is generally pessimistic. In particular, when the number of kernel evaluations is bounded, there are cases where no algorithm attains performance better than a trivial sub-sampling strategy, where most of the data is thrown away. Also, no algorithm can work well when the regularization parameter is sufficiently small or the norm constraint is sufficiently large. On a more optimistic note, our lower bounds are substantially weaker when dealing with smooth losses. Although we do not know if these weaker lower bounds are tight, they may indicate that better kernel learning algorithms are possible by exploiting smoothness of the loss. Smoothness of the squared loss has been used in [26], but perhaps this property can be utilized more generally.

In our results, we focused on the problem of minimizing regularized training error on a given training set. This is a different goal than minimizing generalization error in a stochastic setting, where the data is assumed to be drawn i.i.d. from some underlying distribution. However, we believe that our lower bounds should also be applicable in terms of optimizing the risk (or expected error with respect to the underlying distribution). The main obstacle is that our lower bounds are proven for a given class of kernel matrices, which are not induced by an explicit i.i.d. sampling process of training instances. However, inspecting our basic construction in Subsection A.1, it can be seen that it is very close to such a process: The kernel is constructed by pairs of instances sampled i.i.d. from a finite set . We believe that all our results would hold if the instances were to be sampled i.i.d. from . The reason that we sample pairs is purely technical, since it ensures that for every , there is an equal number of and

in the training set, making the calculations more tractable. Morally, the same techniques should work with i.i.d. sampling, as long as the probability of sampling

and are the same for all .

Our work leaves several questions open. First, while the results for the absolute loss are tight, we do not know if this is the case for our other results. Second, the low-rank result in Section 4 applies only to squared loss (Ridge Regression), and it would be interesting to extend it to other losses. Third, it should be possible to extend our results also to randomized algorithms that query the kernel matrix a number of times bounded by only in expectation (with respect to the algorithm’s internal randomization), rather than deterministically. Finally, our results may indicate that at least for smooth losses, better kernel learning algorithms are possible, and remain to be discovered.

### Acknowledgements

This research was carried out in part while the authors were attending the research program on the Mathematics of Machine Learning, at the Centre de Recerca Matemàtica of the Universitat Autònoma de Barcelona (Spain). Partial support is gratefully acknowledged.

## References

• [1] F. Bach. Sharp analysis of low-rank kernel matrix approximations. In COLT, 2013.
• [2] F. Bach and M. Jordan. Predictive low-rank decomposition for kernel methods. In ICML, 2005.
• [3] Z. Bar-Yossef. Sampling lower bounds via information theory. In STOC, 2003.
• [4] Giovanni Cavallanti, Nicolò Cesa-Bianchi, and Claudio Gentile.

Tracking the best hyperplane with a simple budget Perceptron.

Machine Learning, 69(2-3):143–167, 2007.
• [5] C. Cortes, M. Mohri, and A. Talwalkar. On the impact of kernel approximation on learning accuracy. In AISTATS, 2010.
• [6] A. Cotter, S. Shalev-Shwartz, and N. Srebro. The kernelized stochastic batch Perceptron. In ICML, 2012.
• [7] A. Cotter, S. Shalev-Shwartz, and N. Srebro. Learning optimally sparse Support Vector Machines. In ICML, 2013.
• [8] B. Dai, B. Xie, N. He, Y. Liang, A. Raj, M.-F. Balcan, and L. Song. Scalable kernel methods via doubly stochastic gradients. arXiv preprint arXiv:1407.5599, 2014.
• [9] O. Dekel, S. Shalev-Shwartz, and Y. Singer. The forgetron: A kernel-based Perceptron on a budget. SIAM Journal on Computing, 37(5):1342–1372, 2008.
• [10] P. Drineas and M. Mahoney. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2153–2175, 2005.
• [11] S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. The Journal of Machine Learning Research, 2:243–264, 2002.
• [12] A. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for finding low-rank approximations. Journal of the ACM (JACM), 51(6):1025–1041, 2004.
• [13] C.-J. Hsieh, S. Si, and I. Dhillon. A divide-and-conquer solver for kernel Support Vector Machines. In ICML, 2014.
• [14] S. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems, pages 793–800, 2009.
• [15] S. Kumar, M. Mohri, and A. Talwalkar. Sampling techniques for the Nyström method. In AISTATS, 2009.
• [16] M. Lin, S. Weng, and C. Zhang. On the sample complexity of random Fourier features for online learning: How many random Fourier features do we need? ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):13, 2014.
• [17] M. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009.
• [18] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.
• [19] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS, 2008.
• [20] G. Raskutti, M. Wainwright, and B. Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research, 15(1):335–366, 2014.
• [21] B. Scholkopf and A. Smola. Learning with kernels: Support Vector Machines, regularization, optimization, and beyond. MIT press, 2001.
• [22] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University Press, 2004.
• [23] K. Sridharan, S. Shalev-Shwartz, and N. Srebro. Fast rates for regularized objectives. In NIPS, 2009.
• [24] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. Nyström method vs random Fourier features: A theoretical and empirical comparison. In NIPS, 2012.
• [25] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
• [26] Y. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel Ridge Regression. In COLT, 2013.

## Appendix A Proofs

### a.1 Construction properties from Section 3

We consider a randomized strategy, where the kernel matrix is sampled randomly from (according to a distribution to be defined shortly), and are fixed deterministically in a certain way. We will analyze what is the best possible performance using any budgeted algorithm, in expectation over this strategy.

To define the distribution , we let be the standard basis vectors in , and sample a kernel matrix from as follows:

• Pick uniformly at random.

• For all , define as

• if ,

• and if .

• For , choose uniformly at random from .

• Choose a permutation uniformly at random.

• Return the kernel matrix defined as for all .

To understand the construction, we begin by noting that represents the inner product of a set of vectors, and hence is always positive semidefinite and a valid kernel matrix. Moreover, are all in the set , and therefore the resulting kernel matrix equals (up to permutation of rows and columns) a block-diagonal matrix of the following form:

 1 S1 0 0 ⋯ 0 0 S⊤1 1 0 0 ⋯ 0 0 0 0 1 S2 ⋯ 0 0 0 0 S⊤2 1 ⋯ 0 0 ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 0 0 ⋯ 1 Sd 0 0 0 0 ⋯ S⊤d 1

Here, is an all-zero block if , and an all-ones block if . In other words, the matrix is composed of blocks, one for each value of . If , then block is a monolithic all-ones block (corresponding to ), and if , then block is composed of two equal-sized sub-blocks (corresponding to and to ). This implies that the kernel matrix is indeed in .

Our proofs rely on the following intuition: To achieve small error, the learning algorithm must know the values of the entries in (i.e., the values of ). However, when is large, these blocks are rather small, and their entries are randomly permuted in the matrix. Thus, any algorithm with a constrained query budget is likely to “miss” many of these blocks.

To simplify the presentation, we will require a few auxiliary definitions. First, given a kernel matrix constructed as above, let

 Ti,1={π(t):zt=v1i}andTi,2={π(t):zt=v2i}

denote the set of row/column indices in the kernel matrix, corresponding to instances which were chosen to be (respectively ). Note that is a disjoint partition of all indices , and . We then define,

 Ti=Ti,1∪Ti,2 and Ni=|Ti|, (6)

and also define,

 βi,1=∑t∈Ti,1αt and βi,2=∑t∈Ti,2αt, (7)

to be the sum of the corresponding coefficients in the solution returned by the algorithm. With these definitions, we can re-write the average loss and the regularization term as follows.

###### Lemma 2.

For any coefficient vector ,

 1mm∑t=1ℓ(α⊤Kei,y) =d∑i=1Ni2m(ℓ(βi,1+σiβi,2,y)+ℓ(σiβi,1+βi,2,y)) and α⊤Kα =d∑i=1(β2i,1+β2i,2+2σiβi,1βi,2),

where are defined in (7).

The proof is a straightforward exercise based on the definition of . Finally, we define to be the event that the algorithm never queries a pair of inputs in , i.e., the algorithm’s queries on the kernel matrix satisfy

 st∉Ti∨rt∉Tit=1,…,B .

To prove our results, we will require two key lemmas, presented below, which quantify how any budgeted algorithm is likely to “miss” many blocks, and hence have its output relatively insensitive to .

###### Lemma 3.

Suppose and . Then for any deterministic learning algorithm,

 d∑i=1P(Ei)>d2 .

The formal proof is provided below. Although it is quite technical, the lemma’s intuition is very simple: Recall that the kernel matrix is composed of blocks, each of size in expectation. Thus, if we choose an entry uniformly at random, the chance of “hitting” some block is approximately . Thus, if we sample points uniformly at random, where , then the number of “missed” blocks is likely to be . The lemma above simply quantifies this, and shows that this holds not just for uniform sampling, but for any algorithm with a budgeted number of queries.

###### Proof.

Recall that each corresponds to one of blocks in the kernel matrix (possibly composed of two sub-blocks). The algorithm queries . For each possible query at time we define the set of blocks such that was queried with a member of that block and we obtained a value zero in the kernel matrix. Namely,

 Qs,t= {i=1,…,d:(∃τ

Given the query we define to be the blocks in which some member was queried with , and the blocks in which some member was queried with .

We introduce a quantity defined as follows: if there is a query such that and, moreover, or (that is, the block of or the block of was already discovered). Otherwise, let .

Let be the event that the -th query discovers a new block. That is, is true if and only if and