# The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning

We derive an upper bound on the local Rademacher complexity of ℓ_p-norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches aimed at analyzed the case p=1 only while our analysis covers all cases 1≤ p≤∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely fast convergence rates of the order O(n^-α/1+α), where α is the minimum eigenvalue decay rate of the individual kernels.

## Authors

• 33 publications
• 32 publications
02/06/2018

### Near-Optimal Coresets of Kernel Density Estimates

We construct near-optimal coresets for kernel density estimate for point...
03/27/2011

### Fast Learning Rate of lp-MKL and its Minimax Optimality

In this paper, we give a new sharp generalization bound of lp-MKL which ...
08/01/2018

### Just Interpolate: Kernel "Ridgeless" Regression Can Generalize

In the absence of explicit regularization, Kernel "Ridgeless" Regression...
08/27/2019

### On the Risk of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels

We study the risk of minimum-norm interpolants of data in a Reproducing ...
03/04/2011

### Multiple Kernel Learning: A Unifying Probabilistic Viewpoint

We present a probabilistic viewpoint to multiple kernel learning unifyin...
04/09/2021

### How rotational invariance of common kernels prevents generalization in high dimensions

Kernel ridge regression is well-known to achieve minimax optimal rates i...
02/14/2012

### Lipschitz Parametrization of Probabilistic Graphical Models

We show that the log-likelihood of several probabilistic graphical model...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Propelled by the increasing “industrialization” of modern application domains such as bioinformatics or computer vision leading to the accumulation of vast amounts of data, the past decade experienced a rapid professionalization of machine learning methods. Sophisticated machine learning solutions such as the support vector machine can nowadays almost completely be applied out-of-the-box

(Bouckaert et al., 2010). Nevertheless, a displeasing stumbling block towards the complete automatization of machine learning remains that of finding the best abstraction or kernel for a problem at hand.

In the current state of research, there is little hope that a machine will be able to find automatically—or even engineer—the best kernel for a particular problem (Searle, 1980). However, by restricting to a less general problem, namely to a finite set of base kernels the algorithm can pick from, one might hope to achieve automatic kernel selection: clearly, cross-validation based model selection (Stone, 1974) can be applied if the number of base kernels is decent. Still, the performance of such an algorithm is limited by the performance of the best kernel in the set.

In the seminal work of Lanckriet et al. (2004) it was shown that it is computationally feasible to simultaneously learn a support vector machine and a linear combination of kernels at the same time, if we require the so-formed kernel combinations to be positive definite and trace-norm normalized. Though feasible for small sample sizes, the computational burden of this so-called multiple kernel learning (MKL) approach is still high. By further restricting the multi-kernel class to only contain convex combinations of kernels, the efficiency can be considerably improved, so that ten thousands of training points and thousands of kernels can be processed (Sonnenburg et al., 2006).

However, these computational advances come at a price. Empirical evidence has accumulated showing that sparse-MKL optimized kernel combinations rarely help in practice and frequently are to be outperformed by a regular SVM using an unweighted-sum kernel (Cortes et al., 2008; Gehler and Nowozin, 2009), leading for instance to the provocative question “Can learning kernels help performance?”(Cortes, 2009).

By imposing an -norm, , rather than an penalty on the kernel combination coefficients, MKL was finally made useful for practical applications and profitable (Kloft et al., 2009, 2011). The -norm MKL is an empirical minimization algorithm that operates on the multi-kernel class consisting of functions with , where is the kernel mapping into the reproducing kernel Hilbert space (RKHS) with kernel and norm , while the kernel itself ranges over the set of possible kernels .

In Figure 1, we reproduce exemplary results taken from Kloft et al. (2009, 2011) (see also references therein for further evidence pointing in the same direction). We first observe that, as expected, -norm MKL enforces strong sparsity in the coefficients when , and no sparsity at all for , which corresponds to the SVM with an unweighted-sum kernel, while intermediate values of enforce different degrees of soft sparsity (understood as the steepness of the decrease of the ordered coefficients ). Crucially, the performance (as measured by the AUC criterion) is not monotonic as a function of ; (sparse MKL) yields significantly worse performance than (regular SVM with sum kernel), but optimal performance is attained for some intermediate value of . This is an empirical strong motivation to study theoretically the performance of -MKL beyond the limiting cases or .

A conceptual milestone going back to the work of Bach et al. (2004) and Micchelli and Pontil (2005) is that the above multi-kernel class can equivalently be represented as a block-norm regularized linear class in the product Hilbert space , where denotes the RKHS associated to kernel , . More precisely, denoting by the kernel feature mapping associated to kernel over input space , and , the class of functions defined above coincides with

 (1)

where there is a one-to-one mapping of to given by . The -norm is defined here as ; for simplicity, we will frequently write .

Clearly, learning the complexity of (1) will be greater than one that is based on a single kernel only. However, it is unclear whether the increase is decent or considerably high and—since there is a free parameter —how this relates to the choice of . To this end the main aim of this paper is to analyze the sample complexity of the above hypothesis class (1). An analysis of this model, based on global Rademacher complexities, was developed by Cortes et al. (2010). In the present work, we base our main analysis on the theory of local Rademacher complexities, which allows to derive improved and more precise rates of convergence.

#### Outline of the contributions.

This paper makes the following contributions:

• Upper bounds on the local Rademacher complexity of -norm MKL are shown, from which we derive an excess risk bound that achieves a fast convergence rate of the order , where is the minimum eigenvalue decay rate of the individual kernels (previous bounds for -norm MKL only achieved .

• A lower bound is shown that beside absolute constants matches the upper bounds, showing that our results are tight.

• The generalization performance of -norm MKL as guaranteed by the excess risk bound is studied for varying values of , shedding light on the appropriateness of a small/large in various learning scenarios.

Furthermore, we also present a simpler proof of the global Rademacher bound shown in Cortes et al. (2010). A comparison of the rates obtained with local and global Rademacher analysis, respectively, can be found in Section 6.1.

#### Notation.

For notational simplicity we will omit feature maps and directly view and

and taking values in the Hilbert space and , respectively, where . Correspondingly, the hypothesis class we are interested in reads If or are clear from the context, we sometimes synonymously denote . We will frequently use the notation for the element .

We denote the kernel matrices corresponding to and by and , respectively. Note that we are considering normalized kernel Gram matrices, i.e., the th entry of is . We will also work with covariance operators in Hilbert spaces. In a finite dimensional vector space, the (uncentered) covariance operator can be defined in usual vector/matrix notation as . Since we are working with potentially infinite-dimensional vector spaces, we will use instead of

the tensor notation

, which is a Hilbert-Schmidt operator defined as . The space of Hilbert-Schmidt operators on is itself a Hilbert space, and the expectation is well-defined and belongs to as soon as is finite, which will always be assumed (as a matter of fact, we will often assume that is bounded a.s.). We denote by , the uncentered covariance operators corresponding to variables , ; it holds that and .

Finally, for we use the standard notation to denote the conjugate of , that is, and .

## 2 Global Rademacher Complexities in Multiple Kernel Learning

We first review global Rademacher complexities (GRC) in multiple kernel learning. Let be an i.i.d. sample drawn from . The global Rademacher complexity is defined as , where is an i.i.d. family (independent of ) of Rademacher variables (random signs). Its empirical counterpart is denoted by . The interest in the global Rademacher complexity comes from that if known it can be used to bound the generalization error (Koltchinskii, 2001; Bartlett and Mendelson, 2002).

In the recent paper of Cortes et al. (2010) it was shown using a combinatorial argument that the empirical version of the global Rademacher complexity can be bounded as

where and denotes the trace of the kernel matrix . We will now show a quite short proof of this result and then present a bound on the population version of the GRC. The proof presented here is based on the Khintchine-Kahane inequality (Kahane, 1985) using the constants taken from Lemma 3.3.1 and Proposition 3.4.1 in Kwapién and Woyczyński (1992).

[Khintchine-Kahane inequality] Let be . Then, for any , it holds

where . In particular the result holds for .

[Global Rademacher complexity, empirical version] For any the empirical version of global Rademacher complexity of the multi-kernel class can be bounded as

 ∀t≥p:ˆR(Hp)≤D√t∗n∥∥(tr(Km))Mm=1∥∥t∗2.
###### Proof.

First note that it suffices to prove the result for as trivially holds for all and therefore . We can use a block-structured version of Hölder’s inequality (cf. Lemma A) and the Khintchine-Kahane (K.-K.) inequality (cf. Lemma 2) to bound the empirical version of the global Rademacher complexity as follows:

 ˆR(Hp) def.=Eσsupfw∈Hp⟨w,1nn∑i=1σixi⟩ H\"{o}lder≤DEσ∥∥1nn∑i=1σixi∥∥2,p∗ Jensen≤D(EσM∑m=1∥∥1nn∑i=1σix(m)i∥∥p∗2)1p∗ K.-K.≤D√p∗n(M∑m=1(1nn∑i=1∥∥x(m)i∥∥22=tr(Km))p∗2)1p∗ =D√p∗n∥∥(∗tr(Km))Mm=1∥∥p∗2,

what was to show. ∎

#### Remark.

Note that there is a very good reason to state the above bound in terms of instead of solely in terms of : the Rademacher complexity is not monotonic in and thus it is not always the best choice to take in the above bound. This can is readily seen, for example, for the easy case where all kernels have the same trace—in that case the bound translates into . Interestingly, the function is not monotone and attains its minimum for , where denotes the natural logarithm with respect to the base . This has interesting consequences: for any we can take the bound , which has only a mild dependency on the number of kernels; note that in particular we can take this bound for the -norm class for all .

Despite the simplicity the above proof, the constants are slightly better than the ones achieved in Cortes et al. (2010). However, computing the population version of the global Rademacher complexity of MKL is somewhat more involved and to the best of our knowledge has not been addressed yet by the literature. To this end, note that from the previous proof we obtain . We thus can use Jensen’s inequality to move the expectation operator inside the root,

 R(Hp)=D√p∗/n(M∑m=1E(1nn∑i=1∥∥x(m)i∥∥22)p∗2)1p∗, (2)

but now need a handle on the

-th moments. To this aim we use the inequalities of

Rosenthal (1970) and Young (e.g., Steele, 2004) to show the following Lemma.

[Rosenthal + Young] Let be independent nonnegative random variables satisfying almost surely. Then, denoting , for any it holds

 E(1nn∑i=1Xi)q≤Cq((Bn)q+(1nn∑i=1EXi)q).

The proof is defered to Appendix A. It is now easy to show:

[Global Rademacher complexity, population version] Assume the kernels are uniformly bounded, that is, , almost surely. Then for any the population version of global Rademacher complexity of the multi-kernel class can be bounded as

For the right-hand term can be discarded and the result also holds for unbounded kernels.

###### Proof.

As above in the previous proof it suffices to prove the result for . From (2) we conclude by the previous Lemma

 R(Hp) ≤D√p∗n(M∑m=1(ep∗)p∗2((Bn)p∗2+(E1nn∑i=1∥∥x(m)i∥∥22=tr(Jm))p∗2))1p∗ ≤Dp∗√en∥∥(tr(Jm))Mm=1∥∥p∗2+√BeDM1p∗p∗n,

where for the last inequality we use the subadditivity of the root function. Note that for it is and thus it suffices to employ Jensen’s inequality instead of the previous lemma so that we come along without the last term on the right-hand side. ∎

For example, when the traces of the kernels are bounded, the above bound is essentially determined by . We can also remark that by setting we obtain the bound .

## 3 The Local Rademacher Complexity of Multiple Kernel Learning

Let be an i.i.d. sample drawn from . We define the local Rademacher complexity of as , where . Note that it subsumes the global RC as a special case for

. As self-adjoint, positive Hilbert-Schmidt operators, covariance operators enjoy discrete eigenvalue-eigenvector decompositions

and , where and form orthonormal bases of and , respectively.

We will need the following assumption for the case :

###### Assumption (U) (no-correlation).

The Hilbert space valued variables are said to be (pairwise) uncorrelated if for any and  , the real variables and are uncorrelated.

Since are RKHSs with kernels , if we go back to the input random variable in the original space , the above property is equivalent to saying that for any fixed , the variables and are uncorrelated. This is the case, for example, if the original input space is , the orginal input variable has independent coordinates, and the kernels each act on a different coordinate. Such a setting was considered in particular by Raskutti et al. (2010) in the setting of -penalized MKL. We discuss this assumption in more detail in Section 6.2.

We are now equipped to state our main results:

[Local Rademacher complexity, ] Assume that the kernels are uniformly bounded () and that Assumption (U) holds. The local Rademacher complexity of the multi-kernel class can be bounded for any as

 ∀t∈[p,2]:Rr(Hp)≤ ⎷16n∥∥∥(∞∑j=1min(rM1−2t∗,ceD2t∗2λ(m)j))Mm=1∥∥∥t∗2+√BeDM1t∗t∗n.

[Local Rademacher complexity, ] The local Rademacher complexity of the multi-kernel class can be bounded for any as

 Rr(Hp)≤ ⎷2n∞∑j=1min(r,D2M2p∗−1λj).

#### Remark 1.

Note that for the case , by using   in Theorem 3, we obtain the bound

(See below after the proof of Theorem 3 for a detailed justification.)

#### Remark 2.

The result of Theorem 3 for can be proved using considerably simpler techniques and without imposing assumptions on boundedness nor on uncorrelation of the kernels. If in addition the variables are centered and uncorrelated, then the spectra are related as follows : ; that is, . Then one can write equivalently the bound of Theorem 3 as  . However, the main intended focus of this paper is on the more challenging case which is usually studied in multiple kernel learning and relevant in practice.

#### Remark 3.

It is interesting to compare the above bounds for the special case with the ones of Bartlett et al. (2005). The main term of the bound of Theorem 3 (taking ) is then essentially determined by . If the variables are centered and uncorrelated, by the relation between the spectra stated in Remark 2, this is equivalently of order , which is also what we obtain through Theorem 3, and coincides with the rate shown in Bartlett et al. (2005).

###### Proof.

of Theorem 3 and Remark 1.  The proof is based on first relating the complexity of the class with its centered counterpart, i.e., where all functions are centered around their expected value. Then we compute the complexity of the centered class by decomposing the complexity into blocks, applying the no-correlation assumption, and using the inequalities of Hölder and Rosenthal. Then we relate it back to the original class, which we in the final step relate to a bound involving the truncation of the particular spectra of the kernels. Note that it suffices to prove the result for as trivially for all .

Step 1: Relating the original class with the centered class.  In order to exploit the no-correlation assumption, we will work in large parts of the proof with the centered class , wherein , and . We start the proof by noting that

, so that, by the bias-variance decomposition, it holds that

 Pf2w=Efw(x)2=E(fw(x)−Efw(x))2+(Efw(x))2=P~f2w + (Pfw)2. (3)

Furthermore we note that by Jensen’s inequality

 ∥∥Ex∥∥2,p∗ (4)

so that we can express the complexity of the centered class in terms of the uncentered one as follows:

 Rr(Hp) =Esupfw∈Hp,Pf2w≤r⟨w,1nn∑i=1σixi⟩ ≤Esupfw∈Hp,Pf2w≤r⟨w,1nn∑i=1σi~xi⟩ +Esupfw∈Hp,Pf2w≤r⟨w,1nn∑i=1σiEx⟩

Concerning the first term of the above upper bound, using (3) we have  , and thus

 Esupfw∈Hp,Pf2w≤r⟨w,1nn∑i=1σi~xi⟩≤Esupfw∈Hp,P~f2w≤r⟨w,1nn∑i=1σi~xi⟩=Rr(~Hp).

Now to bound the second term, we write

 Esupfw∈Hp,Pf2w≤r⟨w,1nn∑i=1σiEx⟩ =E∣∣ ∣∣1nn∑i=1σi∣∣ ∣∣supfw∈Hp,Pf2w≤r⟨w,Ex⟩ ≤supfw∈Hp,Pf2w≤r⟨w,Ex⟩⎛⎝E(1nn∑i=1σi)2⎞⎠12 =√nsupfw∈Hp,Pf2w≤r⟨w,Ex⟩.

Now observe finally that we have

 ⟨w,Ex⟩H\"{o}lder≤∥w∥2,p∥Ex∥2,p∗∥w∥2,p√∥∥(tr(Jm))Mm=1∥∥p∗2

as well as

 ⟨w,Ex⟩=Efw(x)≤√Pf2w.

We finally obtain, putting together the steps above,

 Rr(Hp)≤Rr(~Hp)+n−12min(√r,D√∥∥(tr(Jm))Mm=1∥∥p∗2) (5)

This shows that we at the expense of the additional summand on the right hand side we can work with the centered class instead of the uncentered one.

Step 2: Bounding the complexity of the centered class.  Since the (centered) covariance operator is also a self-adjoint Hilbert-Schmidt operator on , there exists an eigendecomposition

 E~x(m)⊗~x(m)=∞∑j=1~λ(m)j~u(m)j⊗~u(m)j, (6)

wherein is an orthogonal basis of . Furthermore, the no-correlation assumption (U) entails for all . As a consequence,

 P~f2w = (7) \bf(U)= M∑m=1⟨wm,(E~x(m)⊗~x(m))wm⟩ = M∑m=1∞∑j=1~λ(m)j⟨wm,~u(m)j⟩2

and, for all and ,

 E⟨1nn∑i=1σi~x(m)i,~u(m)j⟩2 = E1n2n∑i,l=1σiσl⟨~x(m)i,~u(m)j⟩⟨~x(m)l,~u(m)j⟩σ i.% i.d.=E1n2n∑i=1⟨~x(m)i,~u(m)j⟩2 (8) = 1n⟨~u(m)j,(1nn∑i=1E~x(m)i⊗~x(m)i=E~x(m)⊗~x(m))~u(m)j⟩=~λ(m)jn.

Let now be arbitrary nonnegative integers. We can express the local Rademacher complexity in terms of the eigendecompositon (6) as follows

 Rr(~Hp) = Esupfw∈~Hp:P~f2w≤r⟨w,1nn∑i=1σi~xi⟩ = Esupfw∈~Hp:P~f2w≤r⟨(w(m))Mm=1,(1nn∑i=1σi~x(m)i)Mm=1⟩ ≤ (hm∑j=1√~λ(m)j−1⟨1nn∑i=1σi~x(m)i,~u(m)j⟩~u(m)j)Mm=1 ⟩ +  Esupfw∈~Hp⟨w,(∞∑j=hm+1⟨1nn∑i=1σi~x(m)i,~u(m)j⟩~u(m)j)Mm=1⟩ C.-S.,~{}Jensen≤ supP~f2w≤r[(M∑m=1hm∑j=1~λ(m)j⟨w(m),~u(m)j⟩2)12 ×(M∑m=1hm∑j=1(~λ(m)j)−1E⟨1nn∑i=1σi~x(m)i,~u(m)j⟩2)12] + Esupfw∈~Hp⟨w,(∞∑j=hm+1⟨1nn∑i=1σi~x(m)i,~u(m)j⟩~u(m)j)Mm=1⟩

so that (7) and (8) yield

 Rr(~Hp) (???),~{}(???)≤ √r∑Mm=1hmn+Esupfw∈~Hp⟨w,(∞∑j=hm+1⟨1nn∑i=1σi~x(m)i,~u(m)j⟩~u(m)j)Mm=1⟩ H\"{o}lder≤ √r∑Mm=1hmn+DE∥∥∥(∞∑j=hm+1⟨1nn∑i=1σi~x(m)i,~u(m)j⟩~u(m)j)Mm=1∥∥∥2,p∗.

Step 3: Khintchine-Kahane’s and Rosenthal’s inequalities.  We can now use the Khintchine-Kahane (K.-K.) inequality (see Lemma 2 in Appendix A) to further bound the right term in the above expression as follows

 E∥∥∥(∞∑j=hm+1⟨1nn∑i=1σi~x(m)i,~u(m)j⟩~u(m)j)Mm=1∥∥∥2,p∗Jensen% ≤E(M∑m=1Eσ∥∥∥∞∑j=hm+1⟨1nn∑i=1σi~x(m)i,~u(m)j⟩~u(m)j∥∥∥p∗Hm)1p∗K.-K.≤√p∗n E(M∑m=1(∞∑j=hm+11nn∑i=1⟨~x(m)i,~u(m)j⟩2)p∗2)1p∗Jensen≤√p∗n (M∑m=1E(∞∑j=hm+11nn∑i=1⟨~x(m)i