# On the Rademacher Complexity of Linear Hypothesis Sets

Linear predictors form a rich class of hypotheses used in a variety of learning algorithms. We present a tight analysis of the empirical Rademacher complexity of the family of linear hypothesis classes with weight vectors bounded in ℓ_p-norm for any p ≥ 1. This provides a tight analysis of generalization using these hypothesis sets and helps derive sharp data-dependent learning guarantees. We give both upper and lower bounds on the Rademacher complexity of these families and show that our bounds improve upon or match existing bounds, which are known only for 1 ≤ p ≤ 2.

## Authors

• 29 publications
• 3 publications
• 42 publications
• ### Hypothesis Set Stability and Generalization

We present an extensive study of generalization for data-dependent hypot...
04/09/2019 ∙ by Dylan J. Foster, et al. ∙ 0

• ### Adversarial Learning Guarantees for Linear Hypotheses and Neural Networks

Adversarial or test time robustness measures the susceptibility of a cla...
04/28/2020 ∙ by Pranjal Awasthi, et al. ∙ 5

• ### Generalization Bounds for Learning with Linear, Polygonal, Quadratic and Conic Side Knowledge

In this paper, we consider a supervised learning setting where side know...
05/30/2014 ∙ by Theja Tulabandhula, et al. ∙ 0

• ### Risk Bounds for Learning via Hilbert Coresets

We develop a formalism for constructing stochastic upper bounds on the e...
03/29/2021 ∙ by Spencer Douglas, et al. ∙ 0

• ### Approximate is Good Enough: Probabilistic Variants of Dimensional and Margin Complexity

We present and study approximate notions of dimensional and margin compl...
03/09/2020 ∙ by Pritish Kamath, et al. ∙ 0

• ### Relative Deviation Margin Bounds

We present a series of new and more favorable margin-based learning guar...
06/26/2020 ∙ by Corinna Cortes, et al. ∙ 7

• ### Ensembles of Kernel Predictors

This paper examines the problem of learning with a finite and possibly l...
02/14/2012 ∙ by Corinna Cortes, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Linear predictors form a rich class of hypotheses used in a variety of learning algorithms, including SVM (Cortes and Vapnik, 1995)

, logistic regression or conditional maximum entropy models

(Berger et al., 1996)(Hoerl and Kennard, 1970), and Lasso (Tibshirani, 1996).

Different regularizations or -norm conditions are used to constrain the family of linear predictors. This short note gives a sharp analysis of the generalization properties of linear predictors for arbitrary -norm upper bound constraints. To do so, we give tight upper bounds on the empirical Rademacher complexity of these hypothesis sets which we show are matched by lower bounds, modulo some constants.

The notion of Rademacher complexity is a general complexity measure used to derive sharp data-dependent learning guarantees for different hypothesis sets, including margin bounds, which are key in the analysis of generalization for classification (Koltchinskii and Panchenko, 2002; Bartlett and Mendelson, 2002; Mohri et al., 2018). There are known upper bounds on the Rademacher complexity of linear hypothesis sets for some values of , including or (Bartlett and Mendelson, 2002; Mohri et al., 2018), as well as (Kakade et al., 2008). Our upper bounds on the empirical Rademacher complexity are tighter than those known for and match the existing one for . We further give upper bounds on the Rademacher complexity for other values of (). Our upper bounds are expressed in terms of , where is the matrix whose columns are the sample points and where conjugate number associated to . We give matching lower bounds in terms of the same quantity for all values of , which suggest the key role played by this quantity in the analysis of complexity.

Much of the results presented here already appeared in (Awasthi et al., 2020), in the context of the analysis of adversarial Rademacher complexity. Here, we present a more self-contained and detailed analysis, including the statement and proof of lower bounds. In Section 2, we introduce some preliminary definitions and notation. We present our new upper and lower bounds on the Rademacher complexity of linear hypothesis sets in Section 3 (Theorem 3 and Theorem 3). The proof of the upper bounds is given in Appendix A and that of the lower bounds in Appendix B. Lastly, in Appendix D we give a detailed analysis of how our bounds improve upon existing ones.

## 2 Preliminaries

We will denote vectors as lowercase bold letters (e.g., ) and matrices as uppercase bold (e.g., ). The all-ones vector is denote by . The Hölder conjugate of is denoted by . For a matrix , the -group norm is defined as the -norm of the -norm of the columns of , that is , where s are the columns of .

Let be a family of functions mapping from to . Then, the empirical Rademacher complexity of for a sample , is defined by

 ˆRS(F)=Eσ[supf∈F1mm∑i=1σif(xi)], (1)

where

is a vector of i.i.d. Rademacher variables, that is independent uniform random variables taking values in

. The Rademacher complexity of , , is defined as the expectation of this quantity: , where is a distribution over the input space . The empirical Rademacher complexity is a key data-dependent complexity measure. For a family of functions taking values in , the following learning guarantee holds: for any

, with probability at least

over the draw of a sample , the following inequality holds for all (Mohri et al., 2018):

 Ex∼D[f(x)]≤Ex∼S[f(x)]+2ˆRS(F)+3√log2δ2m,

where we denote by the empirical average of , that is . A similar inequality holds for the average Rademacher complexity :

 Ex∼D[f(x)]≤Ex∼S[f(x)]+2Rm(F)+√log1δ2m.

An important application of these bounds is the derivation of margin bounds which are crucial in the analysis of classification. Fix . Then, for any , with probability at least over the draw of a sample , the following inequality holds for all (Koltchinskii and Panchenko, 2002; Mohri et al., 2018):

 E(x,y)∼D[1yf(x)≤0] ≤E(x,y)∼S[min(1,(1−yf(x)ρ)+)]+2ρˆRS(F)+3√log2δ2m ≤1mm∑i=11yif(xi)≤ρ+2ρˆRS(F)+3√log2δ2m.

Finer margin guarantees were recently presented by Cortes et al. (2020) in terms of Rademacher complexity and other complexity measures. Furthermore, the Rademacher complexity of a hypothesis set also appears as a lower bound in generalization. As an example, for a symmetric family of functions taking values in , the following holds (van der Vaart and Wellner, 1996):

 12[Rm(G)−1√m]≤ES∼Dm[supf∈G∣∣∣Ex∼D[f(x)]−Ex∼S[f(x)]∣∣∣]≤2Rm(G).

The hypothesis set we will analyze in this paper is that of linear predictors whose weight vector is bounded in -norm:

 Fp={x↦w⋅x:∥w∥p≤W}. (2)

## 3 Empirical Rademacher Complexity of Linear Hypothesis Sets

The main results of this note are the following upper and lower bounds on the empirical Rademacher complexity of linear hypothesis sets. [] Let be a family of linear functions defined over with bounded weight in -norm. Then, the empirical Rademacher complexity of for a sample admits the following upper bounds:

where is the -matrix with s as columns: . Furthermore, the constant factor in the inequality for the case can be bounded as follows:

 e−12√p∗≤√2[Γ(p∗+12)√π]1p∗≤e−12√p∗+1.

The proof is given in Appendix A. Both the statement of the theorem and its proof first appeared in (Awasthi et al., 2020) in the context of the analysis of adversarial Rademacher complexity. We present a self-contained analysis in this note to make the results more easily accessible, as we believe these results are of a wider interest. The next theorem is new and provides a lower bound for which, modulo a constant factor, matches the upper bounds stated above.

[] Let be a family of linear functions defined over with bounded weight in -norm. Then, the empirical Rademacher complexity of for a sample admits the following lower bound, where :

 ˆRS(Fp)≥W√2m∥X⊤∥2,p∗. (3)

This lower bound is in tight in terms of dependence on sample size and dimension . The proof is given in Appendix B. The following corollary presents somewhat looser upper bounds that may be more convenient in various contexts, such as that of kernel-based hypothesis sets. The corollary can be derived directly by combining Theorem 3 and Proposition 1 (see Section 3.2).

Let be a family of linear functions defined over with bounded weight in -norm. Then, the empirical Rademacher complexity of for a sample admits the following upper bounds, where :

 for p=1,ˆRS(Fp) ≤Wm√2log(2d)∥X∥p∗,2; for 1

### 3.1 Discussion

We now make a few remarks about Theorem 3 and present the proof in Appendix A. The theorem states that for any data set, is a constant times . This is in contrast to the quantity that appears in the existing analysis available in the literature for linear hypothesis sets (Kakade et al., 2008). However, as we will soon see in Theorem 2 using always leads to a better upper bound.

Another interesting aspect of the upper bound is the dimension dependence of the constant in front of . This constant is independent of dimension only for . For , the dependence on dimension is tight, which can be seen from the correspondence tightness of the maximal inequality and thus that of Massart’s inequality (Boucheron et al., 2013). We also provide a simple example further illustrating this dependence in Appendix E. This observation also explains why the constant for approaches infinity as : if we had that

 ˆRS(Fp)≤c(p)∥X⊤∥2,p∗

for , then by continuity

 ˆRS(F1)≤limp→1c(p)∥X⊤∥2,∞

If were dimension independent and were finite, then the constant for would be finite and dimension independent as well. Since we just showed that the constant for must have dimension dependence, we must have that . This observation suggests that finding dimension-dependent constant for could greatly improve the upper bound of Theorem 3. However, our example where the dimension dependence was tight for had , which is unrealistic for most applications. It’s possible that with some reasonable assumption on the relationship between and , one could find a far better constant for .

### 3.2 Comparison with Previous Work

We are not aware of any existing bound for the empirical Rademacher complexity of linear hypothesis sets for before this work. For other values of , the best existing upper bounds were given by Kakade et al. (2008) for and by Bartlett and Mendelson (2001) (see also (Mohri et al., 2018)) for :

 (4)

Our new upper bound coincides with (4) when and is strictly tighter otherwise. Readers familiar with Rademacher complexity bounds for linear hypothesis sets will notice that our bound in this case depends on the norm . In contrast, the previously known bounds depend on . In fact, one can show that the is always smaller than for , that is , as shown by the last inequality of (5) in the following proposition.

[] Let be a matrix. If , then

 (5)

If , then

 (6)

These bounds are tight. The proof is presented in Appendix C. To visualize the ratio between these two norms, we plot the two norms for various values of in figure 1.

For convenience, in the discussion below, we set and . Regarding the growth of the constant in our bound, Theorem 3 implies that as , grows asymptotically like . Furthermore, in the relevant region (See Appendix A.3). In Figure 2 we plot and the bounds on to illustrate the growth rate of these constants with .

Proposition 1 and the inequality imply the following result.

[] For , the following inequality holds:

Thus, for , the bound of Theorem 3 is tighter than (4).

## 4 Conclusion

We presented tight bounds on the empirical Rademacher complexity of linear hypothesis sets constrained by an -norm bound on the weight vector. These bounds can be used to derive sharp generalization guarantees for these hypothesis sets in a variety of different contexts, by plugging them in existing Rademacher complexity learning bounds. Our proofs and guarantees suggest an extension beyond -norm constrained hypothesis sets that we will discuss elsewhere.

## References

• Alzer (1997) Horst Alzer. On some inequalities for the Gamma and Psi functions. Math. Comput., 66(217):373–389, 1997.
• Awasthi et al. (2020) Pranjal Awasthi, Natalie Frank, and Mehryar Mohri.

Adversarial learning guarantees for linear hypotheses and neural networks.

In Proceedings of ICML, 2020.
• Bartlett and Mendelson (2001) Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. In Proceedings of COLT, 2001.
• Bartlett and Mendelson (2002) Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results.

Journal of Machine Learning Research

, 3, 2002.
• Berger et al. (1996) Adam L. Berger, Stephen Della Pietra, and Vincent J. Della Pietra.

A maximum entropy approach to natural language processing.

Comp. Linguistics, 22(1), 1996.
• Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
• Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. Support-vector networks. Mach. Learn., 20(3):273–297, 1995.
• Cortes et al. (2020) Corinna Cortes, Mehryar Mohri, and Ananda Theertha Suresh. Relative deviation margin bounds. CoRR, abs/2006.14950, 2020.
• Haagerup (1981) Uffe Haagerup. The best constants in the Khintchine inequality. Studia Mathematica, 70:231–283, 1981.
• Hoerl and Kennard (1970) Arthur E. Hoerl and Robert W. Kennard.

Ridge regression: Biased estimation for nonorthogonal problems.

Technometrics, 12(1):55–67, 1970.
• Kakade et al. (2008) Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Proceedings of NIPS, pages 793–800, 2008.
• Koltchinskii and Panchenko (2002) Vladmir Koltchinskii and Dmitry Panchenko.

Empirical margin distributions and bounding the generalization error of combined classifiers.

Annals of Statistics, 30, 2002.
• Massart (2000) Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la Faculté des Sciences de Toulouse, IX:245–303, 2000.
• Mohri et al. (2018) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, second edition, 2018.
• Olver et al. (2010) Frank W. J. Olver, Daniel W. Lozier, Ronald F. Boisvert, and Charles W. Clark. The NIST Handbook of Mathematical Functions. Cambridge Univ. Press, 2010.
• Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58(1):267–288, 1996.
• van der Vaart and Wellner (1996) Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996.

## Appendix A Proof of Theorem 3

In this section, we present the proof of Theorem 3.

See 3 The proof proceeds in several steps. First, in Appendix A.1 we upper bound the Rademacher complexity of . Next, in Appendix A.2, we establish the upper bound for . Lastly, in Appendix A.3, we prove the inequalities for the constant terms in the case .

### a.1 Proof of the upper bound, case p=1

The bound on the Rademacher complexity for was previously known but we reproduce the proof of this theorem for completeness. We closely follow the proof given in (Mohri et al., 2018). For any , denotes the th component of .

 ˆRS(F1) (by definition of the dual norm) (by definition of ∥⋅∥∞) (by definition of |⋅|)

where denotes the set of vectors . For any , we have . Further, contains at most elements. Thus, by Massart’s Lemma (Massart, 2000; Mohri et al., 2018),

 ˆRS(F1) ≤W∥X⊤∥2,∞√2log(2d)m,

which concludes the proof.

### a.2 Proof of upper bound, case p>1

Here again, we use the shorthand . By definition of the dual norm, we can write:

 ˆRS(Fp) (dual norm property) ≤Wm[Eσ[∥uσ∥p∗p∗]]1p∗. (Jensen's inequality, p∗∈[1,+∞)) =Wm[d∑j=1Eσ[|uσ,j|p∗]]1p∗.

Next, by Khintchine’s inequality (Haagerup, 1981), the following holds:

 Eσ[|uσ,j|p∗] ≤Bp∗[m∑i=1x2i,j]p∗2,

where for and

 Bp∗ =2p∗2Γ(p∗+12)√π,

for . This yields the following bound on the Rademacher complexity:

 ˆRS(Fp)≤⎧⎪ ⎪⎨⎪ ⎪⎩Wm∥X⊤∥2,p∗if p∗∈[1,2],√2Wm[Γ(p∗+12)√π]1p∗∥X⊤∥2,p∗if p∗∈[2,+∞).

### a.3 Bounding the Constant

For convenience, set . We establish upper and lower bound on . Let . Then the following inequalities hold:

 e−12√p∗≤c2(p)≤e−12√p∗+1.

For convenience, we set , , . Next, we recall a useful inequality (Olver et al., 2010) bounding the gamma function:

 1<(2π)−12x12−xexΓ(x)

We start with the upper bound. If we apply the right-hand side inequality of (7) to we get the following bound on :

 f2(q)≤212qe−12√q+1e−12q+16(q+1)q (8)

It is easy to verify that,

 212qe−12q+16q(q+1)=e1q(ln2−12+16q(q+1)). (9)

Furthermore, the expression decreases with increasing . At , it is negative, which implies that (9) is less than 1 for . Hence

 f2(q)≤e−12√q+1

Next, we prove the lower bound. Applying the lower bound of (7) to results in

 f2(q)≥e−12√q(e−12q(log2−1)√1+1q).

We will establish that , which will complete the proof of the lower bound. We prove this statement by showing that

 (e−12q(log2−1)√1+1q)2=e−1q(log2−1)(1+1q)≥1.

By applying some elementary inequalities

 e−1q(log2−1)(1+1q) ≥(1q(log2−1)+1)(1+1q) (using ex≥1+x) =1+1q(log(2)−1−log(2)q) ≥1

The last inequality follows since increases with , and is positive at .

## Appendix B Proof of Theorem 3

In this section, we prove the lower bound of Theorem 3.

See 3

For any vector , let denote the vector derived from by taking the absolute value of each of its components. Starting as in the proof of Theorem 3, using the dual norm property, we can write:

 ˆRS(Fp) =Eσ[sup∥w∥p≤Ww⋅m∑i=1σixi] (dual norm property) ≥Wm∥∥ ∥∥Eσ[∣∣ ∣∣m∑i=1σixi∣∣ ∣∣]∥∥ ∥∥p∗ (norm sub-additivity) =Wm⎡⎣d∑j=1(Eσ[∣∣ ∣∣m∑i=1σixij∣∣ ∣∣])p∗⎤⎦1p∗ ≥Wm⎡⎢ ⎢⎣d∑j=1⎛⎜⎝1√2∣∣ ∣∣m∑i=1x2ij∣∣ ∣∣12⎞⎟⎠p∗⎤⎥ ⎥⎦1p∗ (Khintchine's ineq. \@@cite[citep]{(\@@bibref{AuthorsPhrase% 1Year}{Haagerup1981}{\@@citephrase{, }}{})}) =W√2m⎡⎢ ⎢⎣d∑j=1[∣∣ ∣∣m∑i=1x2ij∣∣ ∣∣]p∗2⎤⎥ ⎥⎦1p∗ =W√2m∥X⊤∥2,p∗.

## Appendix C Proof of Proposition 1

In this section, we prove Proposition 1. This result implies that for , the group norm , is always a lower bound on the term that appears in existing upper bounds. We first present a simple lemma helpful for the proof.

Let and be dimension. Then

 sup∥w∥p≤1∥w∥r∗=max(1,d1−1r−1p)

We prove that, if , then the following equality holds:

 sup∥w∥p≤1∥w∥r∗=d1−1r−1p,

and otherwise that the following holds:

 sup∥w∥p≤1∥w∥r∗=1.

If , by Hölder’s generalized inequality with ,

 sup∥w∥p≤1∥w∥∗r≤sup∥w∥p≤1∥1∥s∥w∥p=∥1∥s=d1s=d1r∗−1p=d1−1r−1p.

Note that equality holds at the vector , and this implies that the inequality in the line above is an equality. Now for , , implying that . Here, equality is achieved at a unit vector .

We now present the proof of Proposition 1.

See 1

First, (6) follows from (5) by substituting for a matrix : For ,

 min(m,d)1p−1q∥A∥p,q≤∥A⊤∥q,p≤∥A∥p,q

which implies that

 ∥A⊤∥q,p≤∥A∥p,q≤min(m,d)1q−1p∥A⊤∥q,p

However, now and are swapped in comparison to (6). Now after swapping them again, for ,

 ∥A⊤∥p,q≤∥A∥q,p≤min(m,d)1p−1q∥A⊤∥p,q

The rest of this proof will be devoted to showing (5).

Next, if , then . For the rest of the proof, we will assume that . Specifically, which allows us to consider fractions like .

We will show that for , the following inequality holds: , or equivalently, .

We will use the shorthand . By definition of the group norm and using the notation , we can write

 =∥∥ ∥∥⎡⎢⎣∑dj=1U1j⋮∑dj=1Umj⎤⎥⎦∥∥ ∥∥r ≤d∑j=1∥∥ ∥∥[U1j⋮Umj]∥∥ ∥∥r=d∑j=1[m∑i=1|Mij|p]qp=∥M⊤∥qp,q.

To show that this inequality is tight, note that equality holds for an all-ones matrix. Next, we prove the inequality

for . Applying Lemma C twice gives

 ∥M⊤∥p,q≤∥M⊤∥q,q=∥M∥q,q≤d1q−1p∥M∥p,q. (10)

Again applying Lemma C twice gives

 ∥M⊤∥p,q≤m1q−1p∥M⊤∥p,p=m1q−1p∥M∥p,p≤m1q−1p∥M∥p,q. (11)

Next, we show that (10) is tight if and that (11) is tight if . If , the bound is tight for the block matrix , and, if , then the bound is tight for the block matrix

## Appendix D Proof of Theorem 2

See 2

Both Theorem 3 and equation (4) present upper bounds on for . Both of these bounds are of the form a constant times a matrix norm of . In Appendix C, we compared the two matrix norms and proved the inequality in the relevant region (Lemma C). Here, we compare the two constants and show that the constant associated with Theorem 3 is smaller than the one appearing in (4) (Lemma D). These lemmas combined directly prove Theorem 2.

In this section, we study the constants in the two known bounds on the Rademacher complexity of linear classes for . Specifically,

 ˆRS(Fp)≤ Wm√p∗−1∥X∥p∗,2 25cm. (12) ˆRS(Fp)≤ √2Wm[Γ(p∗+12)√π]1p∗∥X⊤∥2,p∗ (13)

We will compare the constants in equations (12) and (13), namely and . Since divides both of these constants, we drop this factor and work with the expressions and .

Here we establish our main claim that . Let and . Then

 c2(p)≤c1(p),

for all . First note that . For convenience, set , , and . We claim for , and this implies that for .

The rest of this proof is devoted to showing that . Upon differentiating we get that . Next, we will differentiate . To start, we state a useful inequality (see Equation  in Alzer (1997)) bounding the digamma function, .

 ψ(x)≤log(x)−12x (14)

Recall that the digamma function is the logarithmic derivative of the gamma function, . Now we differentiate :

 ddq(lnf2(q)) =q2ψ(q+12)−(ln(Γ(q+12))−ln(√π))q2 ≤q2(log(q+12−1q+1)−(ln(Γ(q+12))−ln√π)q2 (by (???)) ≤q2(logq+12−1q+1)−(12ln2+q2logq+12−q+12)q2 (by the left-hand equality in (7)) =12q+1q2(12(q+1)−12log2) ≤12q.

The last line follows since we only consider and in this range. Finally, the fact that implies

 f′2(q) =f2(q)ddq(lnf2(q)) ≤12qf2(q) (by ddq(lnf2(q))≤12q) ≤e−12√q+12q (by applying the upper bound in Lemma\leavevmode\nobreak% \ ???) =12√q−1e−12√(q+1)(q−1)q ≤e−1212√q−1 (using q2−1≤q2) ≤12√q−1=f′1(q) (using e−12<1).

## Appendix E The Tightness of the √log(d) factor for p=1

Here, we provide an example showing that the dimension dependence of in our upper bound on the Rademacher complexity of linear functions bounded in norm is tight.

Consider a data set with . Then the data matrix has rows. We pick the data so that the rows of are the set . This means that and we can compute the Rademacher complexity as

 ˆRS(F1) (definition of dual norm) (tightness of Cauchy-Schwartz) =m