Margin-based learning bounds provide a fundamental tool for the analysis of generalization in classification (Vapnik1998; Vapnik2006; SchapireFreundBartlettLee1997; KoltchinskiiPanchenko2002; TaskarGuestrinKoller2003; BartlettShaweTaylor1998). These are guarantees that hold for real-valued functions based on the notion of confidence margin. Unlike worst-case bounds based on standard complexity measures such as the VC-dimension, margin bounds provide optimistic guarantees: a strong guarantee holds for predictors that achieve a relatively small empirical margin loss, for a relatively large value of the confidence margin. More generally, guarantees similar to margin bounds can be derived based on notion of a luckiness (ShaweTaylorBartlettWilliamsonAnthony1998; KoltchinskiiPanchenko2002).
Notably, margin bounds do not have an explicit dependency on the dimension of the feature space for linear or kernel-based hypotheses. They provide strong guarantees for large-margin maximization algorithms such as Support Vector Machines (SVM)(CortesVapnik1995), including when used for positive definite kernels such as Gaussian kernels, for which the dimension of the feature space is infinite. Similarly, margin-based learning bounds have helped derive significant guarantees for AdaBoost (FreundSchapire1997; SchapireFreundBartlettLee1997)
. More recently, margin-based learning bounds have been derived for neural networks (NNs)(NeyshaburTomiokaSrebro2015; BartlettFosterTelgarsky2017)
and convolutional neural networks (CNNs)(LongSedghi2020).
An alternative family of tighter learning guarantees is that of relative deviation bounds (Vapnik1998; Vapnik2006; AnthonyShaweTaylor1993; CortesGreenbergMohri2019)
. These are bounds on the difference of the generalization and empirical error scaled by the square-root of the generalization error or empirical error, or some other power of the error. The scaling is similar to dividing by the standard deviation since, for smaller values of the error, the variance of the error of a predictor roughly coincides with its error. These guarantees translate into very useful bounds on the difference of the generalization error and empirical error whose complexity terms admit the empirical error as a factor.
This paper presents general relative deviation margin bounds. These bounds combine the benefit of standard margin bounds and that of standard relative deviation bounds, thereby resulting in tighter margin bounds (Section 5). As an example, our learning bounds provide tighter guarantees for margin-based algorithms such as SVM and boosting than existing ones. We give two families of relative deviation bounds, both bounds valid for general families and data-dependent ones. Additionally, both families of guarantees hold for an arbitrary
-moment, with. In Section 5, we also briefly highlight several applications of our bounds and discuss their connection with existing results.
Our first family of margin bounds are expressed in terms of the empirical -covering number of the hypothesis set (Section 3). We show how these empirical covering numbers can be upper bounded to derive empirical fat-shattering guarantees. One benefit of these resulting guarantees is that there are known upper bounds for various standard hypothesis sets, which can be leveraged to derive explicit bounds (see Section 5).
Our second family of margin bounds are expressed in terms of the Rademacher complexity of the hypothesis set used (Section 4). Here, our learning bounds are first expressed in terms of a peeling-based Rademacher complexity term we introduce. Next, we give a series of upper bounds on this complexity measure, first simpler ones in terms of Rademacher complexity, next in terms of empirical covering numbers, and finally in terms of the so-called maximum Rademacher complexity. In particular, we show that a simplified version of our bounds yields a guarantee similar to the maximum Rademacher margin bound of SrebroSridharanTewari2010, but with more favorable constants and for a general -moment.
Novelty and proof techniques. A version of our main result for empirical -covering number bounds for the special case was postulated by Bartlett1998 without a proof. The author suggested that the proof could be given by combining various techniques with the results of AnthonyShaweTaylor1993 and Vapnik1998; Vapnik2006. However, as pointed out by CortesGreenbergMohri2019, the proofs given by AnthonyShaweTaylor1993 and Vapnik1998; Vapnik2006 are incomplete and rely on a key lemma that is not proven. Our proof and presentation follow (CortesGreenbergMohri2019) but also partly benefit from the analysis of Bartlett1998, in particular the bound on the covering number (Corollary 3). To the best of our knowledge, our Rademacher complexity learning bounds of Section 4 are new. The proof consists of using a peeling technique combined with an application of a bounded difference inequality finer than McDiarmid’s inequality. For both families of bounds, the proof relies on a margin-based symmetrization result (Lemma 1) proven in the next section.
In this section, we prove two key symmetrization-type lemmas for a relative deviation between the expected binary loss and empirical margin loss.
We consider an input space and a binary output space and a hypothesis set of functions mapping from to . We denote by a distribution over and denote by the generalization error and by the empirical error of a hypothesis :
where we write to indicate that is randomly drawn from the empirical distribution defined by . Given , we similarly defined the -margin loss and empirical -margin loss of :
We will sometimes use the shorthand to denote a sample of points .
The following is our first symmetrization lemma in terms of empirical margin loss. Fix and and assume that . Then, for any any , the following inequality holds:
The proof is presented in Appendix A. It consists of extending the proof technique of CortesGreenbergMohri2019 for standard empirical error to the empirical margin case and of using the binomial inequality (GreenbergMohri2013, Lemma A). The lemma helps us bound the relative deviation in terms of the empirical margin loss on a sample and the empirical error on an independent sample , both of size .
We now introduce some notation needed for the presentation and discussion of our relative deviation margin bound. Let be a function such that the following inequality holds for all :
As an example, we can choose as in the previous sections. For a sample , let . Then,
Let the family be defined as follows: and let denote the expectation of and its empirical expectation for a sample . There are several choices for function , as illustrated by Figure 1. For example, can be chosen to be or (Bartlett1998). can also be chosen to be the so-called ramp loss:
or the smoothed margin loss chosen by (SrebroSridharanTewari2010):
Fix . Define the -truncation function by , for all . For any , we denote by the -truncation of , , and define .
For any family of functions , we also denote by the empirical covering number of over the sample and by a minimum empirical cover. Then, the following symmetrization lemma holds. Fix and . Then, the following inequality holds:
Further for , using the shorthand , the following holds:
The proof consists of using inequality 3, it is given in Appendix A. The first result of the lemma gives an upper bound for a general choice of functions , that is for an arbitrary choices of the loss function. This inequality will be used in Section 4 to derive our Rademacher complexity bounds. The second inequality is for the specific choice of that corresponds to -step function. We will use this inequality in the next section to derive covering number bounds.
3 Relative deviation margin bounds – Covering numbers
In this section, we present a general relative deviation margin-based learning bound, expressed in terms of the expected empirical covering number of . The learning guarantee is thus data-dependent. It is also very general since it is given for any and an arbitrary hypothesis set.
[General relative deviation margin bound] Fix and . Then, for any hypothesis set of functions mapping from to and any , the following inequality holds:
The proof is given in Appendix B. As mentioned earlier, a version of this result for was postulated by Bartlett1998. The result can be alternatively expressed as follows, taking the limit .
Fix and . Then, for any hypothesis set of functions mapping from to
, with probability at least, the following inequality holds for all :
Note that a smaller value of ( closer to ) might be advantageous for some values of , at the price of a worse complexity in terms of the sample size. For , the result can be rewritten as follows. Fix . Then, for any hypothesis set of functions mapping from to , with probability at least , the following inequality holds for all :
Let , , and . Then, for , the inequality of Corollary 3 can be rewritten as
This implies that and hence . Therefore, . Substituting the values of and yields the bound. The guarantee just presented provides a tighter margin-based learning bound than standard margin bounds since the dominating term admits the empirical margin loss as a factor. Standard margin bounds are subject to a trade-off: a large value of reduces the complexity term while leading to a larger empirical margin loss term. Here, the presence of the empirical loss factor favors this trade-off by allowing a smaller choice of . The bound is data-dependent since it is expressed in terms of the expected covering number and it holds for an arbitrary hypothesis set .
The learning bounds just presented hold for a fixed value of . They can be extended to hold uniformly for all values of , at the price of an additional -term. We illustrate that extension for Corollary 3. Fix . Then, for any hypothesis set of functions mapping from to and any , with probability , the following inequality holds for all :
For , let and . For all such , by Corollary 3 and the union bound,
By the union bound, the error probability is most . For any , there exists a such that . For this , . Hence, . By the definition of margin, for all , . Furthermore, as , . Hence, for all ,
Our previous bounds can be expressed in terms of the fat-shattering dimension, as illustrated below. Recall that, given , a set of points is said to be -shattered by a family of real-valued functions if there exist real numbers (witnesses) such that for all binary vectors , there exists such that:
The fat-shattering dimension of the family is the cardinality of the largest set -shattered set by (AnthonyBartlett99). Fix . Then, for any hypothesis set of functions mapping from to with , with probability at least , the following holds for all :
where . By (Bartlett1998, Proof of theorem 2), we have
where . Upper bounding the expectation by the maximum completes the proof. We will use this bound in Section 5 to derive explicit guarantees for several standard hypothesis sets.
4 Relative deviation margin bounds – Rademacher complexity
In this section, we present relative deviation margin bounds expressed in terms of the Rademacher complexity of the hypothesis sets. As with the previous section, these bounds are general: they hold for any and arbitrary hypothesis sets.
As in the previous section, we will define the family by , where is a function such that
4.1 Rademacher complexity-based margin bounds
We first relate the symmetric relative deviation bound to a quantity similar to the Rademacher average, modulo a rescaling.
Fix . Then, the following inequality holds:
The proof is given in Appendix C. It consists of introducing Rademacher variables and deriving an upper bound in terms of the first points only.
Now, to bound the right-hand side of the Lemma 4.1, we use a peeling argument, that is we partition into subsets , give a learning bound for each , and then take a weighted union bound. For any non-negative integer with , let denote the family of hypotheses defined by
Using the above inequality and a peeling argument, we show the following upper bound expressed in terms of Rademacher complexities. Fix and . Then, the following inequality holds:
The proof is given in Appendix C. Instead of applying Hoeffding’s bound to each term of the left-hand side for a fixed and then using covering and the union bound to bound the supremum, here, we seek to bound the supremum over directly. To do so, we use a bounded difference inequality that leads to a finer result than McDiarmid’s inequality.
Let be defined as the following peeling-based Rademacher complexity of :
Then, the following is a margin-based relative deviation bound expressed in terms of , that is in terms of Rademacher complexities.
Fix . Then, with probability at least , for all hypothesis , the following inequality holds:
Combining the above lemma with Theorem 4.1 yields the following.
Fix and let be defined as above. Then, with probability at least , for all hypothesis ,
The above result can be extended to hold for all simultaneously.
Let be defined as above. Then, with probability at least , for all hypothesis and ,
4.2 Upper bounds on peeling-based Rademacher complexity
We now present several upper bounds on . We provide proofs for all the results in Appendix D. For any hypothesis set , we denote by the number of distinct dichotomies generated by over that sample:
We note that we do not make any assumptions over range of . If the range of is in , then the following upper bounds hold on the peeling-based Rademacher complexity of :
Combining the above result with Corollary 4.1, improves the relative deviation bounds of (CortesGreenbergMohri2019, Corollary 2) for . In particular, we improve the term in their bounds to , which is an improvement for .
We next upper bound the peeling based Rademacher complexity in terms of the covering number. For a set of hypotheses ,
One can further simplify the above bound using the smoothed margin loss from SrebroSridharanTewari2010. Let the worst case Rademacher complexity be defined as follows.
Let be the smoothed margin loss from (SrebroSridharanTewari2010, Section 5.1), with its second moment bounded by . Then, the following holds:
For any , with probability at least , the following inequality holds for all and all :
In this section, we briefly highlight some applications of our learning bounds: both our covering number and Rademacher complexity margin bounds can be used to derive finer margin-based guarantees for several commonly used hypothesis sets. Below we briefly illustrate these applications.
Linear hypothesis sets: let be the family of liner hypotheses defined by
Then, the following upper bound holds for the fat-shattering dimension of (BartlettShaweTaylor1998): . Plugging in this upper bound in the bound of Corollary 3 yields the following:
with . In comparison, the best existing margin bound for SVM by (BartlettShaweTaylor1998, Theorem 1.7) is
Ensembles of predictors in base hypothesis set : let be the VC-dimension of and consider the family of ensembles . Then, the following upper bound on the fat-shattering dimension holds (BartlettShaweTaylor1998): , for some universal constant . Plugging in this upper bound in the bound of Corollary 3 yields the following:
with . In comparison, the best existing margin bound for ensembles such as AdaBoost in terms of the VC-dimension of the base hypothesis given by SchapireFreundBartlettLee1997 is:
Feed-forward neural networks of depth : let and for , where is a
-Lipschitz activation function. Then, the following upper bound holds for the fat-shattering dimension of(BartlettShaweTaylor1998): . Plugging in this upper bound in the bound of Corollary 3 gives the following:
with . In comparison, the best existing margin bound for neural networks by (BartlettShaweTaylor1998, Theorem 1.5 , Theorem 1.11) is
where is some universal constant and where . The margin bound in (8) is thus more favorable than (9). The Rademacher complexity bounds of Corollary 4.2 can also be used to provide generalization bounds for neural networks. For a matrix , let denote the matrix norm and denote the spectral norm. Let and . Then, by (BartlettFosterTelgarsky2017), the following upper bound holds:
Plugging in this upper bound in the bound of Corollary 4.2 leads to the following:
where . In comparison, the best existing neural network bounds by BartlettFosterTelgarsky2017 is
where is a universal constant and is the empirical Rademacher complexity. The margin bound (10) has the benefit of a more favorable dependency on the empirical margin loss than (11), which can be significant when that empirical term is small. On other hand, the empirical Rademacher complexity of (11) is more favorable than its counterpart in (10).
In Appendix E, we further discuss other potential applications of our learning guarantees.
We presented a series of general relative deviation margin bounds. These are tighter margin bounds that can serve as useful tools to derive guarantees for a variety of hypothesis sets and in a variety of applications. In particular, these bounds could help derive better margin-based learning bounds for different families of neural networks, which has been the topic of several recent research publications.
The work of Mehryar Mohri was partly supported by NSF CCF-1535987, NSF IIS-1618662, and a Google Research Award.
Appendix A Symmetrization
We use the following lemmas from CortesGreenbergMohri2019 in our proofs. [CortesGreenbergMohri2019] Fix and with . Let be the function defined by . Then, is a strictly increasing function of and a strictly decreasing function of .
[GreenbergMohri2013] Letwith a positive integer (the number of trials) and (the probability of success of each trial). Then, the following inequality holds:
and, if instead of requiring we require , then
where in both cases .
The following symmetrization lemma in terms of empirical margin loss is proven using the previous lemmas.
Fix and and assume that . Then, for any any , the following inequality holds:
We will use the function defined over by .
Fix . We first show that the following implication holds for any :
The first condition can be equivalently rewritten as , which implies