Margin-Based Generalization Lower Bounds for Boosted Classifiers

09/27/2019 ∙ by Allan Grønland, et al. ∙ Aarhus Universitet 0

Boosting is one of the most successful ideas in machine learning. The most well-accepted explanations for the low generalization error of boosting algorithms such as AdaBoost stem from margin theory. The study of margins in the context of boosting algorithms was initiated by Schapire, Freund, Bartlett and Lee (1998) and has inspired numerous boosting algorithms and generalization bounds. To date, the strongest known generalization (upper bound) is the kth margin bound of Gao and Zhou (2013). Despite the numerous generalization upper bounds that have been proved over the last two decades, nothing is known about the tightness of these bounds. In this paper, we give the first margin-based lower bounds on the generalization error of boosted classifiers. Our lower bounds nearly match the kth margin bound and thus almost settle the generalization performance of boosted classifiers in terms of margins.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Boosting algorithms produce highly accurate classifiers by combining several less accurate classifiers and are amongst the most popular learning algorithms, obtaining state-of-the-art performance on several benchmark machine learning tasks [KMF17, CG16]. The most famous of these boosting algorithm is arguably AdaBoost [FS97]. For binary classification, AdaBoost takes a training set of labeled samples as input, with and labels . It then produces a classifier in iterations: in the th iteration, a base classifier is trained on a reweighed version of that emphasizes data points that struggles with and this classifier is then added to . The final classifier is obtained by taking the sign of , where the ’s are non-negative coefficients carefully chosen by AdaBoost. The base classifiers all come from a hypothesis set , e.g.

could be a set of small decision trees or similar. As AdaBoost’s training progresses, more and more base classifiers are added to

, which in turn causes the training error of to decrease. If is rich enough, AdaBoost will eventually classify all the data points in the training set correctly [FS97].

Early experiments with AdaBoost report a surprising generalization phenomenon [SFBL98]. Even after perfectly classifying the entire training set, further iterations keeps improving the test accuracy. This is contrary to what one would expect, as gets more complicated with more iterations, and thus prone to overfitting. The most prominent explanation for this phenomena is margin theory, introduced by Schapire et al.  [SFBL98]. The margin of a training point is a number in , which can be interpreted, loosely speaking, as the classifier’s confidence on that point. Formally, we say that is a voting classifier if for all . Note that one can additionally assume without loss of generality that since normalizing each by leaves the sign of unchanged. The margin of a point with respect to a voting classifier is then defined as

Thus , and if , then taking the sign of correctly classifies . Informally speaking, margin theory guarantees that voting classifiers with large (positive) margins have a smaller generalization error. Experimentally AdaBoost has been found to continue to improve the margins even when training past the point of perfectly classifying the training set. Margin theory may therefore explain the surprising generalization phenomena of AdaBoost. Indeed, the original paper by Schapire et al.  [SFBL98] that introduced margin theory, proved the following margin-based generalization bound. Let be an unknown distribution over and assume that the training data is obtained by drawing i.i.d. samples from

. Then with high probability over

it holds that for every margin , every voting classifier satisfies

(1)

The left-hand side of the equation is the out-of-sample error of (since precisely when ). On the right-hand side, we use to denote a uniform random point from . Hence is the fraction of training points with margin less than . The last term is increasing in and decreasing in and . Here it is assumed is finite. A similar bound can be proved for infinite by replacing by , where is the VC-dimension of . This holds for all the generalization bounds below as well. The generalization bound thus shows that has low out-of-sample error if it attains large margins on most training points. This fits well with the observed behaviour of AdaBoost in practice.

The generalization bound above holds for every voting classifier , i.e. regardless of how was obtained. Hence a natural goal is to design boosting algorithms that produce voting classifiers with large margins on many points. This has been the focus of a long line of research and has resulted in numerous algorithms with various margin guarantees, see e.g. [GS98, Bre99, BDST00, RW02, RW05, GLM19]. One of the most well-known of these is Breimann’s ArcGV [Bre99]. ArcGV produces a voting classifier maximizing the minimal margin, i.e. it produces a classifier for which is as large as possible. Breimann complemented the algorithm with a generalization bound stating that with high probability over the sample , it holds that every voting classifier satisfies:

(2)

where is the minimal margin over all training examples. Notice that if one chooses as the minimal margin in the generalization bound (1) of Schapire et al.  [SFBL98], then the term becomes and one obtains the bound

which is weaker than Breimann’s bound and motivated his focus on maximizing the minimal margin. Minimal margin is however quite sensitive to outliers and work by Gao and Zhou 

[GZ13]

proved a generalization bound which provides an interpolation between (

1) and (2). Their bound is known as the th margin bound, and states that with high probability over the sample , it holds for every margin and every voting classifier that:

The th margin bound remains the strongest margin-based generalization bound to date (see Section 1.2 for further details). The th margin bound recovers Breimann’s minimal margin bound by choosing as the minimal margin (making ), and it is always at most the same as the bound (1) by Schapire et al.  As with previous generalization bounds, it suggests that boosting algorithms should focus on obtaining a large margin on as large a fraction of training points as possible.

Despite the decades of progress on generalization upper bounds, we still do not know how tight these bounds are. That is, we do not have any margin-based generalization lower

bounds. Generalization lower bounds are not only interesting from a theoretical point of view, but also from an algorithmic point of view: If one has a provably tight generalization bound, then a natural goal is to design a boosting algorithm minimizing a loss function that is equal to this generalization bound. This approach makes most sense with a matching lower bound as the algorithm might otherwise minimize a sub-optimal loss function. Furthermore, a lower bound may also inspire researchers to look for other parameters than margins when explaining the generalization performance of voting classifiers. Such new parameters may even prove useful in designing new algorithms, with even better generalization performance in practice.

1.1 Our Results

In this paper we prove the first margin-based generalization lower bounds for voting classifiers. Our lower bounds almost match the th margin bound and thus essentially settles the generalization performance of voting classifiers in terms of margins.

To present our main theorems, we first introduce some notation. For a ground set and hypothesis set , let denote the family of all voting classifiers over , i.e. contains all functions that can be written as such that for all and . For a (randomized) learning algorithm and a sample of points, let denote the (possibly random) voting classifier produced by when given the sample as input. With this notation, our first main theorem is the following:

Theorem 1.

For every large enough integer , every and every there exist a set and a hypothesis set over , such that and for every and for every (randomized) learning algorithm , there exist a distribution over and a voting classifier such that with probability at least over the choice of samples and the random choices of

  1. ; and

  2. .

Theorem 1 states that for any algorithm , there is a distribution for which the out-of-sample error of the voting classifier produced by is at least that in the second point of the theorem. At the same time, one can find a voting classifier obtaining a margin of at least on at least a fraction of the sample points. Our proof of Theorem 1 not only shows that such a classifier exists, but also provides an algorithm that constructs such a classifier. Loosely speaking, the first part of the theorem reflects on the nature of the distribution and the hypothesis set . Intuitively it means that the distribution is not too hard and the hypothesis set is rich enough, so that it is possible to construct a voting classifier with good empirical margins. Clearly, we cannot hope to prove that the algorithm constructs a voting classifier that has a margin of at least on a fraction of the sample set, since we make no assumptions on the algorithm. For example, if the constant hypothesis that always outputs is in , then could be the algorithm that simply outputs . The interpretation is thus: and allow for an algorithm to produce a voting classifier with margin at least on a fraction of samples. The second part of the theorem thus guarantees that regardless of which voting classifier produces, it still has large out-of-sample error. This implies that every algorithm that constructs a voting classifier by minimizing the empirical risk, must have a large error. Formally, Theorem 1 implies that if then

The first part of the theorem ensures that the condition is not void. That is, there exists an algorithm for which . Comparing Theorem 1 to the th margin bound, we see that the parameter corresponds to . The magnitude of the out-of-sample error in the second point in the theorem thus matches that of the th margin bound, except for a factor in the first term inside the and a factor in the second term. If we consider the range of parameters and for which the lower bound applies, then these ranges are almost as tight as possible. For , note that the theorem cannot generally be true for , as the algorithm that outputs a uniform random choice of hypothesis among and (the constant hypothesis outputting ), gives a (random) voting classifier with an expected out-of-sample error of . This is less than the second point of the theorem would state if it was true for . For , observe that our theorem holds for arbitrarily large values of . That is, the integer can be as large as desired, making as large as desired. Finally, for the constraint on , notice again that the theorem simply cannot be true for smaller values of as then the term exceeds .

Our second main result gets even closer to the th margin bound:

Theorem 2.

For every large enough integer , every , and every , there exist a set , a hypothesis set over and a distribution over such that and with probability at least over the choice of samples there exists a voting classifier such that

  1. ; and

  2. .

Observe that the second point of Theorem 2 has an additional factor on the first term in compared to Theorem 1. It is thus only off from the th margin bound by a factor in the second term and hence completely matches the th margin bound for small values of . To obtain this strengthening, we replaced the guarantee in Theorem 1 saying that all algorithms have such a large out-of-sample error. Instead, Theorem 2 demonstrates only the existence of a voting classifier (that is chosen as a function of the sample ) that simultaneously achieves a margin of at least on a fraction of the sample points, and yet has out-of-sample error at least that in point 2. Since the th margin bound holds with high probability for all voting classifiers, Theorem 2 rules out any strengthening of the th margin bound, except for possibly a factor on the second additive term. Again, our lower bound holds for almost the full range of parameters of interest. As for the bound on , our proof assumes , however the theorem holds for any constant greater than in the exponent.

Finally, we mention that both our lower bounds are proved for a finite hypothesis set . This only makes the lower bounds stronger than if we proved it for an infinite with bounded VC-dimension, since the VC-dimension of a finite , is no more than .

1.2 Related Work

We mentioned above that the th margin bound is the strongest margin-based generalization bound to date. Technically speaking, it is incomparable to the so-called emargin bound by Wang et al.  [WSJ11]. The th margin bound by Gao and Zhou [GZ13], the minimum margin bound by Breimann [Bre99] and the bound by Schapire et al.  [SFBL98] all have the form for some function . The emargin bound has a different (and quite involved) form, making it harder to interpret and compute. We will not discuss it in further detail here and just remark that our results show that for generalization bounds of the form studied in most previous work [SFBL98, Bre99, GZ13], one cannot hope for much stronger upper bounds than the th margin bound.

2 Proof Overview

The main argument that lies in the heart of both proofs is a probabilistic method argument. With every labeling we associate a distribution over . We then show that with some positive probability if we sample , satisfies the requirements of Theorem 1 (respectively Theorem 2). We thus conclude the existence of a suitable distribution. We next give a more detailed high-level description of the proof for Theorem 1. The proof of Theorem 2 follows similar lines.

Constructing a Family of Distributions.

We start by first describing the construction of for . Our construction combines previously studied distribution patterns in a subtle manner.

Ehrenfeucht et al.  [EHKV89] observed that if a distribution assigns each point in a fixed (yet unknown) label, then, loosely speaking, every classifier , that is constructed using only information supplied by a sample , cannot do better than random guessing the labels for the points in

. Intuitively, consider a uniform distribution

over . If we assume, for example, that , then with very high probability over a sample of points, many elements of are not in . Moreover, assume that associates every with a unique ”correct” label . Consider some (perhaps random) learning algorithm , and let be the classifier it produces given a sample as input. If is chosen randomly, then, loosely speaking, for every point not in the sample, and are independent, and thus returns the wrong label with probability . In turn, this implies that there exists a labeling such that is wrong on a constant fraction of when receiving a sample . While the argument above can in fact be used to prove an arbitrarily large generalization error, it requires to be large, and specifically to increase with . This conflicts with the first point in Theorem 1, that is, we have to argue that a voting classifier with good margins exist for the sample . If consists of distinct points, and each point in can have an arbitrary label, then intuitively needs to be very large to ensure the existence of . In order to overcome this difficulty, we set to assign very high probability to one designated point in , and the rest of the probability mass is then equally distributed between all other points. The argument above still applies for the subset of small-probability points. More precisely, if assigns all but one point in probability , then the expected generalization error (over the choice of ) is still . It remains to determine how large can we set . In the notations of the theorem, in order for a hypothesis set to satisfy , and at the same time, have an obtaining margins of on most points in a sample, our proof (and specifically Lemma 3, described hereafter) requires to be not significantly larger than , and therefore the generalization error we get is . This accounts for the first term inside the -notation in the second point of Theorem 1.

Anthony and Bartlett [AB09, Chapter 5] additionally observed that for a distribution that assigns each point in a random label, if does not sample a point enough times, any classifier , that is constructed using only information supplied by , cannot determine with good probability the Bayes label of , that is, the label of that minimizes the error probability. Intuitively, consider once more a distribution that is uniform over . However, instead of associating every point with one correct label , is now only slightly biased towards . That is, given that is sampled, the label in the sample point is with probability that is a little larger than , say for some small . Note that every classifier has an error probability of at least on every given point in . Consider once again a learning algorithm and the voting classifier it constructs. Loosely speaking, if does not sample a point enough times, then with good probability . More formally, in order to correctly assign the Bayes label of , an algorithm must see samples of . Therefore if we set the bias to be , then with high probability the algorithm does not see a constant fraction of enough times to correctly assign their label. In turn, this implies an expected generalization error of , where the expectation is over the choice of . By once again letting we conclude that there exists a labeling such that for , the expected generalization error of is . This expression is almost the second term inside the -notation in the theorem statement, though slightly larger. We note, however, for large values of , the in-sample error is arbitrarily close to . One challenge is therefore to reduce the in-sample-error, and moreover guarantee that we can find a voting classifier where the ’th smallest margin for is at least , where are the parameters provided by the theorem statement.

To this end, our proof subtly weaves the two ideas described above and constructs a family of distributions . Informally, we partition into two disjoint sets, and conditioned on the sample point belonging to each of the subsets, is defined similarly to be one of the two distribution patterns defined above. The main difficulty lies in delicately balancing all ingredients and ensuring that we can find an with margins of at least on all but of the sample points, while still enforcing a large generalization error. Our proof refines the proof given by Ehrenfeucht et al.  and Anthony and Bartlett and shows that not only does there exists a labeling such that has large generalization error with respect to (with probability at least over the randomness of ), but rather that a large (constant) fraction of labelings share this property. This distinction becomes crucial in the proof.

Small yet Rich Hypothesis Sets.

The technical crux in our proofs is the construction of an appropriate hypothesis set. Loosely speaking, the size of has to be small, and most importantly, independent of the size of the sample set. On the other hand, the set of voting classifiers is required to be rich enough to, intuitively, contain a classifier that with good probability has good in-sample margins for a sample with a large fraction of labelings . Our main technical lemma presents a distribution over small hypothesis sets such that for every sparse , that is for a small number of entries , with high probability over , there exists some voting classifier that has minimum margin with over the entire set . In fact, the size of the hypothesis set does not depend on the size of , but only on the sparsity parameter . More formally, we show the following.

Lemma 3.

For every , and integers , there exists a distribution over hypothesis sets , where is a set of size , such that the following holds.

  1. For all , we have ; and

  2. For every labeling , if no more than points satisfy , then

where

In fact, we prove that if is a random hypothesis set that also contains the hypothesis mapping all points to , then with good probability satisfies the second requirement in the theorem.

To show the existence of a good voting classifier in our proof actually employs a slight variant of the celebrated AdaBoost algorithm, and shows that with high probability (over the choice of the random hypothesis set ), the voting classifier constructed by this algorithm attains minimum margin at least over the entire set .

Note that Lemma 3 speaks of a distribution over hypothesis sets. When using Lemma 3 in our proofs, we will invoke Yao’s principle to conclude the existence of a suitable fixed hypothesis set .

Existential Lower Bound.

Our proof of Theorem 2 uses many of the same ideas as the proof of Theorem 1. The difference between the generalization lower bound (second point) in Theorem 1 and 2 is an factor in the first term inside the notation. That is, Theorem 2 has an where Theorem 1 has an . This term originated from having points with a probability mass of in and one point having the remaining probability mass. In the proof of Theorem 2, we first exploit that we are proving an existential lower bound by assigning all points the same label . That is, our hard distribution assigns all points the label (ignoring the second half of the distribution with the random and slightly biased labels). Since we are not proving a lower bound for every algorithm, this will not cause problems. We then change to about and assign each point the same probability mass in distribution . The key observation is that on a random sample of points, by a coupon-collector argument, there will still be points from that were not sampled. From Lemma 3, we can now find a voting classifier , such that is on all points in , and on a set of points in . This means that has out-of-sample error under distribution and obtains a margin of on all points in the sample .

As in the proof Theorem 1, we can combine the above distribution with the ideas of Anthony and Bartlett to add the terms depending on to the lower bound.

3 Margin-Based Generalization Lower Bounds

In this section we prove Theorems 1 and 2 assuming Lemma 3, whose proof is deferred to Section 4, and we start by describing the outlines of the proofs. To this end fix some integer , and fix . Let be an integer, and let be some set with elements. With every we associate a distribution over , and show that with some constant probability over a random choice of , a voting classifier of interest has a high generalization probability with respect to . By a voting classifier of interest we mean one constructed by a learning algorithm in the proof of Theorem 1 and an adversarial classifier in the proof of Theorem 2. We additionally show existence of a hypothesis set such that with very high (constant) probability over a random choice of , contains a voting classifier that attains high margins with over the entire set . Finally, we conclude that with positive probability over a random choice of both properties are satisfied, and therefore there exists at least one labeling that satisfies both properties.

We start by constructing the family of distributions over . To this end, let be some constant to be fixed later, and let . We define separately for the first points and the last points of . Intuitively, every point in has a fixed label determined by , however all points but one have a very small probability of being sampled according to . Every point in , on the other hand, has an equal probability of being sampled, however its label is not fixed by rather than slightly biased towards . Formally, let be constants to be fixed later. We construct using the ideas described earlier in Section 2, by sewing them together over two parts of the set . We assign probability to and to . That is, for , the probability that is . Next, conditioned on , is assigned high probability and the rest of the measure is distributed uniformly over . That is

Finally, conditioned on , distributes uniformly over , and conditioned on , we have with probability . That is

In order to give a lower bound on the generalization error for some classifier

of interest, we define new random variables such that their sum is upper bounded by

, and give a lower bound on that sum. To this end, for every and , denote

(3)

When are clear from the context we shall simply denote . We show next that indeed proving a lower bound on implies a lower bound on the generalization error.

Claim 4.

For every we have .

Before getting proving the claim, we explain why focusing on , rather than bounding the generalization error directly is essential for the proof. The reason lies in the fact that we need a lower bound to hold with constant probability over the choice of and (and in the case of Theorem 1 also the random choices made by the algorithm) and not only in expectation. While lower bounding is clearly not harder than lower bounding , showing that a lower bound holds with some constant probability is slightly more delicate. Our proof uses the fact that with probability , is not larger than a constant from its expectation, and therefore we can use Markov’s inequality to lower bound with constant probability. We next turn to prove the claim.

Proof.

We first observe that

(4)

For every and , if then . Moreover, if and then . Therefore

(5)

Next, for every we have that

and therefore

(6)

Plugging (5) and (6) into (4) we conclude the claim. ∎

To prove existence of a ”rich” yet small enough hypothesis set we apply Lemma 3 together with Yao’s minimax principle. In order to ensure that the hypothesis sets constructed using Lemma 3 is small enough, and specifically has size , we need to focus our attention on sparse labelings only. That is, the labelings cannot contain more than . To this end we will focus on

-sparse vectors, and more specifically, a designated set of

-sparse labelings. More formally, we define a set of labelings of interest as the set of all labelings such that the restriction to the first entries is -sparse. That is

(7)

We next show that there exists a small enough (with respect to ) hypothesis set that is rich enough. That is, with high probability over , there exists a voting classifier that attains high minimum margin with over the entire set . Note that the following result, similarly to Lemma 3 does not depend on the size of , but only on the sparsity of the labelings in question.

Claim 5.

If then there exists a hypothesis set such that and

Proof.

Let , be the distribution whose existence is guaranteed in Lemma 3. Then for every labeling , with probability at least over , there exists a voting classifier that has minimal margin of . That is, for every , . By Yao’s minimax principle, there exists a hypothesis set such that

Moreover, since , then . Since and since and thus we get that there exists some universal constant such that , and thus . ∎

3.1 Proof Algorithmic Lower Bound

This section is devoted to the proof of Theorem 1. That is, we show that for every algorithm , there exist some distribution and some classifier such that with constant probability over , has large margins on points in , yet has large generalization error. To this end we now fix to be and . For these values of we get that is, in fact, the set of all possible labelings, i.e. . Next, fix be a (perhaps randomized) learning algorithm. For every -point sample and recall that denotes the classifier returned by when running on sample .

The main challenge is to show that there exists a labeling such that contains a good voting classifier for and, in addition, has a large generalization error with respect to . We will show that if is small enough, then indeed such a labeling exists. Formally, we show the following.

Lemma 6.

If , then there exists such that

  1. There exists such that for every , ; and

  2. with probability at least over and the randomness of we have

Before proving the lemma, we first show how it implies Theorem 1

Proof of Theorem 1.

Fix some . Assume first that , and let and . Let be as in Lemma 6, then for every sample , , and moreover with probability at least over and the randomness of

where the last transition is due to the fact that and .

Otherwise, assume , and let , and . Since , then . Moreover, if for large enough but universal constant , then , and hence . Moreover, since then , and therefore . Let therefore be a labeling and a classifier in whose existence is guaranteed in Lemma 6. Let be a sample of points drawn independently according to . For every , we have . Therefore by Chernoff we get that for large enough ,

where the inequality before last is due to the fact that , since . Moreover, by Lemma 6 we get that with probability at least over and we get that

where the last transition is due to the fact that . This completes the proof of Theorem 1. ∎

For the rest of the section we therefore prove Lemma 6. We start by lower bounding the expected value of , where the expectation is over the choice of labeling , and the random choices made by . Intuitively, as points in are sampled with very small probability, it is very likely that the sample does not contain many of them, and therefore the algorithm cannot do better than randomly guessing many of the labels. Moreover, if is small enough, and does not sample a point in enough times, there is a larger probability that does not determine the bias correctly.

Claim 7.

If , then .

Proof.

To lower bound the expectation, we lower bound the expectations of and separately. For every , if then and are independent, and therefore . Let be the set of all samples for which , then for every ,

As this holds for every , we conclude that

Next, for large enough a Chernoff bound gives , and therefore , and by Fubini’s theorem