Boosting is a fundamental methodology in machine learning which allows us to automatically convert (“boost”) a number of weak learning rules into a strong one. Boosting was first studied in the context of (realizable) PAC learning in a line of seminal works which include the celebrated Adaboost algorithm as well an many other algorithms with various applications (see e.g. [29, 34, 20, 17]). It was later adapted to the agnostic PAC setting and was extensively studied in this context as well [7, 31, 21, 27, 30, 26, 28, 16, 13, 18]. More recently,  and  studied boosting in the context of online prediction and derived boosting algorithms in the realizable setting (a.k.a. mistake-bound model).
In this work we study agnostic boosting in the online setting: let be a class of experts and assume we have an oracle access to a weak online learner for with a non-trivial (yet far from desired) regret guarantee. The goal is to use it to obtain a strong online learner for , i.e. which exhibits a vanishing regret.
Why Online Agnostic Boosting?
The setting of realizable boosting poses a restriction on the possible input sequences: there must be an expert that attains near-zero mistake-bound on the input sequence. This is a non-standard assumption in online learning. In contrast, in the (agnostic) setting we consider, there is no restriction on the input sequence and it can be chosen adversarially.
Applications of Online Agnostic Boosting.
Apart from being a fundamental question in any machine learning setting, let us mention a couple of more concrete incentives to study online agnostic boosting:
Differential Privacy and Online Learning: A recent line of work revealed deep connections between online learning and differentially private learning [5, 1, 6, 10, 32, 25, 22, 11]. In fact, these two notions are equivalent in the sense that a class can be PAC learned by a differentially private algorithm if and only if it can be learned in the online setting with vanishing regret [6, 11]. However, the above equivalence is only known to hold from an information theoretic perspective, and deriving efficient reductions between online and private learning is an open problem . The only case where an efficient reduction is known to exist is in converting a pure private learner to an online learner in the realizable setting . This reduction heavily relies on the realizable-case online boosting algorithm by . Moreover, the derivation of an agnostic online boosting algorithm is posed by  as an open problem towards extending their reduction to the agnostic setting.
Time Series Prediction and Online Control: Recent machine learning literature considered the problem of controlling a dynamical system from the lens of online learning and regret minimization, see e.g. [3, 4, 23] and referenced work therein. The online learning approach also gave rise to the first boosting methods in this context , and demonstrates the potential impact of boosting in the online setting. Thus, the current work aims at continuing the development of the boosting methodology in online machine learning, starting from the basic setting of expert advice.
1.1 Main Results
The Weak Learning Assumption.
In this paper we use the same formulation as  used in the statistical setting. Towards this end, it is convenient to measure the performance of online learners using gain rather than loss: let be an (adversarial and adaptive) input sequence of examples presented to an online learning algorithm ; that is, in each iteration , the adversary picks an example , then the learner first gets to observe , and predicts (possibly in a randomized fashion) , and lastly it observes and gains a reward of . The goal of the learner is to maximize the total gain (or correlation), given by . Note that this is equivalent to the often used notion of loss where in each iteration the learner suffers a loss of and its goal is to minimize the accumulated loss . 111Indeed, since . Therefore, the accumulated loss and correlation are affinely related by .
Definition 1 (Agnostic Weak Online Learning).
Let be a class of experts, let denote the horizon length, and let denote the advantage. An online learning algorithm is a -agnostic weak online learner (AWOL) for if for any sequence , at every iteration , the algorithm outputs such that,
where the expectation is taken w.r.t the randomness of the weak learner and that of the possibly adaptive adversary, is the additive regret: a non-decreasing, sub-linear function of .
Note the slight abuse of notation in the last definition: an online learner is not an “” function; rather it is an algorithm with an internal state that is updated as it is fed training examples. Thus, the prediction depends on the internal state of , and for notational convenience we avoid reference to the internal state.
Our agnostic online boosting algorithm has an oracle access to weak learners and predicts each task by combining their predictions. The number of weak learners is a meta-parameter which can be tuned by the user according to the following trade-off: on the one hand, the regret bound improves as increases, and on the other hand, a larger number of weak learners is more costly in terms of computational resources.
Theorem 2 (Agnostic Online Boosting).
Let be a class of experts, let denote the horizon length, and let be -AWOL for with advantage and regret (see Definition 1). Then, there exists an online learning algorithm, which has oracle access to each , and has expected regret of at most
To exemplify the interplay between and , imagine a scenario where (as is often the case for regret bounds). Then, setting the number of weak learners to be gives that the overall regret remains .
An Abstract Framework for Boosting.
Boosting and Regret Minimization algorithms are intimately related. This tight connection is exhibited both in statistical boosting (see [19, 17, 33]) as well as in the online boosting (). Our algorithm is inspired by this fruitful connection and utilizes it: in particular, Theorem 2 is an instantiation of a more abstract meta-algorithm which takes an arbitrary online convex optimizer and uses it in a black-box manner to obtain an agnostic online boosting algorithm. Thus, in fact we obtain a family of boosting algorithms; one for each choice of an online convex optimizer. Specifically, Theorem 2 follows by picking Online Gradient Decent for the meta-algorithm. We present this in detail in Section 2.
The same type of reasoning carries to realizable online boosting, and even to statistical boosting (both realizable and agnostic setting). In Section 3 we demonstrate a general reduction from each of these boosting settings to online convex optimization.
1.2 Related Work
studies online boosting under real-valued loss functions. The main difference from our work is in the weak learning assumption: consider weak learners that are in fact strong online learners for a base class of regression functions. The boosting process produces an online learner for a bigger class which consists of the linear span of the base class. This is different from the setting considered here where the class is fixed, but the regret bound is being boosted.
A main motivation in this work is the connection between boosting and regret minimization. This builds on and inspired by previous works that demonstrated this fruitful relationship. We refer the reader to the book by  (Chapter 6) for an excellent presentation of this relationship in the context of Adaboost.
The main result of our agnostic online boosting algorithm, and the proof of Theorem 3, are given in Section 2. In Section 3, we first give a game-theoretic perspective of our method when applied to the statistical setting (Subsection 3.1). We then demonstrate a general reduction, in the statistical setting, from both the agnostic (Subsection 3.2), and realizable (Subsection 3.3) boosting settings, to online convex optimization. Lastly, we give a similar result for the online realizable boosting setting in Section 4.
2 Agnostic Online Boosting
In this section we prove Theorem 2, which establishes an efficient online agnostic boosting algorithm. We begin in Subsection 2.1 with formally presenting our framework which enables converting an online convex optimizer to an online booster. Then, in Subsection 2.2 we show how Theorem 2 follows directly by picking the online convex optimizer to be Online Gradient Decent.
2.1 Online Agnostic Boosting with OCO
We begin with describing our boosting algorithm (see Algorithm 1 for the pseudo-code). The booster has black-box oracle access to two types of auxiliary algorithms: a weak learner, and an online-convex optimizer. The booster maintains instances of a weak learning algorithm. Specifically, each weak learner is a -AWOL (see Definition 1). The online-convex optimizer is a -OCO algorithm (see Equation 1 below).
Online Convex Optimization
(see e.g. ). Recall that in the Online Convex Optimization (OCO) framework, an online player iteratively makes decisions from a compact convex set . At iteration , the online player chooses , and the adversary reveals the cost , chosen from a family of bounded convex functions over . We will refer to an algorithm in this setting as a -OCO. Let be a -OCO. The regret of is defined by:
The last component needed to describe our boosting algorithm is the randomized projection “” which is used to predict in Line 2. For any , denote by the following random label:
We now state and prove the regret bound for Algorithm 1.
Proposition 3 (Regret Bound).
The accumulated gain of Algorithm 1 satisfies:
where ’s are the observed examples, ’s are the predictions, the expectation is with respect to the algorithm and learners’ randomness, and and are the regret terms of the weak learner and the OCO, respectively.
The proof follows by combining upper and lower bounds on the expected sum of losses incurred by the OCO algorithm. The bounds follow directly from the weak learning assumption (lower bound) and the OCO guarantee (upper bound). These bounds involve some simple algebraic manipulations.
It is convenient to abstract out some of these calculations into lemmas, which are described later in this section.
Before delving into the analysis, we first clarify several assumptions used below. For simplicity of presentation we assume an oblivious adversary, however, using a standard reduction, our results can be generalized to an adaptive one 222See discussion in , Pg. 69, as well as Exercise 4.1 formulating the reduction.. Let be any sequence of observed examples. Observe that there are several sources of randomness at play; the weak learning algorithm ’s internal randomness, the random re-labeling (line 6, Algorithm 1), and the randomized prediction (line 2, Algorithm 1
). The analysis below is given in expectation with respect to all these random variables.
Note the following fact used in the analysis; for all , the random variables and are conditionally independent given and . Since , using the conditional independence, it follows that (see Lemma 13 in the Appendix). We can now begin the analysis, starting with lower bounding the expected sum of losses, using the weak learning guarantee,
where is an optimal expert in hindsight for the observed sequence of examples ’s. Thus, we obtain the lower bound on the expected sum of losses (see Line 5 in Algorithm 1 for the definition of the ’s), given by,
|(See Lemma 4 below)|
For the upper bound, observe that the OCO regret guarantee implies that for any , and any ,
Thus, by setting according to Lemma 5 (see below, with ), and summing over , we get,
By combining the lower and upper bounds for , we get,
It remains to prove two Lemmas that are used in the proof of the theorem above, as well as in the more general settings in the following sections.
For any , an example pair , and , we have:
Let . Observe that . Thus, since , . ∎
Given an example pair , and , there exists , such that,
where , with expectation taken only w.r.t. the randomness of (see Definition (2)).
If , and by setting , the equality follows. Thus, assume , and consider the following cases:
If , then . Hence, by setting , the equality follows.
If , then since it must be that , and . Since , we have . Hence, by setting the inequality holds.
2.2 Proof of Theorem 2
The proof of Theorem 2 is a direct corollary of Proposition 3, by plugging Online Gradient Descent (OGD) to be the OCO algorithm (e.g., see  Chapter 3.1): the OGD regret is , where is the number of iterations, is an upper bound on the gradient of the losses, and is the diameter of the set . In our setting, , and . Hence, , and the overall bound on the regret follows.
3 Statistical Boosting via Improper Game Playing
In this section we first give a game-theoretic perspective of our method when applied to the statistical setting (Subsection 3.1). We then demonstrate a general reduction from both the agnostic (Subsection 3.2), and realizable (Subsection 3.3) boosting settings, to online convex optimization. The following algorithm is given as input a sample , and has a black-box access to two auxiliary algorithms: a weak learner, and an online-convex optimizer. Note that this in fact defines a family of boosting algorithms, depending on the choice of the online-convex optimizer.
3.1 Solving Zero Sum Games Improperly Using an Approximate Optimization Oracle
Our framework uses as a main building block a procedure for approximately solving zero sum games using an approximate optimization oracle. It is described in this section.
In the zero sum games setting, there are two players A and B, and a payoff function that depends on the players’ strategies. Player A’s goal is to minimize the payoff, while player B’s goal is to maximize it. Let and be the convex, compact decision sets of players A and B, respectively, and assume that is convex-concave. By Sion’s minimax theorem , the value of the game is well-defined, and we denote it by :
Let be a convex, compact set such that . We refer to strategies in as proper strategies, while those in are improper strategies. We consider a modified zero sum games setting where the payoff function is defined on , the set of improper strategies. Note that is defined with respect to the set of proper strategies, and it is still a well-defined quantity in this game.
Assumption 1: Player B has access to a randomized approximate optimization oracle . Given any , outputs an improper best response: a strategy such that , where the expectation is taken over the randomness of .
Assumption 2: Player B is allowed to play strategies in .
Assumption 3: Player A has access to a possibly randomized -OCO algorithm with regret (See Definition 1).
If players A and B play according to Algorithm 3, then player B’s average strategy , , satisfies for any ,
where the expectation is taken over the randomness of .
Since the game is well-defined over and , there exists a max-min strategy for player B such that for all , . Let , and observe that since the ’s depend on the sequence of ’s, they are also random variables, as well as . We have,
The first inequality is due to Assumption 1, where The second inequality holds because is convex in .
Now, let ; note that since is convex. For the upper bound, observe that the OCO regret guarantee implies that for any we have,
where the second inequality holds because is concave in . Combining the lower and upper bounds yields the theorem. ∎
3.2 Statistical Agnostic Boosting
We will use the following notation. Let be a distribution over and let be an hypothesis. Define the correlation of with respect to by:
Definition 7 (Empirical Agnostic Weak Learning Assumption).
Let be a hypothesis class and let denote an unlabeled sample. A learning algorithm is a -agnostic weak learner (AWL) for with respect to if for any labels ,
where is the distribution which uniformly assigns to each example probability , and is an independent sample of size drawn from .
In accordance with previous works, we focus on the setting where is a small constant (say ) and , where is the VC-dimension of (see  for a detailed discussion). We stress however that our results apply for any setting of .
The above weak learning assumption can be seen as an empirical variant of the assumption in , where is replaced with the population distribution over and the labels
’s are replaced with an arbitrary classifier. Both of these assumptions are weaker than the standard agnostic weak learning assumption, for which the guarantee holds with respect to every distribution over . It will be interesting to investigate the relationship between the assumption of  and our empirical variant, however this is beyond the scope of this work.
We now state and prove the regret bound for Algorithm 2.
Theorem 8 (Empirical Agnostic Boosting).
The correlation of the output of Algorithm 2, which is denoted , satisfies:
The above theorem asserts that the correlation of the output hypothesis is competitive with the best hypothesis in with respect to the empirical distribution. Obtaining a similar guarantee with respect to the population distribution can be obtained using standard arguments. One way of deriving it is via a sample compression argument (which is natural in boosting; see, e.g. [33, 15]): indeed, the final hypothesis is obtained by aggregating the weak hypotheses ’s, each of which is determined by the examples fed to the weak learner. Thus, can be encoded by input examples and hence the entire algorithm forms a sample compression scheme of this size. Consequently, by setting the input sample we get the same guarantee like in Equation 3 up to an additive error of .
The proof has two parts. The first part is a straightforward reduction to the game-theoretic setup of Proposition 6, and the second part shows how to project the “improper” strategy obtained by Proposition 6 to the desired output hypothesis.
Reduction to Proposition 6. The agnostic version of Algorithm 2 can be presented as an instance of Algorithm 3, where Player A and B are the weak learner and the OCO oracle algorithms, respectively. The decision sets are , , and , and the payoff function is given by
is a vector in thedimensional continuous cube, and is a non-negative combination of hypotheses in (and so corresponds to the mapping ). We leave it to the reader to verify that the agnostic weak learner corresponds to an approximate optimization oracle . Namely, for any the output satisfies and
Furthermore, it can be shown that the value of the above game is
This can be done by (i) observing that the strategy is dominant for Player and (ii) computing which is equal to (since is dominating).
Now, Proposition 6 implies that for any , we have
Projection. Recall that the output hypothesis is defined using the projection (see Definition 2):
Now, by Lemma 5 there exists such that
where the expectation is taken over the randomness of the projection, the weak learner, and the random samples given to the weak learner. Simple manipulation on the above inequality directly yields
If we use OGD as the OCO algorithm, we have , where and . We arrive at the theorem by plugging in
3.3 Statistical Realizable Boosting
Definition 9 (Empirical Weak Learning Assumption ).
Let be a hypothesis class, and let be a sample. A learning algorithm is a -weak learner (WL) for with respect to if for any distribution which assigns each example with probability ,
where is an independent sample of size drawn from .
The correlation of the output of Algorithm 2, denoted , satisfies
The proof follows in a similar structure as in Theorem 8, and is deffered to the Appendix.
4 Online Realizable Boosting
In this section, we give an online realizable boosting algorithm, and state the regret bound. The result is along similar lines as our main result given in Section 2. We first state the weak learning assumption for the online realizable setting.
Definition 11 (Online Weak Learning).
Let be a class of experts, let denote the horizon length, and let denote the advantage. An online learning algorithm is a -weak online learner (WOL) for if for any sequence that is realizable by , at every iteration , the algorithm outputs such that,
where the expectation is taken over the randomness of the weak learner and is the additive regret: a non-decreasing, sub-linear function of .
Similar to the online agnostic case, the boosting algorithm is given access to instances of a -WOL algorithm (see Definition 11) and a -OCO algorithm (see Definition 1). Instead of setting as in the agnostic case, we set . The algorithm for online boosting is exactly the same as in the agnostic online case (see Algorithm 1), except for line 6. In the online agnostic case, we pass a relabeled data point to , while the algorithm below does not relabel the data points.
The following theorem proves the realizable online boosting result. Observe that in the realizable case, . Let . Note that the error can be made arbitrarily small, by setting the number of weak learners to and the number of iterations of Algorithm 4 to , for any . Thus, for an OCO algorithm (1) with regret bound , and a weak learner with regret bound , by the following theorem, we get that the online correlation of the booster is at least .
The accumulated gain of Algorithm 4 satisfies:
where ’s are the observed examples, ’s are the predictions, the expectation is with respect to the algorithm and learners’ randomness, , and and are the regret terms of the weak learner and the OCO, respectively.
The proof follows similarly to the proof of Proposition 3, and is deferred to the Appendix.
We have presented the first boosting algorithm for agnostic online learning. In contrast to the realizable setting, we do not place any restrictions on the online sequence of examples. It remains open to prove lower bounds on online agnostic boosting as a function of the natural parameters of the problem and/or improve our upper bounds.
-  (2017) Online learning via differential privacy. CoRR abs/1711.10019. External Links: Cited by: 1st item.
-  (2019) Boosting for dynamical systems. arXiv preprint arXiv:1906.08720. Cited by: 2nd item.
-  (2019) Online control with adversarial disturbances. In International Conference on Machine Learning, pp. 111–119. Cited by: 2nd item.
-  (2019) Logarithmic regret for online control. In Advances in Neural Information Processing Systems 32, pp. 10175–10184. Cited by: 2nd item.
-  (2017) The price of differential privacy for online learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 32–40. External Links: Cited by: 1st item.
Private PAC learning implies finite Littlestone dimension.
Proceedings of the 51st Annual ACM Symposium on the Theory of Computing, STOC ’19, New York, NY, USA. Cited by: 1st item.
International Conference on Computational Learning Theory, pp. 507–516. Cited by: §1.
Online gradient boosting. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2458–2466. External Links: Cited by: §1.2.
-  (2015) Optimal and adaptive algorithms for online boosting. In International Conference on Machine Learning, pp. 2323–2331. Cited by: Proof of Theorem 12, 1st item, §1.2, §1.
-  (2019) Passing tests without memorizing: two models for fooling discriminators. External Links: Cited by: 1st item.
-  An Equivalence Between Private Classification and Online Prediction. Cited by: 1st item.
-  (2006) Prediction, learning, and games. Cambridge university press. Cited by: footnote 2, footnote 4.
-  (2016) Communication efficient distributed agnostic boosting. In Artificial Intelligence and Statistics, pp. 1299–1307. Cited by: §1.
-  (2012) An online boosting algorithm with theoretical justifications. External Links: Cited by: §1.1, §1.2, §1.
-  (2016) Supervised learning through the lens of compression. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2784–2792. External Links: Cited by: §3.2.
-  (2009) Distribution-specific agnostic boosting. arXiv preprint arXiv:0909.2927. Cited by: §1.
-  (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55 (1), pp. 119–139. External Links: Cited by: §1.1, §1.
-  (1996) Game theory, on-line prediction and boosting. In Proceedings of the ninth annual conference on Computational learning theory, pp. 325–332. Cited by: §1.
-  (1999) Adaptive game playing using multiplicative weights. Games and Economic Behavior 29 (1-2), pp. 79–103. Cited by: §1.1.
-  (1990) Boosting a weak learning algorithm by majority. In Proceedings of the Third Annual Workshop on Computational Learning Theory, COLT 1990, University of Rochester, Rochester, NY, USA, August 6-8, 1990, M. A. Fulk and J. Case (Eds.), pp. 202–216. External Links: Cited by: §1.
-  (2003) Optimally-smooth adaptive boosting and application to agnostic learning. Journal of Machine Learning Research 4 (May), pp. 101–117. Cited by: §1.
-  (2019) Private learning implies online learning: an efficient reduction. CoRR abs/1905.11311. External Links: Cited by: 1st item.
-  (2019) The nonstochastic control problem. arXiv preprint arXiv:1911.12178. Cited by: 2nd item.
-  (2016) Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (3-4), pp. 157–325. Cited by: §2.1, §2.2.
-  (2019) The role of interactivity in local differential privacy. In FOCS, Cited by: 1st item.
-  (2008) On agnostic boosting and parity learning. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pp. 629–638. Cited by: §1.
-  (2005) Boosting in the presence of noise. J. Comput. Syst. Sci. 71 (3), pp. 266–290. External Links: Cited by: §1.
-  (2009) Potential-based agnostic boosting. In Advances in neural information processing systems, Cited by: §1.1, §1, §3.2, §3.2.
-  (1988-12) Thoughts on hypothesis boosting. Note: Unpublished Cited by: §1.
-  (2008) Adaptive martingale boosting. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.), pp. 977–984. External Links: Cited by: §1.
-  (2002) Boosting using branching programs. J. Comput. Syst. Sci. 64 (1), pp. 103–112. Cited by: §1.
How to use heuristics for differential privacy. In FOCS, Cited by: 1st item.
-  (2012) Boosting: Foundations and Algorithms. Cambridge university press. External Links: Cited by: §1.1, §3.2, Definition 9.
-  (1990) The strength of weak learnability. Machine Learning 5 (2), pp. 197–227. External Links: Cited by: §1.2, §1.
-  (1958) On general minimax theorems.. Pacific J. Math. 8 (1), pp. 171–176. External Links: Cited by: §3.1.
Let be random variables, such that