# Online Boosting with Bandit Feedback

We consider the problem of online boosting for regression tasks, when only limited information is available to the learner. We give an efficient regret minimization method that has two implications: an online boosting algorithm with noisy multi-point bandit feedback, and a new projection-free online convex optimization algorithm with stochastic gradient, that improves state-of-the-art guarantees in terms of efficiency.

Comments

There are no comments yet.

## Authors

• 6 publications
• 49 publications
• ### Boosting for Online Convex Optimization

We consider the decision-making framework of online convex optimization ...
02/18/2021 ∙ by Elad Hazan, et al. ∙ 0

read it

• ### Online Agnostic Boosting via Regret Minimization

Boosting is a widely used machine learning approach based on the idea of...
03/02/2020 ∙ by Nataly Brukhim, et al. ∙ 0

read it

• ### Setpoint Tracking with Partially Observed Loads

We use online convex optimization (OCO) for setpoint tracking with uncer...
09/12/2017 ∙ by Antoine Lesage-Landry, et al. ∙ 0

read it

• ### Online Multiclass Boosting with Bandit Feedback

We present online boosting algorithms for multiclass classification with...
10/11/2018 ∙ by Daniel Zhang, et al. ∙ 0

read it

• ### Risk-Averse Stochastic Convex Bandit

Motivated by applications in clinical trials and finance, we study the p...
10/01/2018 ∙ by Adrian Rivera Cardoso, et al. ∙ 0

read it

• ### Boosting One-Point Derivative-Free Online Optimization via Residual Feedback

Zeroth-order optimization (ZO) typically relies on two-point feedback to...
10/14/2020 ∙ by Yan Zhang, et al. ∙ 10

read it

• ### Fast Rates for Bandit Optimization with Upper-Confidence Frank-Wolfe

We consider the problem of bandit optimization, inspired by stochastic o...
02/22/2017 ∙ by Quentin Berthet, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Boosting is a fundamental methodology in machine learning which allows us to efficiently convert a number of weak learning rules into a strong one. The theory of boosting in the batch setting has been studied extensively, leading to a tremendous practical success. See

Schapire and Freund (2012) for a thorough discussion.

In contrast to the batch setting, online learning algorithms typically don’t make any stochastic assumptions about the data. They are often faster, memory-efficient, and can adapt to the best changing predictor over time. A line of previous work has explored extensions of boosting methods to the online learning setting Leistner et al. (2009); Chen et al. (2012, 2014); Beygelzimer et al. (2015b, a); Agarwal et al. (2019); Brukhim et al. (2020). Of these, several works Beygelzimer et al. (2015a); Agarwal et al. (2019)

formally address the setting of online boosting for regression, providing theoretical guarantees on variants of the Gradient boosting method

Friedman (2001); Mason et al. (2000)

widely used in practice. However, such guarantees are only provided under the assumption that full information is available to the learner, i.e., that the entire loss function is revealed after each prediction is made.

On the other hand, in many online learning problems, the feedback available to the learner is limited. These problems naturally occur in many practical applications, in which interactions with the environment are costly, and the learner has to operate under bandit feedback. Such is often the case, for example, for Reinforcement Learning in a Markov decision process

Jin and Luo (2019); Rosenberg and Mansour (2019). In the bandit feedback model, the learner only observes the loss values related to the predictions she chose. In particular, the loss function is not revealed to the learner and, unless the prediction was correct, the true label remains unknown. In this paper we propose the first online boosting algorithm with theoretical guarantees, in the bandit feedback setting.

The underlying ideas used in our approach are based on the fact that boosting can be seen as an optimization procedure. It can be interpreted as cost minimization over the set of linear combinations of weak learners. That is, boosting can be thought of as applying a gradient-descent-type algorithm in a function space Schapire and Freund (2012); Friedman (2001); Mason et al. (2000). This functional view of boosting has also inspired a few studies of boosting methods Friedman (2001); Wang et al. (2015); Beygelzimer et al. (2015a) that are based on the classical Frank-Wolfe algorithm Frank and Wolfe (1956), a projection-free convex optimization method.

In this work we leverage these ideas to yield a new online boosting algorithm based on a Frank-Wolfe-type technique. Namely, our online boosting algorithm is based on a projection-free Online Convex Optimization (OCO) method with stochastic gradients. The stochastic gradient assumption can capture, in particular, bandit feedback, since stochastic gradient estimates can be obtained using random function evaluation

Flaxman et al. (2005).

However, such existing projection-free OCO methods either achieve suboptimal regret bounds Hazan and Kale (2012) or have high per-iteration computational costs Mokhtari et al. (2018); Chen et al. (2018); Xie et al. (2019). To fill this gap, we derive a new method and analysis of a projection-free OCO algorithm with stochastic gradients. As summarized in Table 1, our projection-free OCO algorithm is the fastest known method compared to previous work, while achieving an optimal regret bound. Furthermore, our Frank-Wolfe-type algorithm gives rise to an efficient online boosting method in the bandit setting.

#### Our results

We propose new online learning methods using only limited feedback. Specifically:

• Online Boosting with Bandit Feedback, we propose the first online boosting algorithm with theoretical regret bounds in the bandit feedback setting. The formal description of our method is given in Algorithm 2, and its theoretical guarantees are stated in Theorem 9. In addition, Section 4 presents encouraging experiments on benchmark datasets.

• Projection-Free OCO with Stochastic Gradients, an efficient projection-free OCO algorithm, with stochastic gradients, which improves state-of-the-art guarantees in terms of computational efficiency. Table 2 compares these results to previous work. Our method is given in Algorithm 1, and its theoretical guarantees are stated in Theorems 2 and 3.

#### Paper outline

In the next subsection we discuss related work. Section 2 deals with the setting of projection-free online convex optimization, with stochastic gradient oracle. We describe the OCO algorithm and formally state its theoretical guarantees. In Section 3 we describe a generalization of these techniques, and give our main algorithm of online boosting in the bandit feedback model, along with the main theorem. In Section 4 we empirically evaluate the performance of our algorithms. The complete analysis and proofs of all our methods are given in the supplementary material.

### 1.1 Related work

#### Projection-free OCO.

The classical Frank-Wolfe (FW) method was introduced in Frank and Wolfe (1956)

for efficiently solving linear programming. The framework of Online Convex Optimization (OCO) was introduced by

Zinkevich (2003), with the online projected gradient descent method, achieving regret bound. However, the projections required for such an algorithm are too expensive for many large-scale online problems. The online variant of the FW algorithm that applies to general OCO was given in Hazan and Kale (2012). It attains regret for the general OCO setting, with only one linear optimization step per iteration. A more general setting considers the use of stochastic gradient estimates instead of exact gradients Mokhtari et al. (2018); Chen et al. (2018); Xie et al. (2019). Although it enables to remove the assumption that exact gradient computation is tractable, it often requires larger computational costs per-iteration. In this work, we give a projection-free OCO method that improves state-of-the-art guarantees with regret bound, and per-round cost.

#### Online Boosting

Previous works on online boosting have mostly focused on classification tasks Leistner et al. (2009); Chen et al. (2012, 2014); Beygelzimer et al. (2015b); Jung et al. (2017); Jung and Tewari (2018). The main result in this paper is a generalization of the online boosting for regression problems by Beygelzimer et al. (2015a), to the bandit feedback model. We combine these ideas with zero-order convex optimization techniques Flaxman et al. (2005), and with our novel projection-free OCO algorithm and analysis. Recent works have also considered online boosting in the bandit setting for classification tasks Chen et al. (2014); Zhang et al. (2018). These works give convergence guarantees in the more restricted mistake-bound model, whereas in this work we provide regret bounds, compared to a reference function class. The related works of Garber (2017); Hazan et al. (2018) consider the metric of -regret, which is applicable to computationally-hard problems.

#### Multi-Point Bandit Feedback

In this work we consider a relaxation of the standard bandit setting: noisy multi-point bandit feedback. In this model, the learner can query each loss function at multiple points, and obtains noisy feedback values. This model is motivated by reinforcement learning in Markov decision processes. Previous work on the multi-point bandit model allows multi-point noiseless feedback Agarwal et al. (2010); Duchi et al. (2015); Shamir (2017). Noiseless feedback is significantly less challenging, since with only two points one can get an arbitrarily good approximation to the gradient. In addition, other works have also considered a single point projection-free noiseless bandit model Garber and Kretzu (2019); Chen et al. (2019).

## 2 Projection-Free OCO with Limited Feedback

Consider the setting of Online Convex Optimization (OCO), when only limited feedback is available to the learner, rather than full information. Recall that in the OCO framework (see e.g. Hazan (2016)), an online player iteratively makes decisions from a compact convex set . At iteration , the online player chooses , and the adversary reveals the cost , chosen from a family of bounded convex functions over . The metric of performance in this setting is regret: the difference between the total loss of the learner and that of the best fixed decision in hindsight. Formally, the regret of the OCO algorithm is defined by:

 RLA(T)=T∑t=1ℓt(xt)−infx∗∈KT∑t=1ℓt(x∗). (1)

In this work we restrict the information that the learner has with respect to the loss function . Specifically, we focus on two such types of limited feedback:

1. Stochastic Gradients: the learner is only provided with stochastic gradient estimates.

2. Bandit Feedback: the learner only observes the loss values of predictions she made.

Our goal is to design an algorithm which has low regret and low cost per iteration . We begin with the more restricted setting which assumes access to a stochastic gradient oracle. In Section 3.2 we describe a reduction for the more general bandit setting, in the context of online boosting.

As in previous methods of projection-free OCO Mokhtari et al. (2018); Chen et al. (2018); Xie et al. (2019), we assume oracle access to an Online Linear Optimizer (OLO). The OLO algorithm optimizes linear objectives in a sequential manner, and has sublinear regret guarantees. A formal definition is given below.

###### Definition 1.

Let denote a class of linear loss functions, , with -bounded gradient norm (i.e., ). An algorithm is an Online Linear Optimizer (OLO) for  w.r.t. , if for any sequence , the algorithm has expected regret w.r.t. , 111For ease of presentation we denote . that is sublinear in , where expectation is taken w.r.t the internal randomness of .

Suitable choices for the OLO algorithm include Follow the Perturbed Leader (FPL) Kalai and Vempala (2005), Online Gradient Descent Zinkevich (2003), Regularized Follow The Leader Hazan (2016), etc.

Denote the diameter of the set by , (i.e., , ), denote by an upper bound on the norm of the gradients of over (i.e., ), and denote by an upper bound on the loss (i.e., ). We also make the following common assumptions:

###### Assumption 1.

The loss functions are -smooth, i.e., for any , ,

 ∥∇ℓ(x)−∇ℓ(x′)∥≤β∥x−x′∥.
###### Assumption 2.

The stochastic gradient oracle

returns an unbiased estimate

, for any , and with bounded norm, i.e.,

 E[gt]=∇ℓt(x)  ,  ∥gt∥2 ≤ σ2.

### 2.1 Algorithm and Analysis

At a high level, our algorithm maintains oracle access to copies of an OLO algorithm, and iteratively produces points by running a subroutine of a -step Frank-Wolfe procedure. It uses previous OLOs’ predictions, and gradient estimates oracle in place of exact optimization with true gradients. To update parameters, at each iteration , the algorithm queries the gradient oracle at points. Then, the gradient estimates are fed to the OLO oracles as linear loss functions. Intuitively, it guides each OLO algorithm to correct for mistakes of the preceding OLOs. A formal description is provided in Algorithm 1.

The following Theorem states the regret guarantees of Algorithm 1. In this paper, all bounds are given with respect to the dependence on the different parameters, and omit all constants.

###### Theorem 2.

Given that assumptions 1 - 2 hold, then Algorithm 1 is a projection-free OCO algorithm which only requires stochastic gradient oracle calls per iteration, such that for any sequence of convex losses , and any , its expected regret is,

 E[T∑t=1ℓt(xt)−T∑t=1ℓt(x∗)]≤O(σD√T).

The theoretical guarantees given in Theorem 2

use expected regret as the performance metric. Even though expected regret is a widely accepted metric for online randomized algorithms, one might want to rule out the possibility that the regret has high variance, and verify that the given result actually holds with high probability. By observing that excess loss can be formulated as a martingale difference sequence, and by applying analysis using the Azuma-Hoeffding inequality, we can obtain regret guarantees which hold with high probability. The main result is stated below.

###### Theorem 3.

Given that assumptions 1 - 2 hold, then Algorithm 1 is a projection-free OCO algorithm which only requires stochastic gradient oracle calls per iteration, such that for any , and any sequence of convex losses over convex set , w.p. at least ,

 T∑t=1ℓt(xt)−infx∗∈KT∑t=1ℓt(x∗)≤O(σD√TlogβDTσρ).

The complete analysis and proofs of both theorems is deferred to the Appendix. Below we give an overview of the main ideas used in the proof of Theorem 2. For simplicity assume an oblivious adversary (although using a standard reduction, our results can be generalized to an adaptive one) 222See discussion in Cesa-Bianchi and Lugosi (2006), Pg. 69, as well as Exercise 4.1 formulating the reduction..

Let be any sequence of losses in . Observe that the only sources of randomness at play are: the OLOs’ (

’s) internal randomness, and the stochasiticity of the gradients. The analysis below is given in expectation with respect to all these random variables. Note the following fact used in the analysis; for any

, the random variables and (i.e., the output of at time ) are conditionally independent, given all history up to time and step . This fact allows to derive the following Lemma:

###### Lemma 4.

For any and , let be the unbiased stochastic gradient estimate used in Algorithm 1. Denote the output of algorithm at time as . Then, we have,

Using Lemma 4, the algorithm is analyzed along the lines of the Frank-Wolfe algorithm, obtaining the expected regret bound of Algorithm 1.

###### Proposition 5.

Given that assumptions 1 - 2 hold, and given oracle access to copies of an OLO algorithm for linear losses, with regret (see Definition 1), Algorithm 1 is an online learning algorithm, such that for any sequence of convex losses , and any , its expected regret is,

 E[T∑t=1ℓt(xt)−T∑t=1ℓt(x∗)]≤2βD2TN+RA(T,σ).

### 2.2 Proof of Theorem 2

###### Proof.

The proof of Theorem 2 is a direct Corollary of Proposition 5, by plugging Follow the Perturbed Leader Kalai and Vempala (2005) as the OLO algorithm required for Algorithm 1. We get that the regret of the base algorithms is w.r.t the sequence of linear losses , where is the diameter of the set , and is the stochastic gradient norm bound (Assumption 2). Thus, by setting , we get expected regret of w.r.t the convex loss sequence . ∎

## 3 Online Boosting with Bandit Feedback

The projection-free OCO method given in Section 2, assumes oracle access to an online linear optimizer (OLO), and utilizes it by iteratively making oracle calls with modified objectives, in order to solve the harder task of convex optimization. Analogously, boosting algorithms typically assume oracle access to a "weak" learner, which are utilized by iteratively making oracle calls with modified objective, in order to obtain a "strong" learner, with boosted performance. In this section, we derive an online boosting method in the bandit setting, based on an adaptation of Algorithm 1.

In the online learning setting, we assume that in each round for , an adversary selects an example and a loss function , where . The loss is chosen from a class of bounded convex losses . The adversary then presents to the online learning algorithm , which predicts in the goal of minimizing the sum of losses over time, when compared against a function class . Specifically, the metric of performance in this setting is policy regret: the difference between the total loss of the learner’s predictions, and that of the best fixed policy/function , in hindsight:

 RLA(T)=T∑t=1ℓt(A(xt))−inff∈FT∑t=1ℓt(f(xt)). (2)

To compare this setting with the OCO setting detailed in Section 2, observe that in the OCO setting, at every time step, the adversary only picks the loss function, and the online player picks a point in the decision set , towards minimizing the loss and competing with the best fixed point in hindsight. On the other hand, in this online learning setting, at every time step the adversary picks both an example and a loss function, and the online player picks a point in , towards minimizing the loss and competing with the best fixed mapping in hindsight, of examples in to labels in . Considering these observations, we describe the online boosting methodology next.

Generalizing from the offline setting for boosting, the notion of a weak learning algorithm is modeled as an online learning algorithm for linear loss functions that competes with a base class of regression functions, while a strong learning algorithm is an online learning algorithm with convex loss functions that competes with a larger class of regression functions. We follow a similar setting to that of the full information Online Gradient Boosting method Beygelzimer et al. (2015a), in the more general case of noisy, bandit feedback, and a weaker notion of weak learner.

###### Definition 6.

Let denote a reference class of regression functions , let denote the horizon length, and let denote the advantage. Let denote a class of linear loss functions, . An online learning algorithm is a -agnostic weak online learner (AWOL) for  w.r.t. , if for any sequence , at every iteration , the algorithm outputs such that for any ,

 E[T∑t=1ℓ′t(A(xt))−γ T∑t=1ℓ′t(f(xt))]≤RA(T,σ),

where the expectation is taken w.r.t the randomness of the weak learner and that of the adversary, and the regret is sub-linear in  .

Note the slight abuse of notation here; is not a function but rather the output of the online learning algorithm computed on the given example using its internal state. Observe that the above definition is the natural extension of the -approximation guarantee of a standard classification weak learner in the statistical setting Schapire and Freund (2012), to regression problems in the online learning setting.

The weak learning algorithm is "weak" in the sense that it is only required to, (a) learn linear loss functions, (b) succeed on full-information feedback, and (c) -approximate the best predictor in its reference class , up to an additive regret. Our main result is an online boosting algorithm (Algorithm 2) that converts a weak online learning algorithm, as defined above, into a strong online learning algorithm. The resulting algorithm is "strong" in the sense that it, (a) learns convex loss functions, (b) relies on bandit feedback only, and (c) -approximates the best predictor in a larger class of functions, the convex hull of the base class , up to an additive regret.

### 3.1 Setting

At every round , the learner predicts , and receives the noisy bandit feedback , where the noise is drawn i.i.d from a distribution . We make no distributional assumptions on the noise apart from the fact that it is zero-mean and bounded. Denote the diameter of the set by , (i.e., , ), denote by an upper bound on the norm of the gradients of over (i.e., ), and denote by an upper bound on the loss (i.e., ). Denote the bound on the noise by w.l.o.g. (i.e., for all ). Additionally, assume that the set is endowed with a projection operation, that we denote by , and satisfies the following properties,

###### Assumption 3.

The function satisfies that for any , , .

Consider the following example which demonstrates that Assumption 3 is in fact a realistic assumption: for any let the class of loss functions contain losses that are of the form for some , and let be the Euclidean projection. Indeed, it can be shown that for any , , simply by a generalization of the Pythagorean theorem. 333Moreover, projections according to other distances, that are not the Euclidean distance, can be defined, in particular with respect to Bregman divergences, and an analogue of the generalized Pythagorean theorem remains valid (see e.g., Lemma 11.3 in Cesa-Bianchi and Lugosi (2006)). Thus, any class of loss functions that are measuring distance to some based on a Bregman divergences, denote , corresponds to a suitable projection operation, that is simply .

### 3.2 Stochastic Gradients to Bandit Feedback

We build on the techniques shown in Section 2, and describe an implementation of the unbiased stochastic gradient oracle, in the bandit setting. Recall that in the bandit feedback model, the only information revealed to the learner at iteration is the loss at the point that she has chosen. In particular, the learner does not know the loss had she chosen a different point .

We consider a more relaxed noisy multi-point bandit setting, in which the learner can choose several points for which the loss value will be observed. We remark that unlike previous work on multi-point bandit Agarwal et al. (2010); Duchi et al. (2015); Shamir (2017) we consider noisy feedback, and do not require additional assumptions on the loss function, as we show next.

The idea is to combine the method in Algorithm 1, with gradient estimation techniques for the bandit setting, by Flaxman et al. (2005). The approach of Flaxman et al. (2005) is based on constructing a simple estimate of the gradient, computed by evaluating the loss at a random point. Therefore, we obtain a smoothed approximation of the loss function. Note that since we construct a smoothed approximation of the loss, the smoothness assumption (Assumption 1) becomes redundant, as well the stochastic gradient oracle (Assumption 2). The following lemmas introduce the smoothed loss function and its properties:

###### Lemma 7 (Flaxman et al. (2005), Lemma 2.1).

Let be a set of convex loss functions that are -Lipschitz. For any , define the function as follows: , where

is a unit vector drawn uniformly at random, and

. Then, is differentiable with gradient:

 ∇^ℓ(y)=Ev[dδℓ(y+δv)v].
###### Lemma 8.

Let , be a smoothed function as defined in Lemma 7. Then, the following holds:

1. is convex, -Lipschitz, and for any , .

2. For any , . Thus, is -smooth.

3. For any , unit vector , .

### 3.3 Algorithm and Analysis

At a high level, our boosting algorithm maintains oracle access to copies of a weak learning algorithm (see Definition 6), and iteratively produces predictions , upon receiving an example , by running a subroutine of a -step optimization procedure. It generates a randomized gradient estimator of function , a smoothed approximation of the loss function ,444We assume that one can indeed query at any point . It is w.l.o.g. since a standard technique (see Agarwal et al. (2010); Hazan (2016)) is to simply run the learners on a slightly smaller set , where is sufficiently large so that must be in . Since can be arbitrarily small, the additional regret/error incurred is arbitrarily small. as shown in Lemma 7, and Lemma 8. The estimator is used in place of exact optimization with true gradients.

To update parameters, the gradient estimates are fed to the weak learners as linear loss functions. Recall that is not a function but rather the output of the algorithm computed on the given example using its internal state, after having observed . Intuitively, boosting guides each weak learner to correct for mistakes of the preceding learner . The output prediction of the boosting algorithm (Line 13) relies on the projection operation, described in Assumption 3. A formal description is provided in Algorithm 2.

The following Theorem states the regret guarantees of Algorithm 2. We remark that although it uses expected regret as the performance metric, it can be converted to a guarantee that holds with high probability, with techniques similar to those used to obtain Theorem 3.

###### Theorem 9.

Given that the setting in 3.1, and assumption 3 hold, and given oracle access to copies of an online weak learning algorithms (Definition 6) w.r.t. reference class for linear losses, with regret, then Algorithm 2 is an online learning algorithm w.r.t. reference class for convex losses , such that for any ,

 E[RB(T)]=E[T∑t=1ℓt(yt)−T∑t=1ℓt(f(xt))]≤2dLD2Tδγ2N+RA(T,dM/δ)γ+2TδL.

Lastly, observe that the average regret clearly converges to as , and . While the requirement that may raise concerns about computational efficiency, this is in fact analogous to the guarantee in the batch setting: the algorithms converge only when the number of boosting stages goes to infinity. Moreover, previous work on online boosting in the full information setting, gives a lower bound (Beygelzimer et al. (2015a), Theorem 4) which shows that this is indeed necessary.

## 4 Experiments

While the focus of this paper is theoretical investigation of online boosting and projection-free algorithms with limited information, we have also performed experiments to evaluate our algorithms. We focused our empirical investigation on the more challenging task of Online Boosting with bandit feedback, proposed in Section 3. Algorithm 2 was implemented in NumPy, and the weak online learner was a linear model updated with FKM Flaxman et al. (2005), online projected gradient descent with spherical gradient estimators. To facilitate a fair comparison to a baseline, we provided an FKM model with a -point noisy bandit feedback, where is the number of weak learners of the corresponding boosting method. We denote this baseline as N-FKM. We also compare against the full information setting, which amounts to the method used in previous work (Beygelzimer et al. (2015a), Algorithm 2), and compared to a linear model baseline updated with online gradient descent (OGD). Table 2 summarizes the average squared loss and the standard deviation, and the last column refers to the relative loss decrease on average, of boosting in the bandit setting compared to the N-FKM baseline.

The experiments we carry out were proposed by Beygelzimer et al. (2015a) for evaluating online boosting, they are composed of several data sets for regression and classification tasks, obtained from the UCI machine learning repository (and further described in the supplementary material). For each experiment, reported are average results over 20 different runs. In the bandit setting, each loss function evaluation was obtained with additive noise, uniform on , and gradients were evaluated as in Algorithm 2. The only hyper-parameters tuned were the learning rate, the number of weak learners, and the smoothing parameter . We remark that a small number of weak learners is sufficient, and was set in the range of . Parameters were tuned based on progressive validation loss on half of the dataset; reported is progressive validation loss on the remaining half. Progressive validation is a standard online validation technique, where each training example is used for testing before it is used for updating the model Blum et al. (1999).

## References

• A. Agarwal, O. Dekel, and L. Xiao (2010) Optimal algorithms for online convex optimization with multi-point bandit feedback.. In COLT, pp. 28–40. Cited by: §1.1, §3.2, footnote 4.
• N. Agarwal, N. Brukhim, E. Hazan, and Z. Lu (2019) Boosting for dynamical systems. arXiv preprint arXiv:1906.08720. Cited by: §1.
• A. Beygelzimer, E. Hazan, S. Kale, and H. Luo (2015a) Online gradient boosting. In Advances in neural information processing systems, pp. 2458–2466. Cited by: Appendix E, Appendix E, §1.1, §1, §1, §3.3, §3, §4, §4.
• A. Beygelzimer, S. Kale, and H. Luo (2015b) Optimal and adaptive algorithms for online boosting. In International Conference on Machine Learning, pp. 2323–2331. Cited by: §1.1, §1.
• A. Blum, A. Kalai, and J. Langford (1999) Beating the hold-out: bounds for k-fold and progressive cross-validation. In

Proceedings of the twelfth annual conference on Computational learning theory

,
pp. 203–208. Cited by: §4.
• N. Brukhim, X. Chen, E. Hazan, and S. Moran (2020) Online agnostic boosting via regret minimization. arXiv preprint arXiv:2003.01150. Cited by: §1.
• N. Cesa-Bianchi and G. Lugosi (2006) Prediction, learning, and games. Cambridge university press. Cited by: footnote 2, footnote 3.
• L. Chen, C. Harshaw, H. Hassani, and A. Karbasi (2018) Projection-free online optimization with stochastic gradient: from convexity to submodularity. In International Conference on Machine Learning, pp. 814–823. Cited by: §1.1, Table 1, §1, §2.
• L. Chen, M. Zhang, and A. Karbasi (2019) Projection-free bandit convex optimization. In

The 22nd International Conference on Artificial Intelligence and Statistics

,
pp. 2047–2056. Cited by: §1.1.
• S. Chen, H. Lin, and C. Lu (2012) An online boosting algorithm with theoretical justifications. External Links: 1206.6422 Cited by: §1.1, §1.
• S. Chen, H. Lin, and C. Lu (2014) Boosting with online binary learners for the multiclass bandit problem. In International Conference on Machine Learning, pp. 342–350. Cited by: §1.1, §1.
• J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono (2015) Optimal rates for zero-order convex optimization: the power of two function evaluations. IEEE Transactions on Information Theory 61 (5), pp. 2788–2806. Cited by: §1.1, §3.2.
• A. D. Flaxman, A. T. Kalai, and H. B. McMahan (2005) Online convex optimization in the bandit setting: gradient descent without a gradient. ACM-SIAM Symposium on Discrete Algorithms (SODA). Cited by: Appendix E, §1.1, §1, §3.2, §4, Lemma 7.
• M. Frank and P. Wolfe (1956) An algorithm for quadratic programming. Naval research logistics quarterly 3 (1-2), pp. 95–110. Cited by: §1.1, §1.
• J. H. Friedman (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §1, §1.
• D. Garber and B. Kretzu (2019) Improved regret bounds for projection-free bandit convex optimization. arXiv preprint arXiv:1910.03374. Cited by: §1.1.
• D. Garber (2017) Efficient online linear optimization with approximation algorithms. In Advances in Neural Information Processing Systems, pp. 627–635. Cited by: §1.1.
• E. Hazan, W. Hu, Y. Li, and Z. Li (2018) Online improper learning with an approximation oracle. In Advances in Neural Information Processing Systems, pp. 5652–5660. Cited by: §1.1.
• E. Hazan and S. Kale (2012) Projection-free online learning. In 29th International Conference on Machine Learning, ICML 2012, pp. 521–528. Cited by: §1.1, Table 1, §1.
• E. Hazan (2016) Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (3-4), pp. 157–325. Cited by: §2, §2, footnote 4.
• T. Jin and H. Luo (2019) Learning adversarial mdps with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192. Cited by: §1.
• Y. H. Jung, J. Goetz, and A. Tewari (2017) Online multiclass boosting. In Advances in neural information processing systems, pp. 919–928. Cited by: §1.1.
• Y. H. Jung and A. Tewari (2018) Online boosting algorithms for multi-label ranking. In International Conference on Artificial Intelligence and Statistics, pp. 279–287. Cited by: §1.1.
• A. Kalai and S. Vempala (2005) Efficient algorithms for online decision problems. Journal of Computer and System Sciences 71 (3), pp. 291–307. Cited by: §C.1, §C.3, §2.2, §2.
• C. Leistner, A. Saffari, P. M. Roth, and H. Bischof (2009) On robustness of on-line boosting-a competitive study. In

2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops

,
pp. 1362–1369. Cited by: §1.1, §1.
• L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean (2000) Boosting algorithms as gradient descent. In Advances in neural information processing systems, pp. 512–518. Cited by: §1, §1.
• A. Mokhtari, H. Hassani, and A. Karbasi (2018) Stochastic conditional gradient methods: from convex minimization to submodular maximization. arXiv preprint arXiv:1804.09554. Cited by: §1.1, §1, §2.
• G. Neu and G. Bartók (2016) Importance weighting without importance weights: an efficient algorithm for combinatorial semi-bandits. The Journal of Machine Learning Research 17 (1), pp. 5355–5375. Cited by: §C.1, §C.3.
• A. Rosenberg and Y. Mansour (2019) Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems, pp. 2209–2218. Cited by: §1.
• R. E. Schapire and Y. Freund (2012) Boosting: Foundations and Algorithms. Cambridge university press. External Links: Document, ISBN 9780262017183, ISSN 1098-6596 Cited by: §1, §1, §3.
• O. Shamir (2017) An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. The Journal of Machine Learning Research 18 (1), pp. 1703–1713. Cited by: §1.1, §3.2.
• C. Wang, Y. Wang, R. Schapire, et al. (2015) Functional frank-wolfe boosting for general loss functions. arXiv preprint arXiv:1510.02558. Cited by: §1.
• J. Xie, Z. Shen, C. Zhang, H. Qian, and B. Wang (2019) Stochastic recursive gradient-based methods for projection-free online learning. arXiv preprint arXiv:1910.09396. Cited by: §1.1, Table 1, §1, §2.
• D. T. Zhang, Y. H. Jung, and A. Tewari (2018) Online multiclass boosting with bandit feedback. arXiv preprint arXiv:1810.05290. Cited by: §1.1.
• M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pp. 928–936. Cited by: §1.1, §2.

## Appendix A Technical Lemmas

In this section we give several useful claims and lemmas that are used in the main analysis.

###### Lemma 10.

Let be any convex, -smooth function. Let be a set of points with bounded diameter . Let , and let . Let , and . Define,

 zi=(1−ηi)zi−1−ηiγzi,

and a random variable, such that . Denote . Then, for any ,

 (ℓ(zi)−ℓ(z)) ≤  (1−ηi)(ℓ(zi−1)−ℓ(z))+ηi(g⊤i(1γzi−z)+ηiβD22γ2+ζi).
###### Proof.

We have,

 ℓ(zi) = ℓ(zi−1+ηi(1γzi−zi−1)) (3) ≤  ℓ(zi−1)+ηi∇ℓ(zi−1)⊤⋅(1γzi−zi−1)+η2iβ2∥1γzi−zi−1∥2 ≤  ℓ(zi−1)+ηi∇ℓ(zi−1)⊤⋅(1γzi−zi−1)+η2iβD22γ2,

where the inequalities follow from the -smoothness of , and the bound on the set , respectively. Observe that,

 ∇ℓ(zi−1)⊤(1γzi−zi−1) =g⊤i(1γzi−zi−1)+(∇ℓ(zi−1)−gi)⊤(1γzi−zi−1) (4) (by adding and subtracting the term: g⊤i(1γzi−zi−1)) =g⊤i(1γzi−z)+g⊤i(z−zi−1)+(∇ℓ(zi−1)−gi)⊤(1γzi−zi−1) (by adding and subtracting the term: g⊤iz) =g⊤i(1γzi−z)+∇ℓ(zi−1)⊤(z−zi−1)+(∇ℓ(zi−1)−gi)⊤(1γzi−z) (by adding and subtracting the term: ∇ℓ(zi−1)⊤z) ≤g⊤i(1γzi−z)+ℓ(z)−ℓ(zi−1)+(∇ℓ(zi−1)−gi)⊤(1γzi−z) (by convexity, ∇ℓ(zi−1)⊤⋅(z−zi−1) ≤ ℓ(z)−ℓ(zi−1)).

Combining (3) and (4), and the definition of we have that,

 (ℓ(zi)−ℓ(z)) ≤  (1−ηi)(ℓ(zi−1)−ℓ(z))+η2iβD22γ2+ηi(g⊤i(1γzi−z)+ζi).

###### Claim 11.

Define , for some . Let be some constants, and define , such that,

 ϕi ≤(1−ηi)ϕi−1+η2iC12+ηiC2.

Then, it holds that

###### Proof.

We prove by induction over . For , since , the assumption implies that . Thus, the base case of the induction holds true. Now assume the claim holds for , and we will prove it holds for . By the induction step,

 ϕk+1 ≤(1−2k+2)ϕk+2C1(k+2)2+2C2k+2 ≤kk+2(2C1k+1+C2)+2C1(k+2)2+2C2k+2 =2C1k+2(kk+1+1k+2)+C2≤2C1k+2+C2.

## Appendix B Projection-free OCO with Stochastic Gradients: Proofs

### b.1 Proof of Lemma 4

###### Proof.
 E[ℓit(xt,i)] =E[g⊤t,i⋅xt,i] (definition of ℓit(⋅)) =EIi−1t[E[g⊤t,i⋅xt,i∣∣Ii−1t]] (law of total expectation) ( Ii−1t denotes the σ-algebra measuring all sources of randomness up to time t,i−1.) =EIi−1t[Egt,i[gt,i∣∣Ii−1t]⊤⋅EAi[xt,i∣∣Ii−1t]] (conditional independence) ( Inner expectations are w.r.t gradient stochasiticity, and Ai's internal randomness, % respectively.) (Since E[gt,i]=∇ℓt(xi−1t)) =E[∇ℓt(xi−1t)⊤⋅xt,i]

### b.2 Proof of Proposition 5

###### Proof.

Let be the output of the OLO algorithm at time , and let be any . The regret definition of (Definition 1), and the definition of in Algorithm 1, imply that:

 E[T∑t=1g⊤t,i⋅xt,i −T∑t=1g⊤t,i⋅x∗] ≤  RA(T). (5)

By applying Lemma 10, we have,

 Δi  ≤  (1−ηi)Δi−1+η2iβD22T+ηiT∑t=1(g⊤t,i(xt,i−x∗)+\boldmathζ\unboldmatht,i)

where , and , for . Take expectation on both sides. By Lemma 4, we have , and by the OLO guarantee (5), we get that,

 E[Δi] ≤  (1−ηi)E[Δi−1]+η2iβD22T+ηiRA(T)

By Claim 11, we get for all that,

 E[Δi ]≤2βD2Ti+1+RA(T). (6)

Applying the bound in Equation (6) for concludes the proof. ∎

## Appendix C High probability bounds for Projection-Free OCO with Stochastic Gradients

In this section we give a high-probability regret bound to Algorithm 1. Observe that when the variance of the base OLO algorithm is unbounded, the regret guarantees cannot hold with high probability. Thus, we slightly modify the OLO definition to hold w.h.p. This is w.l.o.g as there are projection-free OLO algorithm for which such guarantees hold, as we describe in Theorem 3.

###### Definition 12.

Let denote a class of linear loss functions, . An online learning algorithm is an Online Linear Optimizer (OLO) for  w.r.t. , if for any , and any sequence of losses , w.p. at least , the algorithm has regret w.r.t. , that is sublinear in .

We can now derive the following proposition (corresponding to Proposition 5 of the expected case):

###### Proposition 13.

Given that assumptions 1 - 2 hold, and given oracle access to copies of an OLO algorithm for linear losses, with regret, Algorithm 1 is an OCO algorithm which only requires stochastic gradient oracle calls per iteration, such that for any , and any sequence of convex losses over convex set , w.p. at least ,

 T∑t=1ℓt(xt)−infx∗∈KT∑t=1ℓt(x∗)≤2βD2TN+RA(T)+(σ+G)D√2Tlog(4N/ρ).
###### Proof.

Let be the output of the OLO algorithm at time , and let be any point in . The regret definition of (Definition 12), and the definition of in Algorithm 1, imply that for we have that, w.p. at least ,

 T∑t=1g⊤t,i⋅xt,i −T∑t=1g⊤t,i⋅x∗ ≤  RA(T). (7)

By applying Lemma 10, and by the OLO guarantee (7), we get that,

 Δi  ≤  (1−ηi)Δi−1+η2iβD22T+ηi(RA(T)+T∑t=1% \boldmathζ\unboldmatht,i). (8)

where , and , for . By applying the union bound, the above inequality holds for all , with probability at least .

For any fixed , Observe that by Lemma 4. Therefore, is a martingale difference sequence. Moreover, by the Cauchy-Schwartz inequality, we have,

where the second inequality follows from the triangle inequality, and the last inequality follows from the diameter bound on the set , the bound on the gradient norm , and the bound on the stochastic gradient estimate (Assumption 2). Let , by the Azuma-Hoeffding inequality,

 P[∣∣T∑t=1\boldmathζ\unboldmatht,i∣∣≥λ]≤2 exp(−λ22T∑t=1c2t)=ρ/2N.

Observe that, by applying the union bound, the above inequality holds for all , with probability at least . Therefore, by combining the above with (8), applying union bound, we get that w.p. at least