# Parameter-free online learning via model selection

We introduce an efficient algorithmic framework for model selection in online learning, also known as parameter-free online learning. Departing from previous work, which has focused on highly structured function classes such as nested balls in Hilbert space, we propose a generic meta-algorithm framework that achieves online model selection oracle inequalities under minimal structural assumptions. We give the first computationally efficient parameter-free algorithms that work in arbitrary Banach spaces under mild smoothness assumptions; previous results applied only to Hilbert spaces. We further derive new oracle inequalities for matrix classes, non-nested convex sets, and R^d with generic regularizers. Finally, we generalize these results by providing oracle inequalities for arbitrary non-linear classes in the online supervised learning model. These results are all derived through a unified meta-algorithm scheme using a novel "multi-scale" algorithm for prediction with expert advice based on random playout, which may be of independent interest.

## Authors

• 26 publications
• 16 publications
• 36 publications
• 1 publication
• ### Oracle inequalities for computationally adaptive model selection

We analyze general model selection procedures using penalized empirical ...
08/01/2012 ∙ by Alekh Agarwal, et al. ∙ 0

• ### ZigZag: A new approach to adaptive online learning

We develop a novel family of algorithms for the online learning setting ...
04/13/2017 ∙ by Dylan J. Foster, et al. ∙ 0

• ### Online Learning: Sufficient Statistics and the Burkholder Method

We uncover a fairly general principle in online learning: If regret can ...
03/20/2018 ∙ by Dylan J. Foster, et al. ∙ 0

• ### Sharp non asymptotic oracle inequalities for non parametric computerized tomography model

We consider non parametric estimation problem for stochastic tomography ...
11/21/2018 ∙ by Dominique Fourdrinier, et al. ∙ 0

• ### Oracle inequalities for the stochastic differential equations

This paper is a survey of recent results on the adaptive robust non para...
12/15/2017 ∙ by Evgeny Pchelintsev, et al. ∙ 0

• ### Online Learning for Distribution-Free Prediction

We develop an online learning method for prediction, which is important ...
03/15/2017 ∙ by Dave Zachariah, et al. ∙ 0

• ### Learning in Non-convex Games with an Optimization Oracle

We consider adversarial online learning in a non-convex setting under th...
10/17/2018 ∙ by Alon Gonen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A key problem in the design of learning algorithms is the choice of the hypothesis set . This is known as the model selection problem. The choice of is driven by inherent trade-offs. In the statistical learning setting, this can be analyzed in terms of the estimation and approximation errors. A richer or more complex

helps better approximate the Bayes predictor (smaller approximation error). On the other hand, a hypothesis set that is too complex may have too large a VC-dimension or have unfavorable Rademacher complexity, thereby resulting in looser guarantees on the difference between the loss of a hypothesis and that of the best-in class (large estimation error).

In the batch setting, this problem has been extensively studied with the main ideas originating in the seminal work of Vapnik and Chervonenkis (1971) and Vapnik (1982) and the principle of Structural Risk Minimization (SRM). This is typically formulated as follows: let be an infinite sequence of hypothesis sets (or models); the problem consists of using the training sample to select a hypothesis set with a favorable estimation-approximation trade-off and choosing the best hypothesis in .

If we had access to a hypothetical oracle informing us of the best choice of for a given instance, the problem would reduce to the standard one of learning with a fixed hypothesis set. Remarkably though, techniques such as SRM or similar penalty-based model selection methods return a hypothesis that enjoys finite-sample learning guarantees that are almost as favorable as those that would be obtained had an oracle informed us of the index

of the best-in-class classifier’s hypothesis set

(Vapnik, 1982; Devroye et al., 1996; Shawe-Taylor et al., 1998; Koltchinskii, 2001; Bartlett et al., 2002; Massart, 2007). Such guarantees are sometimes referred to as oracle inequalities. They can be derived even for data-dependent penalties (Koltchinskii, 2001; Bartlett et al., 2002; Bartlett and Mendelson, 2003).

Such results naturally raise the following questions in the online setting: can we develop an analogous theory of model selection in online learning? Can we design online algorithms for model selection with solutions benefiting from strong guarantees, analogous to the batch ones? Unlike the statistical setting, in online learning one cannot split samples to first learn the optimal predictor within each subclass and then later learn the optimal subclass choice.

A series of recent works on online learning provide some positive results along that direction. On the algorithmic side, McMahan and Abernethy (2013); McMahan and Orabona (2014); Orabona (2014); Orabona and Pál (2016) present solutions that efficiently achieve model selection oracle inequalities for the important special case where is a sequence of nested balls in a Hilbert space. On the theoretical side, a different line of work focusing on general hypothesis classes (Foster et al., 2015) uses martingale-based sequential complexity measures to show that, information-theoretically, one can obtain oracle inequalities in the online setting at a level of generality comparable to that of the batch statistical learning. However, this last result is not algorithmic.

The first approach that a familiar reader might think of for tackling the online model selection problem is to run for each an online learning algorithm that minimizes regret against , and then aggregate over these algorithms using the multiplicative weights algorithm for prediction with expert advice. This would work if all the losses or “experts” considered were uniformly bounded by a reasonably small quantity. However, in many reasonable problems — particularly those arising in the context of online convex optimization — the losses of predictors or experts for each may grow with . Using simple aggregation would scale our regret with the magnitude of the largest and not the we want to compare against. This is the main technical challenge faced in this context, and one that we fully address in this paper.

Our results are based on a novel multi-scale algorithm for prediction with expert advice. This algorithm works in a situation where the different experts’ losses lie in different ranges, and guarantees that the regret to each individual expert is adapted to the range of its losses. The algorithm can also take advantage of a given prior over the experts reflecting their importance. This general, abstract setting of prediction with expert advice yields online model selection algorithms for a host of applications detailed below in a straightforward manner.

First, we give efficient algorithms for model selection for nested linear classes that provide oracle inequalities in terms of the norm of the benchmark to which the algorithm’s performance is compared. Our algorithm works for any norm, which considerably generalizes previous work (McMahan and Abernethy, 2013; McMahan and Orabona, 2014; Orabona, 2014; Orabona and Pál, 2016) and gives the first polynomial time online model selection for a number of online linear optimization settings. This includes online oracle inequalities for high-dimensional learning tasks such as online PCA and online matrix prediction. We then generalize these results even further by providing oracle inequalities for arbitrary non-linear classes in the online supervised learning model. This yields algorithms for applications such as online penalized risk minimization and multiple kernel learning.

### 1.1 Preliminaries

#### Notation.

For a given norm , let denote the dual norm. Likewise, for any function , will denote its Fenchel conjugate. For a Banach space , the dual is . We use

as shorthand for a sequence of vectors

. For such sequences, we will use to denote the th vector’s th coordinate. We let denote the th standard basis vector. denotes the norm, denotes the spectral norm, and denotes the trace norm. For any , let be such that .

#### Setup and goals.

We work in two closely related settings: online convex optimization (proto:oco) and online supervised learning (proto:supervised_learning). In online convex optimization, the learner selects decisions from a convex subset of some Banach space . Regret to a comparator in this setting is defined as .

Suppose can be decomposed into sets . For a fixed set , the optimal regret, if one tailors the algorithm to compete with , is typically characterized by some measure of intrinsic complexity of the class (such as Littlestone’s dimension (Ben-David et al., 2009) and sequential Rademacher complexity (Rakhlin et al., 2010)), denoted . We would like to develop algorithms that predict a sequence such that

 n∑t=1ft(wt)−minw∈Wkn∑t=1ft(w)≤Compn(\mcWk)+Penn(k)∀k. (1)

This equation is called an oracle inequality and states that the performance of the sequence

matches that of a comparator that minimizes the bias-variance tradeoff

, up to a penalty whose scale ideally matches that of . We shall see shortly that ensuring that the scale of does indeed match is the core technical challenge in developing online oracle inequalities for commonly used classes.

In the supervised learning setting we measure regret against a benchmark class of functions , where is some abstract context space, also called feature space. In this case, the desired oracle inequality has the form:

 n∑t=1\ls(^yt,yt)−inff∈\Fkn∑t=1\ls(f(xt),yt)≤Compn(\mcFk)+Penn(k)∀k. (2)

## 2 Online Model Selection

### 2.1 The need for multi-scale aggregation

Let us briefly motivate the main technical challenge overcome by the model selection approach we consider. The most widely studied oracle inequality in online learning has the following form

 n∑t=1ft(wt)−n∑t=1ft(w)≤O\prn∗(\nrmw2+1)√n⋅log\prn∗(\nrmw2+1)n∀w∈\Rd. (3)

In light of eq:oco_oracle, a model selection approach to obtaining this inequality would be to split the set into norm balls of doubling radius, i.e. . A standard fact (Hazan, 2016) is that such a set has if one optimizes over it using Mirror Descent, and so obtaining the oracle inequality eq:oco_oracle is sufficient to recover eq:oco_hilbert, so long as is not too large relative to .

Online model selection is fundamentally a problem of prediction with expert advice (Cesa-Bianchi and Lugosi, 2006), where the experts correspond to the different model classes one is choosing from. Our basic meta-algorithm, MultiScaleFTPL (alg:general), operates in the following setup. The algorithm has access to a finite number, , of experts. In each round, the algorithm is required to choose one of the experts. Then the losses of all experts are revealed, and the algorithm incurs the loss of the chosen expert.

The twist from the standard setup is that the losses of all the experts are not uniformly bounded in the same range. Indeed, for the setup described for the oracle inequality eq:oco_hilbert, class will produce predictions with norm as large as . Therefore, here, we assume that expert incurs losses in the range , for some known parameter . The goal is to design an online learning algorithm whose regret to expert scales with , rather than , which is what previous algorithms for learning from expert advice (such as the standard multiplicative weights strategy or AdaHedge (De Rooij et al., 2014)) would achieve. Indeed, any regret bound scaling in will be far too large to achieve eq:oco_hilbert, as the term will dominate. This new type of scale-sensitive regret bound, achieved by our algorithm MultiScaleFTPL, is stated below.

###### Theorem 1.

Suppose the loss sequence satisfies for a sequence with each . Let be a given prior distribution on the experts. Then, playing the strategy given by alg:general, MultiScaleFTPL yields the following regret bound:111This regret bound holds under expectation over the player’s randomization. It is assumed that each is selected before the randomized strategy is revealed, but may adapt to the distribution over . In fact, a slightly stronger version of this bound holds, namely . A similar strengthening applies to all subsequent bounds.

 \En\brk∗n∑t=1\tri∗eit,gt−n∑t=1\tri∗ei,gt≤O\prn∗ci√nlog\prn∗nci/πi∀i∈\brkN. (4)

The proof of the theorem is deferred to app:proofs in the supplementary material due to space constraints. Briefly, the proof follows the technique of adaptive relaxations from (Foster et al., 2015). It relies on showing that the following function of the first loss vectors is an admissible relaxation (see (Foster et al., 2015) for definitions):

 Rel(g1:t)\defeq\Enσt+1,…,σT∈\pmoNsupi\brk∗−t∑s=1\tri∗ei,gs+4T∑s=t+1σs[i]ci−B(i).

This implies that if we play the strategy given by alg:general, the regret to the th expert is bounded by , where indicates the function applied to an empty sequence of loss vectors. As a final step, we bound as using a probabilistic maximal inequality (lem:maximal in the supplementary material), yielding the bound eq:ftpl_regret. Compared to related FTPL algorithms (Rakhlin et al., 2012), the analysis is surprisingly delicate, as additive factors can spoil the desired regret bound eq:ftpl_regret if the s differ by orders of magnitude.

The min-max optimization problem in MultiScaleFTPL

can be solved in polynomial-time using linear programming — see app:ftpl in the supplementary material for a full discussion.

In related work, Bubeck et al. (2017) simultaneously developed a multi-scale experts algorithm which could also be used in our framework. Their regret bound has sub-optimal dependence on the prior distribution over experts, but their algorithm is more efficient and is able to obtain multiplicative regret guarantees.

### 2.2 Online convex optimization

for online optimization problems whenever it is possible to bound the losses of the different experts a-priori. One such application is to online convex optimization, where each “expert” is a a particular OCO algorithm, and for which such a bound can be obtained via appropriate bounds on the relevant norms of the parameter vectors and the gradients of the loss functions. We detail this application — which yields algorithms for parameter-free online learning and more — below. All of the algorithms in this section are derived using a unified meta-algorithm strategy

MultiScaleOCO.

The setup is as follows. We have access to sub-algorithms, denoted for . In round , each sub-algorithm produces a prediction , where is a set in a vector space over containing . Our meta-algorithm is then required to choose one of the predictions . Then, a loss function is revealed, whereupon incurs loss , and the meta-algorithm suffers the loss of the chosen prediction. We make the following assumption on the sub-algorithms:

###### Assumption 1.

The sub-algorithms satisfy the following conditions:

• For each , there is an associated norm such that .

• For each , the sequence of functions are -Lipschitz on with respect to .

• For each sub-algorithm , the iterates enjoy a regret bound , where may be data- or algorithm-dependent.

In most applications, will be a convex set and a convex function; this convexity is not necessary to prove a regret bound for the meta-algorithm. We simply need boundedness of the set and Lipschitzness of the functions , as specified in assumption:oco. This assumption implies that for any , we have for any . Thus, we can design a meta-algorithm for this setup by using MultiScaleFTPL with , which is precisely what is described in alg:oco_aggregation. The following theorem provides a bound on the regret of MultiScaleOCO; a direct consequence of theorem:ftpl_alg.

###### Theorem 2.

Without loss of generality, assume that 222For notational convenience all Lipschitz bounds are assumed to be at least without loss of generality for the remainder of the paper.. Suppose that the inputs to alg:oco_aggregation satisfy assumption:oco. Then the iterates returned by alg:oco_aggregation follow the regret bound

 \En\brk∗n∑t=1ft(wt)−infw∈Win∑t=1ft(w)≤\En\brk∗Regn(i)+O\prn∗RiLi√nlog\prn∗RiLin/πi∀i∈[N]. (5)

thm:oco_aggregation shows that if we use alg:oco_aggregation to aggregate the iterates produced by a collection of sub-algorithms , the regret against any sub-algorithm will only depend on that algorithm’s scale, not the regret of the worst sub-algorithm.

#### Application 1: Parameter-free online learning in uniformly convex Banach spaces.

As the first application of our framework, we give a generalization of the parameter-free online learning bounds found in McMahan and Abernethy (2013); McMahan and Orabona (2014); Orabona (2014); Orabona and Pál (2016); Cutkosky and Boahen (2016) from Hilbert spaces to arbitrary uniformly convex Banach spaces. Recall that a Banach space is -uniformly convex if is -strongly convex with respect to itself (Pisier, 2011). Our algorithm obtains a generalization of the oracle inequality eq:oco_hilbert for any uniformly convex by running multiple instances of Mirror Descent — the workhorse of online convex optimization — and aggregating their iterates using MultiScaleOCO. This strategy is thus efficient whenever Mirror Descent can be implemented efficiently. The collection of sub-algorithms used by MultiScaleOCO, which was alluded to at the beginning of this section is as follows: For each , set , , , , and . Finally, set .

Mirror Descent is reviewed in detail in app:proofs_oco in the supplementary material, but the only feature of its performance of importance to our analysis is that, when configured as described above, the iterates produced by specified above will satisfy on any sequence of losses that are -Lipschitz with respect to . Using just this simple fact, combined with the regret bound for MultiScaleOCO and a few technical details in app:proofs_oco, we can deduce the following parameter-free learning oracle inequality:

###### Theorem 3 (Oracle inequality for uniformly convex Banach spaces).

The iterates produced by MultiScaleOCO on any -Lipschitz (w.r.t. ) sequence of losses satisfy

 \En\brk∗n∑t=1ft(wt)−n∑t=1ft(w)≤O\prn∗L⋅(\nrmw+1)√n⋅log\prn∗(\nrmw+1)Ln/λ∀w∈B. (6)

Note that the above oracle inequality applies for any uniformly convex norm . Previous results only obtain bounds of this form efficiently when is a Hilbert space norm or . As is standard for such oracle inequality results, the bound is weaker than the optimal bound if were selected in advance, but only by a mild factor.

###### Proposition 1.

The algorithm can be implemented in time per iteration, where is the time complexity of a single Mirror Descent update.

In the example above, the -uniform convexity condition was mainly chosen for familiarity. The result can easily be generalized to related notions such as -uniform convexity (see Srebro et al. (2011)). More generally, the approach can be used to derive oracle inequalities with respect to general strongly convex regularizer defined over the space . Such a bound would have the form for typical choices of .

This example captures well-known quantile bounds (Koolen and Van Erven, 2015) when one takes to be the KL-divergence and to be the simplex, or, in the matrix case, takes to be the quantum relative entropy and to be the set of density matrices, as in Hazan et al. (2017).

#### Application 2: Oracle inequality for many ℓp norms.

It is instructive to think of MultiScaleOCO as executing a (scale-sensitive) online analogue of the structural risk minimization principle. We simply specify a set of subclasses and a prior specifying the importance of each subclass, and we are guaranteed that the algorithm’s performance matches that of each sub-class, plus a penalty depending on the prior weight placed on that subclass. The advantage of this approach is that the nested structure used in the thm:oco_2smooth is completely inessential. This leads to the exciting prospect of developing parameter-free algorithms over new and exotic set systems. One such example is given now: The MultiScaleOCO framework allows us to obtain an oracle inequality with respect to many norms in simultaneously. To the best of our knowledge all previous works on parameter-free online learning have only provided oracle inequalities for a single norm.

###### Theorem 4.

Fix . Suppose that the loss functions are -Lipschitz w.r.t. for each . Then there is a computationally efficient algorithm that guarantees regret

 \En\brk∗n∑t=1ft(wt)−n∑t=1ft(w)≤O\prn∗(\nrmwp+1)Lp√nlog\prn∗(\nrmwp+1)Lplog(d)n/(p−1)∀w∈\Rd,∀p∈[1+δ,2]. (7)

The configuration in the above theorem is described in full in app:proofs_oco in the supplementary material. This strategy can be trivially extended to handle in the range . The inequality holds for rather than for because the norm is not uniformly convex, but this is easily rectified by changing the regularizer at ; we omit this for simplicity of presentation.

We emphasize that the choice of norms for the result above was somewhat arbitrary — any finite collection of norms will also work. For example, the strategy can also be applied to matrix optimization over by replacing the norm with the Schatten norm. The Schatten norm has strong convexity parameter on the order of (which matches the norm up to absolute constants (Ball et al., 1994)) so the only change to practical change to the setup in thm:all_lp will be the running time . Likewise, the approach applies to -group norms as used in multi-task learning (Kakade et al., 2012).

#### Application 3: Adapting to rank for online PCA

For the online PCA task, the learner predicts from a class . For a fixed value of , such a class is a convex relaxation of the set of all rank projection matrices. After producing a prediction , we experience affine loss functions , where .
We leverage an analysis of online PCA due to (Nie et al., 2013) together with MultiScaleOCO to derive an algorithm that competes with many values of the rank simultaneously. This gives the following result:

###### Theorem 5.

There is an efficient algorithm for Online PCA with regret bound

 \En\brk∗n∑t=1\tri∗I−Wt,Yt−minWprojectionrank(W)=kn∑t=1\tri∗I−W,Yt≤˜O\prn∗k√n∀k∈\brkd/2.

For a fixed value of , the above bound is already optimal up to log factors, but it holds for all simultaneously.

#### Application 4: Adapting to norm for Matrix Multiplicative Weights

In the Matrix Multiplicative Weights setting (Arora et al., 2012) we consider hypothesis classes of the form . Losses are given by , where . For a fixed value of , the well-known Matrix Multiplicative Weights strategy has regret against bounded by . Using this strategy for fixed as a sub-algorithm for MultiScaleOCO, we achieve the following oracle inequality efficiently:

###### Theorem 6.

There is an efficient matrix prediction strategy with regret bound

 \En\brk∗n∑t=1\tri∗Wt,Yt−n∑t=1\tri∗W,Yt≤(\nrm∗WΣ+1)√nlogdlog((\nrm∗WΣ+1)n))∀W⪰0. (8)

#### A remark on efficiency

All of our algorithms that provide bounds of the form eq:oco_2smooth_general instantiate experts with MultiScaleFTPL because, in general, the worst case for achieving can have norm as large as . If one has an a priori bound — say — on the range at which each attains its minimum, then the number of experts be reduced to .

### 2.3 Supervised learning

We now consider the online supervised learning setting (proto:supervised_learning), with the goal being to compete with a sequence of hypothesis classes simultaneously. Working in this setting makes clear a key feature of the meta-algorithm approach we have adopted: We can efficiently obtain online oracle inequalities for arbitrary nonlinear function classes — so long as we have an efficient algorithm for each .

We obtain a supervised learning meta-algorithm by simply feeding the observed losses (which may even be non-convex) to the meta-algorithm MultiScaleFTPL in the same fashion as MultiScaleOCO.

The resulting strategy, which is described in detail in app:supervised for completeness, is called MultiScaleLearning. We make the following assumptions analogous to assumption:oco, which lead to the performance guarantee for MultiScaleLearning given in thm:supervised_aggregation below.

###### Assumption 2.

The sub-algorithms used by MultiScaleLearning satisfy the following conditions:

• For each , the iterates produced by sub-algorithm satisfy .

• For each , the function is -Lipschitz on .

• For each sub-algorithm , the iterates enjoy a regret bound , where may be data- or algorithm-dependent.

###### Theorem 7.

Suppose that the inputs to alg:supervised_aggregation satisfy assumption:supervised. Then the iterates produced by the algorithm enjoy the regret bound

 \En\brk∗n∑t=1\ls(^yit,yt)−inff∈\mcFin∑t=1\ls(f(xt),yt)≤\En\brk∗Regn(i)+O\prn∗RiLi√nlog\prn∗RiLin/πi∀i∈[N]. (9)

#### Online penalized risk minimization

In the statistical learning setting, oracle inequalities for arbitrary sequences of hypothesis classes are readily available. Such inequalities are typically stated in terms of complexity parameters for the classes such as VC dimension or Rademacher complexity. For the online learning setting, it is well-known that sequential Rademacher complexity provides a sequential counterpart to these complexity measures (Rakhlin et al., 2010), meaning that it generically characterizes the minimax optimal regret for Lipschitz losses. We will obtain an oracle inequality in terms of this parameter.

###### Assumption 3.

The sequence of hypothesis classes are such that

1. There is an efficient algorithm producing iterates satisfying for any -Lipschitz loss, where is some constant. (an algorithm with this regret is always guaranteed to exist, but may not be efficient).

2. Each has output range , where without loss of generality.

3. — this is obtained by most non-trivial classes.

###### Theorem 8 (Online penalized risk minimization).

Under ass:srm there is an efficient (in ) algorithm that achieves the following regret bound for any -Lipschitz loss:

As in the previous section, one can derive tighter regret bounds and more efficient (e.g. sublinear in ) algorithms if are nested.

#### Application: Multiple kernel learning

###### Theorem 9.

Let be reproducing kernel Hilbert spaces for which each has a kernel such that . Then there is an efficient learning algorithm that guarantees

 \En\brk∗n∑t=1\ls(^yt,yt)−n∑t=1\ls(f(xt),yt)≤O\prn∗LBk(\nrm∗f\mcHk+1)√log(LBkkn(\nrm∗f\mcHk+1))∀k,∀f∈\mcHk

for any -Lipschitz loss, whenever an efficient algorithm is available for the norm ball in each .

## 3 Discussion and Further Directions

#### Related work

There are two directions in parameter-free online learning that have been explored extensively. The first considers bounds of the form eq:oco_hilbert; namely, the Hilbert space version of the more general setting explored in sec:oco_slow. Beginning with Mcmahan and Streeter (2012), which obtained a slightly looser rate than eq:oco_hilbert, research has focused on obtaining tighter dependence on and in this type of bound (McMahan and Abernethy, 2013; McMahan and Orabona, 2014; Orabona, 2014; Orabona and Pál, 2016); all of these algorithms run in linear time per update step. Recent work (Cutkosky and Boahen, 2016, 2017) has extended these results to the case where the Lipschitz constant is not known in advance. These works give lower bounds for general norms, but only give efficient algorithms for Hilbert spaces. Extending alg:oco_aggregation to reach the Pareto frontier of regret in the unknown Lipschitz setting as described in (Cutkosky and Boahen, 2017) may be an interesting direction for future research.

The second direction concerns so-called “quantile bounds” (Chaudhuri et al., 2009; Koolen and Van Erven, 2015; Luo and Schapire, 2015; Orabona and Pál, 2016) for experts setting, where the learner’s decision set is the simplex and losses are bounded in . The multi-scale machinery developed in the present work is not needed to obtain bounds for this setting because the losses are uniformly bounded across all model classes. Indeed, Foster et al. (2015) recovered a basic form of quantile bound using the vanilla multiplicative weights strategy as a meta-algorithm. It is not known whether the more sophisticated data-dependent quantile bounds given in Koolen and Van Erven (2015); Luo and Schapire (2015) can be recovered in the same fashion.

#### Losses with curvature.

The -type regret bounds provided by alg:general are appropriate when the sub-algorithms themselves incur regret bounds. However, assuming certain curvature properties (such as strong convexity, exp-concavity, stochastic mixability, etc. (Hazan et al., 2007; van Erven et al., 2015)) of the loss functions it is possible to construct sub-algorithms that admit significantly more favorable regret bounds ( or even ). These are also referred to as “fast rates” in online learning. A natural direction for further study is to design a meta-algorithm that admits logarithmic or constant regret to each sub-algorithm, assuming that the loss functions of interest satisfy similar curvature properties, with the regret to each individual sub-algorithm adapted to the curvature parameters for that sub-algorithm. Perhaps surprisingly, for the special case of the logistic loss, improper prediction and aggregation strategies similar to those proposed in this paper offer a way to circumvent known proper learning lower bounds (Hazan et al., 2014). This approach will be explored in detail in a forthcoming companion paper.

#### Computational efficiency.

We suspect that a running-time of to obtain inequalities like eq:oco_2smooth_general may be unavoidable through our approach, since we do not make use of the relationship between sub-algorithms beyond using the nested class structure. Whether the runtime of MultiScaleFTPL can be brought down to match is an open question. This boils down to whether or not the min-max optimization problem in the algorithm description can simultaneously be solved in 1) Linear time in the number of experts 2) strongly polynomial time in the scales .

## Acknowledgements

We thank Francesco Orabona and Dávid Pál for inspiring initial discussions. Part of this work was done while DF was an intern at Google Research and while DF and KS were visiting the Simons Institute for the Theory of Computing. DF is supported by the NDSEG fellowship.

## References

• Arora et al. (2012) Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
• Ball et al. (1994) Keith Ball, Eric A Carlen, and Elliott H Lieb. Sharp uniform convexity and smoothness inequalities for trace norms. Inventiones mathematicae, 115(1):463–482, 1994.
• Bartlett and Mendelson (2003) Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results.

Journal of Machine Learning Research

, 3:463–482, 2003.
ISSN 1532-4435.
• Bartlett et al. (2002) Peter L. Bartlett, Stéphane Boucheron, and Gábor Lugosi. Model selection and error estimation. Machine Learning, 48(1-3):85–113, 2002.
• Ben-David et al. (2009) Shai Ben-David, David Pal, and Shai Shalev-Shwartz. Agnostic online learning. In Proceedings of the 22th Annual Conference on Learning Theory, 2009.
• Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
• Bubeck et al. (2017) Sebastien Bubeck, Nikhil Devanur, Zhiyi Huang, and Rad Niazadeh. Online auctions and multi-scale online learning. Accepted to The 18th ACM conference on Economics and Computation (EC 17), 2017.
• Cesa-Bianchi and Lugosi (2006) Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
• Chaudhuri et al. (2009) Kamalika Chaudhuri, Yoav Freund, and Daniel J Hsu. A parameter-free hedging algorithm. In Advances in neural information processing systems, pages 297–305, 2009.
• Cutkosky and Boahen (2016) Ashok Cutkosky and Kwabena A Boahen. Online convex optimization with unconstrained domains and losses. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 748–756. 2016.
• Cutkosky and Boahen (2017) Ashok Cutkosky and Kwabena A. Boahen. Online learning without prior information. The 30th Annual Conference on Learning Theory, 2017.
• De Rooij et al. (2014) Steven De Rooij, Tim Van Erven, Peter D Grünwald, and Wouter M Koolen. Follow the leader if you can, hedge if you must. Journal of Machine Learning Research, 15(1):1281–1316, 2014.
• Devroye et al. (1996) Luc Devroye, Lázló Györfi, and Gábor Lugosi.

A Probabilistic Theory of Pattern Recognition

.
Springer, 1996.
• Foster et al. (2015) Dylan J Foster, Alexander Rakhlin, and Karthik Sridharan. Adaptive online learning. In Advances in Neural Information Processing Systems, pages 3375–3383, 2015.
• Hazan (2016) Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
• Hazan et al. (2007) Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.
• Hazan et al. (2014) Elad Hazan, Tomer Koren, and Kfir Y Levy. Logistic regression: Tight bounds for stochastic and online optimization. In Proceedings of The 27th Conference on Learning Theory, pages 197–209, 2014.
• Hazan et al. (2017) Elad Hazan, Satyen Kale, and Shai Shalev-Shwartz. Near-optimal algorithms for online matrix prediction. SIAM J. Comput., 46(2):744–773, 2017. doi: 10.1137/120895731.
• Kakade et al. (2009) Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems 21, pages 793–800. MIT Press, 2009.
• Kakade et al. (2012) Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13(Jun):1865–1890, 2012.
• Koltchinskii (2001) Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Trans. Information Theory, 47(5):1902–1914, 2001.
• Koolen and Van Erven (2015) Wouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial games. In Proceedings of The 28th Conference on Learning Theory, pages 1155–1175, 2015.
• Luo and Schapire (2015) Haipeng Luo and Robert E Schapire. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, pages 1286–1304, 2015.
• Massart (2007) Pascal Massart. Concentration inequalities and model selection. Lecture Notes in Mathematics, 1896, 2007.
• McMahan and Abernethy (2013) Brendan McMahan and Jacob Abernethy. Minimax optimal algorithms for unconstrained linear optimization. In Advances in Neural Information Processing Systems, pages 2724–2732, 2013.
• Mcmahan and Streeter (2012) Brendan Mcmahan and Matthew Streeter. No-regret algorithms for unconstrained online convex optimization. In Advances in neural information processing systems, pages 2402–2410, 2012.
• McMahan and Orabona (2014) H. Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. In Proceedings of The 27th Conference on Learning Theory, pages 1020–1039, 2014.
• Nemirovski (2004) Arkadi Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
• Nie et al. (2013) Jiazhong Nie, Wojciech Kotłowski, and Manfred K Warmuth. Online pca with optimal regrets. In International Conference on Algorithmic Learning Theory, pages 98–112. Springer, 2013.
• Orabona (2014) Francesco Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, pages 1116–1124, 2014.
• Orabona and Pál (2016) Francesco Orabona and Dávid Pál. From coin betting to parameter-free online learning. arXiv preprint arXiv:1602.04128, 2016.
• Pisier (2011) Gilles Pisier. Martingales in banach spaces (in connection with type and cotype). course ihp, feb. 2–8, 2011. 2011.
• Rakhlin et al. (2010) Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages, combinatorial parameters, and learnability. Advances in Neural Information Processing Systems 23, pages 1984–1992, 2010.
• Rakhlin et al. (2012) Alexander. Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and randomize: From value to algorithms. In Advances in Neural Information Processing Systems 25, pages 2150–2158, 2012.
• Renegar (1988) James Renegar. A polynomial-time algorithm, based on newton’s method, for linear programming. Mathematical Programming, 40(1):59–93, 1988.
• Shawe-Taylor et al. (1998) John Shawe-Taylor, Peter L Bartlett, Robert C Williamson, and Martin Anthony. Structural risk minimization over data-dependent hierarchies. IEEE transactions on Information Theory, 44(5):1926–1940, 1998.
• Srebro et al. (2011) Nati Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent. In Advances in neural information processing systems, pages 2645–2653, 2011.
• van Erven et al. (2015) Tim van Erven, Peter D. Grünwald, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson. Fast rates in statistical and online learning. Journal of Machine Learning Research, 16:1793–1861, 2015.
• Vapnik (1982) Vladimir Vapnik. Estimation of dependences based on empirical data, volume 40. Springer-Verlag New York, 1982.
• Vapnik and Chervonenkis (1971) Vladimir Vapnik and Alexey Chervonenkis.

On the uniform convergence of relative frequencies of events to their probabilities.

Theory of Probability and its Applications, 16(2):264–280, 1971.

## Appendix A Proofs

### a.1 Multi-scale FTPL algorithm

###### theorem:ftpl_alg.

Recall that . Let . For a regret bound of the form to be achievable by a randomized algorithm such as alg:general we need

 \Vn\defeq\dtri∗infPt∈Δ(ΔN)supgt∈\mcC\Enpt∼Pt\Enit∼ptnt=1supi∈\brkN\brk∗n∑t=1\tri∗eit,gt−n∑t=1\tri∗ei,gt−B(i)≤K,

where denotes interleaving of the operator from to . In the context of alg:general, the distributions above refer to the strategy selected by the algorithm and

refers to the distribution over this strategy induced by sampling the random variables

. See Foster et al. (2015) for a more extensive introduction to this type of minimax analysis for comparator-dependent regret bounds.

We will develop an algorithm to certify this bound for using the framework of adaptive relaxations proposed by Foster et al. (2015). Define a relaxation via

 Rel(g1:t)\defeq\Enσt+1:n∈\pmoNsupi∈\brkN\brk∗−t∑s=1\tri∗ei,gs+4n∑s=t+1σs[i]ci−B(i).

The proof structure is as follows: We show that playing as suggested by alg:general with satisfies the initial condition and admissibility condition for adaptive relaxations from Foster et al. (2015), which implies that if we play we will have . Then as a final step we bound using a probabilistic maximal inequality, lem:maximal.

#### Initial condition

This condition asks that the initial value of the relaxation upper bound the worst-case value of the negative benchmark minus the bound (in other words, the inner part of with the learner’s loss removed). This is holds by definition and is trivial to verify:

 Rel(g1:n)=supi∈\brkN\brk∗−n∑t=1\tri∗ei,gt−B(i).