# Best of many worlds: Robust model selection for online supervised learning

We introduce algorithms for online, full-information prediction that are competitive with contextual tree experts of unknown complexity, in both probabilistic and adversarial settings. We show that by incorporating a probabilistic framework of structural risk minimization into existing adaptive algorithms, we can robustly learn not only the presence of stochastic structure when it exists (leading to constant as opposed to O(√(T)) regret), but also the correct model order. We thus obtain regret bounds that are competitive with the regret of an optimal algorithm that possesses strong side information about both the complexity of the optimal contextual tree expert and whether the process generating the data is stochastic or adversarial. These are the first constructive guarantees on simultaneous adaptivity to the model and the presence of stochasticity.

## Authors

• 7 publications
• 1 publication
• 9 publications
• 42 publications
• ### Competitive ratio versus regret minimization: achieving the best of both worlds

We consider online algorithms under both the competitive ratio criteria ...
04/07/2019 ∙ by Amit Daniely, et al. ∙ 0

• ### Logarithmic Regret for Adversarial Online Control

We introduce a new algorithm for online linear-quadratic control in a kn...
02/29/2020 ∙ by Dylan J. Foster, et al. ∙ 0

• ### Model selection for contextual bandits

We introduce the problem of model selection for contextual bandits, wher...
06/03/2019 ∙ by Dylan J. Foster, et al. ∙ 0

• ### Online Allocation with Traffic Spikes: Mixing Adversarial and Stochastic Models

Motivated by Internet advertising applications, online allocation proble...
11/15/2017 ∙ by Hossein Esfandiari, et al. ∙ 0

• ### Regret Balancing for Bandit and RL Model Selection

We consider model selection in stochastic bandit and reinforcement learn...
06/09/2020 ∙ by Yasin Abbasi-Yadkori, et al. ∙ 9

• ### A new regret analysis for Adam-type algorithms

In this paper, we focus on a theory-practice gap for Adam and its varian...
03/21/2020 ∙ by Ahmet Alacaoglu, et al. ∙ 0

• ### Adapting to Non-stationarity with Growing Expert Ensembles

When dealing with time series with complex non-stationarities, low retro...
03/04/2011 ∙ by Cosma Rohilla Shalizi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In full-information online learning, there are no generative assumptions on the data. We consider online supervised learning where we observe pairs of covariates and responses, and need to minimize regret with respect to the best function in hindsight from a fixed model class. In the case where covariates and responses are discrete, we can consider the loss function, and characterize the performance of tree experts (also called contextual experts) that map a covariate to an appropriate response. A natural goal is to minimize minimax cumulative regret as a function of the number of rounds . This is well known to scale [CBFH97] as . Once this is guaranteed, we are especially interested in adaptive algorithms that preserve this guarantee and also adapt to “easier" stochastic structure. Again, it is well known that we can get much faster rates in this case; essentially, constant regret. Recent work [CBMS07, EKRG11, DRVEGK14, LS15, KVE15, KGvE16] constructs algorithms that adapt to these faster rates while preserving the minimax rate; thus obtaining the best of both worlds.

A more classical goal of adaptivity is adapting to the complexity of the true model class. The traditional offline model selection framework [Mas07] studies a hierarchy of models, and shows that the right model for the problem can be chosen in a data-adaptive fashion when the data is generated according to a stochastic iid process. It is clear that model adaptivity is a natural goal in online learning – after all, while low regret is important, so is the right choice of benchmark with respect to which to minimize regret. And the importance of model selection is reflected very naturally in regret: either our data is not well-expressed by the used model class, leading us to question what a good regret rate really means, or our data is well-approximated by simple models and we spend more time than needed looking for the right predictor, building up unnecessary regret.

In this context, we have a natural goal. Starting with absolutely no assumptions, we still wish to protect ourselves from adversaries with the minimax regret rates (up to constants). However, we also want to adapt simultaneously to the existence and statistical complexity of stochastic structure, and perform almost as well as an algorithm with oracle knowledge of that structure would.

Typically, we use adaptive entropy regularization with a changing learning rate to interpolate between the stochastic and adversarial regimes. Structural risk minimization has been considered in purely stochastic, or purely adversarial environments, and uses a very different kind of model complexity regularization. Even in the simplest discrete problems, it was never clear whether these objectives could be achieved simultaneously. In this paper, we answer the question in the affirmative. We adaptively recover the stochastic model selection framework in the discrete “contextual experts" setting and obtain near-optimal, theoretical guarantees on regret in expectation and with high probability. We also provide simulations to illustrate the value of achieving this kind of two-fold adaptivity.

##### Our contributions

We show that an adaptive variant of the tree expert forecaster adapts not only to stochastic structure but also the order of that stochastic structure that best describes the mapping between covariates and responses. Our main result is stated informally below. (For a formal statement of the theorem, see Theorem 1.)

Main theorem (informal):

Let be the maximum model order of tree experts. The regret of our algorithm with respect to the best -order tree expert is in an adversarial setting and with high probability when the data is actually generated by a -order tree expert, for any .

Thus, we can recover stochastic online model selection in an adversarial framework – our regret rate for -order processes is achieved without knowing the value of in advance, or even that the process is stochastic. This rate is competitive with the optimal regret rate that would be achieved by a greedy algorithm possessing side information about both the existence of stochastic structure and the true model order. We will see the empirical benefit of this two-fold adaptivity in the simulations in Section 5, where we compare directly to existing algorithms that only achieve one kind of adaptivity.

Interestingly, we are able to obtain these guarantees for an algorithm that is a natural adaptation of the standard exponential weights framework, and our results have an intuitive interpretation. We combine the adaptivity to stochasticity of an existing “best-of-both-worlds" algorithm (called AdaHedge [EKRG11, DRVEGK14]) with the prior weighting on tree experts that is used in tree forecasters [HS97]111Most interestingly, this prior distribution was designed for the original tree expert forecaster [HS97], but this algorithm could not effectively utilize the prior because of the fixed learning rate.. As is intuitive, the prior is inversely proportional to the complexity of the tree expert.

Our analysis recovers the stochastic structural risk minimization framework in a probabilistic sense. There are two penalties involved: the complexity of the model selected (to achieve model selection) as well as determinism (to ensure protection against adversaries). Remarkably, our algorithm uses a common time-varying, data-dependent learning rate, defined in the elegant AdaHedge style, to learn the correct proportion with which to apply both regularizers.

##### Related work

The framework for offline structural risk minimization in purely stochastic environments was laid out in seminal work (for a review, see [Mas07]). Generalization bounds are used to characterize model order complexity, and empirical process theory is used to show that data-adaptive model selection can be performed with high probability. Online bandit approaches for stochastic model selection have also been considered more recently [ADBL11].

On the other side, the paradigm for adversarial regret minimization was laid out in the discrete “experts" setting in seminal work (for a review, see [CBFH97]), and subsequently lifted up to the more general online convex optimization framework (for a review, see [SS12]). The next natural goal was adaptivity to several types of “easier" instances while preserving the worst-case guarantees. Most pertinent to our work are the easier stochastic losses [DRVEGK14], under which the greedy Follow-the-Leader algorithm achieves regret . In the experts setting, multiple algorithms have been proposed [CBMS07, EKRG11, DRVEGK14, LS15, KVE15, KGvE16] that adaptively achieve regret. Some of these guarantees have been extended to online optimization [vEK16]. As we will see, naively extending these analyses to the tree expert forecaster problem gives a pessimistic regret bound. In our work, we show that we can get the best of many worlds and greatly improve the exponent to , reducing the dependence on the maximum model complexity from exponential to linear.

Recent guarantees on adapting to a simpler model class, but not to stochasticity, have also been developed [RS13, Ora14, LS15, KVE15, OP16, FKMS17]. Many of these approaches [RS13, Ora14, OP16, FKMS17] do not improve the rate for stoachastic data. Others [LS15, KVE15]

obtain second-order quantile regret bounds in terms of a data-dependent term and the correct model complexity

in the worst case – but the subsequent analysis in the stochastic regime [KGvE16] avoids the model selection issue, and again yields a pessimistic regret bound222We do not believe this to be a shortcoming of the algorithms, to be clear: sharper analysis of their updates would likely yield similar probabilistic model selection guarantees.. In our work, we adaptively recover the stochastic model selection framework from the adversarial setting and obtain sharp, closed-form regret bounds for data generated from a hierarchy of stochastic models.

And so, while the notions of adapting to stochasticity and simpler models have been considered separately in online learning, no previous analysis shows that we can provably simultaneously adapt to both. This has been proposed as an important objective in recent work [vEK16, FKMS17].

## 2 Problem statement

We consider an online supervised learning setting over rounds, in which we receive context-output pairs . We consider , where is the binary alphabet333As a general note, all our analysis can easily be extended to the -ary case. We present the binary case for simplicity.. It will also be natural to consider the truncated version of that only represents the last coordinates – we denote this by , with the convention that .

We follow the online supervised learning paradigm: before round , we are given access to , but not . Let denote the set of all tree experts, expressed as Boolean functions from to . We will also be considering tree experts that map from the subcontexts to outputs , denoted by for all values of in . We use the shorthand notation . We define the order of a tree expert, denoted by , as the minimum value of for which its functionality can be expressed equivalently in terms of a function from to . That is,

 order(fh):=min{d≤h: there exists f′d∈Fd s.t. fh(x(h))=f′d(x(d)) for all x(h)∈Xh}. (1)

We define our randomized online algorithm for prediction using tree experts

in terms of a sequence of probability distributions

over the set of all tree experts. Note that cannot depend on or . We denote the realization of the prediction at time by , and the distribution on by (clearly induced by ). After prediction, the actual value is revealed, and the expected loss is modeled as loss depending on whether we get the prediction right. Formally, we have , and the expected loss of the algorithm in round is given by . We denote as shorthand

 Lt,f :=t∑s=1I[Ys≠f(Xt(h))] % for all f∈Fh,h≤D LX,t,y :=t∑s=1I[Xs=X;Ys≠y] for all X∈Xh,h≤D,y∈X LX,t :=[LX,t,0LX,t,1] for all X∈Xh,h≤D.

The traditional quantity of regret measures the loss of an algorithm with respect to the loss of the algorithm that possessed oracle knowledge of the best single “action" to take in hindsight, after seeing the entire sequence offline. In the context of online supervised learning, this “action" represents the best -order Boolean function . The expected regret with respect to the best -order tree expert is defined as .

Our algorithm is effectively an exponential-weights update on tree experts equipped with a time-varying, data-dependent learning rate and a suitable prior distribution on tree experts. We start by describing the structure of the prior distribution.

###### Definition 1.

For any non-negative-valued function , we define the prior distribution on all tree experts in , , where is the normalizing factor.

We select a function and use the prior defined above to effectively downweight more complex experts. We will see that the choice of prior is crucial to recovering stochastic model selection.

A good data-adaptive choice of has been an intriguing question of significant recent interest. The idea is that we want to learn the correct learning rate for the problem. We consider a particularly elegant choice based on the algorithm AdaHedge, that was defined for the simpler experts setting. We denote for shorthand.

###### Definition 2 ([Drvegk14]).

The AdaHedge learning rate process is described as

 ηt =ln2Δt−1(ηt−11), (2)

where is called the “cumulative mixability gap" at time and is given by

 Δt(ηt−11) :=t∑s=1δs(ηs) where (3) δs(ηs) :=⟨ws(ηs),ls⟩+1ηsln⟨ws(ηs),e−ηsls⟩. (4)

We are now ready to describe our main algorithm.

###### Definition 3.

The algorithm ContextTreeAdaHedge whose prior is derived from the function updates its probability distribution on tree expert as follows:

 w(tree)t,f(ηt;g) =(∑Dh=order(f)g(h))e−ηtLt,f∑f′∈FD(∑Dh=order(f′)g(h))e−ηtLt,f′. (5)

and learning rate update made according to Equations (2) and (18).

The algorithm ContextTreeAdaHedge appears to have a prohibitive computational complexity of . However, the distributive law enables a clever reduction in computational complexity to . The main idea is that instead of keeping track of cumulative losses of all the functions in , represented by , we only need to keep track of the cumulative losses of making certain predictions as a function of certain contexts, represented by . This reduction was first considered for tree expert prediction in the worst-case [HS97], with a fixed learning rate , and can easily be extended to the broader class of exponential-weights updates. Proposition 3, which is stated and proved in Appendix B for completeness, shows that the update on probability distribution on tree experts, described in Equation (5) – can be equivalently written as a computationally faster update on probability distribution on predictors:

 wt,y(ηt;g) =∑Dh=0g′(h;ηt)e−ηtLXt(h),t,y∑Dh=0g′(h;ηt)(∑y∈Xe−ηtLXt(h),t,y) where (6a) g′(h;ηt) =g(h)∏x(h)≠Xt(h)(∑y∈Xe−ηtLx(h),t,y) (6b)

The equivalence is in the sense that the expected loss incurred by updates (5) and (6a) is the same.

### 2.2 Potential generative assumptions on data

As we have mentioned informally, we would like to get greatly improved regret rates for data generated in a certain way (without apriori knowledge of such generation). We work with the following standard stochastic condition on our data.

###### Definition 4.

We say that our data satisfies the -order stochastic condition if the following conditions hold:

1. The random vectors

are independent and identically distributed across .

2. , for all .

We denote the marginal distribution on by . For this setting, it is natural to define the best “external predictor" for any :

 f∗(x(h)):∈argmaxy∈XP∗(y|x(h)) % for all x(h)∈Xh, (7)

For the special case of , we assume that the best predictor is unique444This is the fundamental Tsybakov margin condition [T04] that is essential for eventual learnability of the best predictor., i.e.

 P∗(f∗(x(d))|x(d))>P∗(y|x(d)) for all y≠f∗(x(d)) and for all x(d)∈Xd.

and denote the parameter

 β(x(d)) =P∗(f∗(x(d))|x(d)) (8) β∗ :=minx(d)∈Xdβ(x(d)). (9)

Note that the uniqueness of best-predictor assumption directly implies that , since we are working with a binary alphabet.

Based on this, we also define the important notions of asymptotic unpredictability for all model orders . The definitions and notation are directly inspired by information-theoretic limits on sequence compression and prediction [FMG92].

###### Definition 5 ( [Fmg92]).

For data satisfying the -order stochastic condition, we define its asymptotic unpredictability under the -order predictive model by –

 π∗h:=∑x(h)∈XhQ∗h(x(h))[1−maxy∈X{P∗(y|x(h))}] (10)

For , we have . For , we have .

## 3 Main results

Different choices of the function used to describe the prior distribution on tree experts yield vastly different results. Consider the choice , which corresponds to the typical prior-free implementation of exponential weights (i.e Equation (5) with a uniform prior). With this choice, Proposition 1 in Appendix A.2.3 describes the “best-of-both-worlds" bound that we obtain: worst-case regret , and regret in the stochastic case. Note that the stochastic regret bound, while constant and thus independent of the horizon , is highly suboptimal in its dependence on the maximum model order . The bound does not improve for drastically simpler cases; for example, and is independent of .

If we knew the true model order , we would want to use ContextTreeAdaHedge. We now show that a suitable choice of prior helps us effectively learn the model order, as well as stay worst-case robust. We study the algorithm with the following choice of model-order-proportional prior function.

 gprop(h) =2−2h+1 (11)
###### Theorem 1.
1. For any sequence the algorithm ContextTreeAdaHedge with prior defined according to function gives us regret rate

 RT,d =O(√T2d) (12)

with respect to the best -order tree expert in hindsight, and for every .

2. Consider any . Let the sequence satisfy the -order stochastic condition with parameter . Denote . Then, ContextTreeAdaHedge with prior function incurs regret with probability greater than or equal to :

 RT,d =O(22d(d2α2d−1,dln(dα2d−1,dϵ)+D⋅d(α∗)2ln(Dα∗ϵ))) (13)

where .

The proof of Theorem 1 involves several moving parts to combine adversarial-stochastic interpolation and structural risk minimization, and we defer this proof to the appendix. We provide an intuitive sketch of the proof in Section 4.

Theorem 1 is the first result of its kind to obtain comparable regret rates as would be achieved by an algorithm that had oracle knowledge about the presence of stochasticity and the model order. This is the strongest possible side information that an algorithm could conceivably possess keeping the online learning problem non-trivial. In simulation, we also demonstrate the significant empirical advantage of algorithms that achieve two-fold adaptivity over “best-of-both-worlds" algorithms that do not adapt to model complexity. The advantage of offline data-driven model selection is well established, and we see this advantage even more naturally while measuring regret in online learning.

## 4 Proof sketch of Theorem 1

Initially, we mirror the established style of “best-of-both-worlds" results. The first step is always to prove a regret bound that is dependent on the data ; in particular, a bound of the form where

represents the cumulative variance of loss incurred by the algorithm. Curiously, we are easily able to get a bound (commonly called a second-order bound) that is adaptive to the model order using exponential weights with a prior

555The careful reader will notice that there is nevertheless a suboptimality in the exponent as compared to the second-order bound obtained by algorithms like Squint [KVE15] and AdaNormalHedge [LS15]. However, the “variance"-like terms in those results are different, as is their more complicated analysis for the iid case. Until similar analysis is done for these algorithms, they are not immediately comparable.!

The cumulative variance term is telling us something about how random the randomized updates in the algorithm are. In the worst case, and we automatically recover the adversarial result – but often, this term can be significantly smaller. It is easy to see that this randomness will greatly reduce when the losses are stochastic in the sense that one tree expert looks consistently better than the others. It will also reduce in the presence of a favorable prior if that best expert possesses simpler structure. However, all existing analysis [CBMS07, EKRG11, DRVEGK14, LS15, KVE15, KGvE16] only exploits the former property, and not the latter – thus giving a pessimistic scaling of for our problem.

Our main technical contribution is tackling the more difficult problem of finely controlling the cumulative variance under a favorable prior – showing that it in fact scales as the significantly smaller . We achieve this by making an explicit connection to probabilistic model selection by complexity regularization. To see this, consider Equation (5) written equivalently as the optimization problem in the Follow-the-Regularized Leader [SS12] update:

 w(tree)t :=argminw(tree)⎡⎢ ⎢⎣⟨w(tree),L(tree)t⟩+1ηt⎛⎜ ⎜⎝−H(w(tree))entropy % regularization+⟨w(tree),C(tree)⟩complexity regularization⎞⎟ ⎟⎠⎤⎥ ⎥⎦, (14)

where and

denotes the entropy functional on a probability distribution over a discrete-valued random variable. Viewed this way, the algorithm

Figure 1 illustrates the classical tradeoff in stochastic model selection in an example where the true model order is – the estimation error increases with model order, and the approximation error decreases with model order, and plateaus out at the true model order (note that this is the minimum average prediction error that any online learning algorithm should be expected to pay). Clearly, the true model order minimizes the appropriate combination of estimation error and approximation error. We show a probabilistic model selection guarantee, i.e. we can pick the true model high probability. We do this by ruling out lower and higher-order models alike. On one hand, the more (superfluously) complex a model is, the more it is going to overfit, contributing to unnecessary accumulated regret – however, the more its unfavorable prior drags it down to rule it out. On the other hand, the more (unnecessarily) simple a model is, the worse it is going to approximate – and since this approximation error is directly penalized in Equation (14), the less likely it is to be picked.

## 5 Simulations

We now provide a brief empirical illustration of the power of two-fold adaptivity to stochasticity and model complexity with ContextTreeAdaHedge equipped with the prior function .

We consider a -order-stochastic process such that . Figure 2 compares three algorithms: the optimal online algorithm with oracle knowledge of this structure (the greedy Follow-the-Context-Leader); uniform-prior ContextTreeAdaHedge, which adapts to stochasticity but not model order; and our two-fold adaptive algorithm, ContextTreeAdaHedge with the prior function .

Figure 2 shows the expected normalized regret and expected normalized cumulative loss of the algorithms. We make two natural conclusions from Figure 2. One, that model adaptivity makes a tremendous difference to regret and overall loss: ContextTreeAdaHedge equipped with uniform prior does not adapt to model order, and pays for it with loss (regret) accumulated due to overfitting. Two, that our main adaptive algorithm, which is effectively learning the presence of stochasticity and the right model order is remarkably competitive with the optimal Follow-the-Leader algorithm, which possesses oracle knowledge of both. Viewed another way, this competitiveness of adaptive algorithms suggests that there is only a small price to pay to incorporate adversarial robustness in existing stochastic model selection frameworks. Appendix D provides an additional example of an iid process on , the simplest possible model, which further illuminates both the positives of adaptivity and the negatives of lack of adaptivity.

## 6 Discussion

##### Summarization of contributions

We study the problem of binary contextual prediction (easily generalizable to -ary contextual prediction) with loss. We design an algorithm that incorporates recent advances in adaptivity with contextual pre-weighting, and show that we can simultaneously adapt to the model order complexity and the existence of stochasticity. By adaptively recovering the stochastic structural risk minimization framework, we are able to select the right -order model for the stochastic process, and obtain regret rates that are competitive with those of the optimal greedy algorithm which knows not only the presence of stochastic structure, but the exact value of . As far as we know, our work provides the first perspective on online stochastic model selection in a more challenging environment where we need to distinguish between actual stochasticity and adversity: the case where the data is not, in fact, coming from any of these models.

##### Future directions

Many future directions arise from this work. First, we acknowledge that the regret rate we obtain is not exactly optimal, particularly in terms of the multiplicative factor of in the exponent. It would be interesting to understand whether we can further improve on this factor in our bound, either by analyzing other existing algorithms that learn the learning rate [KVE15, LS15], or devising a new approach altogether. The simpler experts setting was the first natural choice to study this question, and we are hopeful that the positive results obtained here can be generalized to online optimization to develop a universal theory for simultaneous model selection and stochastic adaptivity. Recent advances have been made, separately, in both of these areas [Ora14, vEK16, OP16, FKMS17]). We are also interested in studying these problems for limited-information feedback, which would lead to the contextual bandits setting.

#### Acknowledgments

We would like to thank Sebastien Gerchinovitz for useful discussions. We gratefully acknowledge the support of the NSF through grants AST-1444078, ECCS-1343398, CNS-1321155 and IIS-1619362. We also credit the DARPA Spectrum Challenge for inspiring some of the ideas in this work, and generous gifts from Futurewei.

## References

• [ADBL11] Alekh Agarwal, John C Duchi, Peter L Bartlett, and Clement Levrard. Oracle inequalities for computationally budgeted model selection. In Proceedings of the 24th Annual Conference on Learning Theory, pages 69–86, 2011.
• [CBFH97] Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E Schapire, and Manfred K Warmuth. How to use expert advice. Journal of the ACM (JACM), 44(3):427–485, 1997.
• [CBMS07] Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2-3):321–352, 2007.
• [DRVEGK14] Steven De Rooij, Tim Van Erven, Peter D Grünwald, and Wouter M Koolen. Follow the leader if you can, hedge if you must. Journal of Machine Learning Research, 15(1):1281–1316, 2014.
• [EKRG11] Tim V Erven, Wouter M Koolen, Steven D Rooij, and Peter Grünwald. Adaptive hedge. In Advances in Neural Information Processing Systems, pages 1656–1664, 2011.
• [FKMS17] Dylan J Foster, Satyen Kale, Mehryar Mohri, and Karthik Sridharan. Parameter-free online learning via model selection. In Advances in Neural Information Processing Systems, pages 6022–6032, 2017.
• [FMG92] Meir Feder, Neri Merhav, and Michael Gutman. Universal prediction of individual sequences. IEEE transactions on Information Theory, 38(4):1258–1270, 1992.
• [HS97] David P Helmbold and Robert E Schapire.

Predicting nearly as well as the best pruning of a decision tree.

Machine Learning, 27(1):51–68, 1997.
• [KGvE16] Wouter M Koolen, Peter Grünwald, and Tim van Erven. Combining adversarial guarantees and stochastic fast rates in online learning. In Advances in Neural Information Processing Systems, pages 4457–4465, 2016.
• [KVE15] Wouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial games. In Conference on Learning Theory, pages 1155–1175, 2015.
• [LS15] Haipeng Luo and Robert E Schapire. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, pages 1286–1304, 2015.
• [Mas07] Pascal Massart. Concentration inequalities and model selection, volume 6. Springer, 2007.
• [OP16] Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning. In Advances in Neural Information Processing Systems, pages 577–585, 2016.
• [Ora14] Francesco Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, pages 1116–1124, 2014.
• [RS13] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. 2013.
• [SS12] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
• [T04] Alexander B Tsybakov et al.

Optimal aggregation of classifiers in statistical learning.

The Annals of Statistics, 32(1):135–166, 2004.
• [vEK16] Tim van Erven and Wouter M Koolen. Metagrad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems, pages 3666–3674, 2016.

## Appendix A Main proofs of ContextTreeAdaHedge(D)

### a.1 Second-order regret bound and adversarial result

We first obtain our second-order-regret bound, stated generally for a prior function . Tables 1 and 2 recap the basic notation for regret minimization and important algorithmic notation, and are useful to look at while reading the proof of the second-order bound.

Recall the expression for the computationally naive update in Equation (5):

 w(tree)t,f(ηt;g) =(∑Dh=order(f)g(h))e−ηtLt,f∑f∈FD(∑Dh=order(f)g(h))e−ηtLt,f.

and the expression for the initial distribution on tree experts based on Definition 1:

 w(tree)1,f(g)=∑Dh=order(f)g(h)Z(g)

where is the initial normalizing factor. The explicit expression for the normalizing factor is .

###### Lemma 1.

ContextTreeAdaHedge with prior function obtains regret

 RT,d≤(√VTln2+23ln2+1)⎛⎜ ⎜⎝1+ln(Z(g)g(d))ln2⎞⎟ ⎟⎠

for every .

###### Proof.

Recall that denotes the best -order tree expert at round for the given loss sequence. We denote as the actual loss incurred by this expert. We start with the computationally naive update in probability distribution over tree experts as in Equation (5), and the proof proceeds in a very similar manner to the variance-based regret bound for vanilla AdaHedge [DRVEGK14]. We denote

 ht(ηt;g) :=⟨wt(ηt;g),lt⟩=⟨w(tree)t(ηt;g),l(tree)t⟩ HT(ηT1;g) :=T∑t=1ht(ηt;g) mt(ηt;g) :=1ηtln⟨wt(ηt;g),e−ηtlt⟩=1ηtln⟨w(tree)t(ηt;g),e−ηtl(tree)t⟩ MT(ηT1;g) :=T∑t=1mt(ηt;g).

Recall that the mixability gap and . Since the instantaneous losses are bounded between and , it is easy to show that .

A standard argument tells us that

 RT,d =HT(ηT1;g)−L∗T,d =HT(ηT1;g)−MT(ηT1;g)+MT(ηT1;g)−L∗T,d =MT(ηT1;g)−L∗T,d+ΔT(ηT1;g).

Recall that the sequence is decreasing as an automatic consequence of the update in Equation (2), and non-negativity of . Handling a time-varying, data-dependent learning rate is well known to be challenging [EKRG11, DRVEGK14]. We invoke a simple lemma from the original proof of AdaHedge [DRVEGK14] that helps us effectively subsitute the final learning rate.

###### Lemma 2 ([Drvegk14]).

For any exponential-weights update with a decreasing learning rate and prior function , we have .

Thus, we get

 RT,d≤MT({ηT}Tt=1;g)−L∗T,d+ΔT(ηT1;g). (15)

We also have the following simple intermediate result for , which is simply a slightly more general version of the lemma in [DRVEGK14] that can apply to non-uniform priors.

###### Lemma 3.
 MT({ηT}Tt=1;g)≤L∗T,d+1ηTln(Z(g)g(d)).
###### Proof.

We note that

 ⟨w(tree)1(g),e−ηTL(tree)T⟩≥w(tree)1,f∗T,d(g)e−ηTL∗T,d.

Because the initial distribution is normalized to sum to , a simple telescoping argument can be used to give .

This automatically tells us that

 MT({ηT}Tt=1;g) =−1ηTln(⟨w(tree)1(g),e−ηTL(tree)T⟩) ≤−1ηTln(w(tree)1,f∗T,d(g))+L∗T,d =L∗T,d+1ηTln⎛⎜ ⎜⎝1w(tree)1,f∗T,d(g)⎞⎟ ⎟⎠ =L∗T,d+1ηTln(Z(g)∑Dh=dg(h)) ≤L∗T,d+1ηTln(Z(g)g(d))

thus proving the lemma. ∎

Now, Equation (15) and Lemma 3 together with the definition of in Equation (2) give us

 RT,d ≤1ηTln(Z(g)g(d))+ΔT(ηT1;g) =ln(Z(g)g(d))ln2ΔT−1(ηT−11;g)+ΔT(ηT1;g).

From non-negativity of , we have and so

 RT,d≤ΔT(ηT1;g)(1+ln(Z(g)g(d))ln2). (16)

It now remains to bound the quantity in terms of variance. In fact, it will be useful to define slightly more generic quantities

 ΔTT0(ηTT0;g) :=T∑t=T0δt(ηt;g) VTT0(ηTT0;g) :=T∑t=T0vt(ηt;g) where vt(ηt;g) :=varKt∼wt(ηt;g)[lt,Kt].

The bound is described below.

###### Lemma 4.

We have

 ΔTT0(ηTT0;g)≤√VTT0(ηTT0;g)ln2+(23ln2+1).
###### Proof.

The argument is similar to the original AdaHedge proof [DRVEGK14] and proceeds below. We use a telescoping sum to get

 (ΔTT0(ηTT0;g))2 =T∑t=T0+1(ΔtT0(ηtT0;g))2−(Δt−1T0(ηt−1T0;g))2 =T∑t=T0(Δt−1T0(ηt−1T0;g)+δt(ηt;g))2−(Δt−1T0(ηt−1T0;g))2 ≤T∑t=T02δt(ηt;g)Δt−1(ηt−11;g)+(δt(ηt;g))2 =T∑t=T02δt(ηt;g)ln2ηt+(δt(ηt;g))2) ≤T∑t=T02δt(ηt;g)ln2ηt+δt(ηt;g) since δt(ηt;g)≤1 ≤(2ln2)T∑t=T0δt(ηt;g)ηt+ΔTT0(ηTT0;g).

We also recall the following lemma from the original proof of AdaHedge [DRVEGK14]. The proof of this lemma involves a Bernstein tail bounding argument.

###### Lemma 5 ([Drvegk14]).

We have

 δt(ηt;g)ηt≤12vt(ηt;g)+13δt(ηt;g).

Using Lemma 5, we then get

 (ΔTT0(ηTT0;g))2≤VTT0(ηTT0;g)ln2+(23ln2+1)ΔTT0(ηTT0;g) (17)

which is an inequality for the quantity in quadratic form. We now solve Equation (17), and use Fact 2 from Appendix C to get

 ΔTT0(ηTT0;g)≤√VTT0(ηTT0;g)ln2+23ln2+1. (18)

Now we complete the proof of Lemma 1 by combining Equations (16) and (18) for the special case of . ∎

Now, noting that and substituting the expression for from Equation (11) directly proves Equation (12) from Lemma 1. To see this, we substitute into the statement of Lemma 1 to get

 RT,d ≤(√VT(