1 Introduction
A key problem in the design of learning algorithms is the choice of the hypothesis set . This is known as the model selection problem. The choice of is driven by inherent tradeoffs. In the statistical learning setting, this can be analyzed in terms of the estimation and approximation errors. A richer or more complex
helps better approximate the Bayes predictor (smaller approximation error). On the other hand, a hypothesis set that is too complex may have too large a VCdimension or have unfavorable Rademacher complexity, thereby resulting in looser guarantees on the difference between the loss of a hypothesis and that of the bestin class (large estimation error).
In the batch setting, this problem has been extensively studied with the main ideas originating in the seminal work of Vapnik and Chervonenkis (1971) and Vapnik (1982) and the principle of Structural Risk Minimization (SRM). This is typically formulated as follows: let be an infinite sequence of hypothesis sets (or models); the problem consists of using the training sample to select a hypothesis set with a favorable estimationapproximation tradeoff and choosing the best hypothesis in .
If we had access to a hypothetical oracle informing us of the best choice of for a given instance, the problem would reduce to the standard one of learning with a fixed hypothesis set. Remarkably though, techniques such as SRM or similar penaltybased model selection methods return a hypothesis that enjoys finitesample learning guarantees that are almost as favorable as those that would be obtained had an oracle informed us of the index
of the bestinclass classifier’s hypothesis set
(Vapnik, 1982; Devroye et al., 1996; ShaweTaylor et al., 1998; Koltchinskii, 2001; Bartlett et al., 2002; Massart, 2007). Such guarantees are sometimes referred to as oracle inequalities. They can be derived even for datadependent penalties (Koltchinskii, 2001; Bartlett et al., 2002; Bartlett and Mendelson, 2003).Such results naturally raise the following questions in the online setting: can we develop an analogous theory of model selection in online learning? Can we design online algorithms for model selection with solutions benefiting from strong guarantees, analogous to the batch ones? Unlike the statistical setting, in online learning one cannot split samples to first learn the optimal predictor within each subclass and then later learn the optimal subclass choice.
A series of recent works on online learning provide some positive results along that direction. On the algorithmic side, McMahan and Abernethy (2013); McMahan and Orabona (2014); Orabona (2014); Orabona and Pál (2016) present solutions that efficiently achieve model selection oracle inequalities for the important special case where is a sequence of nested balls in a Hilbert space. On the theoretical side, a different line of work focusing on general hypothesis classes (Foster et al., 2015) uses martingalebased sequential complexity measures to show that, informationtheoretically, one can obtain oracle inequalities in the online setting at a level of generality comparable to that of the batch statistical learning. However, this last result is not algorithmic.
The first approach that a familiar reader might think of for tackling the online model selection problem is to run for each an online learning algorithm that minimizes regret against , and then aggregate over these algorithms using the multiplicative weights algorithm for prediction with expert advice. This would work if all the losses or “experts” considered were uniformly bounded by a reasonably small quantity. However, in many reasonable problems — particularly those arising in the context of online convex optimization — the losses of predictors or experts for each may grow with . Using simple aggregation would scale our regret with the magnitude of the largest and not the we want to compare against. This is the main technical challenge faced in this context, and one that we fully address in this paper.
Our results are based on a novel multiscale algorithm for prediction with expert advice. This algorithm works in a situation where the different experts’ losses lie in different ranges, and guarantees that the regret to each individual expert is adapted to the range of its losses. The algorithm can also take advantage of a given prior over the experts reflecting their importance. This general, abstract setting of prediction with expert advice yields online model selection algorithms for a host of applications detailed below in a straightforward manner.
First, we give efficient algorithms for model selection for nested linear classes that provide oracle inequalities in terms of the norm of the benchmark to which the algorithm’s performance is compared. Our algorithm works for any norm, which considerably generalizes previous work (McMahan and Abernethy, 2013; McMahan and Orabona, 2014; Orabona, 2014; Orabona and Pál, 2016) and gives the first polynomial time online model selection for a number of online linear optimization settings. This includes online oracle inequalities for highdimensional learning tasks such as online PCA and online matrix prediction. We then generalize these results even further by providing oracle inequalities for arbitrary nonlinear classes in the online supervised learning model. This yields algorithms for applications such as online penalized risk minimization and multiple kernel learning.
1.1 Preliminaries
Notation.
For a given norm , let denote the dual norm. Likewise, for any function , will denote its Fenchel conjugate. For a Banach space , the dual is . We use
as shorthand for a sequence of vectors
. For such sequences, we will use to denote the th vector’s th coordinate. We let denote the th standard basis vector. denotes the norm, denotes the spectral norm, and denotes the trace norm. For any , let be such that .Setup and goals.
We work in two closely related settings: online convex optimization (proto:oco) and online supervised learning (proto:supervised_learning). In online convex optimization, the learner selects decisions from a convex subset of some Banach space . Regret to a comparator in this setting is defined as .
Suppose can be decomposed into sets . For a fixed set , the optimal regret, if one tailors the algorithm to compete with , is typically characterized by some measure of intrinsic complexity of the class (such as Littlestone’s dimension (BenDavid et al., 2009) and sequential Rademacher complexity (Rakhlin et al., 2010)), denoted . We would like to develop algorithms that predict a sequence such that
(1) 
This equation is called an oracle inequality and states that the performance of the sequence
matches that of a comparator that minimizes the biasvariance tradeoff
, up to a penalty whose scale ideally matches that of . We shall see shortly that ensuring that the scale of does indeed match is the core technical challenge in developing online oracle inequalities for commonly used classes.In the supervised learning setting we measure regret against a benchmark class of functions , where is some abstract context space, also called feature space. In this case, the desired oracle inequality has the form:
(2) 
2 Online Model Selection
2.1 The need for multiscale aggregation
Let us briefly motivate the main technical challenge overcome by the model selection approach we consider. The most widely studied oracle inequality in online learning has the following form
(3) 
In light of eq:oco_oracle, a model selection approach to obtaining this inequality would be to split the set into norm balls of doubling radius, i.e. . A standard fact (Hazan, 2016) is that such a set has if one optimizes over it using Mirror Descent, and so obtaining the oracle inequality eq:oco_oracle is sufficient to recover eq:oco_hilbert, so long as is not too large relative to .
Online model selection is fundamentally a problem of prediction with expert advice (CesaBianchi and Lugosi, 2006), where the experts correspond to the different model classes one is choosing from. Our basic metaalgorithm, MultiScaleFTPL (alg:general), operates in the following setup. The algorithm has access to a finite number, , of experts. In each round, the algorithm is required to choose one of the experts. Then the losses of all experts are revealed, and the algorithm incurs the loss of the chosen expert.
The twist from the standard setup is that the losses of all the experts are not uniformly bounded in the same range. Indeed, for the setup described for the oracle inequality eq:oco_hilbert, class will produce predictions with norm as large as . Therefore, here, we assume that expert incurs losses in the range , for some known parameter . The goal is to design an online learning algorithm whose regret to expert scales with , rather than , which is what previous algorithms for learning from expert advice (such as the standard multiplicative weights strategy or AdaHedge (De Rooij et al., 2014)) would achieve. Indeed, any regret bound scaling in will be far too large to achieve eq:oco_hilbert, as the term will dominate. This new type of scalesensitive regret bound, achieved by our algorithm MultiScaleFTPL, is stated below.
Theorem 1.
Suppose the loss sequence satisfies for a sequence with each . Let be a given prior distribution on the experts. Then, playing the strategy given by alg:general, MultiScaleFTPL yields the following regret bound:^{1}^{1}1This regret bound holds under expectation over the player’s randomization. It is assumed that each is selected before the randomized strategy is revealed, but may adapt to the distribution over . In fact, a slightly stronger version of this bound holds, namely . A similar strengthening applies to all subsequent bounds.
(4) 
The proof of the theorem is deferred to app:proofs in the supplementary material due to space constraints. Briefly, the proof follows the technique of adaptive relaxations from (Foster et al., 2015). It relies on showing that the following function of the first loss vectors is an admissible relaxation (see (Foster et al., 2015) for definitions):
This implies that if we play the strategy given by alg:general, the regret to the th expert is bounded by , where indicates the function applied to an empty sequence of loss vectors. As a final step, we bound as using a probabilistic maximal inequality (lem:maximal in the supplementary material), yielding the bound eq:ftpl_regret. Compared to related FTPL algorithms (Rakhlin et al., 2012), the analysis is surprisingly delicate, as additive factors can spoil the desired regret bound eq:ftpl_regret if the s differ by orders of magnitude.
The minmax optimization problem in MultiScaleFTPL
can be solved in polynomialtime using linear programming — see app:ftpl in the supplementary material for a full discussion.
In related work, Bubeck et al. (2017) simultaneously developed a multiscale experts algorithm which could also be used in our framework. Their regret bound has suboptimal dependence on the prior distribution over experts, but their algorithm is more efficient and is able to obtain multiplicative regret guarantees.
2.2 Online convex optimization
One can readily apply MultiScaleFTPL
for online optimization problems whenever it is possible to bound the losses of the different experts apriori. One such application is to online convex optimization, where each “expert” is a a particular OCO algorithm, and for which such a bound can be obtained via appropriate bounds on the relevant norms of the parameter vectors and the gradients of the loss functions. We detail this application — which yields algorithms for parameterfree online learning and more — below. All of the algorithms in this section are derived using a unified metaalgorithm strategy
MultiScaleOCO.The setup is as follows. We have access to subalgorithms, denoted for . In round , each subalgorithm produces a prediction , where is a set in a vector space over containing . Our metaalgorithm is then required to choose one of the predictions . Then, a loss function is revealed, whereupon incurs loss , and the metaalgorithm suffers the loss of the chosen prediction. We make the following assumption on the subalgorithms:
Assumption 1.
The subalgorithms satisfy the following conditions:

For each , there is an associated norm such that .

For each , the sequence of functions are Lipschitz on with respect to .

For each subalgorithm , the iterates enjoy a regret bound , where may be data or algorithmdependent.
In most applications, will be a convex set and a convex function; this convexity is not necessary to prove a regret bound for the metaalgorithm. We simply need boundedness of the set and Lipschitzness of the functions , as specified in assumption:oco. This assumption implies that for any , we have for any . Thus, we can design a metaalgorithm for this setup by using MultiScaleFTPL with , which is precisely what is described in alg:oco_aggregation. The following theorem provides a bound on the regret of MultiScaleOCO; a direct consequence of theorem:ftpl_alg.
Theorem 2.
Without loss of generality, assume that ^{2}^{2}2For notational convenience all Lipschitz bounds are assumed to be at least without loss of generality for the remainder of the paper.. Suppose that the inputs to alg:oco_aggregation satisfy assumption:oco. Then the iterates returned by alg:oco_aggregation follow the regret bound
(5) 
thm:oco_aggregation shows that if we use alg:oco_aggregation to aggregate the iterates produced by a collection of subalgorithms , the regret against any subalgorithm will only depend on that algorithm’s scale, not the regret of the worst subalgorithm.
Application 1: Parameterfree online learning in uniformly convex Banach spaces.
As the first application of our framework, we give a generalization of the parameterfree online learning bounds found in McMahan and Abernethy (2013); McMahan and Orabona (2014); Orabona (2014); Orabona and Pál (2016); Cutkosky and Boahen (2016) from Hilbert spaces to arbitrary uniformly convex Banach spaces. Recall that a Banach space is uniformly convex if is strongly convex with respect to itself (Pisier, 2011). Our algorithm obtains a generalization of the oracle inequality eq:oco_hilbert for any uniformly convex by running multiple instances of Mirror Descent — the workhorse of online convex optimization — and aggregating their iterates using MultiScaleOCO. This strategy is thus efficient whenever Mirror Descent can be implemented efficiently. The collection of subalgorithms used by MultiScaleOCO, which was alluded to at the beginning of this section is as follows: For each , set , , , , and . Finally, set .
Mirror Descent is reviewed in detail in app:proofs_oco in the supplementary material, but the only feature of its performance of importance to our analysis is that, when configured as described above, the iterates produced by specified above will satisfy on any sequence of losses that are Lipschitz with respect to . Using just this simple fact, combined with the regret bound for MultiScaleOCO and a few technical details in app:proofs_oco, we can deduce the following parameterfree learning oracle inequality:
Theorem 3 (Oracle inequality for uniformly convex Banach spaces).
The iterates produced by MultiScaleOCO on any Lipschitz (w.r.t. ) sequence of losses satisfy
(6) 
Note that the above oracle inequality applies for any uniformly convex norm . Previous results only obtain bounds of this form efficiently when is a Hilbert space norm or . As is standard for such oracle inequality results, the bound is weaker than the optimal bound if were selected in advance, but only by a mild factor.
Proposition 1.
The algorithm can be implemented in time per iteration, where is the time complexity of a single Mirror Descent update.
In the example above, the uniform convexity condition was mainly chosen for familiarity. The result can easily be generalized to related notions such as uniform convexity (see Srebro et al. (2011)). More generally, the approach can be used to derive oracle inequalities with respect to general strongly convex regularizer defined over the space . Such a bound would have the form for typical choices of .
Application 2: Oracle inequality for many norms.
It is instructive to think of MultiScaleOCO as executing a (scalesensitive) online analogue of the structural risk minimization principle. We simply specify a set of subclasses and a prior specifying the importance of each subclass, and we are guaranteed that the algorithm’s performance matches that of each subclass, plus a penalty depending on the prior weight placed on that subclass. The advantage of this approach is that the nested structure used in the thm:oco_2smooth is completely inessential. This leads to the exciting prospect of developing parameterfree algorithms over new and exotic set systems. One such example is given now: The MultiScaleOCO framework allows us to obtain an oracle inequality with respect to many norms in simultaneously. To the best of our knowledge all previous works on parameterfree online learning have only provided oracle inequalities for a single norm.
Theorem 4.
Fix . Suppose that the loss functions are Lipschitz w.r.t. for each . Then there is a computationally efficient algorithm that guarantees regret
(7) 
The configuration in the above theorem is described in full in app:proofs_oco in the supplementary material. This strategy can be trivially extended to handle in the range . The inequality holds for rather than for because the norm is not uniformly convex, but this is easily rectified by changing the regularizer at ; we omit this for simplicity of presentation.
We emphasize that the choice of norms for the result above was somewhat arbitrary — any finite collection of norms will also work. For example, the strategy can also be applied to matrix optimization over by replacing the norm with the Schatten norm. The Schatten norm has strong convexity parameter on the order of (which matches the norm up to absolute constants (Ball et al., 1994)) so the only change to practical change to the setup in thm:all_lp will be the running time . Likewise, the approach applies to group norms as used in multitask learning (Kakade et al., 2012).
Application 3: Adapting to rank for online PCA
For the online PCA task, the learner predicts from a class . For a fixed value of , such a class is a convex relaxation of the set of all rank projection matrices. After producing a prediction , we experience affine loss functions , where .
We leverage an analysis of online PCA due to (Nie et al., 2013) together with MultiScaleOCO to derive an algorithm that competes with many values of the rank simultaneously. This gives the following result:
Theorem 5.
There is an efficient algorithm for Online PCA with regret bound
For a fixed value of , the above bound is already optimal up to log factors, but it holds for all simultaneously.
Application 4: Adapting to norm for Matrix Multiplicative Weights
In the Matrix Multiplicative Weights setting (Arora et al., 2012) we consider hypothesis classes of the form . Losses are given by , where . For a fixed value of , the wellknown Matrix Multiplicative Weights strategy has regret against bounded by . Using this strategy for fixed as a subalgorithm for MultiScaleOCO, we achieve the following oracle inequality efficiently:
Theorem 6.
There is an efficient matrix prediction strategy with regret bound
(8) 
A remark on efficiency
All of our algorithms that provide bounds of the form eq:oco_2smooth_general instantiate experts with MultiScaleFTPL because, in general, the worst case for achieving can have norm as large as . If one has an a priori bound — say — on the range at which each attains its minimum, then the number of experts be reduced to .
2.3 Supervised learning
We now consider the online supervised learning setting (proto:supervised_learning), with the goal being to compete with a sequence of hypothesis classes simultaneously. Working in this setting makes clear a key feature of the metaalgorithm approach we have adopted: We can efficiently obtain online oracle inequalities for arbitrary nonlinear function classes — so long as we have an efficient algorithm for each .
We obtain a supervised learning metaalgorithm by simply feeding the observed losses (which may even be nonconvex) to the metaalgorithm MultiScaleFTPL in the same fashion as MultiScaleOCO.
The resulting strategy, which is described in detail in app:supervised for completeness, is called MultiScaleLearning. We make the following assumptions analogous to assumption:oco, which lead to the performance guarantee for MultiScaleLearning given in thm:supervised_aggregation below.
Assumption 2.
The subalgorithms used by MultiScaleLearning satisfy the following conditions:

For each , the iterates produced by subalgorithm satisfy .

For each , the function is Lipschitz on .

For each subalgorithm , the iterates enjoy a regret bound , where may be data or algorithmdependent.
Theorem 7.
Suppose that the inputs to alg:supervised_aggregation satisfy assumption:supervised. Then the iterates produced by the algorithm enjoy the regret bound
(9) 
Online penalized risk minimization
In the statistical learning setting, oracle inequalities for arbitrary sequences of hypothesis classes are readily available. Such inequalities are typically stated in terms of complexity parameters for the classes such as VC dimension or Rademacher complexity. For the online learning setting, it is wellknown that sequential Rademacher complexity provides a sequential counterpart to these complexity measures (Rakhlin et al., 2010), meaning that it generically characterizes the minimax optimal regret for Lipschitz losses. We will obtain an oracle inequality in terms of this parameter.
Assumption 3.
The sequence of hypothesis classes are such that

There is an efficient algorithm producing iterates satisfying for any Lipschitz loss, where is some constant. (an algorithm with this regret is always guaranteed to exist, but may not be efficient).

Each has output range , where without loss of generality.

— this is obtained by most nontrivial classes.
Theorem 8 (Online penalized risk minimization).
Under ass:srm there is an efficient (in ) algorithm that achieves the following regret bound for any Lipschitz loss:
(10) 
As in the previous section, one can derive tighter regret bounds and more efficient (e.g. sublinear in ) algorithms if are nested.
Application: Multiple kernel learning
Theorem 9.
Let be reproducing kernel Hilbert spaces for which each has a kernel such that . Then there is an efficient learning algorithm that guarantees
for any Lipschitz loss, whenever an efficient algorithm is available for the norm ball in each .
3 Discussion and Further Directions
Related work
There are two directions in parameterfree online learning that have been explored extensively. The first considers bounds of the form eq:oco_hilbert; namely, the Hilbert space version of the more general setting explored in sec:oco_slow. Beginning with Mcmahan and Streeter (2012), which obtained a slightly looser rate than eq:oco_hilbert, research has focused on obtaining tighter dependence on and in this type of bound (McMahan and Abernethy, 2013; McMahan and Orabona, 2014; Orabona, 2014; Orabona and Pál, 2016); all of these algorithms run in linear time per update step. Recent work (Cutkosky and Boahen, 2016, 2017) has extended these results to the case where the Lipschitz constant is not known in advance. These works give lower bounds for general norms, but only give efficient algorithms for Hilbert spaces. Extending alg:oco_aggregation to reach the Pareto frontier of regret in the unknown Lipschitz setting as described in (Cutkosky and Boahen, 2017) may be an interesting direction for future research.
The second direction concerns socalled “quantile bounds” (Chaudhuri et al., 2009; Koolen and Van Erven, 2015; Luo and Schapire, 2015; Orabona and Pál, 2016) for experts setting, where the learner’s decision set is the simplex and losses are bounded in . The multiscale machinery developed in the present work is not needed to obtain bounds for this setting because the losses are uniformly bounded across all model classes. Indeed, Foster et al. (2015) recovered a basic form of quantile bound using the vanilla multiplicative weights strategy as a metaalgorithm. It is not known whether the more sophisticated datadependent quantile bounds given in Koolen and Van Erven (2015); Luo and Schapire (2015) can be recovered in the same fashion.
Losses with curvature.
The type regret bounds provided by alg:general are appropriate when the subalgorithms themselves incur regret bounds. However, assuming certain curvature properties (such as strong convexity, expconcavity, stochastic mixability, etc. (Hazan et al., 2007; van Erven et al., 2015)) of the loss functions it is possible to construct subalgorithms that admit significantly more favorable regret bounds ( or even ). These are also referred to as “fast rates” in online learning. A natural direction for further study is to design a metaalgorithm that admits logarithmic or constant regret to each subalgorithm, assuming that the loss functions of interest satisfy similar curvature properties, with the regret to each individual subalgorithm adapted to the curvature parameters for that subalgorithm. Perhaps surprisingly, for the special case of the logistic loss, improper prediction and aggregation strategies similar to those proposed in this paper offer a way to circumvent known proper learning lower bounds (Hazan et al., 2014). This approach will be explored in detail in a forthcoming companion paper.
Computational efficiency.
We suspect that a runningtime of to obtain inequalities like eq:oco_2smooth_general may be unavoidable through our approach, since we do not make use of the relationship between subalgorithms beyond using the nested class structure. Whether the runtime of MultiScaleFTPL can be brought down to match is an open question. This boils down to whether or not the minmax optimization problem in the algorithm description can simultaneously be solved in 1) Linear time in the number of experts 2) strongly polynomial time in the scales .
Acknowledgements
We thank Francesco Orabona and Dávid Pál for inspiring initial discussions. Part of this work was done while DF was an intern at Google Research and while DF and KS were visiting the Simons Institute for the Theory of Computing. DF is supported by the NDSEG fellowship.
References
 Arora et al. (2012) Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a metaalgorithm and applications. Theory of Computing, 8(1):121–164, 2012.
 Ball et al. (1994) Keith Ball, Eric A Carlen, and Elliott H Lieb. Sharp uniform convexity and smoothness inequalities for trace norms. Inventiones mathematicae, 115(1):463–482, 1994.

Bartlett and Mendelson (2003)
Peter L. Bartlett and Shahar Mendelson.
Rademacher and Gaussian complexities: risk bounds and structural
results.
Journal of Machine Learning Research
, 3:463–482, 2003. ISSN 15324435.  Bartlett et al. (2002) Peter L. Bartlett, Stéphane Boucheron, and Gábor Lugosi. Model selection and error estimation. Machine Learning, 48(13):85–113, 2002.
 BenDavid et al. (2009) Shai BenDavid, David Pal, and Shai ShalevShwartz. Agnostic online learning. In Proceedings of the 22th Annual Conference on Learning Theory, 2009.
 Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
 Bubeck et al. (2017) Sebastien Bubeck, Nikhil Devanur, Zhiyi Huang, and Rad Niazadeh. Online auctions and multiscale online learning. Accepted to The 18th ACM conference on Economics and Computation (EC 17), 2017.
 CesaBianchi and Lugosi (2006) Nicolo CesaBianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
 Chaudhuri et al. (2009) Kamalika Chaudhuri, Yoav Freund, and Daniel J Hsu. A parameterfree hedging algorithm. In Advances in neural information processing systems, pages 297–305, 2009.
 Cutkosky and Boahen (2016) Ashok Cutkosky and Kwabena A Boahen. Online convex optimization with unconstrained domains and losses. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 748–756. 2016.
 Cutkosky and Boahen (2017) Ashok Cutkosky and Kwabena A. Boahen. Online learning without prior information. The 30th Annual Conference on Learning Theory, 2017.
 De Rooij et al. (2014) Steven De Rooij, Tim Van Erven, Peter D Grünwald, and Wouter M Koolen. Follow the leader if you can, hedge if you must. Journal of Machine Learning Research, 15(1):1281–1316, 2014.

Devroye et al. (1996)
Luc Devroye, Lázló Györfi, and Gábor Lugosi.
A Probabilistic Theory of Pattern Recognition
. Springer, 1996.  Foster et al. (2015) Dylan J Foster, Alexander Rakhlin, and Karthik Sridharan. Adaptive online learning. In Advances in Neural Information Processing Systems, pages 3375–3383, 2015.
 Hazan (2016) Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Hazan et al. (2007) Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(23):169–192, 2007.
 Hazan et al. (2014) Elad Hazan, Tomer Koren, and Kfir Y Levy. Logistic regression: Tight bounds for stochastic and online optimization. In Proceedings of The 27th Conference on Learning Theory, pages 197–209, 2014.
 Hazan et al. (2017) Elad Hazan, Satyen Kale, and Shai ShalevShwartz. Nearoptimal algorithms for online matrix prediction. SIAM J. Comput., 46(2):744–773, 2017. doi: 10.1137/120895731.
 Kakade et al. (2009) Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems 21, pages 793–800. MIT Press, 2009.
 Kakade et al. (2012) Sham M Kakade, Shai ShalevShwartz, and Ambuj Tewari. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13(Jun):1865–1890, 2012.
 Koltchinskii (2001) Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Trans. Information Theory, 47(5):1902–1914, 2001.
 Koolen and Van Erven (2015) Wouter M Koolen and Tim Van Erven. Secondorder quantile methods for experts and combinatorial games. In Proceedings of The 28th Conference on Learning Theory, pages 1155–1175, 2015.
 Luo and Schapire (2015) Haipeng Luo and Robert E Schapire. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, pages 1286–1304, 2015.
 Massart (2007) Pascal Massart. Concentration inequalities and model selection. Lecture Notes in Mathematics, 1896, 2007.
 McMahan and Abernethy (2013) Brendan McMahan and Jacob Abernethy. Minimax optimal algorithms for unconstrained linear optimization. In Advances in Neural Information Processing Systems, pages 2724–2732, 2013.
 Mcmahan and Streeter (2012) Brendan Mcmahan and Matthew Streeter. Noregret algorithms for unconstrained online convex optimization. In Advances in neural information processing systems, pages 2402–2410, 2012.
 McMahan and Orabona (2014) H. Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. In Proceedings of The 27th Conference on Learning Theory, pages 1020–1039, 2014.
 Nemirovski (2004) Arkadi Nemirovski. Proxmethod with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
 Nie et al. (2013) Jiazhong Nie, Wojciech Kotłowski, and Manfred K Warmuth. Online pca with optimal regrets. In International Conference on Algorithmic Learning Theory, pages 98–112. Springer, 2013.
 Orabona (2014) Francesco Orabona. Simultaneous model selection and optimization through parameterfree stochastic learning. In Advances in Neural Information Processing Systems, pages 1116–1124, 2014.
 Orabona and Pál (2016) Francesco Orabona and Dávid Pál. From coin betting to parameterfree online learning. arXiv preprint arXiv:1602.04128, 2016.
 Pisier (2011) Gilles Pisier. Martingales in banach spaces (in connection with type and cotype). course ihp, feb. 2–8, 2011. 2011.
 Rakhlin et al. (2010) Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages, combinatorial parameters, and learnability. Advances in Neural Information Processing Systems 23, pages 1984–1992, 2010.
 Rakhlin et al. (2012) Alexander. Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and randomize: From value to algorithms. In Advances in Neural Information Processing Systems 25, pages 2150–2158, 2012.
 Renegar (1988) James Renegar. A polynomialtime algorithm, based on newton’s method, for linear programming. Mathematical Programming, 40(1):59–93, 1988.
 ShaweTaylor et al. (1998) John ShaweTaylor, Peter L Bartlett, Robert C Williamson, and Martin Anthony. Structural risk minimization over datadependent hierarchies. IEEE transactions on Information Theory, 44(5):1926–1940, 1998.
 Srebro et al. (2011) Nati Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent. In Advances in neural information processing systems, pages 2645–2653, 2011.
 van Erven et al. (2015) Tim van Erven, Peter D. Grünwald, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson. Fast rates in statistical and online learning. Journal of Machine Learning Research, 16:1793–1861, 2015.
 Vapnik (1982) Vladimir Vapnik. Estimation of dependences based on empirical data, volume 40. SpringerVerlag New York, 1982.

Vapnik and Chervonenkis (1971)
Vladimir Vapnik and Alexey Chervonenkis.
On the uniform convergence of relative frequencies of events to their probabilities.
Theory of Probability and its Applications, 16(2):264–280, 1971.
Appendix A Proofs
a.1 Multiscale FTPL algorithm
theorem:ftpl_alg.
Recall that . Let . For a regret bound of the form to be achievable by a randomized algorithm such as alg:general we need
where denotes interleaving of the operator from to . In the context of alg:general, the distributions above refer to the strategy selected by the algorithm and
refers to the distribution over this strategy induced by sampling the random variables
. See Foster et al. (2015) for a more extensive introduction to this type of minimax analysis for comparatordependent regret bounds.We will develop an algorithm to certify this bound for using the framework of adaptive relaxations proposed by Foster et al. (2015). Define a relaxation via
The proof structure is as follows: We show that playing as suggested by alg:general with satisfies the initial condition and admissibility condition for adaptive relaxations from Foster et al. (2015), which implies that if we play we will have . Then as a final step we bound using a probabilistic maximal inequality, lem:maximal.
Initial condition
This condition asks that the initial value of the relaxation upper bound the worstcase value of the negative benchmark minus the bound (in other words, the inner part of with the learner’s loss removed). This is holds by definition and is trivial to verify:
Admissibility
For this step we must show that the inequality
holds for each timestep , and further that the inequality is certified by the strategy of alg:general. We begin by expanding the definition of :
Now plug in the randomized strategy given by alg:general, with taking the place of :  
Grouping expectations and applying Jensen’s inequality:  
Expanding the definition of (using its optimality in particular):  
Now apply a somewhat standard sequential symmetrization procedure. Begin by using the minimax theorem to swap the order of and . To do so, we allow the player to randomize, and denote their distribution by .  
Since the supremum over does not directly depend on , we can rewrite this expression by introducing a (conditionally) IID copy of which we will denote as :  
Comments
There are no comments yet.