This paper studies the online learning framework, where the goal of the player is to incur small regret while observing a sequence of data on which we place no distributional assumptions. Within this framework, many algorithms have been developed over the past two decades, and we refer to the book of Cesa-Bianchi and Lugosi  for a comprehensive treatment of the subject. More recently, a non-algorithmic minimax approach has been developed to study the inherent complexities of sequential problems [2, 1, 14, 19]. In particular, it was shown that a theory in parallel to Statistical Learning can be developed, with random averages, combinatorial parameters, covering numbers, and other measures of complexity. Just as the classical learning theory is concerned with the study of the supremum of empirical or Rademacher process, online learning is concerned with the study of the supremum of a martingale or a certain dyadic process. Even though complexity tools introduced in [14, 16, 15] provide ways of studying the minimax value, no algorithms have been exhibited to achieve these non-constructive bounds in general.
In this paper, we show that algorithms can, in fact, be extracted from the minimax analysis. This observation leads to a unifying view of many of the methods known in the literature, and also gives a general recipe for developing new algorithms. We show that the potential method, which has been studied in various forms, naturally arises from the study of the minimax value as a certain relaxation. We further show that the sequential complexity tools introduced in  are, in fact, relaxations and can be used for constructing algorithms that enjoy the corresponding bounds. By choosing appropriate relaxations, we recover many known methods, improved variants of some known methods, and new algorithms. One can view our framework as one for converting a non-constructive proof of an upper bound on the value of the game into an algorithm. Surprisingly, this allows us to also study such “unorthodox” methods as Follow the Perturbed Leader , and the recent method of  under the same umbrella with others. We show that the idea of a random playout has a solid theoretical basis, and that Follow the Perturbed Leader algorithm is an example of such a method. It turns out that whenever the sequential Rademacher complexity is of the same order as its i.i.d. cousin, there is a family of randomized methods that avoid certain computational hurdles. Based on these developments, we exhibit an efficient method for the trace norm matrix completion problem, novel Follow the Perturbed Leader algorithms, and efficient methods for the problems of transductive learning and prediction with static experts.
The framework of this paper gives a recipe for developing algorithms. Throughout the paper, we stress that the notion of a relaxation, introduced below, is not appearing out of thin air but rather as an upper bound on the sequential Rademacher complexity. The understanding of inherent complexity thus leads to the development of algorithms.
One unsatisfying aspect of the minimax developments so far has been the lack of a localized analysis. Local Rademacher averages have been shown to play a key role in Statistical Learning for obtaining fast rates. It is also well-known that fast rates are possible in online learning, on the case-by-case basis, such as for online optimization of strongly convex functions. We show that, in fact, a localized analysis can be performed at an abstract level, and it goes hand-in-hand with the idea of relaxations. Using such localized analysis, we arrive at local sequential Rademacher
and other local complexities. These complexities upper-bound the value of the online learning game and can lead to fast rates. What is equally important, we provide an associated generic algorithm to achieve the localized bounds. We further develop the ideas of localization, presenting a general adaptive (data-dependent) procedure that takes advantage of the actual moves of the adversary that might have been suboptimal. We illustrate the procedure on a few examples. Our study of localized complexities and adaptive methods follows from a general agenda of developing universal methods that can adapt to the actual sequence of data played by Nature, thus automatically interpolating between benign and minimax optimal sequences.
This paper is organized as follows. In Section 2 we formulate the value of the online learning problem and present the (possibly computationally inefficient) minimax algorithm. In Section 3 we develop the idea of relaxations and the meta algorithm based on relaxations, and present a few examples. Section 4 is devoted to a new formalism of localized complexities, and we present a basic localized meta algorithm. We show, in particular, that for strongly convex objectives, the regret is easily bounded through localization. Next, in Section 5, we present a fully adaptive method that constantly checks whether the sequence being played by the adversary is in fact minimax optimal. We show that, in particular, we recover some of the known adaptive results. We also demonstrate how local data-dependent norms arise as a natural adaptive method. The remaining sections present a number of new algorithms, often with superior computational properties and regret guarantees than what is known in the literature.
A set is often denoted by . A -fold product of is denoted by
. Expectation with respect to a random variablewith distribution is denoted by or . The set is denoted by , and the set of all distributions on some set by
. The inner product between two vectors is written asor as . The set of all functions from to is denoted by . Unless specified otherwise, denotes a vector of i.i.d. Rademacher random variables. An -valued tree of depth is defined as a sequence of mappings (see ). We often write instead of .
2 Value and The Minimax Algorithm
Let be the set of learner’s moves and the set of moves of Nature. The online protocol dictates that on every round the learner and Nature simultaneously choose , , and observe each other’s actions. The learner aims to minimize regret
is a known loss function. Our aim is to study this online learning problem at an abstract level without assuming convexity or other properties of the loss function and the setsand . We do assume, however, that , , and are such that the minimax theorem in the space of distributions over and holds. By studying the abstract setting, we are able to develop general algorithmic and non-algorithmic ideas that are common across various application areas.
The starting point of our development is the minimax value of the associated online learning game:
where is the set of distributions on . The minimax formulation immediately gives rise to the optimal algorithm that solves the minimax expression at every round . That is, after witnessing and , the algorithm returns
Henceforth, if the quantification in and is omitted, it will be understood that , , , range over , , , , respectively. Moreover, is with respect to while is with respect to . The first sum in (2) starts at since the partial loss has been fixed. We now notice a recursive form for defining the value of the game. Define for any and any given prefix the conditional value
The minimax optimal algorithm specifying the mixed strategy of the player can be written succinctly
This recursive formulation has appeared in the literature, but now we have tools to study the conditional value of the game. We will show that various upper bounds on yield an array of algorithms, some with better computational properties than others. In this way, the non-constructive approach of [14, 15, 16] to upper bound the value of the game directly translates into algorithms.
The minimax algorithm in (3) can be interpreted as choosing the best decision that takes into account the present loss and the worst-case future. We then realize that the conditional value of the game serves as a “regularizer”, and thus well-known online learning algorithms such as Exponential Weights, Mirror Descent and Follow-the-Regularized-Leader arise as relaxations rather than a “method that just works”.
3 Relaxations and the Basic Meta-Algorithm
A relaxation is a sequence of functions for each . We shall use the notation for . A relaxation will be called admissible if for any ,
for all , and
A strategy that minimizes the expression in (4) defines an optimal algorithm for the relaxation . This algorithm is given below under the name “Meta-Algorithm”. However, minimization need not be exact: any that satisfies the admissibility condition (4) is a valid method, and we will say that such an algorithm is admissible with respect to the relaxation .
Let be an admissible relaxation. For any admissible algorithm with respect to , including the Meta-Algorithm, irrespective of the strategy of the adversary,
We also have that
If for all , the Hoeffding-Azuma inequality yields, with probability at least
, the Hoeffding-Azuma inequality yields, with probability at least,
Further, if for all , the admissible strategies are deterministic,
The reader might recognize as a potential function. It is known that one can derive regret bounds by coming up with a potential such that the current loss of the player is related to the difference in the potentials at successive steps, and that the loss of the best decision in hindsight can be extracted from the final potential. The origin of “good” potential functions has always been a mystery (at least to the authors). One of the conceptual contributions of this paper is to show that they naturally arise as relaxations on the conditional value. The conditional value itself can be characterized as the tightest possible relaxation.
In particular, for many problems a tight relaxation (sometimes within a factor of ) is achieved through symmetrization. Define the conditional Sequential Rademacher complexity
Here the supremum is over all -valued binary trees of depth . One may view this complexity as a partially symmetrized version of the sequential Rademacher complexity
defined in . We shall refer to the term involving the tree as the “future” and the term being subtracted off – as the “past”. This indeed corresponds to the fact that the quantity is conditioned on the already observed , while for the future we have the worst possible binary tree.111It is somewhat cumbersome to write out the indices on in (6), so we will instead use for , whenever this does not cause confusion.
The conditional Sequential Rademacher complexity is admissible.
The proof of this proposition is given in the Appendix and it corresponds to one step of the sequential symmetrization proof in . We note that the factor appearing in (6) is not necessary in certain cases (e.g. binary prediction with absolute loss).
We now show that several well-known methods arise as further relaxations on the conditional sequential Rademacher complexity .
Suppose is a finite class and . In this case, a (tight) upper bound on sequential Rademacher complexity leads to the following relaxation:
The Chernoff-Cramèr inequality tells us that (8) is the tightest possible relaxation. The proof of Proposition 3 reveals that the only inequality is the softmax which is also present in the proof of the maximal inequality for a finite collection of random variables. In this way, exponential weights is an algorithmic realization of a maximal inequality for a finite collection of random variables. The connection between probabilistic (or concentration) inequalities and algorithms runs much deeper.
We point out that the exponential-weights algorithm arising from the relaxation (8) is a parameter-free algorithm. The learning rate can be optimized (via one-dimensional line search) at each iteration with almost no cost. This can lead to improved performance as compared to the classical methods that set a particular schedule for the learning rate.
In the setting of online linear optimization, the loss is . Suppose is a unit ball in some Banach space and is the dual. Let be some -smooth norm on (in the Euclidean case, ). Using the notation , a straightforward upper bound on sequential Rademacher complexity is the following relaxation:
The relaxation (9) is admissible and
Furthermore, it leads to the Mirror Descent algorithm with regret at most .
An important feature of the algorithms we just proposed is the absence of any parameters, as the step size is tuned automatically. We had chosen Exponential Weights and Mirror Descent for illustration because these methods are well-known. Our aim at this point was to show that the associated relaxations arise naturally (typically with a few steps of algebra) from the sequential Rademacher complexity. More examples are included later in the paper. It should now be clear that upper bounds, such as the Dudley Entropy integral, can be turned into a relaxation, provided that admissibility is proved. Our ideas have semblance of those in Statistics, where an information-theoretic complexity can be used for defining penalization methods.
4 Localized Complexities and the Localized-Meta Algorithm
The localized analysis plays an important role in Statistical Learning Theory. The basic idea is that better rates can be proved for empirical risk minimization when one considers the empirical process in the vicinity of the target hypothesis [11, 4]. Through this, localization gives extra information by shrinking the size of the set which needs to be analyzed. What does it mean to localize in online learning? As we obtain more data, we can rule out parts of as those that are unlikely to become the leaders. This observation indeed gives rise to faster rates. Let us develop a general framework of localization and then illustrate it on examples. We emphasize that the localization ideas will be developed at an abstract level where no assumptions are placed on the loss function or the sets and .
Given any , for any define
That is, given the instances , the set is the set of elements that could be the minimizers of cumulative loss on instances, the first of which are and the remaining arbitrary. We shall refer to minimizers of cumulative loss as empirical risk minimizers (or, ERM).
Henceforth, we shall use the notation . We now consider subdividing into blocks of time such that . With this notation, is the last time in the th block. We then have regret upper bounded as
Hence, one can decompose the online learning game into blocks of successive games. The crucial point to notice is that at the block, we do not compete with the best hypothesis in all of but rather only . It is this localization based on history that could lead to possibly faster rates. While the “blocking” idea often appears in the literature (for instance, in the form of a doubling trick, as described below), the process is usually “restarted” from scratch by considering all of . Notice further that one need not choose all in advance. The player can choose based on history and then use, for instance, the Meta-Algorithm introduced in previous section to play the game within the block using the localized class . Such adaptive procedures will be considered in Section 5, but presently we assume that the block sizes are fixed.
While the successive localizations using subsets can provide an algorithm with possibly better performance, specifying and analyzing the localized subset exactly might not be possible. In such a case, one can instead use
where is some “property” of given data. This definition echoes the definition of the set of -minimizers of empirical or expected risk in Statistical Learning. Further, for a given define
the smallest “radius” such that includes the set of potential minimizers over the next time steps. Of course, if the property does not enforce localization, the bounds are not going to exhibit any improvement, so needs to be chosen carefully for a particular problem of interest.
We have the following algorithm:
The regret of the Localized Meta-Algorithm is bounded as
Note that the above lemma points to local sequential complexities for online learning problems that can lead to possibly fast rates. In particular, if sequential Rademacher complexity is used as the relaxation in the Localized Meta-Algorithm, we get a bound in terms of local sequential Rademacher complexities.
4.1 Local Sequential Complexities
The following corollary is a direct consequence of Lemma 5.
Corollary 6 (Local Sequential Rademacher Complexity).
For any property and any such that , we have that :
Clearly, the sequential Rademacher complexities in the above bound can be replaced with other sequential complexity measures of the localized classes that are upper bounds on the sequential Rademacher complexities. For instance, one can replace each Rademacher complexity by covering number based bounds of the local classes, such as the analogues of the Dudley Entropy Integral bounds developed in the sequential setting in . Once can also use, for instance, fat-shattering dimension based complexity measures for these local classes.
4.2.1 Example : Doubling trick
The doubling trick can be seen as a particular blocking strategy with so that
for defined with respect to some property . The latter inequality is potentially loose, as the algorithm is “restarted” after the previous block is completed. Now if is such that for any , for some then the regret is upper bounded by . The main advantage of the doubling trick is of course that we do not need to know in advance.
4.2.2 Example : Strongly Convex Loss
To illustrate the idea of localization, consider online convex optimization with -strongly convex functions (that is, ). Define
An easy Lemma 27 in the Appendix shows that this relaxation is admissible. Notice that this relaxation grows linearly with block size and is by itself quite bad. However, with blocking and localization, the relaxation gives an optimal bound for strongly convex objectives. To see this note that for , any minimizer of has to be close to the minimizer of , due to strong convexity of the functions. In other words, the property
The relaxation for the block of size is
the radius of the smallest ball containing the localized set , and we immediately get
We remark that this proof is different in spirit from the usual proofs of fast rates for strongly convex functions, and it demonstrates the power of localization.
5 Adaptive Procedures
There is a strong interest in developing methods that enjoy worst-case regret guarantees but also take advantage of the suboptimality of the sequence being played by Nature. An algorithm that is able to do so without knowing in advance that the sequence will have a certain property will be called adaptive. Imagine, for instance, running an experts algorithm, and one of the experts has gained such a lead that she is clearly the winner (that is, the empirical risk minimizer) at the end of the game. In this case, since we are to be compared with the leader at the end, we need not focus on anyone else, and regret for the remainder of the game is zero.
There has been previous work on exploiting particular ways in which sequences can be suboptimal. Examples include the Adaptive Gradient Descent of  and Adaptive Hedge of . We now give a generic method which incorporates the idea of localization in order to adaptively (and constantly) check whether the sequence being played is of optimal or suboptimal nature. Notice that, as before, we present the algorithm at the abstract level of the online game with some decision sets , , and some loss function .
The adaptive procedure below uses a subroutine which, given the history , returns a subdivision of the next rounds into sub-blocks. The choice of the blocking strategy has to be made for the particular problem at hand, but, as we show in examples, one can often use very simple strategies.
Let us describe the adaptive procedure. First, for simplicity of exposition, we start with the doubling-size blocks. Here is what happens within each of these blocks. During each round the learner decides whether to stay in the same sub-block or to start a new one, as given by the blocking procedure . If started, the new sub-block uses the localized subset given history of adversary’s moves up until last round. Choosing to start a new sub-block corresponds to the realization of the learner that the sequence being presented so far is in fact suboptimal. The learner then incorporates this suboptimality into the localized procedure.
Given some admissible relaxation , the regret of the adaptive localized meta-algorithm (Algorithm 3) is bounded as
where is the number of blocks actually played and ’s are adaptive block lengths defined within the algorithm. Further, irrespective of the blocking strategy used, if the relaxation is such that for any , for some , then the worst case regret is always bounded as
We now demonstrate that the adaptive algorithm in fact takes advantage of sub-optimality in several situations that have been previously studied in the literature. On the conceptual level, adaptive localization allows us to view several fast rate results under the same umbrella.
Example: Adaptive Gradient Descent
Consider the online convex optimization scenario. Following the setup of , suppose the learner encounters a sequence of convex functions with the strong convexity parameter , potentially zero, with respect to a -smooth norm . The goal is to adapt to the actual sequence of functions presented by the adversary. Let us invoke the Adaptive Localized Meta-Algorithm with a rather simple blocking strategy
This blocking strategy either says “use all of the next rounds as one block”, or “make each of the next time step into separate blocks”. Let be the empirical minimizer at the start of the block (that is after rounds), and let . Then we can use the localization
where . For the above relaxation we can show that the corresponding update at round is given by
where is the length of the current block. The next lemma shows that the proposed adaptive gradient descent recovers the results of . The method is a mixture of Follow the Leader -style algorithm and a Gradient Descent -style algorithm.
The relaxation specified above is admissible. Suppose the adversary plays -Lipchitz convex functions such that for any , is -strongly convex, and further suppose that for some , we have that . Then, for the blocking strategy specified above,
Example: Adaptive Experts
We now turn to the setting of Adaptive Hedge or Exponential Weights algorithm similar to the one studied in . Consider the following situation: for all time steps after some , there is an element (or, expert) that is the best by a margin over the next-best choice in in terms of the (unnormalized) cumulative loss, and it remains to be the winner until the end. Let us use the localization
the set of functions closer than the margin to the ERM. Let
be the set of empirical minimizers at time . We use the blocking strategy
which says that the size of the next block is given by the gap between empirical minimizer(s) and non-minimizers. The idea behind the proof and the blocking strategy is simple. If it happens at the start a new block that there is a large gap between the current leader and the next expert, then for the number of rounds approximately equal to this gap we can play a new block and not suffer any extra regret.
Consider the relaxation (8) used for the Exponential Weights algorithm.
While we demonstrated a very simple example, the algorithm is adaptive more generally. Lemma 9 considers the assumption that a single expert becomes a clear winner after rounds, with margin of . Even when there is no clear winner throughout the game, we can still achieve low regret. For instance, this happens if only a few elements of have low cumulative loss throughout the game and the rest of suffers heavy loss. Then the algorithm adapts to the suboptimality and gives regret bound with the dominating term depending logarithmically only on the cardinality of the “good” choices in the set . Similar ideas appear in , and will be investigated in more generality in the full version of the paper.
Example: Adapting to the Data Norm
Recall that the set is the subset of functions in that are possible empirical risk minimizers when we consider for some that can occur in the future. Now, given history and a possible future sequence , if is an ERM for and is an ERM for then
Hence, we see that it suffices to consider localizations
If we consider online convex Lipschitz learning problems where and loss is convex in and is such that in the dual norm