We continue the line of work on the minimax analysis of online learning, initiated in [1, 11, 10]. In these papers, an array of tools has been developed to study the minimax value of diverse sequential problems under the worst-case assumption on Nature. In 
, many analogues of the classical notions from statistical learning theory have been developed, and these have been extended in for performance measures well beyond the additive regret. The process of sequential symmetrization emerged as a key technique for dealing with complicated nested minimax expressions. In the worst-case model, the developed tools appear to give a unified treatment to such sequential problems as regret minimization, calibration of forecasters, Blackwell’s approachability, Phi-regret, and more.
Learning theory has been so far focused predominantly on the i.i.d. and the worst-case learning scenarios. Much less is known about learnability in-between these two extremes. In the present paper, we make progress towards filling this gap. Instead of examining various performance measures, as in 
, we focus on external regret and make assumptions on the behavior of Nature. By restricting Nature to play i.i.d. sequences, the results boil down to the classical notions of statistical learning in the supervised learning scenario. By not placing any restrictions on Nature, we recover the worst-case results of. Between these two endpoints of the spectrum, particular assumptions on the adversary yield interesting bounds on the minimax value of the associated problem.
By inertia, we continue to use the name “online learning” to describe the sequential interaction between the player (learner) and Nature (adversary). We realize that the name can be misleading for a number of reasons. First, the techniques developed in [11, 10] apply far beyond the problems that would traditionally be called “learning”. Second, in this paper we deal with non-worst-case adversaries, while the word “online” often (though, not always) refers to worst-case. Still, we decided to keep the misnomer “online learning” whenever the problem is sequential.
Adapting the game-theoretic language, we will think of the learner and the adversary as the two players of a zero-sum repeated game. Adversary’s moves will be associated with “data”, while the moves of the learner – with a function or a parameter. This point of view is not new: game-theoretic minimax analysis has been at the heart of statistical decision theory for more than half a century (see 
). In fact, there is a well-developed theory of minimax estimation when restrictions are put on either the choice of the adversary or the allowed estimators by the player. We are not aware of a similar theory for sequential problems with non-i.i.d. data.
In particular, minimax analysis is central to nonparametric estimation, where one aims to prove optimal rates of convergence of the proposed estimator. Lower bounds are proved by exhibiting a “bad enough” distribution of the data that can be chosen by the adversary. The form of the minimax value is often
where the infimum is over all estimators and the supremum is over all functions from some class . It is often assumed that , with being zero-mean noise. An estimator can be thought of as a strategy, mapping the data to the space of functions on . This description is, of course, only a rough sketch that does not capture the vast array of problems considered in nonparametric estimation.
In statistical learning theory, the data are i.i.d. from an unknown distribution and the associated minimax problem in the supervised setting with square loss is
where the infimum is over all estimators (or learning algorithms) and the supremum is over all distributions. Unlike nonparametric regression which makes an assumption on the “regression function” , statistical learning theory often aims at distribution-free results. Because of this, the goal is more modest: to predict as well as the best function in rather than recover the true model. In particular, (2) sidesteps the issue of approximation error (model misspecification).
What is known about the asymptotic behavior of (2)? The well-developed statistical learning theory tells us that (2) converges to zero if and only if the combinatorial dimensions of (that is, the VC dimension for binary-valued, or scale-sensitive for real-valued functions) are finite. The convergence is intimately related to the uniform Glivenko-Cantelli property. If indeed the value in (2
) converges to zero, an algorithm that achieves this is Empirical Risk Minimization. For unsupervised learning problems, however, ERM does not necessarily drive the quantityto zero.
The formulation (2) no longer makes sense if the data generating process is non-stationary. Consider the opposite from i.i.d. end of the spectrum: the data are chosen in a worst-case manner. First, consider an oblivious adversary who fixes the individual sequence ahead of the game and reveals it one-by-one. A frequently studied notion of performance is regret, and the minimax value can be written as
where the randomized strategy for round is , with being the set of all distributions on . That is, the player furnishes his best randomized strategy for each round, and the adversary picks the worst sequence.
A non-oblivious (adaptive) adversary is, of course, more interesting. The protocol for the online interaction is the following: on round the player chooses a distribution on , the adversary chooses the next move , the player draws from , and the game proceeds to the next round. All the moves are observed by both players. Instead of writing the value in terms of strategies, we can write it in an extended form as
This is precisely the quantity considered in . The minimax value for notions other than regret has been studied in . In this paper, we are interested in restricting the ways in which the sequences are produced. These restrictions can be imposed through a smaller set of mixed strategies that is available to the adversary at each round, or as a non-stochastic constraint at each round. The formulation we propose captures both types of assumptions.
The main contribution of this paper is the development of tools for the analysis of online scenarios where the adversary’s moves are restricted in various ways. Further, we consider a number of interesting scenarios (such as smoothed learning) which can be captured by our framework. The present paper only scratches the surface of what is possible with sequential minimax analysis. Many questions are to be answered: For instance, one can ask whether a certain adversary is more powerful than another adversary by studying the value of the associated game.
The paper is organized as follows. In Section 2 we define the value of the game and appeal to minimax duality. Distribution-dependent sequential Rademacher complexity is defined in Section 3 and can be seen to generalize the classical notion as well as the worst-case notion from . This section contains the main symmetrization result which relies on a careful consideration of original and tangent sequences. Section 4 is devoted to analysis of the distribution-dependent Rademacher complexity. In Section 5 we consider non-stochastic constraints on the behavior of the adversary. From these results, variation-type results are seamlessly deduced. Section 6 is devoted to the i.i.d. adversary. We show equivalence between batch and online learnability. Hybrid adversarial-stochastic supervised learning is considered in Section 7. We show that it is the way in which the variable is chosen that governs the complexity of the problem, irrespective of the way the variable is picked. In Section 8 we introduce the notion of smoothed analysis in the online learning scenario and show that a simple problem with infinite Littlestone’s dimension becomes learnable once a small amount of noise is added to adversary’s moves. Throughout the paper, we use the notation introduced in [11, 10], and, in particular, we extensively use the “tree” notation.
2 Value of the Game
Consider sets and , where is a closed subset of a complete separable metric space. Let
be the set of probability distributions onand assume that is weakly compact. We consider randomized learners who predict a distribution on every round.
Let be the set of probability distributions on . We would like to capture the fact that sequences cannot be arbitrary. This is achieved by defining restrictions on the adversary, that is, subsets of “allowed” distributions for each round. These restrictions limit the scope of available mixed strategies for the adversary.
A restriction on the adversary is a sequence of mappings such that is a convex subset of for any .
Note that the restrictions depend on the past moves of the adversary, but not on those of the player. We will write instead of when is clearly defined.
Using the notion of restrictions, we can give names to several types of adversaries that we will study in this paper.
A worst-case adversary is defined by vacuous restrictions . That is, any mixed strategy is available to the adversary, including any deterministic point distributions.
A constrained adversary is defined by being the set of all distributions supported on the set for some deterministic binary-valued constraint . The deterministic constraint can, for instance, ensure that the length of the path determined by the moves stays below the allowed budget.
A smoothed adversary picks the worst-case sequence which gets corrupted by an i.i.d. noise. Equivalently, we can view this as restrictions on the adversary who chooses the “center” (or a parameter) of the noise distribution. For a given family of noise distributions (e.g. zero-mean Gaussian noise), the restrictions are obtained by all possible shifts .
A hybrid adversary in the supervised learning game picks the worst-case label , but is forced to draw the -variable from a fixed distribution .
Finally, an i.i.d. adversary is defined by a time-invariant restriction for every and some .
For the given restrictions , we define the value of the game as
where has distribution and has distribution . As in , the adversary is adaptive, that is, chooses based on the history of moves and .
At this point, the only difference from the setup of  is in the restrictions on the adversary. Because these restrictions might not allow point distributions, the suprema over ’s in (5) cannot be equivalently written as the suprema over ’s.
The value of the game can also be written in terms of strategies and for the player and the adversary, respectively, where and . Crucially, the strategies also depend on the mappings . The value of the game can equivalently be written in the strategic form as
A word about the notation. In , the value of the game is written as , signifying that the main object of study is . In , it is written as since the focus is on the complexity of the set of transformations and the payoff mapping . In the present paper, the main focus is indeed on the restrictions on the adversary, justifying our choice for the notation.
The first step is to apply the minimax theorem. To this end, we verify the necessary conditions. Our assumption that is a closed subset of a complete separable metric space implies that is tight and Prokhorov’s theorem states that compactness of under weak topology is equivalent to tightness . Compactness under weak topology allows us to proceed as in . Additionally, we require that the restriction sets are compact and convex.
Let and be the sets of moves for the two players, satisfying the necessary conditions for the minimax theorem to hold. Let be the restrictions, and assume that for any , satisfies the necessary conditions for the minimax theorem to hold. Then
The nested sequence of suprema and expected values in Theorem 1 can be re-written succinctly as
where the supremum is over all joint distributionsover sequences, such that satisfies the restrictions as described below. Given a joint distribution on sequences , we denote the associated conditional distributions by . We can think of the choice as a sequence of oblivious strategies , mapping the prefix to a conditional distribution . We will indeed call a “joint distribution” or an “oblivious strategy” interchangeably. We say that a joint distribution satisfies restrictions if for any and any , . The set of all joint distributions satisfying the restrictions is denoted by . We note that Theorem 1 cannot be deduced immediately from the analogous result in , as it is not clear how the restrictions on the adversary per each round come into play after applying the minimax theorem. Nevertheless, it is comforting that the restrictions directly translate into the set of oblivious strategies satisfying the restrictions.
Before continuing with our goal of upper-bounding the value of the game, let us answer the following question: Is there an oblivious minimax strategy for the adversary? Even though Theorem 1 shows equality to some quantity with a supremum over oblivious strategies , it is not immediate that the answer to our question is affirmative, and a proof is required. To this end, for any oblivious strategy , define the regret the player would get playing optimally against :
The next proposition shows that there is an oblivious minimax strategy for the adversary and a minimax optimal strategy for the player that does not depend on its own randomizations. The latter statement for worst-case learning is folklore, yet we have not seen a proof of it in the literature.
For any oblivious strategy ,
with equality holding for which achieves the supremum111Here, and in the rest of the paper, if a supremum is not achieved, a slightly modified analysis can be carried out. in (8). Importantly, the infimum is over strategies of the player that do not depend on player’s previous moves, that is . Hence, there as an oblivious minimax optimal strategy for the adversary, and there is a corresponding minimax optimal strategy for the player that does not depend on its own moves.
3 Symmetrization and Random Averages
is a useful representation of the value of the game. As the next step, we upper bound it with an expression which is easier to study. Such an expression is obtained by introducing Rademacher random variables. This process can be termedsequential symmetrization and has been exploited in [1, 11, 10]. The restrictions , however, make sequential symmetrization a bit more involved than in the previous papers. The main difficulty arises from the fact that the set depends on the sequence , and symmetrization (that is, replacement of with ) has to be done with care as it affects this dependence. Roughly speaking, in the process of symmetrization, a tangent sequence is introduced such that and are independent and identically distributed given “the past”. However, “the past” is itself an interleaving choice of the original sequence and the tangent sequence.
Define the “selector function” by
When and are understood from the context, we will use the shorthand . In other words, selects between and depending on the sign of .
Throughout the paper, we deal with binary trees, which arise from symmetrization . Given some set , an -valued tree of depth is a sequence of mappings . The -tuple defines a path. For brevity, we write instead of .
Given a joint distribution , consider the “”- valued probability tree defined by
In other words, the values of the mappings are products of conditional distributions, where conditioning is done with respect to a sequence made from and depending on the sign of . We note that the difficulty in intermixing the and sequences does not arise in i.i.d. or worst-case symmetrization. However, in-between these extremes the notational complexity seems to be unavoidable if we are to employ symmetrization and obtain a version of Rademacher complexity.
As an example, consider the “left-most” path in a binary tree of depth , where is a
-dimensional vector of ones. Then all the selectorsin the definition (11) select the sequence . The probability tree on the “left-most” path is, therefore, defined by the conditional distributions . Analogously, on the path , the conditional distributions are .
Slightly abusing the notation, we will write for the probability tree since clearly depends only on the prefix up to time . Throughout the paper, it will be understood that the tree is obtained from as described above. Since all the conditional distributions of satisfy the restrictions, so do the corresponding distributions of the probability tree . By saying that satisfies restrictions we then mean that .
Sampling of a pair of -valued trees from , written as , is defined as the following recursive process: for any ,
To gain a better understanding of the sampling process, consider the first few levels of the tree. The roots of the trees are sampled from , the conditional distribution for given by . Next, say, . Then the “right” children of and are sampled via since selects . On the other hand, the “left” children are both distributed according to . Now, suppose and . Then, are both sampled from .
The proof of Theorem 3 reveals why such intricate conditional structure arises, and Section 4 shows that this structure greatly simplifies for i.i.d. and worst-case situations. Nevertheless, the process described above allows us to define a unified notion of Rademacher complexity for the spectrum of assumptions between the two extremes.
The distribution-dependent sequential Rademacher complexity of a function class is defined as
where is a sequence of i.i.d. Rademacher random variables and is the probability tree associated with .
We now prove an upper bound on the value of the game in terms of this distribution-dependent sequential Rademacher complexity. This provides an extension of the analogous result in  to adversaries more benign than worst-case.
The minimax value is bounded as
A more general statement also holds:
for any measurable function with the property . In particular, (13) is obtained by choosing .
The following corollary provides a natural “centered” version of the distribution-dependent Rademacher complexity. That is, the complexity can be measured by relative shifts in the adversarial moves.
For the game with restrictions ,
where denotes the conditional expectation of .
Suppose is a unit ball in a Banach space and . Then
Suppose the adversary plays a simple random walk (e.g., is uniform on a unit sphere). For simplicity, suppose this is the only strategy allowed by the set . Then are independent increments when conditioned on the history. Further, the increments do not depend on . Thus,
where is the corresponding random walk.
4 Analyzing Rademacher Complexity
The aim of this section is to provide a better understanding of the distribution-dependent sequential Rademacher complexity, as well as ways of upper-bounding it. We first show that the classical Rademacher complexity is equal to the distribution-dependent sequential Rademacher complexity for i.i.d. data. We further show that the distribution-dependent sequential Rademacher complexity is always upper bounded by the worst-case sequential Rademacher complexity defined in .
It is already apparent to the reader that the sequential nature of the minimax formulation yields long mathematical expressions, which are not necessarily complicated yet unwieldy. The functional notation and the tree notation alleviate much of these difficulties. However, it takes some time to become familiar and comfortable with these representations. The next few results hopefully provide the reader with a better feel for the distribution-dependent sequential Rademacher complexity.
Consider the i.i.d. restrictions for all , where is some fixed distribution on . Let be the process associated with the joint distribution . Then
is the classical Rademacher complexity.
By definition, we have,
In the i.i.d. case, however, the tree generation according to the process simplifies: for any ,
Thus, the random variables are all i.i.d. drawn from . Writing the expectation (15) explicitly as an average over paths, we get
The second equality holds because, for any fixed path , the random variables have joint distribution . ∎
For any joint distribution ,
is the sequential Rademacher complexity defined in .
To make the process associated with more explicit, we use the expanded definition:
The inequality holds by replacing expectation over by a supremum over the same. We then get rid of ’s since they do not appear anywhere. ∎
An interesting case of hybrid i.i.d.-adversarial data is considered in Lemma 17, and we refer to its proof as another example of an analysis of the distribution-dependent sequential Rademacher complexity.
We now turn to general properties of Rademacher complexity. The proof of next Proposition follows along the lines of the analogous result in .
Distribution-dependent sequential Rademacher complexity satisfies the following properties.
If , then .
for all .
For any , where
Next, we consider upper bounds on via covering numbers. Recall the definition of a (sequential) cover, given in . This notion captures sequential complexity of a function class on a given -valued tree .
A set of -valued trees of depth is an -cover (with respect to -norm) of on a tree of depth if
The covering number of a function class on a given tree is defined as
Using the notion of the covering number, the following result holds.
For any function class ,
The analogous result in  is stated for the worst-case adversary, and, hence, it is phrased in terms of the maximal covering number . The proof, however, holds for any fixed , and thus immediately implies Theorem 8. If the expectation over in Theorem 8 can be exchanged with the integral, we pass to an upper bound in terms of the expected covering number .
The following simple corollary of the above theorem shows that the distribution-dependent Rademacher complexity of a function class composed with a Lipschitz mapping can be controlled in terms of the Dudley integral for the function class itself.
Fix a class and a function . Assume, for all , is a Lipschitz function with a constant . Then,
The statement can be seen as a covering-number version of the Lipschitz composition lemma.
5 Constrained Adversaries
In this section we consider adversaries who are constrained in the sequences of actions they can play. It is often useful to consider scenarios where the adversary is worst case, yet has some budget or constraint to satisfy while picking the actions. Examples of such scenarios include, for instance, games where the adversary is constrained to make moves that are close in some fashion to the previous move, linear games with bounded variance, and so on. Below we formulate such games quite generally through arbitrary constraints that the adversary has to satisfy on each round.
Specifically, for a round game consider an adversary who is only allowed to play sequences such that at round the constraint is satisfied, where represents the constraint on the sequence played so far. The constrained adversary can be viewed as a stochastic adversary with restrictions on the conditional distribution at time given by the set of all Borel distributions on the set
Since set includes all point distributions on each , the sequential complexity simplifies in a way similar to worst-case adversaries. We write for the value of the game with the given constraints. Now, assume that for any , the set of all distributions on is weakly compact in a way similar to compactness of . That is, satisfy the necessary conditions for the minimax theorem to hold. We have the following corollaries of Theorems 1 and 3.
Let and be the sets of moves for the two players, satisfying the necessary conditions for the minimax theorem to hold. Let be the constraints. Then
where ranges over all distributions over sequences such that for all .
Let the set be a set of pairs of -valued trees with the property that for any and any
The minimax value is bounded as
for any measurable function with the property .
Armed with these results, we can recover and extend some known results on online learning against budgeted adversaries. The first result says that if the adversary is not allowed to move by more than away from its previous average of decisions, the player has a strategy to exploit this fact and obtain lower regret. For the -norm, such “total variation” bounds have been achieved in  up to a factor. We note that in the present formulation the budget is known to the learner, whereas the results of  are adaptive. Such adaptation is beyond the scope of this paper.
Proposition 12 (Variance Bound).
Consider the online linear optimization setting with for a -strongly function on , and . Let for any and . Consider the sequence of constraints given by
In particular, we obtain the following variance bound. Consider the case when is given by , and . Consider the constrained game where the move played by adversary at time satisfies
In this case we can conclude that
We can also derive a variance bound over the simplex. Let