We study the problem of predicting a real-valued sequence in an on-line manner. At time , the forecaster receives side information in the form of an element of an abstract set . The forecaster then makes a prediction on the basis of the current observation and the data encountered thus far, and then observes the response .
Such a problem of sequence prediction is studied in the literature under two distinct settings: probabilistic and deterministic 
. In the former setting, which falls within the purview of time series analysis, one posits a parametric form for the data-generating mechanism and estimates the model parameters based on past instances and input information in order to make the next prediction. In contrast, in the deterministic setting one assumes no such probabilistic mechanism. Instead, the goal is phrased as that of predicting as well as the best forecaster from a benchmark set of strategies. This latter setting—often termedprediction of individual sequences, or online learning—is the focus of the present paper.
We let the outcome and the prediction take values in and , respectively. Formally, a deterministic prediction strategy is a mapping . We let the loss function score the quality of the prediction on a single round.
Assume that the time horizon , is known to the forecaster. The overall quality of the forecaster is then evaluated against the benchmark set of predictors, denoted as a class of functions . The cumulative regret of the forecaster on the sequence is defined as
The forecaster aims to keep the difference in (1) small for all sequences .
The comparison class encodes the prior belief about the family of predictors one expects to perform well. If a forecasting strategy guarantees small regret for all sequences, and if is a good model for the sequences observed in reality, then the forecasting strategy will also perform well in terms of its cumulative error. In fact, we can take to be a class of solutions (that is, forecasting strategies) to a set of probabilistic sources one would obtain by positing a generative model of data. By doing so, we are modeling solutions to the prediction problem rather than modeling the data-generating mechanism. We refer to [18, 21] for further discussions on this “duality” between the probabilistic and deterministic approaches.
To ensure that captures the phenomenon of interest, we would like to be large. However, increasing the “size” of likely leads to larger regret, as the comparison term in (1) becomes smaller. On the other hand, decreasing the “size” of
makes the regret minimization task easier, yet the prediction method is less likely to be successful in practice. This dichotomy is an analogue of the bias-variance tradeoff commonly studied in statistics. A contribution of this paper is an analysis of the growth of regret (with) in terms of various notions of complexity of . The task was already accomplished in  for the case of absolute loss . In the present paper we obtain optimal guarantees for convex Lipschitz losses under very general assumptions.
To give the reader a sense of the results of this paper, we state the following informal corollary. Let complexity of be measured via sequential entropy at scale , to be defined below. (For the reader familiar with covering numbers, this is a sequential analogue—introduced in —of the classical Koltchinskii-Pollard entropy).
Corollary 1 (Informal).
Suppose sequential entropy at scale behaves as , . Then optimal regret
for prediction with absolute loss grows as if , and as for ;
for prediction with square loss grows as if , and as for .
Moreover, these rates have matching, sometimes modulo a logarithmic factor, lower bounds.
The first part of this corollary is established in . The second part requires new techniques that take advantage of the curvature of the loss function.
In an attempt to entice the reader, let us discuss two conclusions that can be drawn from Corollary 1
. First, the rates of convergence match optimal rates for excess square loss in the realm of distribution-free Statistical Learning Theory with i.i.d. data, under the assumption on the behavior of empirical covering numbers. Hence, in the absence of a gap between classical and sequential complexities (introduced later) the regression problems in the two seemingly different frameworks enjoy the same rates of convergence. A deeper understanding of this phenomenon is of a great interest.
The second conclusion concerns the same optimal rate for both square and absolute loss for “rich” classes (). Informally, strong convexity of the loss does not affect the rate of convergence for such massive classes. A geometric explanation of this interesting phenomenon requires further investigation.
We finish this introduction with a note about the generality of the setting proposed so far. Suppose , the space of all histories of -valued outcomes. Denoting , we may view each itself as a strategy that maps history to a prediction. Ensuring that is not arbitrary but consistent with history only makes the task of regret minimization easier; the analysis of this paper for this case follows along the same lines, but we omit the extra overhead of restrictions on ’s and instead refer the reader to [14, 21].
The paper is organized as follows. Section 2 introduces the notation and then presents a brief overview of sequential complexities. Upper and lower bounds on minimax regret are established in Sections 3 and 4. We calculate minimax rates for various examples in Section 5. We then turn to the question of developing algorithms in Section 6. We first show that an algorithm based on the Rademacher relaxation is admissible (see ) and yields the rates derived in a non-constructive manner in the first part of the paper. We show that further relaxations in finite dimensional space lead to the famous Vovk-Azoury-Warmuth forecaster. We also derive a prediction method for finite class .
2.1 Assumptions and Definitions
We assume that the set of outcomes is a bounded set, a restriction that can be removed by standard truncation arguments (see e.g. ). Let be some set of covariates, and let be a class of functions for some . Recall the protocol of the online prediction problem: On each round , is revealed to the learner who subsequently makes a prediction . The response is revealed after the prediction is made.
The loss function is assumed to be convex. Let denote any element of the subdifferential set (with respect to first argument), and assume that
We assume that for any distribution of supported on , there is a minimizer of expected loss that is finite and belongs to :
Given a , the error of a linear expansion at to approximate function value at is denoted by
Let be a function defined pointwise as
a lower bound on the residual for any two values separated by . For instance, an easy calculation shows that for .
2.2 Minimax Formulation
Unlike most previous approaches to the study of online regression, we do not start from an algorithm, but instead work directly with minimax regret. We will be able to extract a (not necessarily efficient) algorithm after obtaining upper bounds on the minimax value. Let us introduce the notation that makes the minimax regret definition more concise. We use to denote an interleaved application of the operators, repeated over rounds. With this notation, the minimax regret of the online regression problem described earlier can be written as
where each ranges over , ranges over , and ranges over . An upper bound on guarantees the existence of an algorithm (that is, a way to choose ’s) with at most that much regret against any sequence. A lower bound on , in turn, guarantees the existence of a sequence on which no method can perform better than the given lower bound.
2.3 Sequential Complexities
One of the key tools in the study of estimators based on i.i.d. data is the symmetrization technique 
. By introducing Rademacher random variables, one can study the supremum of an empirical process conditionally on the data. Conditioning facilitates the introduction of sample-based complexities of a function class, such as an empirical covering number. For a class of bounded functions, the covering number with respect to the empirical metric is necessarily finite and leads to a correct control of the empirical process even if discretization of the function class in a data-independent manner is impossible. We will return to this point when comparing our approach with discretization-based methods.
In the online prediction scenario, symmetrization is more subtle and involves the notion of a binary tree. The binary tree is, in some sense, the smallest entity that captures the sequential nature of the problem. More precisely, a -valued tree of depth is a complete rooted binary tree with nodes labeled by elements of a set . Equivalently, we think of as labeling functions, where is a constant label for the root, are the labels for the left and right children of the root, and so forth. Hence, for , is the label of the node on the -th level of the tree obtained by following the path . For a function , is an -valued tree with labeling functions for level (or, in plain words, evaluation of on ).
We now define two tree-based complexity notions of a class of functions.
Definition 1 ().
Sequential Rademacher complexity of a class on a given -valued tree of depth , as well as its supremum, are defined as
where the expectation is over a sequence of independent Rademacher random variables .
One may think of the functions as a predictable process with respect to the dyadic filtration . The following notion of a -cover quantifies complexity of the class evaluated on the predictable process.
Definition 2 ().
A set of -valued trees of depth forms a -cover (with respect to the norm) of a function class on a given -valued tree of depth if
A -cover in the sense requires that for all . The size of the smallest -cover is denoted by , and .
We will refer to as sequential entropy of . In particular, we will study the behavior of when sequential entropy grows polynomially222It is straightforward to allow constants in this definition, and we leave these details out for the sake of simplicity. as the scale decreases:
We also consider the parametric “” case when sequential covering itself behaves as
(e.g. linear regression in a bounded set in ). We remark that the cover is necessarily -dependent, so the forms we assume for nonparametric and parametric cases, respectively, are
3 Upper Bounds
The following theorem from  shows the importance of sequential Rademacher complexity for prediction with absolute loss.
Theorem 2 ().
Let , , and . It then holds that
Furthermore, an upper bound of holds for any -Lipschitz loss. We observe, however, that as soon as contains two distinct functions, sequential Radmeacher complexity of scales as . Yet, it is known that minimax regret for prediction with square loss grows slower than this rate. Therefore, the direct analysis based on sequential Rademacher complexity (and a contraction lemma) gives loose upper bounds on minimax regret. The key contribution of this paper is an introduction of an offset Rademacher complexity that captures the correct behavior.
In the next lemma, we show that minimax value of the sequential prediction problem with any convex Lipschitz loss function can be controlled via offset sequential Rademacher complexity. As before, let where each is an independent Rademacher random variable.
Under the assumptions and definitions in Section 2.1, the minimax rate is bounded by
where and range over all -valued and -valued trees of depth , respectively.
The right-hand side of (8) will be termed offset Rademacher complexity of a function class with respect to a convex even offset function and a mean -valued tree . If , we recover the notion of sequential Rademacher complexity since .
A matching lower bound on the minimax value will be presented in Section 4, and the two results warrant a further study of offset Rademacher complexity. To this end, a natural next question is whether the chaining technique can be employed to control the supremum of this modified stochastic process. As a point of comparison, we first recall that sequential Rademacher complexity of a class of -valued functions on can be upper bounded via the Dudley integral-type bound
for any -valued tree of depth , as shown in . We aim to obtain tighter upper bounds on the offset Rademacher by taking advantage of the negative offset term.
To initiate the study of offset Rademacher complexity with functions other than quadratic, we recall the notion of a convex conjugate.
For a convex function with domain , the convex conjugate is defined as
The chaining technique for controlling a supremum of a stochastic process requires a statement about the behavior of the process over a finite collection. The next lemma provides such a statement for the offset Rademacher process.
Let be a convex, nonnegative, even function on and let denote the convex conjugate of the function . Assume is nondecreasing. For any finite set of -valued trees of depth and any constant ,
Further, for any -valued tree ,
As an example, if , an easy calculation shows that and for any . Hence, the infimum in (10) is achieved at , and the upper bound becomes .
We can now employ the chaining technique to extend the control of the stochastic process beyond the finite collection.
Let and be as in Lemma 4. For any -valued tree of depth and a class of functions and any constant ,
4 Lower Bounds
The function , arising from uniform (or strong) convexity of the loss function, enters the upper bounds on minimax regret. For proving lower bounds, we consider the dual property, that of (restricted) smoothness. To this end, let be a subset satisfying the following condition:
For any such subset , let be defined as
We write for the singleton set .
The lower bounds in this section will be constructed from symmetric distributions supported on two carefully chosen points. Crucially, we do not require a uniform notion of smoothness, but rather a condition on the loss that holds for a restricted subset and a two-point distribution.
As an example, consider square loss and . For any , we may choose the two points as , for small enough , with the desired property. Then and .
Fix . Suppose satisfies condition (12), and suppose that for any ,
Then for any -valued tree of depth ,
The lower bound in (14) is an offset Rademacher complexity that matches the upper bound of Lemma 3 up to constants, as long as functions and exhibit the same behavior. In particular, the upper and lower bounds match up to a constant for the case of square loss.
Our next step is to quantify the lower bound in terms of according to “size” of . In contrast to the more common statistical approaches based on covering numbers and Fano inequality, we turn to a notion of a combinatorial dimension as the main tool.
An -valued tree of depth is said to be -shattered by if there exists an -valued tree of depth such that
for all . The tree is called a witness. The largest for which there exists a -shattered -valued tree is called the (sequential) fat-shattering dimension, denoted by .
The reader will notice that the upper bound of Lemma 5 is in terms of sequential entropies rather than combinatorial dimensions. The two notions, however, are closely related.
Theorem 7 ().
Let be a class of functions . For any ,
As a consequence of the above theorem, if and , then where may depend on the range of functions in .
The lower bounds will now be obtained assuming behavior of the fat-shattering dimension, and the corresponding statements in terms of the sequential entropy growth will involve extra logarithmic factors, hidden in the notation.
As an example, consider the case of square loss with . Then we may take , , , and hence . We verify that (15) holds for .
Suppose the statement of Lemma 6 holds for some . For any class and , there exists a modified class such that for all , and for ,
5 Minimax Rates
Let and suppose the loss function and the function class are such that
Then for ,
Here, depends on . At , the bound (17) gains an extra factor.
We match the above upper bounds with lower bounds under the assumption on the growth of the combinatorial dimension.
The lower bound of Theorem 11 matches (up to polylogarithmic in factors) the upper bound of Theorem 10 in its dependence on , the dependence on the constant , and in dependence on the size of the gradients (respectively, ). The rest of this section is devoted to the discussion of the derived upper and lower bounds for particular loss functions or particular classes of functions.
5.1 Absolute loss
We verify that the general statements recover the correct rates for the case of . Since the absolute loss is not strongly convex, we take (and ). Theorem 10 then yields the rate for and for , up to logarithmic factors. These rates are matched, again up to logarithmic factors, in Theorem 11. Of course, the result already follows from Theorem 2.
It is also instructive to check the case of . In this case, if is scaled properly by the range of function values, the function approaches the zero function, indicating absence of strong convexity of the loss. Examining the power in Theorem 10, we see that it approaches , matching the discussion of the preceding paragraph.
5.2 Square loss
For a class with sequential entropy growth ,
For , the minimax regret333For , . is bounded as
For , the minimax regret is bounded as
For the parametric case (6),
For finite set ,
The upper bounds of Corollary 12 are tight444The notation suppresses logarithmic factors:
For , for any class of uniformly bounded functions with a lower bound of on sequential entropy growth,
For , for any class of uniformly bounded functions, there exists a slightly modified class with the same sequential entropy growth such that
There exists a class with the covering number as in (6), such that
5.3 -loss for
Consider the case of , for
, which interpolates between the absolute value and square losses.
5.4 -loss for
It is easy to check that for , is -uniformly convex, and thus
The upper bound of
then follows from Theorem 10.
5.5 Logistic loss
The loss function is strongly convex and smooth if the sets are bounded. This can be seen by computing the second derivative with respect to the first argument:
We conclude that
Logistic loss is an example of a function with third derivative bounded by a multiple of the second derivative. Control of the remainder term in Taylor approximation for such functions is given in [5, Lemma 1]. Other examples of strongly convex and smooth losses are the exponential loss and truncated quadratic loss. These enjoy the same minimax rate as given above.
5.6 Logarithmic loss
The technique developed in this paper is not universal. In particular, it does not yield correct rates for rich classes of functions under the loss
for the problem of probability assignment and a binary alphabet. The suboptimality of Lemma 3 is due to the exploding Lipschitz constant. However, a modified approach is possible, and will be carried out in a separate paper.
5.7 Sparse linear predictors and square loss
We now focus on quadratic loss and instead detail minimax rates for specific classes of functions. Consider the following parametric class. Let be a set of functions such that each . Define to be the convex combination of at most out of these functions. That is
For this example note that the sequential covering number can be easily upper bounded: we can choose out of functions in ways and observe that pointwise metric entropy for convex combination of bounded functions at scale is bounded as . We conclude that
From the main theorem, for the case of square loss, the upper bound is
The extension to other loss functions follows immediately from the general statements.
5.8 Besov spaces and square loss
Let be a compact subset of . Let be a ball in Besov space . When , pointwise metric entropy bounds at scale scales as [31, p. 20]. On the other hand, when , and , one can show that the space is a -uniformly convex Banach space. From , it can be shown that sequential Rademacher can be upper bounded by , yielding a bound on minimax rate. These two controls together give the bound on the minimax rate. The generic forecaster with Rademacher complexity as relaxation (see Section 6), enjoys the best of both of these rates. More specifically, we may identify the following regimes:
If , the minimax rate is .
If , the minimax rate depends on the interaction of and :
if , the minimax rate , otherwise, the rate is
5.9 Remarks: Experts, Mixability, and Discretization
The problem of prediction with expert advice has been central in the online learning literature . One can phrase the experts problem in our setting by taking a finite class of functions. It is possible to ensure sublinear regret by following the “advice” of a randomly chosen “expert” from an appropriate distribution over experts. The randomized approach, however, effectively linearizes the problem and does not take advantage of the curvature of the loss. The precise way in which the loss enters the picture has been investigated thoroughly by Vovk  (see also ). Vovk defines a mixability curve that parametrizes achievable regret of a form slightly different than (1). Specifically, Vovk allows a constant other than in front of the infimum in the regret definition. Such regret bounds are called “inexact oracle inequalities” in statistics. Audibert  shows that the mixability condition on the loss function leads to a variance-type bound in his general PAC-based formulation, yet the analysis is restricted to the case of finite experts. While it is possible to repeat the analysis in the present paper with a constant other than in front of the comparator, this goes beyond the scope of the paper. Importantly, our techniques go beyond the finite case and can give correct regret bounds even if discretization to a finite set of experts yields vacuous bounds.
Let us emphasize the above point again by comparing the upper bound of Lemma 5 to the bound we may obtain via a metric entropy approach, as in the work of . Assume that is a compact subset of equipped with supremum norm. The metric entropy, denoted by , is the logarithm of the smallest -net with respect to the sup norm on . An aggregating procedure over the elements of the net gives an upper bound (omitting constants and logarithmic factors)
on regret (1). Here, is the amount we lose from restricting the attention to the -net, and the second term appears from aggregation over a finite set. The balance (18) fails to capture the optimal behavior for large nonparametric sets of functions. Indeed, for an behavior of metric entropy, Vovk concludes the rate of . For , this is slower than the rate one obtains from Lemma 5 by trivially upper bounding the sequential entropy by metric entropy. The gain is due to the chaining technique, a phenomenon well-known in statistical learning theory. Our contribution is to introduce the same concepts to the domain of online learning.
6 Relaxations and Algorithms
To design generic forecasters for the problem of online non-parametric regression we follow the recipe provided in . It was shown in that paper that if one can find a relaxation (a sequence of mappings from observed data to reals) that satisfies certain conditions, then one can define prediction strategies based on such relaxations. Specifically, we look for relaxations that satisfy the initial condition
and the recursive admissibility condition that requires
for any and any . A relaxation satisfying these two conditions is said to be admissible, and it leads to an algorithm
For this forecast the associated bound on regret is
(see  for details). We now claim that the following conditional version of (8) gives an admissible relaxation and leads to a method that enjoys the regret bounds shown in the first part of the paper.
The following relaxation is admissible:
The algorithm (20) with this relaxation enjoys the regret bound of offset Rademacher complexity
The proof of Lemma 15 follows closely the proof of Lemma 3 and we omit it (see [19, 20]). Since the regret bound for the above forecaster is exactly the one given in (8), the upper bounds in Corollary 12 hold for the above algorithm. Therefore, the algorithm based on is optimal up to the tightness of the upper and lower bounds in Section 4 and Section 3.
For the rest of this section, we restrict our attention to the case when . We further assume that is a convex function of . In this case, the prediction takes a simple form, as the supremum over is attained either at or . More precisely, the prediction can be written as
6.1 Recipe for designing online regression algorithms for general loss functions
We now provide a schema for deriving forecasters for general online non-parametric regression:
Find relaxation s.t.