Online Nonparametric Regression with General Loss Functions

01/26/2015 ∙ by Alexander Rakhlin, et al. ∙ 0

This paper establishes minimax rates for online regression with arbitrary classes of functions and general losses. We show that below a certain threshold for the complexity of the function class, the minimax rates depend on both the curvature of the loss function and the sequential complexities of the class. Above this threshold, the curvature of the loss does not affect the rates. Furthermore, for the case of square loss, our results point to the interesting phenomenon: whenever sequential and i.i.d. empirical entropies match, the rates for statistical and online learning are the same. In addition to the study of minimax regret, we derive a generic forecaster that enjoys the established optimal rates. We also provide a recipe for designing online prediction algorithms that can be computationally efficient for certain problems. We illustrate the techniques by deriving existing and new forecasters for the case of finite experts and for online linear regression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study the problem of predicting a real-valued sequence in an on-line manner. At time , the forecaster receives side information in the form of an element of an abstract set . The forecaster then makes a prediction on the basis of the current observation and the data encountered thus far, and then observes the response .

Such a problem of sequence prediction is studied in the literature under two distinct settings: probabilistic and deterministic [18]

. In the former setting, which falls within the purview of time series analysis, one posits a parametric form for the data-generating mechanism and estimates the model parameters based on past instances and input information in order to make the next prediction. In contrast, in the deterministic setting one assumes no such probabilistic mechanism. Instead, the goal is phrased as that of predicting as well as the best forecaster from a benchmark set of strategies. This latter setting—often termed

prediction of individual sequences, or online learning—is the focus of the present paper.

We let the outcome and the prediction take values in and , respectively. Formally, a deterministic prediction strategy is a mapping . We let the loss function score the quality of the prediction on a single round.

Assume that the time horizon , is known to the forecaster. The overall quality of the forecaster is then evaluated against the benchmark set of predictors, denoted as a class of functions . The cumulative regret of the forecaster on the sequence is defined as

(1)

The forecaster aims to keep the difference in (1) small for all sequences .

The comparison class encodes the prior belief about the family of predictors one expects to perform well. If a forecasting strategy guarantees small regret for all sequences, and if is a good model for the sequences observed in reality, then the forecasting strategy will also perform well in terms of its cumulative error. In fact, we can take to be a class of solutions (that is, forecasting strategies) to a set of probabilistic sources one would obtain by positing a generative model of data. By doing so, we are modeling solutions to the prediction problem rather than modeling the data-generating mechanism. We refer to [18, 21] for further discussions on this “duality” between the probabilistic and deterministic approaches.

To ensure that captures the phenomenon of interest, we would like to be large. However, increasing the “size” of likely leads to larger regret, as the comparison term in (1) becomes smaller. On the other hand, decreasing the “size” of

makes the regret minimization task easier, yet the prediction method is less likely to be successful in practice. This dichotomy is an analogue of the bias-variance tradeoff commonly studied in statistics. A contribution of this paper is an analysis of the growth of regret (with

) in terms of various notions of complexity of . The task was already accomplished in [24] for the case of absolute loss . In the present paper we obtain optimal guarantees for convex Lipschitz losses under very general assumptions.

To give the reader a sense of the results of this paper, we state the following informal corollary. Let complexity of be measured via sequential entropy at scale , to be defined below. (For the reader familiar with covering numbers, this is a sequential analogue—introduced in [24]—of the classical Koltchinskii-Pollard entropy).

Corollary 1 (Informal).

Suppose sequential entropy at scale behaves as , . Then optimal regret

  • for prediction with absolute loss grows as if , and as for ;

  • for prediction with square loss grows as if , and as for .

Moreover, these rates have matching, sometimes modulo a logarithmic factor, lower bounds.

The first part of this corollary is established in [24]. The second part requires new techniques that take advantage of the curvature of the loss function.

In an attempt to entice the reader, let us discuss two conclusions that can be drawn from Corollary 1

. First, the rates of convergence match optimal rates for excess square loss in the realm of distribution-free Statistical Learning Theory with i.i.d. data, under the assumption on the behavior of empirical covering numbers

[27]. Hence, in the absence of a gap between classical and sequential complexities (introduced later) the regression problems in the two seemingly different frameworks enjoy the same rates of convergence. A deeper understanding of this phenomenon is of a great interest.

The second conclusion concerns the same optimal rate for both square and absolute loss for “rich” classes (). Informally, strong convexity of the loss does not affect the rate of convergence for such massive classes. A geometric explanation of this interesting phenomenon requires further investigation.

We finish this introduction with a note about the generality of the setting proposed so far. Suppose , the space of all histories of -valued outcomes. Denoting , we may view each itself as a strategy that maps history to a prediction. Ensuring that is not arbitrary but consistent with history only makes the task of regret minimization easier; the analysis of this paper for this case follows along the same lines, but we omit the extra overhead of restrictions on ’s and instead refer the reader to [14, 21].

The paper is organized as follows. Section 2 introduces the notation and then presents a brief overview of sequential complexities. Upper and lower bounds on minimax regret are established in Sections 3 and 4. We calculate minimax rates for various examples in Section 5. We then turn to the question of developing algorithms in Section 6. We first show that an algorithm based on the Rademacher relaxation is admissible (see [19]) and yields the rates derived in a non-constructive manner in the first part of the paper. We show that further relaxations in finite dimensional space lead to the famous Vovk-Azoury-Warmuth forecaster. We also derive a prediction method for finite class .

2 Preliminaries

2.1 Assumptions and Definitions

We assume that the set of outcomes is a bounded set, a restriction that can be removed by standard truncation arguments (see e.g. [12]). Let be some set of covariates, and let be a class of functions for some . Recall the protocol of the online prediction problem: On each round , is revealed to the learner who subsequently makes a prediction . The response is revealed after the prediction is made.

The loss function is assumed to be convex. Let denote any element of the subdifferential set (with respect to first argument), and assume that

We assume that for any distribution of supported on , there is a minimizer of expected loss that is finite and belongs to :

Given a , the error of a linear expansion at to approximate function value at is denoted by

Let be a function defined pointwise as

(2)

a lower bound on the residual for any two values separated by . For instance, an easy calculation shows that for .

2.2 Minimax Formulation

Unlike most previous approaches to the study of online regression, we do not start from an algorithm, but instead work directly with minimax regret. We will be able to extract a (not necessarily efficient) algorithm after obtaining upper bounds on the minimax value. Let us introduce the notation that makes the minimax regret definition more concise. We use to denote an interleaved application of the operators, repeated over rounds. With this notation, the minimax regret of the online regression problem described earlier can be written as

(3)

where each ranges over , ranges over , and ranges over . An upper bound on guarantees the existence of an algorithm (that is, a way to choose ’s) with at most that much regret against any sequence. A lower bound on , in turn, guarantees the existence of a sequence on which no method can perform better than the given lower bound.

2.3 Sequential Complexities

One of the key tools in the study of estimators based on i.i.d. data is the symmetrization technique [13]

. By introducing Rademacher random variables, one can study the supremum of an empirical process conditionally on the data. Conditioning facilitates the introduction of sample-based complexities of a function class, such as an empirical covering number. For a class of bounded functions, the covering number with respect to the empirical metric is necessarily finite and leads to a correct control of the empirical process even if discretization of the function class in a data-independent manner is impossible. We will return to this point when comparing our approach with discretization-based methods.

In the online prediction scenario, symmetrization is more subtle and involves the notion of a binary tree. The binary tree is, in some sense, the smallest entity that captures the sequential nature of the problem. More precisely, a -valued tree of depth is a complete rooted binary tree with nodes labeled by elements of a set . Equivalently, we think of as labeling functions, where is a constant label for the root, are the labels for the left and right children of the root, and so forth. Hence, for , is the label of the node on the -th level of the tree obtained by following the path . For a function , is an -valued tree with labeling functions for level (or, in plain words, evaluation of on ).

We now define two tree-based complexity notions of a class of functions.

Definition 1 ([24]).

Sequential Rademacher complexity of a class on a given -valued tree of depth , as well as its supremum, are defined as

(4)

where the expectation is over a sequence of independent Rademacher random variables .

One may think of the functions as a predictable process with respect to the dyadic filtration . The following notion of a -cover quantifies complexity of the class evaluated on the predictable process.

Definition 2 ([24]).

A set of -valued trees of depth forms a -cover (with respect to the norm) of a function class on a given -valued tree of depth if

A -cover in the sense requires that for all . The size of the smallest -cover is denoted by , and .

We will refer to as sequential entropy of . In particular, we will study the behavior of when sequential entropy grows polynomially222It is straightforward to allow constants in this definition, and we leave these details out for the sake of simplicity. as the scale decreases:

(5)

We also consider the parametric “” case when sequential covering itself behaves as

(6)

(e.g. linear regression in a bounded set in ). We remark that the cover is necessarily -dependent, so the forms we assume for nonparametric and parametric cases, respectively, are

(7)

3 Upper Bounds

The following theorem from [24] shows the importance of sequential Rademacher complexity for prediction with absolute loss.

Theorem 2 ([24]).

Let , , and . It then holds that

Furthermore, an upper bound of holds for any -Lipschitz loss. We observe, however, that as soon as contains two distinct functions, sequential Radmeacher complexity of scales as . Yet, it is known that minimax regret for prediction with square loss grows slower than this rate. Therefore, the direct analysis based on sequential Rademacher complexity (and a contraction lemma) gives loose upper bounds on minimax regret. The key contribution of this paper is an introduction of an offset Rademacher complexity that captures the correct behavior.

In the next lemma, we show that minimax value of the sequential prediction problem with any convex Lipschitz loss function can be controlled via offset sequential Rademacher complexity. As before, let where each is an independent Rademacher random variable.

Lemma 3.

Under the assumptions and definitions in Section 2.1, the minimax rate is bounded by

(8)

where and range over all -valued and -valued trees of depth , respectively.

The right-hand side of (8) will be termed offset Rademacher complexity of a function class with respect to a convex even offset function and a mean -valued tree . If , we recover the notion of sequential Rademacher complexity since .

A matching lower bound on the minimax value will be presented in Section 4, and the two results warrant a further study of offset Rademacher complexity. To this end, a natural next question is whether the chaining technique can be employed to control the supremum of this modified stochastic process. As a point of comparison, we first recall that sequential Rademacher complexity of a class of -valued functions on can be upper bounded via the Dudley integral-type bound

(9)

for any -valued tree of depth , as shown in [26]. We aim to obtain tighter upper bounds on the offset Rademacher by taking advantage of the negative offset term.

To initiate the study of offset Rademacher complexity with functions other than quadratic, we recall the notion of a convex conjugate.

Definition 3.

For a convex function with domain , the convex conjugate is defined as

The chaining technique for controlling a supremum of a stochastic process requires a statement about the behavior of the process over a finite collection. The next lemma provides such a statement for the offset Rademacher process.

Lemma 4.

Let be a convex, nonnegative, even function on and let denote the convex conjugate of the function . Assume is nondecreasing. For any finite set of -valued trees of depth and any constant ,

(10)

Further, for any -valued tree ,

(11)

As an example, if , an easy calculation shows that and for any . Hence, the infimum in (10) is achieved at , and the upper bound becomes .

We can now employ the chaining technique to extend the control of the stochastic process beyond the finite collection.

Lemma 5.

Let and be as in Lemma 4. For any -valued tree of depth and a class of functions and any constant ,

Remark 1.

For the case of , it is possible to prove the upper bound of Lemma 5 in terms of sequential covering numbers rather than (see [22]).

Lemma 5, together with Lemma 3, yield upper bounds on minimax regret under assumptions on the growth of sequential entropy. Before detailing the rates, we present lower bounds on the minimax value in terms of the offset Rademacher complexity and combinatorial dimensions.

4 Lower Bounds

The function , arising from uniform (or strong) convexity of the loss function, enters the upper bounds on minimax regret. For proving lower bounds, we consider the dual property, that of (restricted) smoothness. To this end, let be a subset satisfying the following condition:

(12)

For any such subset , let be defined as

(13)

We write for the singleton set .

The lower bounds in this section will be constructed from symmetric distributions supported on two carefully chosen points. Crucially, we do not require a uniform notion of smoothness, but rather a condition on the loss that holds for a restricted subset and a two-point distribution.

As an example, consider square loss and . For any , we may choose the two points as , for small enough , with the desired property. Then and .

Lemma 6.

Fix . Suppose satisfies condition (12), and suppose that for any ,

Then for any -valued tree of depth ,

(14)

The lower bound in (14) is an offset Rademacher complexity that matches the upper bound of Lemma 3 up to constants, as long as functions and exhibit the same behavior. In particular, the upper and lower bounds match up to a constant for the case of square loss.

Our next step is to quantify the lower bound in terms of according to “size” of . In contrast to the more common statistical approaches based on covering numbers and Fano inequality, we turn to a notion of a combinatorial dimension as the main tool.

Definition 4.

An -valued tree of depth is said to be -shattered by if there exists an -valued tree of depth such that

for all . The tree is called a witness. The largest for which there exists a -shattered -valued tree is called the (sequential) fat-shattering dimension, denoted by .

The reader will notice that the upper bound of Lemma 5 is in terms of sequential entropies rather than combinatorial dimensions. The two notions, however, are closely related.

Theorem 7 ([26]).

Let be a class of functions . For any ,

As a consequence of the above theorem, if and , then where may depend on the range of functions in .

The lower bounds will now be obtained assuming behavior of the fat-shattering dimension, and the corresponding statements in terms of the sequential entropy growth will involve extra logarithmic factors, hidden in the notation.

Lemma 8.

Suppose the statement of Lemma 6 holds for some , and suppose

(15)

for any and in the statement of Lemma 6. Then it holds that for any and ,

In particular, if for , we have

As an example, consider the case of square loss with . Then we may take , , , and hence . We verify that (15) holds for .

Lemma 9.

Suppose the statement of Lemma 6 holds for some . For any class and , there exists a modified class such that for all , and for ,

Armed with the upper bounds of Section 3 and the lower bounds of Section 4, we are ready to detail specific minimax rates of convergence for various classes of regression functions and a range of loss functions .

5 Minimax Rates

Combining Lemma 3 and Lemma 5, we can detail the behavior of minimax regret under an assumption about the growth rate of sequential entropy.

Theorem 10.

Let and suppose the loss function and the function class are such that

Then for ,

(16)

and for

(17)

Here, depends on . At , the bound (17) gains an extra factor.

We match the above upper bounds with lower bounds under the assumption on the growth of the combinatorial dimension.

Theorem 11.

Suppose the statement of Lemma 6 holds for some and . Let , , and assume

Then there exists a function class such that for some constant ,

for . Furthermore, for , for any with ,

under the assumption (15).

The lower bound of Theorem 11 matches (up to polylogarithmic in factors) the upper bound of Theorem 10 in its dependence on , the dependence on the constant , and in dependence on the size of the gradients (respectively, ). The rest of this section is devoted to the discussion of the derived upper and lower bounds for particular loss functions or particular classes of functions.

5.1 Absolute loss

We verify that the general statements recover the correct rates for the case of . Since the absolute loss is not strongly convex, we take (and ). Theorem 10 then yields the rate for and for , up to logarithmic factors. These rates are matched, again up to logarithmic factors, in Theorem 11. Of course, the result already follows from Theorem 2.

It is also instructive to check the case of . In this case, if is scaled properly by the range of function values, the function approaches the zero function, indicating absence of strong convexity of the loss. Examining the power in Theorem 10, we see that it approaches , matching the discussion of the preceding paragraph.

5.2 Square loss

The case of square loss has been studied in [22]. In view of Remark 1, we state the corollary below in terms of covering numbers, thus removing some logarithmic terms of Theorem 10.

Corollary 12.

For a class with sequential entropy growth ,

  • For , the minimax regret333For , . is bounded as   

  • For , the minimax regret is bounded as   

  • For the parametric case (6),   

  • For finite set ,   

Corollary 13.

The upper bounds of Corollary 12 are tight444The notation suppresses logarithmic factors:

  • For , for any class of uniformly bounded functions with a lower bound of on sequential entropy growth,

  • For , for any class of uniformly bounded functions, there exists a slightly modified class with the same sequential entropy growth such that

  • There exists a class with the covering number as in (6), such that

5.3 -loss for

Consider the case of , for

, which interpolates between the absolute value and square losses.

Corollary 14.

Suppose and for . Assume complexity of as in Theorems 10 and 11 for some . Then

5.4 -loss for

It is easy to check that for , is -uniformly convex, and thus

The upper bound of

then follows from Theorem 10.

5.5 Logistic loss

The loss function is strongly convex and smooth if the sets are bounded. This can be seen by computing the second derivative with respect to the first argument:

We conclude that

Logistic loss is an example of a function with third derivative bounded by a multiple of the second derivative. Control of the remainder term in Taylor approximation for such functions is given in [5, Lemma 1]. Other examples of strongly convex and smooth losses are the exponential loss and truncated quadratic loss. These enjoy the same minimax rate as given above.

5.6 Logarithmic loss

The technique developed in this paper is not universal. In particular, it does not yield correct rates for rich classes of functions under the loss

for the problem of probability assignment and a binary alphabet

. The suboptimality of Lemma 3 is due to the exploding Lipschitz constant. However, a modified approach is possible, and will be carried out in a separate paper.

5.7 Sparse linear predictors and square loss

We now focus on quadratic loss and instead detail minimax rates for specific classes of functions. Consider the following parametric class. Let be a set of functions such that each . Define to be the convex combination of at most out of these functions. That is

For this example note that the sequential covering number can be easily upper bounded: we can choose out of functions in ways and observe that pointwise metric entropy for convex combination of bounded functions at scale is bounded as . We conclude that

From the main theorem, for the case of square loss, the upper bound is

The extension to other loss functions follows immediately from the general statements.

5.8 Besov spaces and square loss

Let be a compact subset of . Let be a ball in Besov space . When , pointwise metric entropy bounds at scale scales as [31, p. 20]. On the other hand, when , and , one can show that the space is a -uniformly convex Banach space. From [26], it can be shown that sequential Rademacher can be upper bounded by , yielding a bound on minimax rate. These two controls together give the bound on the minimax rate. The generic forecaster with Rademacher complexity as relaxation (see Section 6), enjoys the best of both of these rates. More specifically, we may identify the following regimes:

  • If , the minimax rate is .

  • If , the minimax rate depends on the interaction of and :

    • if , the minimax rate ,   otherwise, the rate is

5.9 Remarks: Experts, Mixability, and Discretization

The problem of prediction with expert advice has been central in the online learning literature [9]. One can phrase the experts problem in our setting by taking a finite class of functions. It is possible to ensure sublinear regret by following the “advice” of a randomly chosen “expert” from an appropriate distribution over experts. The randomized approach, however, effectively linearizes the problem and does not take advantage of the curvature of the loss. The precise way in which the loss enters the picture has been investigated thoroughly by Vovk [28] (see also [15]). Vovk defines a mixability curve that parametrizes achievable regret of a form slightly different than (1). Specifically, Vovk allows a constant other than in front of the infimum in the regret definition. Such regret bounds are called “inexact oracle inequalities” in statistics. Audibert [2] shows that the mixability condition on the loss function leads to a variance-type bound in his general PAC-based formulation, yet the analysis is restricted to the case of finite experts. While it is possible to repeat the analysis in the present paper with a constant other than in front of the comparator, this goes beyond the scope of the paper. Importantly, our techniques go beyond the finite case and can give correct regret bounds even if discretization to a finite set of experts yields vacuous bounds.

Let us emphasize the above point again by comparing the upper bound of Lemma 5 to the bound we may obtain via a metric entropy approach, as in the work of [31]. Assume that is a compact subset of equipped with supremum norm. The metric entropy, denoted by , is the logarithm of the smallest -net with respect to the sup norm on . An aggregating procedure over the elements of the net gives an upper bound (omitting constants and logarithmic factors)

(18)

on regret (1). Here, is the amount we lose from restricting the attention to the -net, and the second term appears from aggregation over a finite set. The balance (18) fails to capture the optimal behavior for large nonparametric sets of functions. Indeed, for an behavior of metric entropy, Vovk concludes the rate of . For , this is slower than the rate one obtains from Lemma 5 by trivially upper bounding the sequential entropy by metric entropy. The gain is due to the chaining technique, a phenomenon well-known in statistical learning theory. Our contribution is to introduce the same concepts to the domain of online learning.

6 Relaxations and Algorithms

To design generic forecasters for the problem of online non-parametric regression we follow the recipe provided in [19]. It was shown in that paper that if one can find a relaxation (a sequence of mappings from observed data to reals) that satisfies certain conditions, then one can define prediction strategies based on such relaxations. Specifically, we look for relaxations that satisfy the initial condition

and the recursive admissibility condition that requires

(19)

for any and any . A relaxation satisfying these two conditions is said to be admissible, and it leads to an algorithm

(20)

For this forecast the associated bound on regret is

(21)

(see [19] for details). We now claim that the following conditional version of (8) gives an admissible relaxation and leads to a method that enjoys the regret bounds shown in the first part of the paper.

Lemma 15.

The following relaxation is admissible:

The algorithm (20) with this relaxation enjoys the regret bound of offset Rademacher complexity

The proof of Lemma 15 follows closely the proof of Lemma 3 and we omit it (see [19, 20]). Since the regret bound for the above forecaster is exactly the one given in (8), the upper bounds in Corollary 12 hold for the above algorithm. Therefore, the algorithm based on is optimal up to the tightness of the upper and lower bounds in Section 4 and Section 3.

For the rest of this section, we restrict our attention to the case when . We further assume that is a convex function of . In this case, the prediction takes a simple form, as the supremum over is attained either at or . More precisely, the prediction can be written as

(22)

6.1 Recipe for designing online regression algorithms for general loss functions

We now provide a schema for deriving forecasters for general online non-parametric regression:

  1. Find relaxation s.t.