Online Nonparametric Regression

02/11/2014 ∙ by Alexander Rakhlin, et al. ∙ 0

We establish optimal rates for online regression for arbitrary classes of regression functions in terms of the sequential entropy introduced in (Rakhlin, Sridharan, Tewari, 2010). The optimal rates are shown to exhibit a phase transition analogous to the i.i.d./statistical learning case, studied in (Rakhlin, Sridharan, Tsybakov 2013). In the frequently encountered situation when sequential entropy and i.i.d. empirical entropy match, our results point to the interesting phenomenon that the rates for statistical learning with squared loss and online nonparametric regression are the same. In addition to a non-algorithmic study of minimax regret, we exhibit a generic forecaster that enjoys the established optimal rates. We also provide a recipe for designing online regression algorithms that can be computationally efficient. We illustrate the techniques by deriving existing and new forecasters for the case of finite experts and for online linear regression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Within the online regression framework, data arrive in a stream, and we are tasked with sequentially predicting each next response given the current and the data observed thus far. Let denote our prediction, and let the quality of this forecast be evaluated via square loss

. Within the field of time series analysis, it is assumed that data are generated according to some model. The parameters of the model can then be estimated from data, leveraging the laws of probability. Alternatively, in the

competitive approach, studied within the field of online learning, the aim is to develop a prediction method that does not assume a generative process of the data [7]. The problem is then formulated as that of minimizing regret

(1)

with respect to some benchmark class of functions . This class encodes our prior belief about the family of regression functions that we expect to perform well on the sequence. Notably, an upper bound on regret is required to hold for all sequences.

In the past twenty years, progress in online regression for arbitrary sequences, starting with the paper of Foster [8], has been almost exclusively on finite-dimensional linear regression (an incomplete list includes [19, 11, 20, 4, 2, 3, 9]). This is to be contrasted with Statistics, where regression has been studied for rich (nonparametric) classes of functions. Important exceptions to this limitation in the online regression framework – and works that partly motivated the present findings – are the papers of Vovk [23, 21, 22]. Vovk considers regression with large classes, such as subsets of a Besov or Sobolev space, and remarks that there appears to be two distinct approaches to obtaining the upper bounds in online competitive regression. The first approach, which Vovk terms Defensive Forecasting, exploits uniform convexity of the space, while the second – an aggregating technique (such as the Exponential Weights Algorithm) – is based on the metric entropy of the space. Interestingly, the two seemingly different approaches yield distinct upper bounds, based on the respective properties of the space. In particular, Vovk asks whether there is a unified view of these techniques. The present paper addresses these questions and establishes optimal performance for online regression.

Since most work in online learning is algorithmic, the boundaries of what can be proved are defined by the regret minimization algorithms one can find. One of the main algorithmic workhorses is the aggregating procedure mentioned above. However, the difficulty in using an aggregating procedure beyond simple parametric classes (e.g. subsets of ) lies in the need for a “pointwise” cover of the set of functions – that is, a cover in the supremum norm on the underlying space of covariates (see Remark 3). The same difficulty arises when one uses PAC-Bayesian bounds [1] that, at the end of the day, require a volumetric argument. Notably, this difficulty has been overcome in statistical learning, where it has long been recognized (since the work of Vapnik and Chervonenkis) that it is sufficient to consider an empirical cover of the class – a potentially much smaller quantity. Such an empirical entropy is necessarily finite, and its growth with is one of the key complexity measures for i.i.d. learning. In particular, the recent work of [16] shows that the behavior of empirical entropy characterizes the optimal rates for i.i.d. learning with square loss. To mimic this development, it appears that we need to understand empirical covering numbers in the sequential prediction framework.

Sequential analogues of covering numbers, combinatorial parameters, and the Rademacher complexity have been recently introduced in [15]. These complexity measures were shown to both upper and lower bound minimax regret of online learning with absolute loss for arbitrary classes of functions. These rates, however, are not correct for the square loss case. Consider, for instance, finite-dimensional regression, where the behavior of minimax regret is known to be logarithmic in ; the Rademacher rate, however, cannot yield rates faster than . A hint as to how to modify the analysis for “curved” losses appears in the paper of [6] where the authors derived rates for log-loss via a two-level procedure: the set of densities is first partitioned into small balls of a critical radius ; a minimax algorithm is employed on each of these small balls; and an overarching aggregating procedure combines these algorithms. Regret within each small ball is upper bounded by classical Dudley entropy integral (with respect to a pointwise metric) defined up to the radius. The main technical difficulty in this paper is to prove a similar statement using “empirical” sequential covering numbers.111While we develop our results for square loss, similar statements hold for much more general losses, as will be shown in the full version of this paper.

Interestingly, our results imply the same phase transition as the one exhibited in [15] for i.i.d. learning with square loss. More precisely, under the assumption of the behavior of sequential entropy, the minimax regret normalized by time horizon decays as if , and as for . We prove lower bounds that match up to a logarithmic factor, establishing that the phase transition is real. Even more surprisingly, it follows that, under a mild assumption that sequential Rademacher complexity of behaves similarly to its i.i.d. cousin, the rates of minimax regret in online regression with arbitrary sequences match, up to a logarithmic factor, those in the i.i.d. setting of Statistical Learning. This phenomenon has been noticed for some parametric classes by various authors (e.g. [5]). The phenomenon is even more striking given the simple fact that one may convert the regret statement, that holds for all sequences, into an i.i.d. guarantee. Thus, in particular, we recover the result of [16] through completely different techniques. Since in many situations, one obtains optimal rates for i.i.d. learning from a regret statement, the relaxation framework of [13] provides a toolkit for developing improper learning algorithms in the i.i.d. scenario.

After characterizing minimax rates for online regression, we turn to the question of developing algorithms. We first show that an algorithm based on the Rademacher relaxation is admissible (see [13]) and yields the rates derived in a non-constructive manner in the first part of the paper. This algorithm is not generally computationally feasible, but, in particular, does achieve optimal rates, improving on those exhibited by Vovk [21] for Besov spaces. We show that further relaxations in finite dimensional space lead to the famous Vovk-Azoury-Warmuth forecaster. For illustration purposes, we also derive a prediction method for finite class .

2 Background

Let be some set of covariates, and let be a class of functions . We study the online regression scenario where on round , is revealed to the learner who subsequently makes a prediction ; Nature then reveals222The assumption of bounded responses can be removed by standard truncation arguments (see e.g. [10]). . Instead of (1), we consider a slightly modified notion of regret

(2)

for some . It is well-known that an upper bound on such a regret notion leads to the so-called optimistic rates which scale favorably with the cumulative loss   [2, 18]. More precisely, suppose we show an upper bound of on regret in (2). Then regret in (1) is upper bounded by

(3)

by considering the case and its converse.

Unlike most previous approaches to the study of online regression, we do not start from an algorithm, but instead directly work with minimax regret. We will be able to extract a (not necessarily efficient) algorithm after getting a handle on the minimax value. Let us introduce the notation that makes the minimax regret definition more concise. We use to denote an interleaved application of the operators inside repeated over rounds. With this notation, the minimax regret of the online regression problem described earlier can be written as

(4)

where each ranges over and range over . The usual minimax regret notion is simply given when as .

As mentioned above, in the i.i.d. scenario it is possible to employ a notion of a cover based on a sample, thanks to the symmetrization technique. In the online prediction scenario, symmetrization is more subtle, and involves the notion of a binary tree, the smallest entity that captures the sequential nature of the problem. To this end, let us state a few definitions. A -valued tree of depth is a complete rooted binary tree with nodes labeled by elements of . Equivalently, we think of as labeling functions, where is a constant label for the root, are the labels for the left and right children of the root, and so forth. Hence, for , is the label of the node on the -th level of the tree obtained by following the path . For a function , is an -valued tree with labeling functions for level (or, in plain words, evaluation of on ).

Next, let us define sequential covering numbers – one of the key complexity measures of .

Definition 1 ([15]).

A set of -valued trees of depth forms a -cover (with respect to the norm) of a function class on a given -valued tree of depth if

A -cover in the sense requires that for all . The size of the smallest -cover is denoted by , and .

We will refer to as sequential entropy of . In particular, we will study the behavior of when sequential entropy grows polynomially333It is straightforward to allow constants in this definition, and we leave these details out for the sake of simplicity. as the scale decreases:

(5)

We also consider the parametric “” case when sequential covering itself behaves as

(6)

(e.g. linear regression in a bounded set in ). We remark that the cover is necessarily -dependent, so the form we assume there is

(7)

3 Main Results

We now state the main results of this paper. They follow from the more general technical statements of Lemmas 4, 5, 6 and 7. We normalize by in order to make the rates comparable to those in statistical learning. Further, throughout the paper refer to constants that may depend on . Their values can be found in the proofs.

Theorem 1.

For a class with sequential entropy growth ,

  • For , the minimax regret444For , . is bounded as   

  • For , the minimax regret is bounded as   

  • For the parametric case (6),   

  • For finite set ,   

Theorem 2.

The upper bounds of Theorem 1 are tight555The notation suppresses logarithmic factors:

  • For , for any class of uniformly bounded functions with a lower bound of on sequential entropy growth,

  • For , for any class of uniformly bounded functions, there exists a slightly modified class with the same sequential entropy growth such that

  • There exists a class with the covering number as in (6), such that

For the following theorem, we assume that is known a priori. Adaptivity to can be obtained through a doubling-type argument [17].

Theorem 3.

Additionally, the following optimistic rates hold for regret (1):

  • For , regret is upper bounded by

  • For , regret is upper bounded by . The bound gains an extra factor for

  • For the parametric case (7), regret is upper bounded by

where .

Remark 1.

The optimistic rate for appears to be slower than the hypothesized rate, and we leave the question of obtaining this rate as future work.

Remark 2.

If we assume that ’s are drawn from distributions with bounded mean and subgaussian tails, the same upper bounds can be shown with an extra factor.

Next, we prove the three theorems stated above. The proofs are of the “plug-and-play style”: the overarching idea is that the optimal rates can be derived simply by assuming an appropriate control of sequential entropy, be it a parametric or a nonparametric class.

Proof of Theorem 1.

We appeal to Eq. (13) in Lemma 4 below. Fix and let denote the -valued tree . Define the class . Observe that the values of outside of range of are immaterial. Also note that the covering number of on coincides with the covering number of on . Now, Lemma 5 applied to this class , together with , yields

(8)

We now evaluate the above upper bound for the growth of sequential entropy at scale . In particular, for the case , we may choose (maximum of the function) and . Then and the first term disappears. We are left with

For the case , Eq. (8) gives an upper bound

(9)

We choose and :

For the case , we gain an extra factor of since the integral of is the logarithm. For the parametric case (6), we choose and . Then Eq. (8) yields (for ),

In the finite case, for any . We then have take (one can see that this value is allowed for the particular case of a finite class; or, use a small enough value). Then,

Normalizing by yields the desired rates in the statement of the theorem. ∎

Proof of Theorem 2.

The first two lower bounds are proved in Lemma 9 and 10. The lower bound for the parametric case follows from the i.i.d. lower bound in [16]. ∎

Proof of Theorem 3.

For optimistic rates, we start with the upper bound in (12) and define as above. We then appeal to Lemma 6 and obtain

(10)

For decay of entropy for , we take , . The first term in (10) can be taken to be zero, as we may take one function at scale . The infimum in (10) evaluates to

For , we gain an extra factor: .

For , we take and . Then infimum in (10) evaluates to

For the parametric case (7), we take and . Then (10) is upper bounded by

The final optimistic rates are obtained by following the bound in (3). ∎

3.1 Offset Rademacher Complexity and the Chaining Technique

Let us recall the definition of sequential Rademacher complexity of a class

(11)

introduced in [14]

, where the expectation is over a sequence of independent Rademacher random variables

and the supremum is over all -valued trees of depth . While this complexity both upper- and lower-bounds minimax regret for absolute loss, it fails to capture the possibly faster rates one can obtain for regression. We show below that modified, or offset

, versions of this complexity do in fact give optimal rates. These complexities have an extra quadratic term being subtracted off. Intuitively, this variance term “extinguishes” the

-type fluctuations above a certain scale. Below this scale, complexity is given by the Dudley-type integral. The optimal balance of the scale gives the correct rates. As can be seen from the proof of Theorem 1, the critical scale is trivial (zero) for a finite case, then for a parametric class, for , and then becomes irrelevant (e.g. constant) at . Indeed, for , the rate is given purely by sequential Rademacher complexity, as curvature of the loss does not help. In particular, can achieve these rates for by simply linearizing the square loss. The same phenomenon occurs in statistical learning with i.i.d. data [16].

We remark that [12] studies bounds for estimation with squared loss for the empirical risk minimization procedure and observes that it is enough to only consider one-sided estimates rather than concentration statements. The offset sequential Rademacher complexities are of this one-sided nature.

In Lemma 4 below, we provide a bound on minimax regret via offset sequential Rademacher complexities.

Lemma 4.

The minimax value of online regression with responses in a bounded interval is upper bounded by

(12)

and

(13)

where ranges over all -valued trees, and over all -valued trees of depth . Furthermore,

(14)

where ranges over -valued trees.

We now show that offset Rademacher complexities can be upper bounded by sequential entropies via the chaining technique. Lemma 5 below is an analogue of the Dudley-type integral bound

(15)

for sequential Rademacher proved in [15]. Crucially, the upper bound of Lemma 5 allows us to choose a critical scale .

Lemma 5.

Let be a -valued tree of depth . For any -valued tree and a class of functions and any ,

For optimistic rates, we can take advantage of an additional offset. This offset arises from the quadratic term due to the multiple of the loss of the algorithm.

Lemma 6.

Let be a -valued tree of depth . For any -valued tree and a class of functions , for any ,

(16)

The chaining arguments of Lemmas 5 and 6 are based on the following key finite-class lemma:

Lemma 7.

Let be a -valued tree of depth . For a finite set of -valued trees of depth , it holds that

(17)

for any , . It also holds that

(18)
Remark 3.

Let us compare the upper bound of Lemma 5 to the bound we may obtain via a metric entropy approach, as in the work of Vovk [21]. Assume that is a compact subset of equipped with supremum norm. The metric entropy, denoted by , is the logarithm of the smallest -net with respect to the sup norm on . An aggregating procedure over the elements of the net gives an upper bound (omitting constants and logarithmic factors)

(19)

on regret (1). Here, is the amount we lose from restricting the attention to the -net, and the second term appears from aggregation over a finite set. While the balance (19) can yield correct rates for small classes, it fails to capture the optimal behavior for large nonparametric sets of functions. Indeed, for an behavior of metric entropy, Vovk concludes the rate of . For , this is slower than the rate one obtains from Lemma 5

by trivially upper bounding the sequential entropy by metric entropy. The gain is due to the chaining technique, a phenomenon well-known in statistical learning theory. Our contribution is to introduce the same concepts to the domain of online learning. Let us also mention that sequential covering number of

is an “empirical” quantity and is finite even if we cannot upper bound metric entropy.

4 Further Examples

For the sake of illustration we show bounds on minimax rates for a couple of examples.

Example 1 (Sparse linear predictors).

Let be a set of functions such that each . Define to be the convex combination of at most out of these functions. That is

For this example note that the sequential covering number can be easily upper bounded: we can choose out of functions in ways and further the metric entropy for convex combination of bounded functions at scale is bounded as . We conclude that

From the main theorem, the upper bound is

Example 2 (Besov Spaces).

Let be a compact subset of . Let be a ball in Besov space . When , pointwise metric entropy bounds at scale scale as [21, p. 20]. On the other hand, when , one can show that the space is a Banach space that is -uniformly convex. From [15], it can be shown that sequential Rademacher can be upper bounded by , yielding an bound on sequential entropy at scale as . These two controls together give the bound on the minimax rate. The generic forecaster with Rademacher complexity as relaxation (see Section 6), enjoys the best of both of these rates. More specifically, we may identify the following regimes:

  • If , the minimax rate is .

  • If , the minimax rate depends on the interaction of and :

    • if , the minimax rate is , as above.

    • otherwise, the minimax rate is

5 Lower Bounds

The lower bounds will involve a notion of a “dimension” of called the sequential fat-shattering dimension. Let us introduce this notion.

Definition 2.

An -valued tree of depth is said to be -shattered by if there exists an -valued tree of depth such that

for all . The tree is called a witness. The largest for which there exists a -shattered -valued tree is called the (sequential) fat-shattering dimension, denoted by .

The sequential fat-shattering dimension is related to sequential covering numbers as follows:

Theorem 8 ([15]).

Let be a class of functions . For any ,

Therefore, if , then

The lower bounds will now be obtained assuming behavior of the fat-shattering dimension, and the resulting statement of Theorem 2 in terms of the sequential entropy growth will involve extra logarithmic factors, hidden in the notation.

Lemma 9.

Consider the problem of online regression with responses bounded by . For any class of functions and any and ,

In particular, if for , we have

Lemma 10.

For any class and , there exists a modified class such that and for ,

In particular, when and ,

6 Relaxations and Algorithms

To design generic forecasters for the problem of online non-parametric regression we follow the recipe provided in [13]. It was shown in that paper that if one can find a relaxation (a sequence of mappings from observed data to reals) that satisfies initial and admissibility conditions then one can build estimators based on such relaxations. Specifically, we look for relaxations that satisfy the following initial condition

and the recursive admissibility condition that for any and any

(20)

If a relaxation satisfies these two conditions then one can define an algorithm via

and for this forecast the associated bound on regret is automatically bounded as (see [13] for details) :

Now further note that if is a convex function of then the prediction takes a very simple form, as the supremum over is attained either at or . The prediction can be written as

Observe that the first term decreases as increases to and likewise the second term monotonically decreases as decreases to . Hence the solution to the above is given when both terms are equal (if this doesn’t happen within the range then we clip). In other words,

Hence, for any admissible relaxation such that is a convex function of , the above prediction based on the relaxation enjoys the bound on regret .

We now claim that the following conditional version of Equation (13) gives an admissible relaxation and leads to a method that enjoys the regret bounds shown in the first part of this paper.

Lemma 11.

The following relaxation is admissible :

The forecast corresponding to this relaxation is given by

The above algorithm enjoys the regret bound of an offset Rademacher complexity:

Notice that since the regret bound for the above prediction based on the sequential Rademacher relaxation is exactly the one given in Equation (13), the upper bounds provided for in Theorem 1 also hold for the above algorithm.

6.1 Recipe for designing online regression algorithms

We now provide a schema for deriving forecasters for general online non-parametric regression problems:

  1. Find relaxation such that

    and s.t. is a convex function of

  2. Check the condition

  3. Given on round , the prediction is given by

Proposition 12.

Any algorithm derived from the above schema using relaxation enjoys a bound

on regret.

Example : Finite class of experts
As an example of estimator derived from the schema we first consider the simple case .

Corollary 13.

The following is an admissible relaxation :

It leads to the following algorithm

and enjoys a regret bound

Example : Linear regression
Next, consider the problem of online linear regression in . Here is the class of linear functions. For this problem we consider a slightly modified notion of regret :

This regret can be seen alternatively as regret if we assume that on rounds to Nature plays ,…, , where

are the standard basis vectors, and that on these rounds the learner (knowing this) predicts

, thus incurring zero loss over these initial rounds. Hence we can readily apply the schema for designing an algorithm for this problem.

Corollary 14.

For any , the following is an admissible relaxation