Mirror Descent Meets Fixed Share (and feels no regret)

02/15/2012 ∙ by Nicolò Cesa-Bianchi, et al. ∙ Cole Normale Suprieure Universitat Pompeu Fabra Università degli Studi di Milano 0

Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Online convex optimization is a sequential prediction paradigm in which, at each time step, the learner chooses an element from a fixed convex set

and then is given access to a convex loss function defined on the same set. The value of the function on the chosen element is the learner’s loss. Many problems such as prediction with expert advice, sequential investment, and online regression/classification can be viewed as special cases of this general framework. Online learning algorithms are designed to minimize the regret. The standard notion of regret is the difference between the learner’s cumulative loss and the cumulative loss of the single best element in

. A much harder criterion to minimize is shifting regret, which is defined as the difference between the learner’s cumulative loss and the cumulative loss of an arbitrary sequence of elements in . Shifting regret bounds are typically expressed in terms of the shift, a notion of regularity measuring the length of the trajectory in described by the comparison sequence (i.e., the sequence of elements against which the regret is evaluated). In online convex optimization, shifting regret bounds for convex subsets are obtained for the projected online mirror descent (or follow-the-regularized-leader) algorithm. In this case the shift is typically computed in terms of the -norm of the difference of consecutive elements in the comparison sequence —see [1, 2] and [3].

We focus on the important special case when is the simplex. In [1] shifting bounds are shown for projected mirror descent with entropic regularizers using a -norm to measure the shift.111Similar -norm shifting bounds can also be proven using the analysis of [2]. However, without using entropic regularizers it is not clear how to achieve a logarithmic dependence on the dimension, which is one of the advantages of working in the simplex. When the comparison sequence is restricted to the corners of the simplex (which is the setting of prediction with expert advice), then the shift is naturally defined to be the number of times the trajectory moves to a different corner. This problem is often called “tracking the best expert” —see, e.g., [4, 5, 1, 6, 7], and it is well known that exponential weights with weight sharing, which corresponds to the fixed-share algorithm of [4], achieves a good shifting bound in this setting. In [6] the authors introduce a generalization of the fixed-share algorithm, and prove various shifting bounds for any trajectory in the simplex. However, their bounds are expressed using a quantity that corresponds to a proper shift only for trajectories on the simplex corners.

In this paper we offer a unified analysis of mirror descent, fixed share, and the generalized fixed share of [6] for the setting of online convex optimization in the simplex. Our bounds are expressed in terms of a notion of shift based on the total variation distance. Our analysis relies on a generalized notion of shifting regret which includes, as special cases, related notions of regret such as adaptive regret, discounted regret, and regret with time-selection functions. Perhaps surprisingly, we show that projected mirror descent and fixed share achieve essentially the same generalized regret bound. Finally, we show that widespread techniques in online learning, such as improvements for small losses and adaptive tuning of parameters, are all easily captured by our analysis.

2 Preliminaries

For simplicity, we derive our results in the setting of online linear optimization. As we show in the supplementary material, these results can be easily extended to the more general setting of online convex optimization through a standard linearization step.

Online linear optimization may be cast as a repeated game between the forecaster and the environment as follows. We use to denote the simplex .

Online linear optimization in the simplex. For each round ,
1. Forecaster chooses

2. Environment chooses a loss vector


3. Forecaster suffers loss  .

The goal of the forecaster is to minimize the accumulated loss, e.g., . In the now classical problem of prediction with expert advice, the goal of the forecaster is to compete with the best fixed component (often called “expert”) chosen in hindsight, that is, with ; or even to compete with a richer class of sequences of components. In Section 3 we state more specifically the goals considered in this paper.

We start by introducing our main algorithmic tool, described in Figure 1, a share algorithm whose formulation generalizes the seemingly unrelated formulations of the algorithms studied in [4, 1, 6]. It is parameterized by the “mixing functions” for

that assign probabilities to past “pre-weights” as defined below. In all examples discussed in this paper, these mixing functions are quite simple, but working with such a general model makes the main ideas more transparent. We then provide a simple lemma that serves as the starting point

222We only deal with linear losses in this paper. However, it is straightforward that for sequences of –exp-concave loss functions, the additional term in the bound is no longer needed. for analyzing different instances of this generalized share algorithm.

  Parameters: learning rate and mixing functions for Initialization: For each round , 1 Predict ; 2 Observe loss ; 3 [loss update] For each define the current pre-weights,    and ; the matrix of all past and current pre-weights; 4 [shared update] Define .  
Algorithm 1 The generalized share algorithm.
Lemma 1.

For all and for all , Algorithm 1 satisfies

Proof.

By Hoeffding’s inequality (see, e.g., [3, Section A.1.1]),

(1)

By definition of , for all we then have which implies . The proof is concluded by taking a convex aggregation with respect to . ∎

3 A generalized shifting regret for the simplex

We now introduce a generalized notion of shifting regret which unifies and generalizes the notions of discounted regret (see [3, Section 2.11]), adaptive regret (see [8]), and shifting regret (see [2]). For a fixed horizon , a sequence of discount factors for assigns varying weights to the instantaneous losses suffered at each round. We compare the total loss of the forecaster with the loss of an arbitrary sequence of vectors in the simplex . Our goal is to bound the regret

in terms of the “regularity” of the comparison sequence and of the variations of the discounting weights . By setting , we can rephrase the above regret as

(2)

In the literature on tracking the best expert [4, 5, 1, 6], the regularity of the sequence is measured as the number of times . We introduce the following regularity measure

(3)

where for , we define . Note that when , we recover the total variation distance , while for general , the quantity is not necessarily symmetric and is always bounded by . The traditional shifting regret of [4, 5, 1, 6] is obtained from (2) when all are such that .

4 Projected update

The shifting variant of the EG algorithm analyzed in [1] is a special case of the generalized share algorithm in which the function performs a projection of the pre-weights on the convex set . Here is a fixed parameter. We can prove (using techniques similar to the ones shown in the next section—see the supplementary material) the following bound which generalizes [1, Theorem 16].

Theorem 1.

For all , for all sequences of loss vectors, and for all , if Algorithm 1 is run with the above update, then

(4)

This bound can be optimized by a proper tuning of and parameters. We show a similarly tuned (and slightly better) bound in Corollary 1.

5 Fixed-share update

Next, we consider a different instance of the generalized share algorithm corresponding to the update

(5)

Despite seemingly different statements, this update in Algorithm 1 can be seen to lead exactly to the fixed-share algorithm of [4] for prediction with expert advice. We now show that this update delivers a bound on the regret almost equivalent to (though slightly better than) that achieved by projection on the subset of the simplex.

Theorem 2.

With the above update, for all , for all sequences of loss vectors , and for all ,

Note that if we only consider vectors of the form then corresponds to the number of times in the sequence . We thus recover [4, Theorem 1] and [6, Lemma 6] from the much more general Theorem 2.

The fixed-share forecaster does not need to “know” anything in advance about the sequence of the norms for the bound above to be valid. Of course, in order to minimize the obtained upper bound, the tuning parameters need to be optimized and their values will depend on the maximal values of and for the sequences one wishes to compete against. This is illustrated in the following corollary, whose proof is omitted. Therein, denotes the binary entropy function for . We recall333As can be seen by noting that that for .

Corollary 1.

Suppose Algorithm 1 is run with the update (5). Let and . For all , for all sequences of loss vectors , and for all sequences with and ,

whenever and are optimally chosen in terms of and .

Proof of Theorem 2.

Applying Lemma 1 with , and multiplying by , we get for all and

(6)

We now examine

(7)

For the first term on the right-hand side, we have

(8)

In view of the update (5), we have and . Substituting in (8), we get

The sum of the second term in (7) telescopes. Substituting the obtained bounds in the first sum of the right-hand side in (7), and summing over , leads to

(9)

We hence get from (6), which we use in particular for ,

6 Applications

We now show how our regret bounds can be specialized to obtain bounds on adaptive and discounted regret, and on regret with time-selection functions. We show regret bounds only for the specific instance of the generalized share algorithm using update (5); but the discussion below also holds up to minor modifications for the forecaster studied in Theorem 1.

Adaptive regret

was introduced by [8] and can be viewed as a variant of discounted regret where the monotonicity assumption is dropped. For , the -adaptive regret of a forecaster is defined by

(10)

The fact that this is a special case of (2) clearly emerges from the proof of Corollary 2 below here.

Adaptive regret is an alternative way to measure the performance of a forecaster against a changing environment. It is a straightforward observation that adaptive regret bounds also lead to shifting regret bounds (in terms of hard shifts). In this paper we note that these two notions of regret share an even tighter connection, as they can be both viewed as instances of the same alma mater notion of regret, i.e., the generalized shifting regret introduced in Section 3. The work [8] essentially considered the case of online convex optimization with exp-concave loss function; in case of general convex functions, they also mentioned that the greedy projection forecaster of [2] enjoys adaptive regret guarantees. This is obtained in much the same way as we obtain an adaptive regret bound for the fixed-share forecaster in the next result.

Corollary 2.

Suppose that Algorithm 1 is run with the shared update (5). Then for all , for all sequences of loss vectors , and for all ,

whenever and are chosen optimally (depending on and ).

As mentioned in [8], standard lower bounds on the regret show that the obtained bound is optimal up to the logarithmic factors.

Proof.

For and , the regret in the right-hand side of (10) equals the regret considered in Theorem 2 against the sequence defined as for and for the remaining . When , this sequence is such that and so that , while . When , we have and . In all cases, , that is, . Specializing the bound of Theorem 2 with the additional choice gives the result. ∎

Discounted regret

was introduced in [3, Section 2.11] and is defined by

(11)

The discount factors measure the relative importance of more recent losses to older losses. For instance, for a given horizon , the discounts may be larger as is closer to . On the contrary, in a game-theoretic setting, the earlier losses may matter more then the more recent ones (because of interest rates), in which case would be smaller as gets closer to . We mostly consider below monotonic sequences of discounts (both non-decreasing and non-increasing). Up to a normalization, we assume that all discounts are in . As shown in [3], a minimal requirement to get non-trivial bounds is that the sum of the discounts satisfies as .

A natural objective is to show that the quantity in (11) is , for instance, by bounding it by something of the order of . We claim that Corollary 1 does so, at least whenever the sequences are monotonic for all . To support this claim, we only need to show that is a suitable value to deal with (11). Indeed, for all and for all , the measure of regularity involved in the corollary satisfies

where the second equality follows from the monotonicity assumption on the discounts.

The values of the discounts for all and are usually known in advance. However, the horizon is not. Hence, a calibration issue may arise. The online tuning of the parameters and shown in Section 7.3 entails a forecaster that can get discounted regret bounds of the order for all . The fundamental reason for this is that the discounts only come in the definition of the fixed-share forecaster via their sums. In contrast, the forecaster discussed in [3, Section 2.11] weighs each instance directly with (i.e., in the very definition of the forecaster) and enjoys therefore no regret guarantees for horizons other than (neither before nor after ). Therein, the knowledge of the horizon is so crucial that it cannot be dealt with easily, not even with online calibration of the parameters or with a doubling trick. We insist that for the fixed-share forecaster, much flexibility is gained as some of the discounts can change in a drastic manner for a round to values for the next round. However we must admit that the bound of [3, Section 2.11] is smaller than the one obtained above, as it of the order of , in contrast to our bound. Again, this improvement was made possible because of the knowledge of the time horizon.

As for the comparison to the setting of discounted losses of [9], we note that the latter can be cast as a special case of our setting (since the discounting weights take the special form therein, for some sequence of positive numbers). In particular, the fixed-share forecaster can satisfy the bound stated in [9, Theorem 2], for instance, by using the online tuning techniques of Section 7.3. A final reference to mention is the setting of time-selection functions of [10, Section 6], which basically corresponds to knowing in advance the weights of the comparison sequence the forecaster will be evaluated against. We thus generalize their results as well.

7 Refinements and extensions

We now show that techniques for refining the standard online analysis can be easily applied to our framework. We focus on the following: improvement for small losses, sparse target sequences, and dynamic tuning of parameters. Not all of them where within reach of previous analyses.

7.1 Improvement for small losses

The regret bounds of the fixed-share forecaster can be significantly improved when the cumulative loss of the best sequence of experts is small. The next result improves on Corollary 1 whenever . For concreteness, we focus on the fixed-share update (5).

Corollary 3.

Suppose Algorithm 1 is run with the update (5). Let , , and . For all , for all sequences of loss vectors , and for all sequences with , , and ,

whenever and are optimally chosen in terms of , , and .

Here again, the parameters and may be tuned online using the techniques shown in Section 7.3. The above refinement is obtained by mimicking the analysis of Hedge forecasters for small losses (see, e.g., [3, Section 2.4]). In particular, one should substitute Lemma 1 with the following lemma in the analysis carried out in Section 5; its proof follows from the mere replacement of Hoeffding’s inequality by [3, Lemma A.3], which states that for all

and for all random variable

taking values in , one has .

Lemma 2.

Algorithm 1 satisfies for all .

7.2 Sparse target sequences

The work [6] introduced forecasters that are able to efficiently compete with the best sequence of experts among all those sequences that only switch a bounded number of times and also take a small number of different values. Such “sparse” sequences of experts appear naturally in many applications. In this section we show that their algorithms in fact work very well in comparison with a much larger class of sequences that are “regular”—that is, , defined in (3) is small—and “sparse” in the sense that the quantity is small. Note that when for all , then two interesting upper bounds can be provided. First, denoting the union of the supports of these convex combinations by , we have , the cardinality of . Also, , the cardinality of the pool of convex combinations. Thus, generalizes the notion of sparsity of [6].

Here we consider a family of shared updates of the form

(12)

where the are nonnegative weights that may depend on past and current pre-weights and is a normalization constant. Shared updates of this form were proposed by [6, Sections 3 and 5.2]. Apart from generalizing the regret bounds of [6], we believe that the analysis given below is significantly simpler and more transparent. We are also able to slightly improve their original bounds.

We focus on choices of the weights that satisfy the following conditions: there exists a constant such that for all and ,

(13)

The next result improves on Theorem 2 when and , that is, when the dimension (or number of experts) is large but the sequence is sparse. Its proof can be found in the supplementary material; it is a variation on the proof of Theorem 2.

Theorem 3.

Suppose Algorithm 1 is run with the shared update (12) with weights satisfying the conditions (13). Then for all , for all sequences of loss vectors , and for all sequences ,

Corollaries 8 and 9 of [6] can now be generalized (and even improved); we do so—in the supplementary material—by showing two specific instances of the generic update (12) that satisfy (13).

7.3 Online tuning of the parameters

The forecasters studied above need their parameters and to be tuned according to various quantities, including the time horizon . We show here how the trick of [11] of having these parameters vary over time can be extended to our setting. For the sake of concreteness we focus on the fixed-share update, i.e., Algorithm 1 run with the update (5). We respectively replace steps 3 and 4 of its description by the loss and shared updates

(14)

for all and all , where and are two sequences of positive numbers, indexed by . We also conventionally define . Theorem 2 is then adapted in the following way (when and , Theorem 2 is exactly recovered).

Theorem 4.

The forecaster based on the updates (14) is such that whenever and for all , the following performance bound is achieved. For all , for all sequences of loss vectors , and for all ,

Due to space constraints, we provide an illustration of this bound only in the supplementary material.

Acknowledgments

The authors acknowledge support from the French National Research Agency (ANR) under grant EXPLO/RA (“Exploration–exploitation for efficient resource allocation”) and by the PASCAL2 Network of Excellence under EC grant no. 506778.

References

  • [1] M. Herbster and M. Warmuth. Tracking the best linear predictor.

    Journal of Machine Learning Research

    , 1:281–309, 2001.
  • [2] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, ICML 2003, 2003.
  • [3] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
  • [4] M. Herbster and M. Warmuth. Tracking the best expert. Machine Learning, 32:151–178, 1998.
  • [5] V. Vovk. Derandomizing stochastic prediction strategies. Machine Learning, 35(3):247–282, Jun. 1999.
  • [6] O. Bousquet and M.K. Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3:363–396, 2002.
  • [7] A. György, T. Linder, and G. Lugosi. Tracking the best of many experts. In Proceedings of the 18th Annual Conference on Learning Theory (COLT), pages 204–216, Bertinoro, Italy, Jun. 2005. Springer.
  • [8] E. Hazan and C. Seshadhri. Efficient learning algorithms for changing environments. Proceedings of the 26th International Conference of Machine Learning (ICML), 2009.
  • [9] A. Chernov and F. Zhdanov. Prediction with expert advice under discounted loss. In Proceedings of the 21st International Conference on Algorithmic Learning Theory, ALT 2010, pages 255–269. Springer, 2008.
  • [10] A. Blum and Y. Mansour. From extermal to internal regret. Journal of Machine Learning Research, 8:1307–1324, 2007.
  • [11] P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64:48–75, 2002.

Appendix A Online convex optimization on the simplex

By using a standard reduction, the results of the main body of the paper (for linear optimization on the simplex) can be applied to online convex optimization on the simplex. In this setting, at each step the forecaster chooses and then is given access to a convex loss . Now, using Algorithm 1 with the loss vector given by a subgradient of leads to the desired bounds. Indeed, by the convexity of , the regret at each time with respect to any vector with is then bounded as

Appendix B Proof of Theorem 3; application of the bound to two different updates

Proof.

The beginning and the end of the proof are similar to the one of Theorem 2, as they do not depend on the specific weight update. In particular, inequalities (6) and (7) remain the same. The proof is modified after (8), which this time we upper bound using the first condition in (13),

(15)

By definition of the shared update (12), we have and . We then upper bound the quantity at hand in (15) by