Learning The Best Expert Efficiently

11/11/2019 ∙ by Daron Anderson, et al. ∙ Trinity College Dublin 0

We consider online learning problems where the aim is to achieve regret which is efficient in the sense that it is the same order as the lowest regret amongst K experts. This is a substantially stronger requirement that achieving O(√(n)) or O(log n) regret with respect to the best expert and standard algorithms are insufficient, even in easy cases where the regrets of the available actions are very different from one another. We show that a particular lazy form of the online subgradient algorithm can be used to achieve minimal regret in a number of "easy" regimes while retaining an O(√(n)) worst-case regret guarantee. We also show that for certain classes of problem minimal regret strategies exist for some of the remaining "hard" regimes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider online convex optimisation in the efficient regret setting. By the efficient regret setting we mean that our task is to choose a sequence of actions such that the regret is of the same order as the lowest regret amongst experts. So if, for example, the regret of the best expert is then we want to actually achieve regret. This is, of course, much stronger than the usual requirement of or regret with respect to the best expert.

Our interest is motivated by applications such as the following. Suppose a person has to make a choice each day, for example what time to leave for work in the morning. Each day the person can use their insight, e.g. gained from experience or information from friends, to propose a time. The person is subject to behavioural biases as well as limited time and effort. In addition, suppose a recommender system is available that each day proposes a time that comes with an regret guarantee. Our task each day is to decide between these two proposed times (or perhaps a combination of them) in such a way that the recommender provides a “safety net”. That is, if the person’s proposed times have consistently lower regret than those proposed by the recommender then we want to achieve this lower regret. But if the person’s judgement is poor and the regret of their choices is greater than , then we want to fall back to the regret of the recommender system.

Intuitively, there are two easy cases where we might reasonably hope to achieve efficient regret. The first is where the difference in the regrets of the two experts is, in some appropriate sense, large. For example, one expert has regret and the other regret. Perhaps surprisingly, it is easy to come up with examples where standard online learning algorithms fail to achieve regret in this case. The second easy case is where both experts have similar regret, e.g. both have regret. Unfortunately, again it is easy to come up with examples where standard algorithms fail to achieve regret even in this case.

In this paper we show that a particular form of the online subgradient algorithm, namely the Biased Lazy Subgraduent algorithm, can be used to achieve efficient regret in such easy cases while retaining an worst-case regret guarantee. This is not the standard greedy form of algorithm but rather a lazy subgradient method with varying step-size. The remaining harder cases correspond to situations where there is no consistent ordering of the regrets of the two experts or where the difference in their regrets is or less. We show that for certain classes of expert efficient regret strategies also exist for some of these harder cases.

1.1 Related Work

There are two main strands of related work. The first, initiated by Cesa-Bianchi et al. (2007), seeks better regret bounds in the low loss and i.i.d. stochastic regimes via second-order regret inequalities. Cesa-Bianchi et al. (2007) derives two main types of second-order inequality. One is of the form (translating to the loss setting), where denotes the regret after steps, is the number of experts and is the loss incurred by taking the action of expert at step . Since when the loss is small this improves on earlier bounds in the low loss regime. The second type of inequality obtained is of the form (again translating to the loss setting and also ignoring minor terms), where for the Prod algorithm and for the Hedge algorithm with adaptive step size, where is the weight assigned to expert at step . Gaillard et al. (2014) build upon this to obtain regret inequalities of the form where . Using these they also obtain bounds for the low loss regime and also for i.i.d stochastic losses. Wintenberger (2017) and Koolen and Erven (2015) take a different approach and obtain second order inequalities by modifying the Hedge algorithm to include a second order loss term. A similar idea is also used by van Erven and Koolen (2016).

The low loss regime is not the same as the efficient regret regime, hence results for the low loss regime are of limited help in the efficient regret setting of interest in the present paper. Second-order inequalities based on the deviation , or similar, can be expected to yield strong lower bounds when an algorithm quickly settles on a single expert. Unfortunately, that leaves open the question of establishing conditions under which such rapid convergence takes place which, as we will see, turns out to be the key issue.

The second main strand of related work aims to construct so-called universal algorithms or algorithms achieving the “best of both worlds”. That is, a single algorithm that simultaneously achieves good regret in both the adversarial and stochastic settings, removing the need for prior knowledge of the setting when choosing the algorithm. One strategy for achieving this is to start off using an algorithm suited to stochastic losses and then switch irreversibly to use of an adversarial algorithm if evidence accumulates that the stochastic assumption is false. The other main strategy is to use reversible switches, with the decision as to which algorithm (or combination of algorithms) is used being updated in an online fashion. One such strategy, the -Prod algorithm introduced by Sani et al. (2014)

, is probably the closest approach in the literature to that considered in the present paper and is discussed in more detail in Section

6. Note that this work seeking universal algorithms by combining two specialised algorithms has perhaps been superceded by recent results showing that the Hedge and Subgradient algorithms with step size are in fact universal in this sense (see Mourtada and Ga•ffas (2019); Anderson and Leith (2019), respectively).

A related line of work uses the fact that popular algorithms such as Hedge can achieve good regret if the step size is tuned to the setting of interest, e.g. a step size of yields log regret for strongly convex losses. The approach taken is therefore to try to learn the best step size in an online fashion. See, for example, Erven et al. (2011) and van Erven and Koolen (2016).

A third recent strand of related work addresses combining learning algorithms in the bandit setting. Agarwal et al. (2017) and Singla et al. (2018) consider combining time-varying experts with the aim of minimising regret with respect to the best constant action (referred to as “competing with the best expert”). Bandit setting aside, the setup is otherwise quite similar to that considered in the present paper. The approach adopted is to manipulate the time-varying experts by adjusting in an online fashion the loss feedback provided to each expert. Regret performance of is achieved when the best expert has regret, and when the best expert has regret.

2 Preliminaries

We start with the usual online setup where at each step we take action , where

is convex, closed and bounded, then observe vector

and suffer loss . While we focus on linear losses the extension to convex losses is immediate by the standard subgradient bounding method.

Now suppose that at step we are restricted to choose amongst a set of actions , . For example, action may be proposed by a human and action by an opimisation algorithm. That is, we are restricted to choosing a meta-action , where is the -simplex, with meta-action corresponding to action , where denotes the ’th element of vector . Defining then and so the loss associated with meta-action is . For simplicity we assume all where is the Euclidean norm. The methods here immediately generalise to when we have a uniform bound by a simple rescaling.

The regret of a sequence of actions , with respect to the best fixed action in is , where . Substituting for and we have

We can also define the regret of , with respect to the best fixed meta-action in , namely

where . Since is a linear programme is an extreme point of the simplex. That is, where and denotes the unit vector with all elements zero apart from the ’th element which is equal to one.

Observe that in general . Indeed,

where is the regret of the ’th expert and equality follows from the fact that

since is a constant that does not depend on . Our interest is in selecting a sequence such that has order no greater than i.e. is . We refer to sequences with this property as having efficient regret, or in short as being efficient.

(a) Hedge
(b) Hedge
(c) Greedy Subgradient
(d) Greedy Subgradient
Figure 1: Performance of the Hedge and Greedy Subgradient algorithms in Example 2.

Importantly, it is easy to verify that common online learning algorithms do not generate sequences with this property, as the following example illustrates.

Suppose loss vector with , i.e. sequence , , , , , and . Suppose also we are to choose between fixed actions and , and that . Then and is . Figures 1(a)-(b) show the regret when using the Hedge algorithm111 , for . and Figures 1(c)-(d) when using the Greedy Subgradient algorithm222 where denotes the Euclidean projection onto the simplex.. Despite the simplicity of the choice to be made in this example it can be seen that the regret of both algorithms is , whereas . It can be verified that for both algorithms similar behaviour is observed with constant stepsize, and also with the Prod algorithm333, for ..

The difficulty here arises because the algorithms do not settle on the best expert , but rather oscillate about a mixture of the actions propsed by the two experts. Due to the loss of , such a mixture is liable to have regret rather than the desired .

3 Gap Property of the Lazy Subgradient Method

The lazy subgradient method selects according to,

(1)

for step size and is the Euclidean projection onto -simplex . Recently, (Anderson and Leith, 2019, Lemma 2) established the following property of the Euclidean projection, [Anderson and Leith (2019)] Suppose has two coordinates with . Then has -coordinate zero. Figure 2 illustrates Lemma 3 for dimensions. Points lying in the region between the two normals are projected onto the interior of the simplex. All other points are projected onto the closest extreme point, e.g. point in Figure 2. Lemma 3 characterises such points.

Figure 2: Illustrating Lemma 3 on the plane. The simplex is indicated by the solid line segment, and normals to the two extreme points of the simplex are indicated by the dashed lines. Points lying above the upper normal or below the lower one are projected onto the corresponding extreme point, e.g. the projection of point is point (0,1). Points lying between the normals are projected onto the interior of the simplex, e.g. point .

Applying Lemma 3 to the lazy subgradient method (1) we immediately have the following result, [Subgradient Gap] Let . Suppose , for all and that . That is, the gap between the regret of the best expert and the other experts is at least . Then the regret of the subgradient update (1) satisfies . Begin by observing that

and so implies , where , is the cumulative loss incurred by the ’th expert . Without loss of generality let since we can always permute the experts so that this holds. Observe that implies . Letting be the vector , then . By Lemma 3 it follows that has coordinate zero. Since by assumption this holds for all then only the first coordinate of is non-zero for i.e. action is applied for , where we need to take the max of and 1 since projection is only used to select from step onwards and the initial is arbitrary. The regret . Since , lie in the simplex the last term is upper bounded by . Note that we can easily tighten up this bound to replace the term with an one via the usual worst-case bound on the regret of the subgradient method over the first steps.

Revisiting Example 2 in light of Lemma 2, it can be verified that and so Lemma 2 holds with and . Hence, subgradient update with step size yields regret i.e. regret of the same order as the regret of the best expert, as desired. See Figure 3.

Figure 3: Performance of the Lazy Subgradient algorithm in Example 2 (with step size ).

More generally, Lemma 2 defines a class of “easy” cases where the regret of the best expert is sufficiently distinct from the other experts in the sense that they differ by at least . For these easy cases the lazy subgradient method achieves efficient regret. Typically we need to choose the step size proportional to in order to ensure good worst case performance in which case we need the gap between regrets to be proportional to in order to apply Lemma 2.

Another “easy” case where we might reasonably expect a learning algorithm to have efficient regret is when all of the experts have similar regret. Unfortunately it is not hard to devise examples where the subgradient method (1) yields regret even though the regrets of the individual experts are all , as the following example illustrates. Suppose , i.e. sequence , and , i.e. sequence . Suppose , and and that . Since for the regret of both experts is . Figure 4(a) shows the regret when these experts are combined using the subgradient method. It can be seen that the regret grows as . Figure 4(b) plots vs time. It can be seen that the action oscillates about the point. The difficulty arises because the sign differences between and mean that such oscillations can yield larger cumulative loss than any fixed combination of and .

Figure 4: Example 3 where individual experts have regret upper bounded by a constant but when combined using subgradient method the resulting actions have regret. Left hand plot shows the regret of the combined action taken by the subgradient method and also the regret of experts 1 and 2 (regret shown is with respect to expert 2 but it also wrt algorithm 2, or any fixed combination of the two). Right hand plots action taken by subgradient vs time.

4 Biased Lazy Subgradient Method

4.1 Learning the Best of Two Experts

It turns out that it is indeed possible to use the Lazy Subgradient method to achieve efficient regret both when the gap condition in Lemma 2 holds and when the difference between the regrets of the available experts is small. However, this requires biasing the loss sequence to which the subgradient method is applied. We begin by considering the case of experts and at step selecting,

(2)

where , with , with . From now on we fix parameter . Observe that this is just the Lazy Subgradient update applied to the sequence of vectors , . We have in mind selecting and using as benchmark against which to compare .

We can rewrite this update equivalently as,

(3)

where is the interval and bias . To see this observe that with . Hence,

(expanding the square and dropping constant terms). Changing variables to now yields (3).

When written in the form (3) it can be seen that when then and when then . Hence, when then (thus ) when and when . That is, we retain a gap property similar to that discussed in Section 3, with the gap now tunable by adjusting . Hence, update (3) continues to achieve efficient regret in the easy case where there is a large gap between the regrets of the available experts.

Secondly, when is less than we have that converges to the origin and . Hence, when then we can use to control the action taken when the difference in the regrets of the two experts is small. In particular, when then ensures i.e. we default to use of expert 1 when the difference in regrets is small. Hence, unlike the original lazy subgradient update in Section 3 the biased update (3) also achieves efficient regret in the second easy case where the available experts have similar regrets.

We formalise these observations in the following lemma, [Equilibrium Points] Under update (3), when either or for all then for all . When for all then for all .

We now establish the worst-case performance of update (3).

[FTL] Under update (3) we have for each the inequality

Let . By Lemma Appendix we have for each . For the sum on the right we have

(4)

where the last line inequality uses the assumption . By Lemma Appendix we have . hence right-hand-side is at most

Combining the above wie By the above (4) gives

where the first inequality follows from how . Since the above holds for all it holds for .

[Worst-case regret] Under update (3) with then regret . Begin by observing that for any we have

The previous lemma says the sum is at most . For the first part write . To show consider two cases. Case (i): . Then and so . Hence for we have and . Combining the two we have . Case (ii): . We have . Choosing the rest of the proof is similar.

Combining the above lemmas yields the following,

[Biased Subgradient Efficiency] Using update (3) with and , we have

  1. Distinct Experts. When or for all then .

  2. Similar Experts. When for all then .

  3. Worst Case. Otherwise .

where . For the worst case we use Lemma 4.1. Observe and for we have

Hence and the worst case now follows from Lemma 4.1. The “distinct” and “similar” expert cases now follow from application of Lemma 4.1 and noting that by Lemma 4.1 the regret over the first steps is at most .

Figure 5: Example 3 where experts are now combined using the biased lazy subgradient method (3) with . Left hand plot shows the regret of the combined action with respect to expert 2 and the right hand plot shows the action taken vs time.

Revisiting Example 3 using the Biased Lazy Subgradient method (3), Figure 5 plots the performance. This can be compared directly with Figure 4. It can be seen that, in line with Theorem 4.1, the Biased Lazy Subgradient method settles quickly on expert 1 and achieves regret in contrast to the regret achieved by the Lazy Subgradient method.

4.2 Discussion

When combining experts with regret Theorem 4.1 says that the combined regret will remain . When combining experts where one has regret and the other has regret less than this, e.g. or then the combined regret will be the same order as the better expert. When combining experts with regret less than then the combined regret will remain less than , and when combining experts with regret then the combined regret will remain . Probably the main limitation highlighted by Theorem 4.1 is that when one expert has regret and the other regret then Theorem 4.1 says that the combined expert may have regret. This behaviour can actually happen, as illustrated by the following example.

Figure 6: Example 4.2 where combining experts with and regret using the biased subgradient method yields regret. Left hand plot shows the regret of the combined action and also the regret of experts 1 and 2 with respect to expert 2. Right hand plots action taken vs time.

Suppose and , i.e. sequence . Suppose and for all , and . The regret of the first expert is . Figure 6(a) shows the regret when these experts are combined using biased subgradient method (3) with . It can be seen that the regret grows as . Figure 6(b) plots vs time.

4.3 Combining Two Learning Algorithms

Theorem 4.1 applies to general loss sequences and requires a gap of between the regrets of the two experts in order for the biased subgradient algorithm to achieve efficient regret. A natural question is whether there exists classes of loss for which we can significantly shrink, or even remove, this gap. With this in mind, one class of particular interest is where the experts and are generated by learning algorithms converging at different rates to the same optimum. In this case we expect to be at most and we exploit this to distinguish between experts with regrets that differ by rather than by .

The source of the gap requirement in Theorem 4.1 is that must be in order to ensure worst-case regret but consequently converges to the origin when is less than . As a result, in this case update (3) cannot distinguish between the experts. But when we know in advance that grows by no more than then we can rescale so that differs between low regret experts. Of course any such rescaling must maintain the growth of at no more than in order to retain the worst case performance guarantee. We have the following,

Suppose all for some . Using update (3) with and , we have

  1. Distinct Experts. When or for all then .

  2. Similar Experts. When for all then .

  3. Worst Case. Otherwise