Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits

by   Zeyuan Allen-Zhu, et al.
Princeton University

Regret bounds in online learning compare the player's performance to L^*, the optimal performance in hindsight with a fixed strategy. Typically such bounds scale with the square root of the time horizon T. The more refined concept of first-order regret bound replaces this with a scaling √(L^*), which may be much smaller than √(T). It is well known that minor variants of standard algorithms satisfy first-order regret bounds in the full information and multi-armed bandit settings. In a COLT 2017 open problem, Agarwal, Krishnamurthy, Langford, Luo, and Schapire raised the issue that existing techniques do not seem sufficient to obtain first-order regret bounds for the contextual bandit problem. In the present paper, we resolve this open problem by presenting a new strategy based on augmenting the policy space.


page 1

page 2

page 3

page 4


The Price of Differential Privacy For Online Learning

We design differentially private algorithms for the problem of online li...

MOTS: Minimax Optimal Thompson Sampling

Thompson sampling is one of the most widely used algorithms for many onl...

Further Optimal Regret Bounds for Thompson Sampling

Thompson Sampling is one of the oldest heuristics for multi-armed bandit...

First-Order Regret Analysis of Thompson Sampling

We address online combinatorial optimization when the player has a prior...

Online Auctions and Multi-scale Online Learning

We consider revenue maximization in online auctions and pricing. A selle...

A PDE-Based Analysis of the Symmetric Two-Armed Bernoulli Bandit

This work addresses a version of the two-armed Bernoulli bandit problem ...

Bandits with Partially Observable Offline Data

We study linear contextual bandits with access to a large, partially obs...

1 Introduction

The contextual bandit problem is an influential extension of the classical multi-armed bandit. It can be described as follows. Let be the number of actions, a set of experts (or “policies”), the time horizon, and denote . At each time step ,

  • The player receives from each expert an “advice” .

  • Using advices and previous feedbacks, the player selects a probability distribution


  • The adversary selects a loss function


  • The player plays an action at random from (and independently of the past).

  • The player’s suffered loss is , which is also the only feedback the player receives about the loss function .

The player’s performance at the end of the rounds is measured through the regret with respect to the best expert:


A landmark result by Auer et al. (2002) is that a regret of order is achievable in this setting. The general intuition captured by regret bounds is that the player’s performance is equal to the best expert’s performance up to a term of lower order. However the aforementioned bound might fail to capture this intuition if . It is thus natural to ask whether one could obtain a stronger guarantee where is essentially replaced by . This question was posed as a COLT 2017 open problem Agarwal et al. (2017). Such bounds are called first-order regret bounds, and they are known to be possible with full information Auer et al. (2002), as well as in the multi-armed bandit setting Allenberg et al. (2006) (see also Foster et al. (2016) for a different proof) and the semi-bandit framework Neu (2015); Lykouris et al. (2017). Our main contribution is a new algorithm for contextual bandit, which we call MYGA (see Section 2), and for which we prove the following first-order regret bound, thus resolving the open problem.

Theorem 1.1.

For any loss sequence such that one has that MYGA with and satisfies

2 Algorithm Description

In this section we describe the MYGA algorithm.

2.1 Truncation

We introduce a truncation operator that takes as input an index and a threshold . Then, treating the first arms as “majority arms” and the last arms as “minority arms,” redistributes “multiplicatively” the probability mass of all minority arms below threshold to the majority arms.

Definition 2.1.

For and , the truncation operator is defined as follows. Given any , then we set

Equivalently one can define for the majority arms with the following implicit formula:


To see this it suffices to note that the amount of mass in the majority arms is given by

Example 2.2.

If , then simply adds into if .

Example 2.3.

An example with and is as follows:

2.2 Informal description

MYGA is parameterized by two parameters: a classical learning rate , and a thresholding parameter . Also let

At a high level, a key feature of MYGA is to introduce a set of auxiliary experts, one for each . More precisely, in each round , after receiving expert advices , MYGA calculates a distribution for each . Then, MYGA uses the standard exponential weight updates on with learning rate , to calculate a weight function —see (2.3). Then, it computes

  • , the weighted average of expert advices in : .

  • , the weighted average of expert advices in : .

Using these information, MYGA calculates the probability distribution from which the arm is played at round .

Let us now explain how and , are defined. First we remark that in the contextual bandit setting, the arm index has no real meaning since in each round we can permute the arms by some

and permute the expert’s advices and the loss vector by the same

. For this reason, throughout this paper, we shall assume

Let us define the “pivot” index . Then, in order to perform truncation, MYGA views the first arms as “majority arms” and the last arms as “minority arms” of the current round . At a high level we will have:

  • the distribution to play from is .

  • Each auxiliary expert is defined by .

We now give a more precise description in Algorithm 1.

1:learning rate , threshold parameter
2: and
3:for  to  do
4:     receive advices from each expert
5:     weighted average
6:     assume wlog. by permuting the arms
7:       the first arms are majority arms
8:     find such that   can be found in time , see Lemma 6.1
9:      for every  and  
10:     draw an arm from probability distribution and receive feedback

     compute loss estimator

12:     update the exponential weights for any :
13:end for
Algorithm 1 MYGA (Make the minoritY Great Again)

3 Preliminaries

Definition 3.1.

For analysis purpose, let us define the truncated loss , so that

We next derive two lemmas that will prove useful to isolate the properties of the truncation operator that are needed to obtain a first-order regret bound.

Lemma 3.2.

Let and assume that for all , for some universal constant , and that . Then one has


Using , , and , we have

The rest of the proof follows from standard argument to bound the regret of Exp4, see e.g., [Theorem 4.2, Bubeck and Cesa-Bianchi (2012)] (with the minor modification that the assumption on implies that ). ∎

The next lemma is straightforward.

Lemma 3.3.

In addition to the assumptions in Lemma 3.2, assume that there exists some numerical constants such that


Then one has

We now see that it suffices to show that MYGA satisfies the assumptions of Lemma 3.2 and Lemma 3.3 for , and (assume that is known), in which case one obtains a bound of order .

In fact the assumption of Lemma 3.2 will be easily verified, and the real difficulty will be to prove (3.2). We observe that the standard trick of thresholding the arms with probability below would yield (3.2) with the right hand side replaced by , and in turn this leads to a regret of order . Our goal is to improve over this naive argument.

4 Proof of the -Armed Case

The goal of this section is to explain how our MYGA algorithm arises naturally. To focus on the main ideas we restrict to the case . The complete formal proof of Theorem 1.1 is given in Section 5.

Recall we have assumed without loss of generality that for each round . This implies because . In this simple case, for , we abbreviate our truncation operator as , and it acts as follows. Given

if we have ;  and if we have .

In particular, we have and for all . We refer to arm as the majority arm and arm as the minority arm. We denote as the loss of the majority arm and as the loss of the minority arm.

Since and , we have


Observe also that one always has (indeed ), and thus the whole game to prove (3.2) is to upper bound the minority’s loss .

4.1 When the minority suffers small loss

Assume that for some constant . Then, because , one can directly obtain (3.2) from (4.1) with . In words, when the minority arm has a total loss comparable to the majority arm, simply playing from would satisfy a first-order regret bound.

Our main idea is to somehow enforce this relation between the minority and majority losses, by “truncating” probabilities appropriately. Indeed, recall that if after some truncation we have , then it satisfies so the minority loss can be improved.

4.2 Make the minority great again

Our key new insight is captured by the following lemma which is proved using an integral averaging argument.

Definition 4.1.

For each , let be the expected loss if the truncated strategy is played at each round.

Lemma 4.2.

As long as ,

In words, if is large, then it must be that was a much better threshold compared to , that is is large.

Proof of Lemma 4.2.

For any , define the function

Let us pick to minimize , and breaking ties by choosing the smaller value of . We make several observations:

  • because for any with we must have .

  • .

  • because .

Let us define the points

 and  .

Note that the tie-breaking rule for the choice of ensures (if then it must satisfy giving a contradiction).

Using the identity


we calculate that

Since , and , we conclude that

Given Lemma 4.2, a very intuitive strategy start to emerge. Suppose we can somehow get an upper bound of the form


Then, putting this into Lemma 4.2 and using , we have for any ,

In words, the minority arm also suffers from a small loss (and thus is great again!) Putting this into (4.1), we immediately get (3.2) as desired and finish the proof of Theorem 1.1 in the case .

Thus, we are left with showing (4.3). The main idea is to add the truncated strategy as an additional auxiliary expert. If we can achieve this, then (4.3) can be obtained from the regret formula in Lemma 3.2.

4.3 Expanding the set of experts

Assume for a moment that we somehow expand the set of experts into

so that:


Then clearly (4.3) would be satisfied using Lemma 3.2, (4.1) and (the loss of an expert should be no better than the loss of the best expert ).

There are two issues with condition (4.4): first, it self-referential, in the sense that it assumes satisfies a certain form depending on while is defined via (recall (2.2)); and second, it potentially requires to have an infinite number of experts (one for each ).

Let us first deal with the second issue via discretization.

Lemma 4.3.

In the same setting as Lemma 4.2, there exists such that


For let be the smallest element in . For any we can rewrite (4.2) as (note that )

where . Using the same proof of Lemma 4.2, and redefining

we get that there exists and such that

The rest of the proof now follows from the same proof of Lemma 4.2, except that we minimize over instead of . ∎

Thus, instead of (4.4), we only need to require


We now resolve the self-referentiality of (4.5) by defining simultaneously and as follows. Consider the map defined by:

It suffices to find a fixed point : indeed, setting

 and   for ,

we have both (4.5) holds and is the correct weighted average of expert advices in

Finally, has a fixed point since it is a nondecreasing function from a closed interval to itself. It is also not hard to find such a point algorithmically.

This concludes the (slightly informal) proof for . We give the complete proof for arbitrary in the next section.

5 Proof of Theorem 1.1

In this section, we assume satisfies (2.2) and we defer the constructive proof of finding to Section 6. Recall the arm index has no real meaning so without loss of generality we have permuted the arms so that

We refer to the set of majority arms and the set of minority arms at round .111We stress that in the -arm setting, although is the minimum index such that , it may not be the minimum index so that . We let and respectively be the total loss of the majority and minority arms. We again have


Thus, the whole game to prove (3.2) is to upper bound and .

5.1 Useful properties

We state a few properties about and its truncations.

Lemma 5.1.

In each round , if satisfies (2.2), then


Let and . By (2.1) and since one has

Moreover is a mixture of and truncated versions of so similarly using (2.1) one has

Putting the two above displays together concludes the proof. ∎

Lemma 5.2.

In each round , if satisfies (2.2), then

  • for every minority arm it satisfies , and

  • for every majority arm it satisfies .


For sake of notation we drop the index in this proof. Recall .

  • For every minority arm , every , we have according to Definition 2.1. Therefore, we must have .

  • For every majority arm , we have (using Lemma 5.1)

    From the definition of , we can also conclude . This is because . ∎

The next lemma shows that setting satisfies the assumption of Lemma 3.2.

Lemma 5.3.

If satisfies (2.2), and , then for every arm :


For sake of notation we drop the index in this proof.

By Definition 2.1 and Lemma 5.2, we have for every :

The other statement follows because whenever , Definition 2.1 says it must satisfy . ∎

5.2 Bounding and

We first upper bound and then upper bound .

Lemma 5.4.

If satisfies (2.2), then .


Using Lemma 5.2 we have for any . Also, for every satisfying (owing to Definition 3.1 and Lemma 5.3). Therefore,