1 Introduction
The contextual bandit problem is an influential extension of the classical multiarmed bandit. It can be described as follows. Let be the number of actions, a set of experts (or “policies”), the time horizon, and denote . At each time step ,

The player receives from each expert an “advice” .

Using advices and previous feedbacks, the player selects a probability distribution
. 
The adversary selects a loss function
. 
The player plays an action at random from (and independently of the past).

The player’s suffered loss is , which is also the only feedback the player receives about the loss function .
The player’s performance at the end of the rounds is measured through the regret with respect to the best expert:
(1.1) 
A landmark result by Auer et al. (2002) is that a regret of order is achievable in this setting. The general intuition captured by regret bounds is that the player’s performance is equal to the best expert’s performance up to a term of lower order. However the aforementioned bound might fail to capture this intuition if . It is thus natural to ask whether one could obtain a stronger guarantee where is essentially replaced by . This question was posed as a COLT 2017 open problem Agarwal et al. (2017). Such bounds are called firstorder regret bounds, and they are known to be possible with full information Auer et al. (2002), as well as in the multiarmed bandit setting Allenberg et al. (2006) (see also Foster et al. (2016) for a different proof) and the semibandit framework Neu (2015); Lykouris et al. (2017). Our main contribution is a new algorithm for contextual bandit, which we call MYGA (see Section 2), and for which we prove the following firstorder regret bound, thus resolving the open problem.
Theorem 1.1.
For any loss sequence such that one has that MYGA with and satisfies
2 Algorithm Description
In this section we describe the MYGA algorithm.
2.1 Truncation
We introduce a truncation operator that takes as input an index and a threshold . Then, treating the first arms as “majority arms” and the last arms as “minority arms,” redistributes “multiplicatively” the probability mass of all minority arms below threshold to the majority arms.
Definition 2.1.
For and , the truncation operator is defined as follows. Given any , then we set
Equivalently one can define for the majority arms with the following implicit formula:
(2.1) 
To see this it suffices to note that the amount of mass in the majority arms is given by
Example 2.2.
If , then simply adds into if .
Example 2.3.
An example with and is as follows:
2.2 Informal description
MYGA is parameterized by two parameters: a classical learning rate , and a thresholding parameter . Also let
At a high level, a key feature of MYGA is to introduce a set of auxiliary experts, one for each . More precisely, in each round , after receiving expert advices , MYGA calculates a distribution for each . Then, MYGA uses the standard exponential weight updates on with learning rate , to calculate a weight function —see (2.3). Then, it computes

, the weighted average of expert advices in : .

, the weighted average of expert advices in : .
Using these information, MYGA calculates the probability distribution from which the arm is played at round .
Let us now explain how and , are defined. First we remark that in the contextual bandit setting, the arm index has no real meaning since in each round we can permute the arms by some
and permute the expert’s advices and the loss vector by the same
. For this reason, throughout this paper, we shall assumeLet us define the “pivot” index . Then, in order to perform truncation, MYGA views the first arms as “majority arms” and the last arms as “minority arms” of the current round . At a high level we will have:

the distribution to play from is .

Each auxiliary expert is defined by .
We now give a more precise description in Algorithm 1.
(2.3) 
3 Preliminaries
Definition 3.1.
For analysis purpose, let us define the truncated loss , so that
We next derive two lemmas that will prove useful to isolate the properties of the truncation operator that are needed to obtain a firstorder regret bound.
Lemma 3.2.
Let and assume that for all , for some universal constant , and that . Then one has
(3.1) 
Proof.
Using , , and , we have
The rest of the proof follows from standard argument to bound the regret of Exp4, see e.g., [Theorem 4.2, Bubeck and CesaBianchi (2012)] (with the minor modification that the assumption on implies that ). ∎
The next lemma is straightforward.
Lemma 3.3.
In addition to the assumptions in Lemma 3.2, assume that there exists some numerical constants such that
(3.2) 
Then one has
We now see that it suffices to show that MYGA satisfies the assumptions of Lemma 3.2 and Lemma 3.3 for , and (assume that is known), in which case one obtains a bound of order .
In fact the assumption of Lemma 3.2 will be easily verified, and the real difficulty will be to prove (3.2). We observe that the standard trick of thresholding the arms with probability below would yield (3.2) with the right hand side replaced by , and in turn this leads to a regret of order . Our goal is to improve over this naive argument.
4 Proof of the Armed Case
The goal of this section is to explain how our MYGA algorithm arises naturally. To focus on the main ideas we restrict to the case . The complete formal proof of Theorem 1.1 is given in Section 5.
Recall we have assumed without loss of generality that for each round . This implies because . In this simple case, for , we abbreviate our truncation operator as , and it acts as follows. Given
if we have ; and if we have . 
In particular, we have and for all . We refer to arm as the majority arm and arm as the minority arm. We denote as the loss of the majority arm and as the loss of the minority arm.
Since and , we have
(4.1) 
Observe also that one always has (indeed ), and thus the whole game to prove (3.2) is to upper bound the minority’s loss .
4.1 When the minority suffers small loss
Assume that for some constant . Then, because , one can directly obtain (3.2) from (4.1) with . In words, when the minority arm has a total loss comparable to the majority arm, simply playing from would satisfy a firstorder regret bound.
Our main idea is to somehow enforce this relation between the minority and majority losses, by “truncating” probabilities appropriately. Indeed, recall that if after some truncation we have , then it satisfies so the minority loss can be improved.
4.2 Make the minority great again
Our key new insight is captured by the following lemma which is proved using an integral averaging argument.
Definition 4.1.
For each , let be the expected loss if the truncated strategy is played at each round.
Lemma 4.2.
As long as ,
In words, if is large, then it must be that was a much better threshold compared to , that is is large.
Proof of Lemma 4.2.
For any , define the function
Let us pick to minimize , and breaking ties by choosing the smaller value of . We make several observations:

because for any with we must have .

.

because .
Let us define the points
and . 
Note that the tiebreaking rule for the choice of ensures (if then it must satisfy giving a contradiction).
Using the identity
(4.2) 
we calculate that
Since , and , we conclude that
Given Lemma 4.2, a very intuitive strategy start to emerge. Suppose we can somehow get an upper bound of the form
(4.3) 
Then, putting this into Lemma 4.2 and using , we have for any ,
In words, the minority arm also suffers from a small loss (and thus is great again!) Putting this into (4.1), we immediately get (3.2) as desired and finish the proof of Theorem 1.1 in the case .
4.3 Expanding the set of experts
Assume for a moment that we somehow expand the set of experts into
so that:(4.4) 
Then clearly (4.3) would be satisfied using Lemma 3.2, (4.1) and (the loss of an expert should be no better than the loss of the best expert ).
There are two issues with condition (4.4): first, it selfreferential, in the sense that it assumes satisfies a certain form depending on while is defined via (recall (2.2)); and second, it potentially requires to have an infinite number of experts (one for each ).
Let us first deal with the second issue via discretization.
Lemma 4.3.
In the same setting as Lemma 4.2, there exists such that
Proof.
Thus, instead of (4.4), we only need to require
(4.5) 
We now resolve the selfreferentiality of (4.5) by defining simultaneously and as follows. Consider the map defined by:
It suffices to find a fixed point : indeed, setting
and for , 
we have both (4.5) holds and is the correct weighted average of expert advices in
Finally, has a fixed point since it is a nondecreasing function from a closed interval to itself. It is also not hard to find such a point algorithmically.
This concludes the (slightly informal) proof for . We give the complete proof for arbitrary in the next section.
5 Proof of Theorem 1.1
In this section, we assume satisfies (2.2) and we defer the constructive proof of finding to Section 6. Recall the arm index has no real meaning so without loss of generality we have permuted the arms so that
We refer to the set of majority arms and the set of minority arms at round .^{1}^{1}1We stress that in the arm setting, although is the minimum index such that , it may not be the minimum index so that . We let and respectively be the total loss of the majority and minority arms. We again have
(5.1) 
Thus, the whole game to prove (3.2) is to upper bound and .
5.1 Useful properties
We state a few properties about and its truncations.
Lemma 5.1.
In each round , if satisfies (2.2), then
Proof.
Lemma 5.2.
In each round , if satisfies (2.2), then

for every minority arm it satisfies , and

for every majority arm it satisfies .
Proof.
For sake of notation we drop the index in this proof. Recall .

For every minority arm , every , we have according to Definition 2.1. Therefore, we must have .

For every majority arm , we have (using Lemma 5.1)
From the definition of , we can also conclude . This is because . ∎
The next lemma shows that setting satisfies the assumption of Lemma 3.2.
Lemma 5.3.
If satisfies (2.2), and , then for every arm :
Proof.
For sake of notation we drop the index in this proof.
By Definition 2.1 and Lemma 5.2, we have for every :
The other statement follows because whenever , Definition 2.1 says it must satisfy . ∎
5.2 Bounding and
We first upper bound and then upper bound .
Lemma 5.4.
If satisfies (2.2), then .
Proof.
Using Lemma 5.2 we have for any . Also, for every satisfying (owing to Definition 3.1 and Lemma 5.3). Therefore,
Comments
There are no comments yet.