The contextual bandit problem is an influential extension of the classical multi-armed bandit. It can be described as follows. Let be the number of actions, a set of experts (or “policies”), the time horizon, and denote . At each time step ,
The player receives from each expert an “advice” .
Using advices and previous feedbacks, the player selects a probability distribution.
The adversary selects a loss function.
The player plays an action at random from (and independently of the past).
The player’s suffered loss is , which is also the only feedback the player receives about the loss function .
The player’s performance at the end of the rounds is measured through the regret with respect to the best expert:
A landmark result by Auer et al. (2002) is that a regret of order is achievable in this setting. The general intuition captured by regret bounds is that the player’s performance is equal to the best expert’s performance up to a term of lower order. However the aforementioned bound might fail to capture this intuition if . It is thus natural to ask whether one could obtain a stronger guarantee where is essentially replaced by . This question was posed as a COLT 2017 open problem Agarwal et al. (2017). Such bounds are called first-order regret bounds, and they are known to be possible with full information Auer et al. (2002), as well as in the multi-armed bandit setting Allenberg et al. (2006) (see also Foster et al. (2016) for a different proof) and the semi-bandit framework Neu (2015); Lykouris et al. (2017). Our main contribution is a new algorithm for contextual bandit, which we call MYGA (see Section 2), and for which we prove the following first-order regret bound, thus resolving the open problem.
For any loss sequence such that one has that MYGA with and satisfies
2 Algorithm Description
In this section we describe the MYGA algorithm.
We introduce a truncation operator that takes as input an index and a threshold . Then, treating the first arms as “majority arms” and the last arms as “minority arms,” redistributes “multiplicatively” the probability mass of all minority arms below threshold to the majority arms.
For and , the truncation operator is defined as follows. Given any , then we set
Equivalently one can define for the majority arms with the following implicit formula:
To see this it suffices to note that the amount of mass in the majority arms is given by
If , then simply adds into if .
An example with and is as follows:
2.2 Informal description
MYGA is parameterized by two parameters: a classical learning rate , and a thresholding parameter . Also let
At a high level, a key feature of MYGA is to introduce a set of auxiliary experts, one for each . More precisely, in each round , after receiving expert advices , MYGA calculates a distribution for each . Then, MYGA uses the standard exponential weight updates on with learning rate , to calculate a weight function —see (2.3). Then, it computes
, the weighted average of expert advices in : .
, the weighted average of expert advices in : .
Using these information, MYGA calculates the probability distribution from which the arm is played at round .
Let us now explain how and , are defined. First we remark that in the contextual bandit setting, the arm index has no real meaning since in each round we can permute the arms by some
and permute the expert’s advices and the loss vector by the same. For this reason, throughout this paper, we shall assume
Let us define the “pivot” index . Then, in order to perform truncation, MYGA views the first arms as “majority arms” and the last arms as “minority arms” of the current round . At a high level we will have:
the distribution to play from is .
Each auxiliary expert is defined by .
We now give a more precise description in Algorithm 1.
compute loss estimatoras
For analysis purpose, let us define the truncated loss , so that
We next derive two lemmas that will prove useful to isolate the properties of the truncation operator that are needed to obtain a first-order regret bound.
Let and assume that for all , for some universal constant , and that . Then one has
Using , , and , we have
The rest of the proof follows from standard argument to bound the regret of Exp4, see e.g., [Theorem 4.2, Bubeck and Cesa-Bianchi (2012)] (with the minor modification that the assumption on implies that ). ∎
The next lemma is straightforward.
In addition to the assumptions in Lemma 3.2, assume that there exists some numerical constants such that
Then one has
In fact the assumption of Lemma 3.2 will be easily verified, and the real difficulty will be to prove (3.2). We observe that the standard trick of thresholding the arms with probability below would yield (3.2) with the right hand side replaced by , and in turn this leads to a regret of order . Our goal is to improve over this naive argument.
4 Proof of the -Armed Case
Recall we have assumed without loss of generality that for each round . This implies because . In this simple case, for , we abbreviate our truncation operator as , and it acts as follows. Given
|if we have ; and if we have .|
In particular, we have and for all . We refer to arm as the majority arm and arm as the minority arm. We denote as the loss of the majority arm and as the loss of the minority arm.
Since and , we have
Observe also that one always has (indeed ), and thus the whole game to prove (3.2) is to upper bound the minority’s loss .
4.1 When the minority suffers small loss
Assume that for some constant . Then, because , one can directly obtain (3.2) from (4.1) with . In words, when the minority arm has a total loss comparable to the majority arm, simply playing from would satisfy a first-order regret bound.
Our main idea is to somehow enforce this relation between the minority and majority losses, by “truncating” probabilities appropriately. Indeed, recall that if after some truncation we have , then it satisfies so the minority loss can be improved.
4.2 Make the minority great again
Our key new insight is captured by the following lemma which is proved using an integral averaging argument.
For each , let be the expected loss if the truncated strategy is played at each round.
As long as ,
In words, if is large, then it must be that was a much better threshold compared to , that is is large.
Proof of Lemma 4.2.
For any , define the function
Let us pick to minimize , and breaking ties by choosing the smaller value of . We make several observations:
because for any with we must have .
Let us define the points
Note that the tie-breaking rule for the choice of ensures (if then it must satisfy giving a contradiction).
Using the identity
we calculate that
Since , and , we conclude that
Given Lemma 4.2, a very intuitive strategy start to emerge. Suppose we can somehow get an upper bound of the form
Then, putting this into Lemma 4.2 and using , we have for any ,
4.3 Expanding the set of experts
Assume for a moment that we somehow expand the set of experts intoso that:
There are two issues with condition (4.4): first, it self-referential, in the sense that it assumes satisfies a certain form depending on while is defined via (recall (2.2)); and second, it potentially requires to have an infinite number of experts (one for each ).
Let us first deal with the second issue via discretization.
In the same setting as Lemma 4.2, there exists such that
Thus, instead of (4.4), we only need to require
We now resolve the self-referentiality of (4.5) by defining simultaneously and as follows. Consider the map defined by:
It suffices to find a fixed point : indeed, setting
|and for ,|
we have both (4.5) holds and is the correct weighted average of expert advices in
Finally, has a fixed point since it is a nondecreasing function from a closed interval to itself. It is also not hard to find such a point algorithmically.
This concludes the (slightly informal) proof for . We give the complete proof for arbitrary in the next section.
5 Proof of Theorem 1.1
In this section, we assume satisfies (2.2) and we defer the constructive proof of finding to Section 6. Recall the arm index has no real meaning so without loss of generality we have permuted the arms so that
We refer to the set of majority arms and the set of minority arms at round .111We stress that in the -arm setting, although is the minimum index such that , it may not be the minimum index so that . We let and respectively be the total loss of the majority and minority arms. We again have
Thus, the whole game to prove (3.2) is to upper bound and .
5.1 Useful properties
We state a few properties about and its truncations.
In each round , if satisfies (2.2), then
In each round , if satisfies (2.2), then
for every minority arm it satisfies , and
for every majority arm it satisfies .
The next lemma shows that setting satisfies the assumption of Lemma 3.2.
If satisfies (2.2), and , then for every arm :
5.2 Bounding and
We first upper bound and then upper bound .
If satisfies (2.2), then .