In a bandit problem, a decision-maker sequentially chooses actions from given action sets and receives rewards corresponding to the selected actions. The goal of the decision-maker, also known as the policy, is to maximize the (cumulatively obtained) reward by utilizing the history of the previous observations. This paper considers a variant of this problem, called stochastic linear bandit, in which all actions are elements of for some integer and the expected values of the rewards depend on the actions through a linear function. We also let the action sets change over time. The classical multi-armed bandit (MAB) and the -armed contextual multi-armed bandit are special cases of this problem.
Since its introduction by , the linear bandit problem has attracted a great deal of attention. Several algorithms based on the idea of upper confidence bound (UCB), due to , have been proposed and analysed (notable examples are [5, 7, 14, 1]). The best known regret bound for these algorithms is which matches the existing lower bounds up to logarithmic factors [7, 14, 18, 13, 12]. The best known algorithm in this family is the optimism in the face of uncertainty linear bandit (OFUL) algorithm by .
A different line of research examines the performance of Thompson sampling (TS), a Bayesian heuristic due to that employs the posterior distribution of the reward function to balance exploration and exploitation. TS is also known as posterior sampling. [16, 17] and  proved an upper bound for the Bayesian regret of TS, thereby indicating its near-optimality. The best thus-far known worst-case regret bound for TS, however, is given by [4, 3] which is worse than the previous bounds by a factor of . As it is stated in Section 8.2.1 of , it is an open question whether this extra factor can be eliminated by a more careful analysis.
In addition, when there is a gap between the expected rewards of the top two actions, OFUL and TS are shown to have a regret with a dependence in instead of . According to , it remains an open problem whether this result extends to the cases when can be . We defer to , and references therein, for a more thorough discussion.
On the other hand, in a subclass of the linear bandit problem known as linear -armed contextual bandit,  considered a more general version of the gap assumption, a certain type of margin condition for the action set, and proposed a novel extension of the -greedy algorithm. Their OLS Bandit algorithm explicitly allocates a fraction of rounds to each arm, and uses these forced samples to discard obviously sub-optimal arms. They show that this filtering approach can lead to a near-optimal regret bound grows logarithmically in .  adapted this idea to the setting with very large , and  extended that further to when both the number of arms and are large. In this paper, we demonstrate that a major generalization of this idea can be used to obtain a unifying technique for analyzing all of the above algorithms, that not only recovers known results in the literature, but also yields a number of new results, and notably, solves the aforementioned two open problems. To be explicit, the main contributions of this paper are:
We propose a general family of algorithms, called Two-phase Bandit, for the stochastic linear bandit problem and prove that they are rate optimal. We also show that TS, OFUL, and OLS Bandit are special cases of this family. Therefore, we obtain a universal proof of rate optimality for all of these algorithms, in both Bayesian or worst-case regret settings.
We consider the same generalized gap assumption as in , that
with positive probability, and obtain a poly-logarithmic (in) gap-dependent regret bound for all of the above algorithms, when the action sets are independently drawn from an unknown distribution. To the best of our knowledge, this result is new for OFUL and TS.
Our proof also shows that TS is vulnerable (can incur a linear Bayesian and worst-case regret) if it uses an incorrect prior distribution for the unknown parameter vector of the linear reward function, or when it uses an incorrect noise distribution.
As a byproduct of our analyses in 3-4, we obtain a set of conditions under which (a) TS is rate optimal and (b) we can shrink the confidence sets of OFUL by a factor , without impacting its regret.
Organization. We start by introducing the notation and main assumptions in §2. In §3 we introduce the Two-phase Bandit algorithm and prove that it is rate optimal. In §4 we introduce the ROFUL algorithm, a special case of the Two-phase Bandit algorithm, and in §5 we show that OFUL and TS are special cases of the ROFUL algorithm. Finally, in §6, we prove that TS can incur linear regret in the worst-case or when it does not have correct information on the prior or the noise distribution.
2 Setting and notations
For any positive integer , we denote by . Letting be a positive semi-definite matrix, by we mean for any vector of suitable size. By a grouped linear bandit (GLB) problem, we mean a tuple where:
is the prior distribution of a parameter vector on .
consists of orthogonal -dimensional subspaces of . By abuse of notation, we write to also denote the projection matrix from onto .
are random compact subsets of .
are random objects passed to (randomized) policies to function.
is a sequence of independent mean-zero
-sub-Gaussian random variables.
The main difference between our model and the common linear bandit formulation (for example, as defined in ) is the introduction of ’s. Loosely speaking, we can consider each as a copy of ; thus, . In this case, our assumption on the action sets demands each action to have non-zeros entries in only one of the copies of . Our problem can be regarded as a -dimensional instance of the ordinary (un-grouped) linear bandit; however, we will see in the next sections that this additional structure lets us improve the regret bound by a factor of in the gap-dependent setting and a factor of in the gap-independent setting. Three interesting special cases of this model are:
When and for all , our problem reduces to a simple multi-armed bandit problem.
When , and each action set contains exactly copies of a vector (one in each ), the problem is called -armed contextual bandit.
When , we get the ordinary stochastic linear bandit problem.
The optimal and selected actions at time are denoted by and respectively. The corresponding reward is then revealed to the policy where . We denote the history of observations up to time by . More precisely, we define
In this model, a policy is formally defined as a deterministic function that maps to an element of . We emphasize that our definition of policies includes randomized policies, as well. Indeed, the random objects ’s are the source of randomness used by a policy.
The performance measure for evaluating the policies is the standard cumulative Bayesian regret defined as
The expectation is taken with respect to the entire randomness in our model, including the prior distribution. Although we describe our setting in a Bayesian fashion, our results cover fixed setting by defining the prior distribution to be the one with a point mass at .
2.1 Action sets
In the next sections, we derive our regret bounds for various types of action sets, thereby dedicating this subsection to the definitions and notations we will use for the action sets. We start off by defining the extremal points of an action set. [Extremal points] For an action set , define its extremal points to be all for which there are no and satisfying
The importance of this definition is that all the algorithms studied in this paper only choose extremal points in action sets. This observation implies that the rewards attained by any of these algorithms, and an action set , belong to the reward profile of defined by
The maximum attainable reward and gap of an action set for the parameter vector are defined respectively as
For any , write
In the above notations, for the sake of simplicity, we may use subscript to refer to . For instance, by we mean . We now define a gapped problem as follows: [Gapped problem] We call a GLB problem gapped if for some the following inequality holds:
Moreover, for a fixed gap level , we let be the indicator of the event . Note that all problems are gapped for all and . This observation will help us obtain gap-independent bounds. Our notion of gap is more general than the well-known gap assumption in the literature (e.g., as in ) which always holds which means would be . We conclude this section by defining near-optimal space followed by diversity condition and margin condition. The two conditions will enable us to enhance a term in the our regret bound, that will appear in Eq. (6), to an expression that grows sub-linearly in terms of . [Near-optimal space] Let be the smallest number such that there exists with and
Let us denote , and as before, we also treat as the projection of onto the subspace . The main purpose of this notion is to handle sub-optimal arms in the special case of -armed contextual bandit. One might harmlessly assume that (or equal to the identity function if viewed as an operator) and follow the rest of the paper. [Diversity condition] We say that a GLB problem satisfies the diversity condition with parameter if is independent of and
[Margin condition] In a GLB problem, the margin condition holds if
where are two constants.
3 Two-phase bandit algorithm
In this section, we describe Two-phase Bandit algorithm, an extension of the OLS Bandit algorithm, that was introduced by  for the special case of -armed contextual bandit setting, to our more general grouped linear bandit problem. Two-phase Bandit algorithm (presented in Algorithm 1) has two separate phases to deal with the exploration-exploitation trade-off. At each time , a forced-sampling rule determines which arm to pull (i.e., when is an element of ) or it refuses to pick one which implies that the best arm should be chosen by exploiting the information gathered thus far (i.e., when ). The forced-sampling rule is allowed to depend on the history as well as the current action set . The notation expresses this dependence explicitly.
In the exploitation phase, two selectors are used to decide which action to choose. The blurry selector first selects a candidate set which is later passed to the vivid selector to pick a single action from. The idea behind this architecture is that the blurry selector eliminates all the actions that are suboptimal with a constant margin with high probability. The name “blurry” indicates the low accuracy of this selector. The vivid selector, on the other hand, should become more accurate as grows with high probability. The diversity condition is the assumption that plays a crucial role in proving this type of result. We now state and briefly discuss the assumptions that need to be met such that our results are valid. [Boundedness] There exist constants such that and for all and for all almost surely. The next assumption concerns how much regret the forced-sampling rule incurs to assure that the blurry selector works properly. [Forced-sampling cost] Let be the indicator function for the event . Then there exists some such that
We are now ready to formalize what it means for the blurry selector to work properly. [Blurry selector bound] Let . There exists some such that for all , we have that
The above assumptions are sufficient to prove a bound which scales with as . Furthermore, we will discuss in §4, that can be tuned such that the regret grows as . Furthermore, we can obtain even sharper regret bounds, under additional assumptions that will be stated below. For example, the vivid selector should satisfy certain properties. In order to highlight the main idea, we restrict our attention to a rather concrete scenario in which these properties hold and defer the most general cases to a longer version of the paper. More specifically, we assume that the vivid selector is a greedy selector defined as follows. [Greedy selector] Let
be an estimator for. By the greedy selector with respect to , we mean a selector given by
[Reasonableness] Let be fixed. For an estimator , define
The estimator is called reasonable if
for some .
The vivid selector is a greedy selector for a reasonable estimator for all .
Our last assumption demands the selected actions to be diverse in the near-optimal space. This condition is not a mere property of a model or a policy; it is a characteristic of a policy in combination with a model. [Linear expansion] We say that linear expansion holds if
for some constants and all . Now, with all these assumptions, we are ready to state our main regret bound that results in both gap-independent and gap-dependent settings. The result is also applicable to the cases that the action selected by an adversary. If Assumptions 1-3 hold, the cumulative regret of Algorithm 1 (denoted by policy ) satisfies the following inequality:
Theorem 3 provides a gap-independent bound, Eq. (6), as well as a gap-dependent bound, Eq. (7). For the special case of OLS Bandit, when , the latter produces an bound on the regret. Note that we can follow the same peeling argument as in [9, 6] and obtain an for OLS Bandit. However, this peeling argument may not apply for more general algorithms that we will be analyzing in next sections.
Proof of Theorem 3.
We split the regret of the algorithm into the following three cases. We will then bound each term separately.
Forced-sampling phase ,
When and , where is the indicator function for ,
When and .
By and , we denote the regrets of the above items up to time . Clearly, we have that
Assumption 1 gives us an upper bound for
Next, notice that the maximum regret that can occur in each round is bounded above by
It thus entails that
From the definition of , we can infer that whenever and , the regret at time can not exceed , and in the case that the action set is gapped (at this level ), the regret is equal to zero. Using this observation, we get that
which completes the proof of (6).
The key idea in bounding is that under , ; thereby, we get, with probability 1,
Hence, we have that
Note that, whenever and , we have
As in the proof of (6), the regret would be zero if is larger than the above. This, in turn, implies that
which is the desired result. ∎
4 Randomized OFUL
In this section, we present an extension of the OFUL algorithm of , and prove that under mild conditions it enjoys the same regret bound as the original OFUL. We call this extension Randomized OFUL (or ROFUL) and present its pseudo-code version in Algorithm 2. It receives an arbitrary estimator , and at each round, it makes greedy decision using this estimator. We require this estimator to be reasonable (Definition 1) and optimistic (Definition 2), and in Theorem 4, we use these assumptions to provide our regret bounds for this algorithm. We now define optimism. Recall that we defined
Similarly, we write
[Optimism] We say that the estimator is optimistic if for some we have
with probability at least where is a fixed constant and is the indicator function for the typical event .
For simplicity, we introduce some new notations to keep the expressions in this proof shorter; thereby, increasing the readability. Define:
Well-posed action set indicator:
Upper and lower confidence bounds:
Our strategy is to first represent ROFUL as an instance of Two-phase Bandit, and then, verify the assumptions of Theorem 3. In order to do so, let be given by
The blurry selector is also defined as
We also set the vivid selector to be the greedy selector with respect to the estimator . It is straight-forward to verify that Two-phase Bandit algorithm with as defined in the above is equivalent to Algorithm 2. We thus need to show that the assumptions of Theorem 3 hold. We begin with computing the forced-sampling cost in Assumption 1.
It follows from the definition of that implies
This gives us
which in combination with (11) leads to
Next, letting , we get
which in turn yields
We deduce from optimism (8) that
almost surely. Hence, we have
Substituting the above inequality into (12), we obtain
Recall we assumed that for each , there exists such that . Now, let be such that . We have that
Finally, we apply Lemma 10 and Lemma 11 in  for each separately:
Therefore, we have