Recently, there has been a rise in the use of experiments by many organizations to optimize decisions (e.g., product recommendation in e-Commerce, ad selection in digital advertising, or testing medical interventions in healthcare). However, running an experiment involves an opportunity cost or regret (e.g., exposing some users or patients to a potentially inferior experience or treatment). To reduce this opportunity cost, a growing number of enterprises leverage multi-armed bandit (MAB) experiments (Scott, 2010, 2015; Johari et al., 2017). MAB approach works by adaptively updating decisions based on partially available results of the experiment to minimize the regret. These practical motivations that date back to Thompson (1933); Lai and Robbins (1985), combined with its mathematical richness, have made the MAB problem subject of intense study in statistics, operations research, electrical engineering, computer science, and economics, over the last few decades (Russo et al., 2018; Lattimore and Szepesvari, 2019).
This paper considers a general version of the MAB problem, the stochastic linear bandit problem, in which an decision-maker is sequentially choosing actions among given action sets and observes rewards
corresponding to the selected actions. The rewards are stochastic and their means depend on the actions through a fixed linear function. While initially unknown to the decision-maker, the reward function can be estimated as more decisions are made and their rewards are observed. The main goal of the decision-maker is to maximize its cumulative expected reward, over a sequence of decision epochs (or time periods). Equivalently, one can measure difference (referred by expected regret or regret for short) between the best achievable cumulative expected reward, obtained by anoracle that has access to the true mean of the reward function, and the cumulative expected reward obtained by the decision-maker.
The regret can be measured in a Bayesian or in a frequentist fashion. The Bayesian regret is used when the mean reward functions depend on random parameters and the expectations are taken with respect to the randomness in the reward functions, the uknown parameters, and new potential randomness introduced by the decision-maker. But the frequentist regret (also referred to by worst-case regret) is used when the mean reward functions are deterministic, so the expectation is only with respect to the other two sources of randomness.
The main challenge of the decision-maker is to design algorithms that efficiently balance between the exploration (experimenting untested actions) and the exploitation (choosing high-reward actions). Two approaches to this problem have attracted a great deal of attention. Dani et al. (2008); Abbasi-Yadkori et al. (2011) utilize optimism in face of uncertainty and obtain policies with worst-case regret bounds which is, as shown by Dani et al. (2008), minimax optimal up to logarithmic factors. The other approach, introduced by Thompson (1933)
, arises from a heuristic idea in the Bayesian setting which suggests sampling from the posterior distribution of the reward function, given past observatrions, and choosing the best action as if this sample were the true reward function. This approach is known as Thompson Sampling (TS) or posterior sampling and although it is Bayesian in nature, it can be applied in the frequentist setting as well. This idea has become increasingly popular in practice due to its simplicity and empirical performance(Scott, 2010, 2015; Russo et al., 2018).
TS has been extensively studied from both theoretical and empirical points of view. Most notably, Agrawal and Goyal (2012, 2013a) prove minimax near-optimal worst-case guarantees for TS in the multi-armed bandit (MAB) setting. Russo and Van Roy (2014) use the connection between TS and optimistic policies to provide the first theoretical guarantee for TS that covers a wide range of problems including the stochastic linear bandit problem in which TS heuristic is refered by LinTS. Their analysis yields a Bayesian regret bound for this problem which cannot be improved in general.
In the frequentist setting, however, Agrawal and Goyal (2013b); Abeille et al. (2017) have obtained regret bounds for a variant of LinTS which samples from a posterior distribution whose variance is inflated by a factor . This bound is far from the optimal rate by a factor of . It has been an open question as to whether this extra factor can be eliminated in the linear bandit problem, e.g., stated in (Russo et al., 2018, page 78). We answer this question negatively. In particular, we construct examples to show that LinTS without inflation can incur linear regret up to time when the noise distribution and/or the prior distribution does not match the ones that LinTS assumes. The striking fact about these examples is that they can successfully deceive LinTS even if one reduces the variance of the noise. In fact, we will show that noiseless observations can cause LinTS to fail for an exponentially long time. This issue with LinTS is important to be understood because of the following reasons:
In many applications, the exact prior and noise distributions are either unknown or not easy to sample from. In these cases, one needs to estimate or approximate the posterior distribution. However, as our examples demonstrate, LinTS is not robust to these mismatches.
This issue opens the door for adversarial attacks. Notice that in the posterior computation, it is often assumed that, conditional on the history, the set of actions is independent of the true reward function. This assumption may not hold true when an adversary who has some knowledge about the true parameter can change action sets. This scenario is in particular applicable in the presence of a competing firm that has acquired more data about the same problem.
We here emphasize that these concerns are not applicable to optimism in the face of uncertainty linear bandit (OFUL) algorithm of Abbasi-Yadkori et al. (2011). These two issues thus call for the necessity of a better understanding of LinTS in the frequentist setting. In fact, on the positive side, we use the framework introduced in Hamidi and Bayati (2020) to prove that under additional assumptions the inflation parameter can be significantly reduced while still holding the theoretical guarantees. We validate our assumptions through simulations in a synthetic setting.
2 Setting and notation
For any positive integer , we denote by . Letting be a positive semi-definite matrix, by we mean
for any vectorof suitable size. For a matrix
with the singular values, we define its operator norm and nuclear norm as and trace norm as respectively.
Let be a sequence of random compact subsets of where is the time horizon. We further assume that for all almost surely. A policy sequentially interacts with this environment in rounds. At time , it receives and chooses an action and receives a stochastic reward where is the unknown (and potentially random) vector of parameters. By we denote the arm with maximum expected reward. We denote the history of observations up to time by . More precisely, we define . In this model, a policy is formally defined as a (stochastic) function that maps to an element of .
We compare policies through their cumulative Bayesian regret defined as
Notice that the expectation is taken with respect to the entire randomness in our model, including the prior distribution. The frequentist regret bounds also follow by taking the prior the distribution to be the measure that puts all the mass on a single vector.
3 Bayesian analyses are brittle
In this section, we demonstrate that LinTS with proper posterior update rule may incur linear regret when the assumptions are slightly violated. These examples, in particular, solve an open question mentioned in (Russo et al., 2018, §8.1.2). More precisely, we show that LinTS’s Bayesian regret (thereby, its worst-case regret) can grow linearly up to time whenever the prior distribution or the noise distribution mismatches with the one that LinTS works with. It, furthermore, follows from our strategy that one needs the inflation rate of at least to avoid these problems.
3.1 Noise reduction and LinTS’s failure
Here we show that reducing noise or the variance of the prior distribution can cause LinTS to fail. Our strategy for proving these results involves the following two steps:
We first construct small problem instances for which is marginally biased.
We then show that by combining independent copies of these biased instances Thompson sampling can get linear Bayes regret.
Bias-introducing action sets.
In this section, we construct an example in which is marginally biased provided that either the prior distribution or the noise distribution mismatches the one that LinTS uses. Fix and let be the vector of unobserved parameters. At time , we reveal the following action sets to the policy:
For , LinTS has only one choice and thus . Assume that is revealed to the algorithm where . At time for the first time, LinTS has two choices. Let be such that . Then, is given to the algorithm where . The following lemma asserts that is marginally biased. Let . For any , we have
where where and
are two independent standard normal random variables. Furthermore,satisfies
Stacking biased settings.
We prove that, by combining independent copies of the above example, LinTS can choose an incorrect action for at least rounds. Let be a positive integer and define . In the first rounds, follow the action sets in the previous section for each pairs for . Namely, define
The following key lemma states that with constant probabilityis the optimal action while LinTS perceives it as suboptimal with enormous gap. Letting , we have
We denote the above event by . Conditional on this event, for all , the optimal arm is and the regret incurred by choosing is at least . Moreover, let be the probability of choosing at . As we will see this probability is exponentially small as a function of and whenever is not chosen, the probability of selecting in the next round remains unchanged. This observation holds true up to the first time that is picked which can, in turn, take an exponentially long time. By making this argument rigorous we can state the following proposition: For fixed , we have
3.2 Mean shift and fixed action sets
In this subsection, we construct an example in which LinTS incurs linear Bayes regret while the action set is fixed over time. This example, nonetheless, might be less appealing than the one in the previous subsection as we shift the mean of the prior distribution. Let be fixed and for , set the prior distribution to be . We now reveal the action set to LinTS for all where
The next proposition highlights the key observations about why LinTS fails in this simple setting: For fixed and for sufficiently large , we have
with probability at least ,
with probability ,
Conditional on , , with probability at least ,
Conditional on , with probability at most ,
For , .
One can slightly modify the proof to obtain similar result for
It is easy to see that for any arbitrary constant , the same rate as in Equation 13 is achievable. Also, for where , one can still get non-trivial results.
4 Improving LinTS
The aim of this section is to introduce a novel approach to improve the inflation parameter in LinTS under additional assumptions. Before stating our results, we discuss the insights that leads to these assumptions.
4.1 Insights into LinTS’s optimism mechanism
This subsection is dedicated to the intuitions about the optimism mechanism of LinTS. We assume that is the ridge estimator for the parameter at some time and is the confidence set that contains and with high probability. We reveal the action set to the policy where is the optimal arm, i.e., . LinTS chooses only if
The left-hand side of this inequality can be decomposed as
This implies that a sufficient condition for to hold is
This inequality requires to compensate for the underestimation of the reward caused by the estimation error vector . These vectors are illustrated in Figure 1.
OFUL explicitly seeks that maximizes the left-hand size of Equation 5, and as with high probability, the desired “compensation inequality” holds and is selected. Thompson sampling, on the other hand, follows a stochastic approach and resorts to a randomly sampled point in to solve Equation 5. Recall that is the ridge estimator for the collected data thus far. In a fixed design setting (which is not true in our bandit problem), the error vector will be pointing to a random direction. Therefore, provided that is independent of , we have
The same expression also holds for ; therefore, the compensation inequality holds with constant probability. To summarize our observation, the inequality Equation 6 holds true if the following two conditions are met
The error vector is distributed in a random direction.
The optimal action is independent of .
The crucial point in the analysis of LinTS in the Bayesian setting is that whenever LinTS has access to the true prior and noise distribution, the first condition above holds. In Section 3, nonetheless, we have shown that this condition is violated if LinTS uses an incorrect prior or noise distribution in computing the posterior. Agrawal and Goyal (2013b); Abeille et al. (2017) take a conservative approach and propose to inflate the variance of the posterior distribution by a factor of to ensure with constant probability. We now present an alternative approach that leverages the randomness of the optimal action to reduce the need for exploration. The following assumption requires the optimal arm (rather then the error vector) to be distributed in a random direction. Assume that for any with , we have
for some fixed
. Unfortunately, this condition alone does not suffice to reduce the inflation rate of the posterior distribution. To see this, consider a case in which the largest eigenvalue ofis much larger than the other ones; thereby, . Figure 2 illustrates this situation. In this case, we have
However, it follows from the definition of LinTS that . Assuming that , we realize that . This suggests is proportional to . Now, we can see that Section 4.1 is not sufficient for ensuring Equation 5 as we have
This observation implies the necessity of the inflation rate of order when the eigenvalues of differ in magnitude significantly. To make this notion precise, we define the thinness coefficient corresponding to to be
We also make the following assumption. For , we have
for any positive definite with . With this assumption, we are now ready to state our formal results.
4.2 Formal results
At time , we say that problem is well-posed if , and the following inequalities are satisfied:
We denote the indicator function for this event by . The next lemma, loosely speaking, asserts that LinTS is optimistic with constant probability. [Optimism of LinTS] Set and . Whenever , we have
It is worth mentioning that one can also reduce the radius of the confidence ball in OFUL under these assumptions. More precisely, one can replace the radius of the confidence ball with while maintaining the same regret bound. Although this does not improve the regret bound, it may improve the empirical performance as it avoids unnecessary exploration. The main caveat of this result is the assumption that holds with high probability since it is not a mere property of the action sets; indeed, it also depends on the policy through the actions that it chooses. We fix this problem by setting the inflation rate whenever . This way, we have the following result. If , we have
We will see in Section 5 that, in our simulations, is indeed large for a short period of time.
5.1 Average failure time of LinTS
We validate the examples in Section 3 through the following simulations:
Noise reduction example.
In this simulation, for each , we generate and run LinTS for rounds using the action sets Equation 3. The reward for choosing an action is simply given by . Therefore, no noise is added to the reward. We then compute the probability that . We repeat this procedure for 50 times to get and take the maximum . Figure 2(a) displays against .
Fixed action set example.
For given and , we draw . Then, we reveal the action set as defined in Equation 4. Then, conditional on , we compute the probability that the next arm is either or . We repeat this process 50 times to get and as before we define to be their maximum. Figure 2(b) shows for when varies between 1 and 120000. Figure 2(c), on the other hand, illustrates for and varying between 0 and 1.
5.2 Thinness over time
Here we investigate how thinness varies over time. We take a similar setting as described in the simulations section in Russo and Van Roy (2014). For , we generate . At each time , we generate i.i.d. random vectors from . Each of the following policies then chooses one of these actions:
TS-1: LinTS with no inflation ().
TS-2: LinTS with .
TS-3: LinTS with whenever and otherwise where
TS-4: LinTS with .
Each policy chooses for and receives feedback where are i.i.d. standard Gaussian random variables. Next, we compute the thinness parameter for . We repeat this procedure 50 times. Figure 5 displays the thinness of these policies in our experiments. This in particular shows that the thinness stays close to 1 for larger values of . Figure 5 also shows the cumulative regret of these policies. Notice that, while TS-3 may inflate the posterior variance by , its performance is in fact much closer to TS-1 and TS-2.
Appendix A Proofs of Section 3
Proof of Section 3.1.
It follows from the definition of that
Next, at , the -th entry is updated according to
Moreover, the other entry remains unchanged, in other words
Therefore, setting , we have
and in particular
We can now compute this expression in terms of the selection bias coefficient given by
By definition, . Therefore, we have
On the other hand, it follows from the symmetry that
Using Appendix C for the sequence
we infer that
Consequently, we can write
Similarly, we can conclude that
This equality implies that is marginally biased whenever . Finally, Appendix A gives
and similarly for , we get that
Therefore, we have
This implies that the m.g.f. of satisfies
Proof of Section 3.1.
It follows from Equation 1 that
where . Assuming , we observe . Moreover, Equation 2 implies that
Using this inequality in combination with Equation 11, we assert the following concentration inequality
Next, note that , and thus, we have
For sufficiently large values of , we have , and hence
Proof of Section 3.1.
For , let be given by
We now have the following lower bound for the regret of Algorithm 1:
Define . We get that
Furthermore, it follows from the definition of that
By combining the above, we have that
This immediately follows that
which demonstrates that the regret of Thompson sampling grows linearly up to time . ∎
Proof of Section 3.2.
Therefore, and simultaneously with probability at least . This thus implies that is the optimal arm with high probability. For sufficiently large , this probability exceeds .
On the other hand, at , LinTS (Algorithm 1) will choose with probability . This holds true as is chosen if and only if
The claim follows from the fact these two random variables are two centered and independent normal random variables. In this case, we have