On Worst-case Regret of Linear Thompson Sampling

06/11/2020
by   Nima Hamidi, et al.
Stanford University
0

In this paper, we consider the worst-case regret of Linear Thompson Sampling (LinTS) for the linear bandit problem. Russo and Van Roy (2014) show that the Bayesian regret of LinTS is bounded above by O(d√(T)) where T is the time horizon and d is the number of parameters. While this bound matches the minimax lower-bounds for this problem up to logarithmic factors, the existence of a similar worst-case regret bound is still unknown. The only known worst-case regret bound for LinTS, due to Agrawal and Goyal (2013b); Abeille et al. (2017), is O(d√(dT)) which requires the posterior variance to be inflated by a factor of O(√(d)). While this bound is far from the minimax optimal rate by a factor of √(d), in this paper we show that it is the best possible one can get, settling an open problem stated in Russo et al. (2018). Specifically, we construct examples to show that, without the inflation, LinTS can incur linear regret up to time (O(d)). We then demonstrate that, under mild conditions, a slightly modified version of LinTS requires only an O(1) inflation where the constant depends on the diversity of the optimal arm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/12/2020

A General Framework to Analyze Stochastic Linear Bandit

In this paper we study the well-known stochastic linear bandit problem w...
03/12/2022

Instance-Dependent Regret Analysis of Kernelized Bandits

We study the kernelized bandit problem, that involves designing an adapt...
12/10/2014

Generalised Entropy MDPs and Minimax Regret

Bayesian methods suffer from the problem of how to specify prior beliefs...
03/18/2022

The price of unfairness in linear bandits with biased feedback

Artificial intelligence is increasingly used in a wide range of decision...
04/17/2018

God Save the Queen

Queen Daniela of Sardinia is asleep at the center of a round room at the...
06/11/2019

Variance-reduced Q-learning is minimax optimal

We introduce and analyze a form of variance-reduced Q-learning. For γ-di...
05/24/2011

Minimax Policies for Combinatorial Prediction Games

We address the online linear optimization problem when the actions of th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, there has been a rise in the use of experiments by many organizations to optimize decisions (e.g., product recommendation in e-Commerce, ad selection in digital advertising, or testing medical interventions in healthcare). However, running an experiment involves an opportunity cost or regret (e.g., exposing some users or patients to a potentially inferior experience or treatment). To reduce this opportunity cost, a growing number of enterprises leverage multi-armed bandit (MAB) experiments (Scott, 2010, 2015; Johari et al., 2017). MAB approach works by adaptively updating decisions based on partially available results of the experiment to minimize the regret. These practical motivations that date back to Thompson (1933); Lai and Robbins (1985), combined with its mathematical richness, have made the MAB problem subject of intense study in statistics, operations research, electrical engineering, computer science, and economics, over the last few decades (Russo et al., 2018; Lattimore and Szepesvari, 2019).

This paper considers a general version of the MAB problem, the stochastic linear bandit problem, in which an decision-maker is sequentially choosing actions among given action sets and observes rewards

corresponding to the selected actions. The rewards are stochastic and their means depend on the actions through a fixed linear function. While initially unknown to the decision-maker, the reward function can be estimated as more decisions are made and their rewards are observed. The main goal of the decision-maker is to maximize its cumulative expected reward, over a sequence of decision epochs (or time periods). Equivalently, one can measure difference (referred by expected regret or regret for short) between the best achievable cumulative expected reward, obtained by an

oracle that has access to the true mean of the reward function, and the cumulative expected reward obtained by the decision-maker.

The regret can be measured in a Bayesian or in a frequentist fashion. The Bayesian regret is used when the mean reward functions depend on random parameters and the expectations are taken with respect to the randomness in the reward functions, the uknown parameters, and new potential randomness introduced by the decision-maker. But the frequentist regret (also referred to by worst-case regret) is used when the mean reward functions are deterministic, so the expectation is only with respect to the other two sources of randomness.

The main challenge of the decision-maker is to design algorithms that efficiently balance between the exploration (experimenting untested actions) and the exploitation (choosing high-reward actions). Two approaches to this problem have attracted a great deal of attention. Dani et al. (2008); Abbasi-Yadkori et al. (2011) utilize optimism in face of uncertainty and obtain policies with worst-case regret bounds which is, as shown by Dani et al. (2008), minimax optimal up to logarithmic factors. The other approach, introduced by Thompson (1933)

, arises from a heuristic idea in the Bayesian setting which suggests sampling from the posterior distribution of the reward function, given past observatrions, and choosing the best action as if this sample were the true reward function. This approach is known as Thompson Sampling (TS) or posterior sampling and although it is Bayesian in nature, it can be applied in the frequentist setting as well. This idea has become increasingly popular in practice due to its simplicity and empirical performance

(Scott, 2010, 2015; Russo et al., 2018).

TS has been extensively studied from both theoretical and empirical points of view. Most notably, Agrawal and Goyal (2012, 2013a) prove minimax near-optimal worst-case guarantees for TS in the multi-armed bandit (MAB) setting. Russo and Van Roy (2014) use the connection between TS and optimistic policies to provide the first theoretical guarantee for TS that covers a wide range of problems including the stochastic linear bandit problem in which TS heuristic is refered by LinTS. Their analysis yields a Bayesian regret bound for this problem which cannot be improved in general.

In the frequentist setting, however, Agrawal and Goyal (2013b); Abeille et al. (2017) have obtained regret bounds for a variant of LinTS which samples from a posterior distribution whose variance is inflated by a factor . This bound is far from the optimal rate by a factor of . It has been an open question as to whether this extra factor can be eliminated in the linear bandit problem, e.g., stated in (Russo et al., 2018, page 78). We answer this question negatively. In particular, we construct examples to show that LinTS without inflation can incur linear regret up to time when the noise distribution and/or the prior distribution does not match the ones that LinTS assumes. The striking fact about these examples is that they can successfully deceive LinTS even if one reduces the variance of the noise. In fact, we will show that noiseless observations can cause LinTS to fail for an exponentially long time. This issue with LinTS is important to be understood because of the following reasons:

  1. In many applications, the exact prior and noise distributions are either unknown or not easy to sample from. In these cases, one needs to estimate or approximate the posterior distribution. However, as our examples demonstrate, LinTS is not robust to these mismatches.

  2. This issue opens the door for adversarial attacks. Notice that in the posterior computation, it is often assumed that, conditional on the history, the set of actions is independent of the true reward function. This assumption may not hold true when an adversary who has some knowledge about the true parameter can change action sets. This scenario is in particular applicable in the presence of a competing firm that has acquired more data about the same problem.

We here emphasize that these concerns are not applicable to optimism in the face of uncertainty linear bandit (OFUL) algorithm of Abbasi-Yadkori et al. (2011). These two issues thus call for the necessity of a better understanding of LinTS in the frequentist setting. In fact, on the positive side, we use the framework introduced in Hamidi and Bayati (2020) to prove that under additional assumptions the inflation parameter can be significantly reduced while still holding the theoretical guarantees. We validate our assumptions through simulations in a synthetic setting.

2 Setting and notation

For any positive integer , we denote by . Letting be a positive semi-definite matrix, by we mean

for any vector

of suitable size. For a matrix

with the singular values

, we define its operator norm and nuclear norm as and trace norm as respectively.

Let be a sequence of random compact subsets of where is the time horizon. We further assume that for all almost surely. A policy sequentially interacts with this environment in rounds. At time , it receives and chooses an action and receives a stochastic reward where is the unknown (and potentially random) vector of parameters. By we denote the arm with maximum expected reward. We denote the history of observations up to time by . More precisely, we define . In this model, a policy is formally defined as a (stochastic) function that maps to an element of .

We compare policies through their cumulative Bayesian regret defined as

Notice that the expectation is taken with respect to the entire randomness in our model, including the prior distribution. The frequentist regret bounds also follow by taking the prior the distribution to be the measure that puts all the mass on a single vector.

3 Bayesian analyses are brittle

In this section, we demonstrate that LinTS with proper posterior update rule may incur linear regret when the assumptions are slightly violated. These examples, in particular, solve an open question mentioned in (Russo et al., 2018, §8.1.2). More precisely, we show that LinTS’s Bayesian regret (thereby, its worst-case regret) can grow linearly up to time whenever the prior distribution or the noise distribution mismatches with the one that LinTS works with. It, furthermore, follows from our strategy that one needs the inflation rate of at least to avoid these problems.

3.1 Noise reduction and LinTS’s failure

Here we show that reducing noise or the variance of the prior distribution can cause LinTS to fail. Our strategy for proving these results involves the following two steps:

  1. We first construct small problem instances for which is marginally biased.

  2. We then show that by combining independent copies of these biased instances Thompson sampling can get linear Bayes regret.

0:  Inflation parameter .
1:  Initialize and
2:  for  do
3:     Observe
4:     Sample
5:     
6:     Observe reward
7:     
8:     
9:  end for
Algorithm 1 Linear Thompson sampling

Bias-introducing action sets.

In this section, we construct an example in which is marginally biased provided that either the prior distribution or the noise distribution mismatches the one that LinTS uses. Fix and let be the vector of unobserved parameters. At time , we reveal the following action sets to the policy:

For , LinTS has only one choice and thus . Assume that is revealed to the algorithm where . At time for the first time, LinTS has two choices. Let be such that . Then, is given to the algorithm where . The following lemma asserts that is marginally biased. Let . For any , we have

(1)

where where and

are two independent standard normal random variables. Furthermore,

satisfies

(2)

Stacking biased settings.

We prove that, by combining independent copies of the above example, LinTS can choose an incorrect action for at least rounds. Let be a positive integer and define . In the first rounds, follow the action sets in the previous section for each pairs for . Namely, define

(3)

where

The following key lemma states that with constant probability

is the optimal action while LinTS perceives it as suboptimal with enormous gap. Letting , we have

We denote the above event by . Conditional on this event, for all , the optimal arm is and the regret incurred by choosing is at least . Moreover, let be the probability of choosing at . As we will see this probability is exponentially small as a function of and whenever is not chosen, the probability of selecting in the next round remains unchanged. This observation holds true up to the first time that is picked which can, in turn, take an exponentially long time. By making this argument rigorous we can state the following proposition: For fixed , we have

3.2 Mean shift and fixed action sets

In this subsection, we construct an example in which LinTS incurs linear Bayes regret while the action set is fixed over time. This example, nonetheless, might be less appealing than the one in the previous subsection as we shift the mean of the prior distribution. Let be fixed and for , set the prior distribution to be . We now reveal the action set to LinTS for all where

(4)

The next proposition highlights the key observations about why LinTS fails in this simple setting: For fixed and for sufficiently large , we have

  1. with probability at least ,

  2. with probability ,

  3. Conditional on , , with probability at least ,

  4. Conditional on , with probability at most ,

  5. For , .

One can slightly modify the proof to obtain similar result for

It is easy to see that for any arbitrary constant , the same rate as in Equation 13 is achievable. Also, for where , one can still get non-trivial results.

4 Improving LinTS

The aim of this section is to introduce a novel approach to improve the inflation parameter in LinTS under additional assumptions. Before stating our results, we discuss the insights that leads to these assumptions.

4.1 Insights into LinTS’s optimism mechanism

This subsection is dedicated to the intuitions about the optimism mechanism of LinTS. We assume that is the ridge estimator for the parameter at some time and is the confidence set that contains and with high probability. We reveal the action set to the policy where is the optimal arm, i.e., . LinTS chooses only if

The left-hand side of this inequality can be decomposed as

This implies that a sufficient condition for to hold is

(5)

This inequality requires to compensate for the underestimation of the reward caused by the estimation error vector . These vectors are illustrated in Figure 1.

(a) Actual confidence set
(b) Translated confidence set
Figure 1: An illustration of a typical setting for , , , and .

OFUL explicitly seeks that maximizes the left-hand size of Equation 5, and as with high probability, the desired “compensation inequality” holds and is selected. Thompson sampling, on the other hand, follows a stochastic approach and resorts to a randomly sampled point in to solve Equation 5. Recall that is the ridge estimator for the collected data thus far. In a fixed design setting (which is not true in our bandit problem), the error vector will be pointing to a random direction. Therefore, provided that is independent of , we have

(6)

The same expression also holds for ; therefore, the compensation inequality holds with constant probability. To summarize our observation, the inequality Equation 6 holds true if the following two conditions are met

  1. The error vector is distributed in a random direction.

  2. The optimal action is independent of .

The crucial point in the analysis of LinTS in the Bayesian setting is that whenever LinTS has access to the true prior and noise distribution, the first condition above holds. In Section 3, nonetheless, we have shown that this condition is violated if LinTS uses an incorrect prior or noise distribution in computing the posterior. Agrawal and Goyal (2013b); Abeille et al. (2017) take a conservative approach and propose to inflate the variance of the posterior distribution by a factor of to ensure with constant probability. We now present an alternative approach that leverages the randomness of the optimal action to reduce the need for exploration. The following assumption requires the optimal arm (rather then the error vector) to be distributed in a random direction. Assume that for any with , we have

for some fixed

. Unfortunately, this condition alone does not suffice to reduce the inflation rate of the posterior distribution. To see this, consider a case in which the largest eigenvalue of

is much larger than the other ones; thereby, . Figure 2 illustrates this situation. In this case, we have

Figure 2: An illustration of a thin confidence set.

However, it follows from the definition of LinTS that . Assuming that , we realize that . This suggests is proportional to . Now, we can see that Section 4.1 is not sufficient for ensuring Equation 5 as we have

This observation implies the necessity of the inflation rate of order when the eigenvalues of differ in magnitude significantly. To make this notion precise, we define the thinness coefficient corresponding to to be

We also make the following assumption. For , we have

for any positive definite with . With this assumption, we are now ready to state our formal results.

4.2 Formal results

At time , we say that problem is well-posed if , and the following inequalities are satisfied:

(7)

We denote the indicator function for this event by . The next lemma, loosely speaking, asserts that LinTS is optimistic with constant probability. [Optimism of LinTS] Set and . Whenever , we have

(8)

Using Theorem 2 in Hamidi and Bayati (2020), we can prove the following result: Under Sections 4.1 and 4.1, provided that with probability at most , we have

It is worth mentioning that one can also reduce the radius of the confidence ball in OFUL under these assumptions. More precisely, one can replace the radius of the confidence ball with while maintaining the same regret bound. Although this does not improve the regret bound, it may improve the empirical performance as it avoids unnecessary exploration. The main caveat of this result is the assumption that holds with high probability since it is not a mere property of the action sets; indeed, it also depends on the policy through the actions that it chooses. We fix this problem by setting the inflation rate whenever . This way, we have the following result. If , we have

We will see in Section 5 that, in our simulations, is indeed large for a short period of time.

5 Simulations

5.1 Average failure time of LinTS

We validate the examples in Section 3 through the following simulations:

Noise reduction example.

In this simulation, for each , we generate and run LinTS for rounds using the action sets Equation 3. The reward for choosing an action is simply given by . Therefore, no noise is added to the reward. We then compute the probability that . We repeat this procedure for 50 times to get and take the maximum . Figure 2(a) displays against .

Fixed action set example.

For given and , we draw . Then, we reveal the action set as defined in Equation 4. Then, conditional on , we compute the probability that the next arm is either or . We repeat this process 50 times to get and as before we define to be their maximum. Figure 2(b) shows for when varies between 1 and 120000. Figure 2(c), on the other hand, illustrates for and varying between 0 and 1.

(a) Ex. 1.
(b) Ex. 2: , varying .
(c) Ex. 2: , varying .
Figure 3: Logarithm of in the noise reduction and fixed action set examples.

5.2 Thinness over time

Here we investigate how thinness varies over time. We take a similar setting as described in the simulations section in Russo and Van Roy (2014). For , we generate . At each time , we generate i.i.d. random vectors from . Each of the following policies then chooses one of these actions:

  1. TS-1: LinTS with no inflation ().

  2. TS-2: LinTS with .

  3. TS-3: LinTS with whenever and otherwise where

  4. TS-4: LinTS with .

Figure 4: Thinness values over time.
Figure 5: Cumulative regret

Each policy chooses for and receives feedback where are i.i.d. standard Gaussian random variables. Next, we compute the thinness parameter for . We repeat this procedure 50 times. Figure 5 displays the thinness of these policies in our experiments. This in particular shows that the thinness stays close to 1 for larger values of . Figure 5 also shows the cumulative regret of these policies. Notice that, while TS-3 may inflate the posterior variance by , its performance is in fact much closer to TS-1 and TS-2.

Appendix A Proofs of Section 3

Proof of Section 3.1.

It follows from the definition of that

Next, at , the -th entry is updated according to

Moreover, the other entry remains unchanged, in other words

Therefore, setting , we have

and in particular

(9)

We can now compute this expression in terms of the selection bias coefficient given by

where and are two independent standard normal random variables. Our main tool in this calculation is Appendix C in Appendix C. Recall that

By definition, . Therefore, we have

On the other hand, it follows from the symmetry that

Using Appendix C for the sequence

we infer that

Consequently, we can write

Similarly, we can conclude that

(10)

Combining Equation 9 with Appendix A and Equation 10, we obtain

This equality implies that is marginally biased whenever . Finally, Appendix A gives

Noting that

and similarly for , we get that

Therefore, we have

This implies that the m.g.f. of satisfies

Proof of Section 3.1.

It follows from Equation 1 that

(11)

where . Assuming , we observe . Moreover, Equation 2 implies that

which means

Using this inequality in combination with Equation 11, we assert the following concentration inequality

where .

Next, note that , and thus, we have

For sufficiently large values of , we have , and hence

Proof of Section 3.1.

For , let be given by

We now have the following lower bound for the regret of Algorithm 1:

Define . We get that

Furthermore, it follows from the definition of that

By combining the above, we have that

This immediately follows that

which demonstrates that the regret of Thompson sampling grows linearly up to time . ∎

Proof of Section 3.2.

Notice that

Therefore, and simultaneously with probability at least . This thus implies that is the optimal arm with high probability. For sufficiently large , this probability exceeds .

On the other hand, at , LinTS (Algorithm 1) will choose with probability . This holds true as is chosen if and only if

The claim follows from the fact these two random variables are two centered and independent normal random variables. In this case, we have