Thompson sampling is one of the most popular algorithms for sequential decision making under uncertainty. First proposed by Thompson (1933)
, it has been rediscovered several times over the consequent decades and has been eventually popularized in the machine learning literature byChapelle and Li (2011) and Scott (2010), who pointed out its excellent empirical performance for solving contextual bandit problems. These empirical studies were followed by a sequence of breakthroughs on the front of theoretical analysis, spearheaded by the works of Agrawal and Goyal (2012, 2013a, 2013b), Kaufmann et al. (2012), and Russo and Van Roy (2014, 2016). Thanks to these successes, Thompson sampling has become one of the gold-standard methods for solving multi-armed bandit problems. Indeed, in the last decade, several Thompson-sampling-style methods have been developed and analyzed for a variety of problem settings.
The variety of different analysis techniques applied to Thompson sampling is perhaps even larger than the variety of problem settings that it has been applied to. The first key tools for analyzing the Bayesian regret of Thompson sampling for multi-armed bandits have been developed by Russo and Van Roy (2014), and our analysis naturally borrows several of these tools. The worst-case results developed in said work were refined in (Russo and Van Roy, 2016), where they proved for the first time “information-theoretic” bounds on the regret of TS that scale with the Shannon entropy of the optimal action under the prior on the model parameters. This result has inspired a range of follow-up works, including extensions to uncountable action sets (Dong and Van Roy, 2018), approximate implementations (Lu and Van Roy, 2017; Qin et al., 2022), and even new algorithms based on the analysis technique itself (Russo and Van Roy, 2018; Kirschner and Krause, 2018). One limitation that this technique could not overcome so far is not being able to satisfyingly deal with context.
When considering i.i.d. contexts and finite policy classes, one can apply the theory of Russo and Van Roy (2016) and treat policies as actions to obtain regret bounds scaling with the entropy of the optimal policy, but this can lead to a polynomial dependence on the number of contexts. Another variant of Thompson sampling demonstrating a similar prior-dependent regret bound has been proposed by Li (2013), whose regret guarantees also suffer from a suboptimal dependence on the number of rounds . A much more satisfying solution has been recently given by Zhang (2021), whose “feel-good Thompson sampling” method guarantees both frequentist and Bayesian regret bounds of the order , where is the number of actions, the number of rounds, and the support size of the prior distribution on model parameters . Under a Lipschitzness assumption on the prior and the likelihood, Zhang
proves a frequentist regret bound scaling with the log-prior-probability mass assigned to the true parameter. The techniques involved in proving these results drew substantial inspiration from the works of Foster and Rakhlin (2020); Foster and Krishnamurthy (2021) and in fact results of a similar flavor were also proved recently in Foster et al. (2021).
Our own approach can be seen as a reconciliation of the analysis style of Zhang (2021) with the information-theoretic methodology of Russo and Van Roy (2016). Our main conceptual contribution is proposing an adjustment of the now-classic notion of “information ratio” proposed by Russo and Van Roy (2016) that applies to contextual bandits. In its original definition, the information ratio quantifies the tradeoff between incurring low regret and gaining information about the optimal action. As we will argue, this notion of information gain is inappropriate for contextual bandits. We propose a variant that measures the amount of information gained about the true model parameter instead of the optimal action (which may be context dependent). The complexity notion resulting from this extension is called the “lifted information ratio”. Our analysis shows that the Bayesian regret of Thompson sampling can be bounded in terms of the lifted information ratio and the Shannon entropy of the hidden parameter , which mirrors the result of Russo and Van Roy (2016). Along the way, we draw inspiration from the recently proposed analysis technique of Zhang (2021) for contextual bandits, and in fact we show that our notion of lifted information ratio bridges the concept of “decoupling coefficient” proposed by Zhang with the information ratio of Russo and Van Roy.
We state our main results in the context of -armed contextual bandits with binary losses. For countable parameter spaces, we prove that the Bayesian regret of Thompson sampling satisfies a bound of order , where denotes the Shannon entropy. This result is comparable to the bound of Foster and Krishnamurthy (2021) for the FastCB algorithm, which is of the order and holds in a frequentist sense. This is the best result we are aware of for this setting. To demonstrate the flexibility of our technique, we provide an extension to logistic bandits with Lipschitz continuous logits, generalizing the well-studied setting involving logits that are linear functions of the context and the parameter . For this setting, we prove a regret bound of order , where is the -covering number of under norm and is the Lipschitz constant of the logits. This implies a regret bound of order in the well-studied case of linear logits. Notably, the bound does not show any dependence on the smallest slope of the sigmoid link function that almost all existing results for this setting suffer from (Filippi et al., 2010; Kveton et al., 2020; Abeille et al., 2021; Faury et al., 2022). Indeed, this constant has plagued all regret bounds since the early work of Filippi et al. (2010) and was only recently moved to lower-order terms by the breakthrough work of Abeille et al. (2021). Bounds involving other potentially large problem-dependent constants have also been proved in the Bayesian setting by Dong and Van Roy (2018) and Dong et al. (2019). To our knowledge, our bounds are the first to entirely remove this factor.111Despite our best efforts, we could not verify how the bounds of Zhang (2021) scale with problem-dependent factors in this setting, due to the heavy use of asymptotic notation in their proofs.
The rest of the paper is organized as follows. After introducing the necessary technical background in Section 2, we discuss matters of information gains, information ratios, and decoupling coefficients in Section 3. We state our main results and instantiate them in a variety of settings in Section 4. We provide the key ideas of the analysis in Section 5 and conclude in Section 6.
For a natural number , denotes the set of the first natural numbers. For , denotes the canonical scalar product of and , and the Euclidean norm of
. We denote the Shannon entropy of a discrete random variable
with probability mass functionas . We use for a
-dimensional vector of zeros andfor the identity matrix.
We consider a parametric class of contextual bandits with parameter space , context space , and actions. To each parameter there corresponds a contextual bandit with loss distribution for each context and action , with the mean of the loss distribution denoted by . We will dedicate special attention to the case where the losses are binary and thus
is a Bernoulli distribution with parameter. For the main part of our theoretical analysis, we will assume is either a finite set or a bounded metric space.
We study the problem of regret minimization in the Bayesian setting. In this setting, the environment secretly samples a parameter from a known prior distribution over . We assume that the agent has full knowledge of the prior and the likelihood model . The agent interacts with the environment for rounds as follows. At each round , an adaptive adversary selects a context , possibly using randomization and taking into account the previous history of actions and losses, but not . After observing the context , the agent selects an action (possibly using randomization) and incurs a binary loss . The goal of the agent is to minimize the expected sum of losses. In the Bayesian setting, this is equivalent to minimizing the Bayesian regret, defined as follows:
where is the optimal action for round , and the expectation in (1) is over all sources of randomness: the initial sampling of from , the agent’s randomization over actions, and the randomness of the loss realizations.
Furthermore, let be the sigma-algebra representing the history of contexts, actions and losses observed by the agent up to time included. We use to denote the distribution of the unknown parameter conditional on the past history , and simply call it the posterior distribution. We denote by the distribution over the agent’s actions conditional on and , and call it the agent’s policy. Finally, we will frequently use the shorthand notations and .
This paper is dedicated to the study of the celebrated Thompson Sampling (TS) algorithm, defined as follows. At each round , TS draws a parameter from the posterior distribution . Then, it selects the action that maximizes . Finally, it updates via Bayes’ rule, obtaining the new posterior
. The algorithm can be equivalently defined as a method that plays actions according to their posterior probability of being optimal, that is:. The pseudocode is shown as Algorithm 1.
3 Regret, information ratio, and decoupling coefficient
The classic results of Russo and Van Roy (2016) establish that the regret of Thompson sampling in non-contextual multi-armed bandit problems can be upper bounded in terms of a quantity called the information ratio. Informally, the information ratio measures the tradeoff between achieving low regret and gaining information about the identity of the optimal action (which is a deterministic function of in the standard multi-armed bandit setting). The formal definition is given by
where denotes the mutual information between and the action-observation pair , conditioned on the history . Intuitively, having small information ratio implies that every time Thompson sampling suffers large regret, it has to gain a lot of information about the optimal action, which suggests that it should be possible to bound the total regret by the total amount of information that there is to be gained. The result of Russo and Van Roy (2016) confirms this intuition by showing that the regret of Thompson sampling is of the order , where and is the Shannon entropy. The information ratio itself can always be upper bounded by
, but better bounds can be shown when the loss function has favorable structural properties (e.g., whenever the reward function is a-dimensional linear function, the information ratio is at most ).
While this result and the underlying information-theoretic framework is very elegant, it is inappropriate for studying contextual bandit problems. The specific challenge is that the optimal action changes from round to round and gaining large amount of information about for any given round may not necessarily be useful for predicting future actions. To see this, consider a stylized example with action set , where there exists an action whose loss entirely reveals the identity of the optimal action for context : . Taking this action provides maximal information gain about , but results in large regret and reveals nothing about the future losses. Thus, in the contextual setting, one can keep following a policy that provides low information ratio while suffering linear regret. This issue necessitates an alternative definition that still permits an effective information-theoretic analysis.
Our proposition is to consider a relaxed definition of the information ratio based on the mutual information between the true parameter and the observed loss. In particular, we define
where is the mutual information between and , conditioned on the history and the context-action pair . This quantity measures the information that the agent gains about . High values of intuitively allow making better predictions about the future loss realizations for all possible context sequences. Since is a deterministic function of given , the data processing inequality implies that the information gain about is always smaller than that about , which in turn implies that is greater than what one would obtain by directly generalizing the definition of Russo and Van Roy (2016). As this notion of information gain measures the efficiency of inferring the identity of a hidden parameter, we refer to as the lifted information gain, and as defined in Equation (2) as the lifted information ratio. As our analysis will establish, a bounded lifted information ratio guarantees low regret, and we will show that the ratio itself can be bounded reasonably under conditions similar to the ones required by the analysis of Russo and Van Roy (2016).
Our lifted information ratio is also closely related to a quantity appearing in the analysis of Zhang (2021), called the “decoupling coefficient”. Adapted to our Bayesian setting, this coefficient can be defined as the smallest constant such that the following inequality holds:
where the first line gives the original definition mirroring that of Zhang (2021) and the second line plugs in the choice of achieving the infimum. Reordering gives the value of the optimal :
which matches our definition of the lifted information ratio, up to the difference of replacing the mutual information by the root mean-squared error in predicting the true parameter . Notably, this definition essentially coincides with the lifted information ratio for the special case of Gaussian losses.
4 Main results
In this section, we state our main results concerning the Bayesian regret of Thompson sampling for contextual bandits. We will assume that the losses are binary and the action space is finite, unless otherwise stated. However, several of our results can be generalized beyond this setting. We will illustrate this in Section 4.3, where we provide some additional results for the classic setting of Gaussian linear contextual bandits.
We begin by stating two general regret bounds in terms of the lifted information ratio defined in Equation (2). The reader that is not interested in the full generality of our theory may skip to Section 4.2 for concrete regret bounds. Our first abstract bound applies to priors with finite entropy, the simplest example being finite parameter spaces.
Assume is supported on the countable set and that the lifted information ratio for all rounds satisfies for some . Then, the Bayesian regret of TS after rounds can be bounded as
In particular if is a finite set with , the regret of TS satisfies
The proof of this theorem is stated in Section 5.1. Unfortunately, the Shannon entropy can be unbounded for distributions with infinite support, which is in fact the typical situation that one encouters in practice. To address this concern, we develop a more general result, that holds for a broader family of distributions. In the following, is a metric space with metric . We make the following regularity assumption on the likelihood function :
There exists a constant such that for any , holds for all , , and .
Under this assumption, we can state a variant of Theorem 1 that applies to metric parameter spaces:
Assume is a metric space, and is supported on with -covering number . Let Assumption 1 hold, and assume the lifted information ratio for all rounds satisfies for some . Then, the Bayesian regret of TS after rounds can be bounded as
The proof is based on a covering argument on top of the proof of the previous theorem, and is provided in Appendix A.2. To get a better understanding of Assumption 1, it is useful to notice that it is satisfied in basic settings, like logistic bandits with Lipschitz logits. See Section 4.2 for details.
4.1 Bounding the lifted information ratio
At this point, some readers may worry that the lifted information ratio may be impossible to bound due to the lifting to the space of parameters . To address this concern, we now turn to showing bounds on the lifted information ratio. We first consider the unstructured case, that holds for arbitrary parameter spaces, likelihoods, and priors.
Suppose that the losses are binary and . Then, the lifted information ratio of Thompson sampling satisfies for all .
The proof of this lemma (provided in Section 5.2) relies on a decoupling argument between the choice of the action and of the parameters at round , inspired by Zhang (2021). Taking his argument one step further, we center our analysis around an application of convex conjugacy, which we believe may be applicable in a broader variety of settings. We wish to highlight that this proof technique is very different from the information-theoretic methodology pioneered by Russo and Van Roy (2016).
Next, we consider the case of linear expected losses in Euclidean parameter spaces, which, in principle, allows for an unbounded number of actions.
Suppose that , the losses are binary, and the expected losses are linear functions of the form , where is a feature map, such that for all . Then, the lifted information ratio satisfies .
Notably, both of these results match the classic bounds of Russo and Van Roy (2016) on the standard definition of the information ratio for these settings (cf. their Propositions 5 and 3), implying that lifting to the space of parameters does not substantially impact the regret-information tradeoff.
4.2 Concrete regret bounds for Bernoulli bandits
We now instantiate our bounds in two well-studied settings for Bernoulli bandits. We start from the fully unstructured case, assuming finite actions and finitely supported prior. The following regret bound follows directly from Theorem 1 and Lemma 1.
Consider a contextual bandit with actions and binary losses, and suppose , the support of , is finite with . Then, the Bayesian regret of TS satisfies:
This result is comparable to the best known regret guarantees for this problem due to Foster and Krishnamurthy (2021) and Zhang (2021), and matches the minimax rate for unstructured contextual bandits with a policy class of size (Beygelzimer et al., 2011; Dudík et al., 2011). However, we are not aware of a comparable result for the Thompson sampling algorithm in the literature, be it Bayesian or not.
Moving to Bernoulli bandits with structure, we consider a well-studied setting known as logistic bandits. In this model, the losses are generated by a Bernoulli distribution as , where
is the sigmoid function. We just assume that, called the logit function, is -Lipschitz in , which directly implies that Assumption 1 holds. Notice that our definition generalizes the commonly used notion of logistic bandits that consider linear logit functions of the form , where is some feature map. Our result for logistic bandits is based on Theorem 2 and Lemma 1.
Assume and for all . Consider a class of logistic bandits with actions and -Lipschitz logit function . Then, the Bayesian regret of TS after rounds can be bounded as:
We can further specialize our result to linear logits, the setting that is most commonly studied in the literature:
Assume , for all . Consider a class of logistic bandits with actions and linear logit function , with for all and . Then, the Bayesian regret of TS after rounds can be bounded as:
The proof of the corollary is stated at Appendix A.3. A remarkable feature of this bound is that it shows no dependence on the minimum derivative of the sigmoid link function, albeit at the price of a factor in the bound. Nevertheless, we believe this to be the first regret guarantee that entirely gets rid of this potentially enormous constant without very strong assumptions. Indeed, this constant has been present in nearly all previous bounds we are aware of (Filippi et al., 2010; Li et al., 2017; Faury et al., 2020; Abeille et al., 2021; Faury et al., 2022)—although these results have the advantage of holding in a frequentist sense. In the Bayesian setting, the works of Dong and Van Roy (2018) and Dong et al. (2019) have proved a variety of bounds on the regret of Thompson sampling for non-contextual logistic bandits, but none of them are directly comparable with our result above. Dong et al. (2019) prove a regret bound of order for a highly specialized setting with , and a range of other bounds under a variety of strong assumptions.
An improved bound for the many-actions setting that scales at most logarithmically with remains an open problem. Its difficulty is testified by a set of negative examples provided by Dong et al. (2019), and by a long-lived conjecture of (Dong and Van Roy, 2018) regarding the information ratio for logistic bandits, that, to our knowledge, has not yet been verified in theory.
4.3 Beyond binary losses: linear bandits with Gaussian noise
We now illustrate how our techniques (in particular the lifted information ratio) can be extended beyond the case of binary losses, and in particular consider the classic setting of Bayesian linear contextual bandits, where the loss is a linear function of a -dimensional feature map with additive Gaussian noise, and the prior is also Gaussian.
For this setting, the bound has already been shown for the classic information ratio by Russo and Van Roy (2016). However, we believe that our bound is new for any definition of information ratio. By combining this bound on the lifted information ratio with standard arguments for linear contextual bandits, we recover both of the well-known regret bounds of order and for this seting, respectively due to Abbasi-Yadkori et al. (2011) andChu et al. (2011). 222While for the cases when the dependence on is not present in the regret bound, our analysis is restricted to the setting with the finite number of actions. Still, using a standard discretization argument, it is possible to extend the analysis to infinite action spaces. See Corollary 2 in Appendix B for a rigorous statement and the proof. This result, although not surprising, indicates once again that our notion of lifted information ratio does not lead to compromises in performance, even when the losses are not binary.
This section presents the key ideas of the proofs of our main results. We will just provide the proof of Theorem 1 and that of Lemma 1, which we believe offer the most insight into our techniques. All other proofs, included those of auxiliary lemmas, are deferred to Appendix A and B. For sake of clarity, we focus on the relatively simple case where is countable, so that (with a slight abuse of notation) we can write to denote the posterior probability associated with . Note, however, that our full proofs also handle the case of general distributions (details in Appendix A.1).
5.1 The proof of Theorem 1
Recalling the definition of the lifted information ratio (Equation 2), we first notice that the regret can be rewritten as follows:
where the first step uses the tower rule of expectation, the second step the definition of , and the final step follows from the Cauchy–Schwarz inequality.
The key challenge is then to bound the sum of information-gain terms. The following lemma provides a more tractable form of this sum:
Under the assumptions of Theorem 1,
The proof of this lemma is based on a classic “Bayesian telescoping” argument that we have learned from Grünwald (2012). We provide the proof of Lemma 4 in Appendix A.1. Supposing now that the prior has bounded entropy, we can easily bound the term appearing on the right hand side as follows:
This concludes the proof of the first statement. The second statement follows from the first using the trivial bound on the Shannon entropy of any finite-support distribution.
5.2 The proof of Lemma 1
We start by introducing some notation that will be useful for the proof. In particular, we use to denote the binary relative entropy function defined for all as
and we use the convention . Furthermore, we define the posterior mean loss as . These notations allow us to conveniently rewrite the information gain as
We will now prove a generalization of Lemma 1, which will directly imply the original result:
Under the assumptions of Lemma 1, for all , the lifted information ratio of Thompson sampling satisfies .
The proof is based on an application of the Fenchel–Young inequality, which requires the introduction of the Legendre–Fenchel conjugate of with respect to its first argument. This function is defined for all as
where the second equality and the inequality follow from a set of straightforward calculations deferred to Appendix A.4. Turning to the actual proof, we consider the instantaneous pseudo-regret in a fixed round and write the following (for any ):
Choosing the value of for which the latter expression is minimal, we obtain . The proof is completed by taking the square on both sides and rearranging. ∎
We have presented a new theoretical framework for analyzing Thompson sampling in contextual bandits, based on a new concept we call lifted information ratio, which measures the tradeoff between achieving low regret and gaining information about the true parameter vector describing the problem instance. We have shown that this relaxation of the classic information ratio of Russo and Van Roy (2016) can effectively deal with contextual information, while making essentially no compromises in any of the classically well-studied bandit settings. We have also managed to show some new results that advance the state of the art in the well-studied problem of logistic bandits. We believe that these results are very encouraging and that our newly proposed formalism may find many more applications in the future.
Throughout the paper, we have studied several different settings, some of which come with a wide range of possible choices for the form of the prior and the likelihood. In some of this scenarios, updating the posterior and sampling from it may be computationally challenging. We have ignored this aspect in order to focus on the pure online-decision aspects, and implicitly assumed that posterior sampling can be performed without approximations. In practice, several heuristics have been proposed, see for instance(Dumitrascu et al., 2018) on efficient TS for logistic bandits. It would be interesting to study how approximate sampling affects our regret guarantees, along the line of (Phan et al., 2019; Mazumdar et al., 2020).
As always, we leave many more questions open than what we have closed. One major question regarding logistic bandits is if it is possible to improve our new results by significantly toning down the dependence on the number of actions
. In light of existing hardness results for nonlinear bandit problems (e.g., for generalized linear bandits with ReLU activation,Dong et al., 2021; Foster et al., 2021) we suspect that this may not be possible. As a more modest goal, we are curious to find out if the lifted information ratio can be upper bounded in terms of the smallest-slope parameter as done in many other works on logistic bandits since (Filippi et al., 2010). We conjecture that a bound on the lifted information ratio is indeed possible, but we were not able to prove it so far. This is the case for the eluder dimension (Russo and Van Roy, 2013), another complexity measure that has been used to upper-bound the regret for contextual bandits. The eluder dimension for linear losses is , but for nonlinear losses we know only of bounds for the generalized linear case.
More broadly, we believe that the most interesting immediate challenge is to extend our results to hold beyond the Bayesian setting. As a counterexample by Zhang (2021)
shows, this may not be possible in general, but we wonder if his “feel-good” adjustment of Thompson sampling could be analyzed with the techniques we introduced in this paper. Finally, we believe that our core idea of defining information gain in terms of problem parameters could be particularly useful in the more general setting of reinforcement learning, where information gain about the optimal policy may suffer from even worse issues than it does in the contextual setting.
- Improved algorithms for linear stochastic bandits. In NeurIPS, pp. 2312–2320. Cited by: §4.3.
- Instance-wise minimax-optimal algorithms for logistic bandits. In AISTATS, Proceedings of Machine Learning Research, Vol. 130, pp. 3691–3699. Cited by: §1, §4.2.
- Analysis of Thompson sampling for the multi-armed bandit problem. In COLT, JMLR Proceedings, Vol. 23, pp. 39.1–39.26. Cited by: §1.
- Further optimal regret bounds for Thompson sampling. In AISTATS, JMLR Workshop and Conference Proceedings, Vol. 31, pp. 99–107. Cited by: §1.
- Thompson sampling for contextual bandits with linear payoffs. In ICML (3), JMLR Workshop and Conference Proceedings, Vol. 28, pp. 127–135. Cited by: §1.
Contextual bandit algorithms with supervised learning guarantees. In AISTATS, JMLR Proceedings, Vol. 15, pp. 19–26. Cited by: §4.2.
- Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends Mach. Learn. 5 (1), pp. 1–122. Cited by: Appendix B.
- An empirical evaluation of Thompson sampling. In NeurIPS, pp. 2249–2257. Cited by: §1.
- Contextual bandits with linear payoff functions. In AISTATS, JMLR Proceedings, Vol. 15, pp. 208–214. Cited by: §4.3.
- Provable model-based nonlinear bandit and reinforcement learning: shelve optimism, embrace virtual curvature. In NeurIPS, pp. 26168–26182. Cited by: §6.
- On the performance of Thompson sampling on logistic bandits. In COLT, Proceedings of Machine Learning Research, Vol. 99, pp. 1158–1160. Cited by: §1, §4.2, §4.2.
- An information-theoretic analysis for Thompson sampling with many actions. In NeurIPS, pp. 4161–4169. Cited by: §1, §1, §4.2, §4.2.
- Efficient optimal learning for contextual bandits. In UAI, pp. 169–178. Cited by: §4.2.
- PG-TS: improved Thompson sampling for logistic contextual bandits. In NeurIPS, pp. 4629–4638. Cited by: §6.
- Improved optimistic algorithms for logistic bandits. In ICML, Proceedings of Machine Learning Research, Vol. 119, pp. 3052–3060. Cited by: §4.2.
- Jointly efficient and optimal algorithms for logistic bandits. CoRR abs/2201.01985. Cited by: §1, §4.2.
- Parametric bandits: the generalized linear case. In NeurIPS, pp. 586–594. Cited by: §1, §4.2, §6.
- The statistical complexity of interactive decision making. CoRR abs/2112.13487. Cited by: §1, §6.
- Efficient first-order contextual bandits: prediction, allocation, and triangular discrimination. In NeurIPS, pp. 18907–18919. Cited by: §1, §1, §4.2.
- Beyond UCB: optimal and efficient contextual bandits with regression oracles. In ICML, Proceedings of Machine Learning Research, Vol. 119, pp. 3199–3210. Cited by: §1.
- The safe Bayesian - learning the learning rate via the mixability gap. In ALT, Lecture Notes in Computer Science, Vol. 7568, pp. 169–183. Cited by: §5.1.
- Thompson sampling: an asymptotically optimal finite-time analysis. In ALT, Lecture Notes in Computer Science, Vol. 7568, pp. 199–213. Cited by: §1.
Information directed sampling and bandits with heteroscedastic noise. In COLT, Proceedings of Machine Learning Research, Vol. 75, pp. 358–384. Cited by: §1.
- Randomized exploration in generalized linear bandits. In AISTATS, Proceedings of Machine Learning Research, Vol. 108, pp. 2066–2076. Cited by: §1.
- Bandit algorithms. Cambridge University Press. Cited by: Appendix B.
- Provably optimal algorithms for generalized linear contextual bandits. In ICML, Proceedings of Machine Learning Research, Vol. 70, pp. 2071–2080. Cited by: §4.2.
- Generalized Thompson sampling for contextual bandits. CoRR abs/1310.7163. Cited by: §1.
- Ensemble sampling. In Advances in neural information processing systems, Vol. 30. Cited by: §1.
- On approximate Thompson sampling with langevin algorithms. In ICML, Proceedings of Machine Learning Research, Vol. 119, pp. 6797–6807. Cited by: §6.
- Thompson sampling and approximate inference. In NeurIPS, pp. 8801–8811. Cited by: §6.
- An analysis of ensemble sampling. CoRR abs/2203.01303. Cited by: §1.
- Eluder dimension and the sample complexity of optimistic exploration. In NeurIPS, pp. 2256–2264. Cited by: §6.
- Learning to optimize via posterior sampling. Math. Oper. Res. 39 (4), pp. 1221–1243. Cited by: §1, §1.
- An information-theoretic analysis of Thompson sampling. The Journal of Machine Learning Research 17 (1), pp. 2442–2471. Cited by: Appendix B, Lifting the Information Ratio: An Information-Theoretic Analysis of Thompson Sampling for Contextual Bandits, §1, §1, §1, §1, §3, §3, §4.1, §4.1, §4.1, §4.3, §6.
- Learning to optimize via information-directed sampling. Oper. Res. 66 (1), pp. 230–252. Cited by: §1.
- Theory of statistics. Springer Series in Statistics, Springer New York. External Links: Cited by: §A.1.
- A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry 26 (6), pp. 639–658. Cited by: §1.
- On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bulletin of the American Mathematics Society 25, pp. 285–294. Cited by: §1.
- Feel-good Thompson sampling for contextual bandits and reinforcement learning. CoRR abs/2110.00871. Cited by: §1, §1, §3, §4.1, §4.2, §6, footnote 1.
Appendix A Omitted proofs
a.1 The proof of Lemma 4
For didactic purposes, we provide two proofs for this lemma. We first start with the simple case of distributions with finite supports that allows us to spell out the steps in the proof using simple and intuitive notation. Then, we provide a proof for general prior distributions. Some general notations that we will use throughout are the following. We let be the distribution of the loss given context , action and a fixed parameter , and let denote the random variable with said distribution. Using this notation, notice that . Finally, we define the likelihood function
Proof for countably supported priors.
We first assume that the support is countable, which will allow us to reason about probability mass functions. In particular, with a slight abuse of our notation, we will write (which should otherwise be written as
). Defining the Bayesian posterior predictive distributioncolor=PalePurp!30color=PalePurp!30todo: color=PalePurp!30G: Is this the correct term? , we can write
Then, summing up and taking marginal expectations, we get
To proceed, let us notice that the posterior updates take the following form by definition:
Also, let us define the notation and notice that we can express this quantity by a recursive application of the above expression as
Then, we have
Proof for general prior distributions.
The proof follows from similar arguments, although we cannot work with probability mass functions any more. In particular, we will denote by the prior distribution of , which satisfies the following identity:
Similarly, we denote by the posterior distribution on after round , which satisfies
We now apply Bayes theorem for general distributions that gives the following expression for:
where is the Radon-Nykodim derivative of the posterior measure with respect to the prior measure which is always guaranteed to exist (cf. Theorem 1.31 of Schervish, 1996).
As in the previous proof, we once again define
and , and compute the relation :
Then, we have