We first recall the general setting of online combinatorial optimization with both full feedback (full information game) and limited feedback (semi-bandit game). Let be a fixed set of combinatorial actions, and assume that for all . An (oblivious) adversary selects a sequence of linear functions, without revealing it to the player. At each time step , the player selects an action , and suffers the instantaneous loss
. The following feedback on the loss function
is then obtained: in the full information game the entire loss vectoris observed, and in the semi-bandit game only the loss on active coordinates is observed (i.e., one observes where denotes the entrywise product). Importantly the player has access to external randomness, and can select their action based on the observed feedback so far. The player’s objective is to minimize its total expected loss . The player’s perfomance at the end of the game is measured through the regret , which is the difference between the achieved cumulative loss and the best one could have done with a fixed action. That is, with , one has . The optimal worst-case regret () is known for both the full information and semi-bandit game. It is respectively of order (Koolen et al. (2010)) and (Audibert et al. (2014)).
1.1 First-order regret bounds
It is natural to hope for strategies with regret , for one can then claim that (in other words the player’s performance is close to the optimal in-hindsight performance up to a smaller order term). However, worst-case bounds fail to capture this behavior when . The concept of first-order regret bound tries to remedy this issue, by asking for regret bounds scaling with instead of . In Koolen et al. (2010) an optimal version of such a bound is obtained for the full information game:
Theorem 1 (Koolen et al. (2010))
In the full information game, there exists an algorithm such that for any loss sequence one has .
The state of the art for first-order regret bounds in the semi-bandit game is more complicated. It is known since Allenberg et al. (2006) that for (i.e., the famous multi-armed bandit game) one can have an algorithm with regret . On the other hand for the best bound due to Lykouris et al. (2018) is . A byproduct of our main result (Theorem 4 below) is to give the first optimal first-order regret bound for the semi-bandit game:
In the semi-bandit game, there exists an algorithm such that for any loss sequence one has .
We derive this result111In fact this bound can also be obtained more directly with mirror descent and an entropic regularizer as in Audibert et al. (2014). using the recipe first proposed (in the context of partial feedback) in Bubeck et al. (2015). Namely, to show the existence of a randomized strategy with regret bounded by for any loss sequence, it is sufficient to show that for any distribution over loss sequences there exists a strategy with regret bounded by in expectation. In other words to prove Theorem 2 it is sufficient to restrict our attention to the Bayesian scenario, where one is given a prior distribution over the loss sequence . Importantly note that there is no independence whatsoever in such a random loss sequence, neither across times nor across coordinates for a fixed time.
1.2 Thompson Sampling
In the Bayesian setting one has access to a prior distribution on the optimal action
In particular, one can update this distribution as more observations on the loss sequence are collected. More precisely, denote for the posterior distribution of given all the information at the beginning of round (i.e., in the full information this is while in semi-bandit it is ). Then Thompson Sampling simply plays an action at random from .
This strategy has recently regained interest, as it is both efficient in practice and particularly elegant in theory. A breakthrough in the understanding of Thompson Sampling’s regret was made in Russo and Van Roy (2014a) where an information theoretic analysis was proposed. They consider in particular the combinatorial setting for which they prove the following result:
Theorem 3 (Russo and Van Roy (2014a))
Assume that the prior is such that the sequence is i.i.d. Then in the full information game Thompson Sampling satisfies , and in the semi-bandit game it satisfies .
Assume furthermore that the prior is such that, for any , conditionally on one has that are independent. Then Thompson Sampling satisfies respectively and in the full information and semi-bandit game.
It was observed in Bubeck et al. (2015) that the assumption of independence across times is immaterial in the information theoretic analysis of Russo and Van Roy. However it turns out that the independence across coordinates (conditionally on the history) in Theorem 3 is key to obtain the worst-case optimal bounds and . One of the contributions of our work is to show how to appropriately modify the notion of entropy to remove this assumption.
Most importantly, we propose a new analysis of Thompson Sampling that allows us to prove first-order regret bounds. In particular we show the following result:
For any prior , Thompson Sampling satisfies in the full information game . Furthemore in the semi-bandit game, assuming that almost surely, then .
To the best of our knowledge such bounds were not even known for the full-information case with (the so-called expert setting of Cesa-Bianchi et al. (1997)).
2 Information ratio and scale-sensitive information ratio
As a warm-up, and to showcase one of our key contributions, we focus here on the full information case with (i.e., the expert setting). We start by recalling the general setting of Russo and Van Roy’s analysis (Subsection 2.1), and how it applies in this expert setting (Subsection 2.2). We then introduce a new quantity, the scale-sensitive information ratio, and show that it naturally implies a first-order regret bound (Subsection 2.3). We conclude this section by showing a new bound between two classical distances on distributions (essentially the chi-squared and the relative entropy), and we explain how to apply it to control the scale-sensitive information ratio (Subsection 2.4).
2.1 Russo and Van Roy’s analysis
Let us denote for the feedback received at the end of round . That is in full information one has , while in semi-bandit one has . Let us denote for the posterior distribution of conditionally on . We write for the integration with respect to and (recall that is the distribution of under ). Let be the mutual information, under the posterior distribution , between and , that is . Let be the instantaneous regret at time . The information ratio introduced by Russo and Van Roy is defined as:
The point of the information ratio is the following result:
Proposition 1 (Proposition 1, Russo and Van Roy (2014a))
Consider a strategy such that for all . Then one has
where denotes the Shannon entropy of the prior distribution (in particular ).
Proof The main calculation is as follows:
Moreover it turns out that the total information accumulation can be easily bounded, by simply observing that the mutual information can be written as a drop in entropy, yielding the bound:
2.2 Pinsker’s inequality and Thompson Sampling’s information ratio
We now describe how to control the information ratio (1) of Thompson Sampling in the expert setting. First note that the posterior distribution of satisfies (with a slight abuse of notation by viewing as a vector in ): . In particular this means that:
where the inequality uses that . Now combining (3) with Jensen followed by Pinsker’s inequality yields:
where we denote (recall that Pinsker’s inequality is simply ). Furthermore classical rewriting of the mutual information shows that the quantity is equal to (see [Proposition 4, Russo and Van Roy (2014a)] for more details). In other words we just proved that and thus:
In the expert setting, Thompson Samping’s information ratio (1) satisfies for all .
2.3 Scale-sensitive information ratio
where . With this new quantity we obtain the following refinement of Proposition 1:
Consider a strategy such that for all . Then one has
Proof The main calculation is as follows:
It only remains to use the fact that implies that .
2.4 Reversed chi-squared/relative entropy inequality
We now describe how to control the scale-sensitive information ratio (4) of Thompson Sampling in the expert setting. As we saw in Subsection 2.2, the two key inequalites in the Russo-Van Roy information ratio analysis are a simple Cauchy-Schwarz followed by Pinsker’s inequality (recall (3)):
In particular, as far as first-order regret bounds are concerned, the “scale” of the loss is lost in the first Cauchy-Schwarz. To control the scale-sensitive information ratio we propose to do the Cauchy-Schwarz step differently and as follows:
where is the chi-squared divergence. Thus, to control the scale-sensitive information ratio (4), it only remains to relate the chi-squared divergence to the relative entropy. Unfortunately it is well-known that in general one only has (which is the opposite of the inequality we need). Somewhat surprisingly we show that the reverse inequality in fact holds true for a slightly weaker form of the chi-squared divergence, which turns out to be sufficient for our needs:
For define the positive chi-squared divergence by
Also we denote Then one has
Proof Consider the function , and observe that . In particular is convex, and for it is -strongly convex. Moreover one has . This directly implies:
which concludes the proof.
In the expert setting, Thompson Samping’s scale-sensitive information ratio (4) satisfies for all .
In the expert setting Thompson Sampling satisfies for any prior distribution:
3 Combinatorial setting and coordinate entropy
We now return to the general combinatorial setting, where the action set is a subset of , and we continue to focus on the full information game. Recall that, as described in Theorem 3, Russo and Van Roy’s analysis yields in this case the suboptimal regret bound (the optimal bound is ). We first argue that this suboptimal bound comes from basing the analysis on the standard Shannon entropy. We then propose a different analysis based on the coordinate entropy.
3.1 Inadequacy of the Shannon entropy
Let us consider the simple scenario where is the set of indicator vectors for the sets , . In other words, the action set consists of disjoint intervals of size . This problem is equivalent to a classical expert setting with actions, and losses with values in . In particular there exists a prior distribution such that any algorithm must suffer regret (the lower bound comes from the fact that there is only available actions).
Thus we see that, unless the regret bound reflects some of the structure of the action set (besides the fact that elements have non-zero coordinates), one cannot hope for a better regret than . For larger action sets this quantity yields the suboptimal bound . This suggests that the Shannon entropy is not the right measure of uncertainty in this combinatorial setting.
Interestingly a similar observation was made in Audibert et al. (2014) where it was shown that the regret for the standard multiplicative weights algorithm is also lower bounded by the suboptimal rate . The connection to the present situation is that standard multiplicative weights corresponds to mirror descent with the Shannon entropy. To obtain an optimal algorithm, Koolen et al. (2010); Audibert et al. (2014) proposed to use mirror descent with a certain coordinate entropy. We show next that basing the analysis of Thompson Sampling on this coordinate entropy allows us to prove optimal guarantees.
3.2 Coordinate entropy analysis
For any vector , we define its coordinate entropy to simply be the sum of the entropies of the individual coordinates:
-valued random variable such as, we define . Equivalently, the coordinate entropy is the sum of the entropies of the random variables .
This definition allows us to consider the information gain in each event separately in the information-theoretic analysis. By inspecting our earlier proof one easily obtains in the combinatorial setting, denoting now ,
As a result, the scale-sensitive information ratio with coordinate entropy is . Therefore
Using the inequality on the second term we obtain
This gives the claimed estimate
Consideration of the coordinate entropy suggests that it is unnecessary to leverage information from correlations between different arms, and we can essentially treat them as independent. Examination of our proofs reveals the following fact supporting this philosophy. Any algorithm which observes arm at time with probability
with probabilitysatisfies the same regret estimates that we show for Thompson Sampling. For example, as long as no arm has probability more than , we could pick two bandit arms half the time and none half the time in a suitable way and obtain the same regret guarantees. This remark extends to the thresholded variants of Thompson Sampling we discuss at the end of the paper.
Now we return to the setting and consider the case of bandit feedback. We again begin by recalling the analysis of Russo and Van Roy, and then adapt it in analogy with the scale-sensitive framework. In this section, we require that an almost sure upper bound for the loss of the best action is given to the player. Under this assumption we show that Thompson Sampling obtains a regret bound . Our Lemma 4 below generalizes a part of their proof and will be crucial in all our analyses.
4.1 The Russo and Van Roy Analysis for Bandit Feedback
In the bandit setting we cannot bound the regret by the movement of . Indeed, the calculation (3) relies on the fact that is known at time which is only true for full feedback. However, a different information theoretic calculation gives a good estimate.
In the bandit setting, Thompson Sampling’s information ratio satisfies for all . Therefore it has expected regret .
Proof We set and . Then we have the calculation
By Lemma 4 below, this means
which is equivalent to .
The following lemma is a generalization of a calculation in Russo and Van Roy (2014a). We leave the proof to the Appendix.
Suppose a Bayesian player is playing a semi-bandit game with a hidden subset of arms. Each round , the player picks some subset of arms and observes all their losses. Define , and . Let . Then with the coordinate information gain we have
In the bandit case, we have an upper bound using the ordinary entropy:
4.2 General Theorem on Perfectly Bayesian Agents
Here we state a theorem on the behavior of a Bayesian agent in an online learning environment. In the next subsection we use it to give a nearly optimal regret bound for Thompson Sampling with bandit feedback. This theorem is stated in a rather general way in order to encompass the semi-bandit case as well as the thresholded version of Thompson Sampling. The proof goes by controlling the errors of unbiased and negatively biased estimators for the losses using a concentration inequality. Then we argue that because these estimators are accurate with high probability, a Bayesian agent will usually believe them to be accurate, even though this accuracy circularly depends on the agent’s past behavior. We relegate the detailed proof to the Appendix.
Consider an online learning game with arm set such that the player has a correct prior on the sequence of losses. Assume there always exists an action with total loss at most . Each round, the player plays some subset of actions, and pays/observes the loss for each of them. Let be the time- probability that is one of the optimal arms and the probability that the player plays arm in round . We suppose that there exist constants and a time-varying partition of the action set into rare and common arms such that:
If , then .
If , then .
Then the following statements hold for every .
The expected loss incurred by the player from rare arms is at most
The expected total loss that arm incurs while it is common is at most
This result is similar in spirit to the work Russo and Van Roy (2014b) which shows that Thompson Sampling outperforms any upper-confidence bound strategy.
4.3 First-Order Regret for Bandit Feedback
As Theorem 4.2 alluded to, we split the action set into rare and common arms for each round. Rare arms are those with for some constant , while common arms have . Note that an arm can certainly switch from rare to common and back over time. We correspondingly split the loss function into
via and .
Now we are ready to prove the first-order regret bound for bandits. Our inequalities follow a similar structure as in the full-feedback case, but there seems to be no clean formulation in terms of an information ratio .
Suppose that the best expert almost surely has total loss at most . Then Thompson Sampling with bandit feedback obeys the regret estimate
Proof Fix and define and correspondingly. We split off the rare arm losses at the start of the analysis:
Substituting in the conclusion of Theorem 4.2B gives:
Taking gives the desired estimate.
5 Semi-bandit and Thresholded Thompson Sampling
We now consider semi-bandit feedback in the combinatorial setting, combining the intricacies of the previous two sections. We again have an action set contained in the set , but now we observe the losses of the arms we played. A natural generalization of the bandit proof to higher yields a first-order regret bound of . However, we give a refined analysis using an additional trick of ranking the arms in by their total loss and performing an information theoretic analysis on a certain set partition of these optimal arms. This method allows us to obtain a regret bound for the semi-bandit regret. We leave the proof to the Appendix.
theoremsemibanditTS The expected regret of Thompson Sampling in the semi-bandit case is
5.1 Thresholded Thompson Sampling
Unlike in the full-feedback case, our first-order regret bound for bandit Thompson Sampling has an additive term, so it is not completely -independent. In fact, some mild -dependence is inherent - an example is given in the Appendix.
However, this mild -dependence can be avoided by using Thresholded Thompson Sampling. In Thresholded Thompson Sampling, the rare arms are never played, and the probabilities for the other arms are scaled up correspondingly. More precisely, for , the -thresholded Thompson Sampling is defined by letting and playing at time from the distribution
This algorithm parallels the work Lykouris et al. (2018) which uses an analogous modification of the EXP3 algorithm to obtain a first-order regret bound. Thresholded semi-bandit Thompson Sampling is defined similarly, where we only allow action sets containing no rare arms.
Thompson Sampling for bandit feedback, thresholded with , has expected regret
Thompson Sampling for semi-bandit feedback, thresholded with , has expected regret
- Allenberg et al.  C. Allenberg, P. Auer, L. Györfi, and G. Ottucsák. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In Proceedings of the 17th International Conference on Algorithmic Learning Theory (ALT), 2006.
- Audibert et al.  J.Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39:31–45, 2014.
- Bubeck et al.  S. Bubeck, O. Dekel, T. Koren, and Y. Peres. Bandit convex optimization: regret in one dimension. In Proceedings of the 28th Annual Conference on Learning Theory (COLT), 2015.
- Cesa-Bianchi et al.  Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E Schapire, and Manfred K Warmuth. How to use expert advice. Journal of the ACM (JACM), 44(3):427–485, 1997.
- Freedman et al.  David A Freedman et al. On tail probabilities for martingales. The Annals of Probability, 3(1):100–118, 1975.
- Koolen et al.  W. Koolen, M. Warmuth, and J. Kivinen. Hedging structured concepts. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010.
- Lykouris et al.  T. Lykouris, K. Sridharan, and E. Tardos. Small-loss bounds for online learning with partial information. In Proceedings of the 31st Annual Conference on Learning Theory (COLT), 2018.
- Neu  G. Neu. First-order regret bounds for combinatorial semi-bandits. In Proceedings of the 28th Annual Conference on Learning Theory (COLT), 2015.
- Russo and Van Roy [2014a] D. Russo and B. Van Roy. An information-theoretic analysis of thompson sampling. arXiv preprint arXiv:1403.5341, 2014a.
- Russo and Van Roy [2014b] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39:1221–1243, 2014b.
- Thompson  W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bulletin of the American Mathematics Society, 25:285–294, 1933.
Appendix A Appendix
a.1 Proof of Lemma 4
Proof We first claim that the relative entropy
is at most the entropy decrease in the law of the random variable from being given that . Indeed, let be a -valued random variable with expected value and conditionally independent of everything else. By definition,
is exactly the information gain in upon being told that . Since is a noisy realization of , the data processing inequality implies that the information gain of is more than the information gain in which proves the claim.
Now, continuing, we have that
is at most the entropy decrease in from being given whether or not . Therefore
Summing over gives the result. For the second assertion regarding the bandit case, we can consider the single random variable which allows us to do everything with ordinary entropy. See Russo and Van Roy [2014a] Proposition 3 for the detailed calculation.
a.2 Proof of Theorem 4.2
Here we prove Theorem 4.2. Recall the statement:
The following notations will be relevant to our analysis. Some have been defined in the main body, while some are only used in the Appendix.
The variables are the instantaneous rare/common losses of an arm, while the variables track the total loss. The
variables are underbiased/unbiased estimates of thewhile the variables are underbiased/unbiased estimates of the .
To control the error of the estimators we rely on Freedman’s inequality (Freedman et al. ), a refinement of Hoeffding-Azuma which is more efficient for highly assymmetric summands.
Theorem 7 (Freedman’s Inequality)
Let be a martingale sequence, so that
Suppose that we have a uniform estimate . Also define the conditional variance
. Also define the conditional variance
and set to be the total variance accumulated so far.
Then with probability at least , we have for all with .
The following extension to supermartingales is immediate by taking the Doob-Meyer decomposition of a supermartingale as a martingale plus a decreasing predictable process.
Let be a supermartingale sequence, so that . Suppose that we have a uniform estimate . Also define the conditional variance
and set to be the total variance accumulated so far.
Then with probability at least , we have for all with .
Towards proving the two claims in Theorem 4.2 we first prove two lemmas. They follow directly from proper applications of Freedman’s Theorem or its corollary.
In the context of Theorem 4.2, with probability at least , for all with we have
In the context of Theorem 4.2, fix constants and and assume . With probability at least , for all with we have
This second lemma has no dependence on and holds with . For proving Theorem 4.2 we will simply take . We will need to apply this lemma with for the semi-bandit analog.
Proof of Lemma 5:
We analyze the (one-sided) error in the underestimate for . Define the supermartingale for
We apply Corollary 1 to this supermartingale, taking
For the filtration, we take the loss sequence as known from the start so that the only randomness is from the player’s choices. Equivalently, we act as the observing adversary - note that is still a supermartingale with respect to this filtration. Crucially, this means the conditional variance is bounded by . Therefore we have . We also note that with these parameters we have
Therefore, Freedman’s inequality tells us that with probability , for all with we have