In this paper, we focus on a recent variant of multi-armed bandit (MAB) problems, which is the combinatorial sleeping MAB with long-term fairness constraints (CSMAB-F) . In CSMAB-F, a learning agent needs to simultaneously pull a subset of available arms subject to some constraints (usually the cardinality constraint) and only observes the reward of each pulled arm (semi-bandit setting) in each round. Both the availability and the reward of each arm are stochastically generated, and the long-term fairness among arms is further considered, i.e., each arm should be pulled at least a number of times in a long horizon of time. The objective is to accumulate as many rewards as possible in the finite time horizon. The CSMAB-F problem has a wide range of real-world applications. For example, in task assignment problems, we want each worker to be assigned for a certain number of tasks (i.e., fairness constraints), while some of the workers may be unavailable in some time slots (i.e., sleeping arms). In movie recommendation systems considering movie diversity, different movie genres should be recommended for a certain number of times (i.e., fairness constraints), while we do not recommend users with genres they dislike (i.e., sleeping arms).
Upper Confidence Bound (UCB) and Thompson Sampling (TS) are two well-known families of algorithms to address the stochastic MAB problems. Theoretically, TS is comparable to UCB [13, 2], but practically, TS usually outperforms UCB-based algorithms significantly . However, while the theoretical performance of UCB-based algorithms has been extensively studied for various MAB problems , there are only a few theoretical results for TS-based algorithms [2, 6, 21].
In , a UCB-based algorithm called Learning with Fairness Guarantee (LFG) was devised and a problem-independent regret bound 222If a regret bound depends on a specific problem instance, we call it a problem-dependent regret bound while if a regret bound does not depend on any problem instances, we call it a problem-independent regret bound. was derived for the CSMAB-F problem, where is the number of arms, is the maximal number of arms that can be pulled simultaneously in each round, is the maximum arm weight, and is a parameter that balances the the fairness and the reward. However, as TS-based algorithms are usually comparable to UCB theoretically but practically perform better than UCB, we are motivated to devise TS-based algorithms and derive regret bounds of such algorithms for the CSMAB-F problem. The contributions of this paper can be summarized as follows.
We devise the first TS-based algorithm for CSMAB-F problems with a provable upper regret bound. To be fully comparable with LFG, we incorporate the virtual queue techniques defined in  but make a modification on the queue evolution process to reduce the accumulated rounding errors.
Our regret bound is in the same polynomial order as the one achieved by LFG, but with lower coefficients. This fact shows again that TS-based algorithms can achieve comparable theoretical guarantee as UCB-based algorithms but with a tighter bound.
We verify and validate the practical performance of our proposed algorithms by numerical experiments and real-world applications. Compared with LFG, it is shown that TSCSF-B does perform better than LFG in practice.
It is noteworthy that our algorithmic framework and proof techniques are extensible to other MAB problems with other fairness definitions. Furthermore, if we do not consider the fairness constraints, our bound boils down to the first problem-independent upper regret bounds of TS algorithms for CSMAB problems and matches the lower regret bound .
The remainder of this paper is organized as follows. In Section 2, we summarize the most related works about CSMAB-F. The problem formulation of CSMAB-F is presented in Section 3, following what in  for comparison purposes. The proposed TS-based algorithm is presented in Section 4, with main results, i.e., the fairness guarantee, performance bounds and proof sketches, presented in Section 5. Performance evaluations are presented in Section 6, followed by concluding remarks and future work in Section 7. Detailed proofs can be found in Appendix A.
2 Related Works
Many variants of the stochastic MAB problems have been proposed and the corresponding regret bounds have been derived. The ones that are most related to our work are Combinatorial MAB (CMAB) and its variants, which was first proposed and analyzed in . In CMAB, an agent needs to pull a combination of arms simultaneously from a fixed arm set. Considering a semi-bandit feedback setting, i.e., the individual reward of each arm in the played combinatorial action can be observed, the authors of  derived a sublinear problem-dependent upper regret bound based on a UCB algorithm and this bound was further improved in . In , a problem-dependent lower regret bound was derived by constructing some problem instances. Very recently, the authors of  derived a problem-dependent regret bound of TS-based algorithms for CMAB problems.
All the aforementioned works make an assumption that the arm set from which the learning agent can pull arms is fixed over all rounds, i.e., all the arms are always available and ready to be pulled. However, in practice, some of the arms may not be available in some rounds, for example, some items for online sales are out of stock temporarily. Therefore, a bunch of literature studied the setting of MAB with sleeping arms (SMAB) [16, 6, 12, 15, 20]. In the SMAB setting, the set of available arms for each round, i.e., the availability set, can vary. For the simplest version of SMAB (only one arm is pulled in each round), the problem-dependent regret bounds of UCB-based algorithms and TS-based algorithms have been analyzed in  and , respectively.
Regarding the combinatorial SMAB setting (CSMAB), some negative results are shown in , i.e., efficient no-regret learning algorithms sometimes are computationally hard. However, for some settings such as stochastic availability and stochastic reward, it is shown that it is still possible to devise efficient learning algorithms with good theoretical guarantees [12, 18]. More importantly, in the work of , they considered a new variant called the combinatorial sleeping MAB with long-term fairness constraints (CSMAB-F). In this setting, fairness among arms is further considered, i.e., each arm needs to be pulled for a number of times. The authors designed a UCB-based algorithm called Learning with Fairness Guarantee (LFG) and provided a problem-independent time-averaged upper regret bound.
Due to the attractive practical performance and lack of theoretical guarantees for TS-based algorithms in CSMAB-F, it is desirable to devise a TS-based algorithm and derive regret bounds for such algorithms. We are interested to derive the problem-independent regret bound as it holds for all problem instances. In this work, we give the first provable regret bound that is in the same polynomial order as the one in  but with lower coefficients. To the best of our knowledge, the derived upper bound is also the first problem-independent regret bound of TS-based algorithms for CSMAB problems which matches the lower regret bounds  when relaxing the long-term fairness constraints.
3 Problem Formulation
In this section, we present the problem formulation of the stochastic combinatorial sleeping multi-armed bandit problem with fairness constraints (CSMAB-F), following  closely for comparison purposes. To state the problem clearly, we first introduce the CSMAB problem and then incorporate the fairness constraints.
Let set be an arm set and be the power set of . At the beginning of each round , a set of arms are revealed to a learning agent according to a fixed but unknown distribution over , i.e., . We call set the availability set in round . Meanwhile, each arm is associated with a random reward
drawn from a fixed Bernoulli distributionwith an unknown mean 333Note that we only consider Bernoulli distribution in this paper for brevity, but it is feasible to extend our algorithm and analysis with little modifications to other general distributions (see [1, 2])., and a fixed known non-negative weight for that arm. Note that for all the arms in , their rewards are drawn independently in each round . Then the learning agent pulls a subset of arms from the availability set with the cardinality no more than , i.e., , and receives a weighted random reward .
In this work, we consider the semi-bandit feedback setting, which is consistent with , i.e., the learning agent can observe the individual random reward of all the arms in . Note that since the availability set is drawn from a fixed distribution and the random rewards of the arms are also drawn from fixed distributions, we are in a bandit setting called the stochastic availability and the stochastic reward.
The objective of the learning agent is to pull the arms sequentially to maximize the expected time-averaged rewards over rounds, i.e., .
Furthermore, we consider the long-term fairness constraints proposed in , where each arm is expected to be pulled at least times when the time horizon is long enough, i.e.,
We say a vectoris feasible if there exists a policy such that (1) is satisfied.
If we knew the availability set distribution and the mean reward for each arm in advance, and was feasible, then there would be a randomized algorithm which is the optimal solution for CSMAB-F problems. The algorithm chooses arms
with probabilitywhen observing available arms . Let . We can determine by solving the following problem:
where the first constraint is equivalent to the fairness constraints defined in (1), and the second constraint states that for each availability set , the probability space for choosing should be complete.
Denote the optimal solution to (2) as , i.e., the optimal policy pulls with probability when observing an available arm set . We denote by the arms pulled by the optimal policy in round .
are unknown in advance, and the learning agent can only observe the available arms and the random rewards of the pulled arms. Therefore, the learning agent faces the dilemma between exploration and exploitation, i.e., in each round, the agent can either do exploration (acquiring information to estimate the mean reward of each arm) or exploitation (accumulating rewards as many as possible). The quality of the agent’s policy is measured by thetime-averaged regret, which is the performance loss caused by not always performing the optimal actions. Considering the stochastic availability of each arm, we define the time-averaged regret as follows:
4 Thompson Sampling with Beta Priors and Bernoulli Likelihoods for CSMAB-F (TSCSF-B)
The key challenges to design an effective and efficient algorithm to solve the CSMAB-F problem can be twofold. First, the algorithm should well balance the exploration and exploitation in order to achieve a low time-averaged regret. Second, the algorithm should make a good balance between satisfying the fairness constraints and accumulating more rewards.
To address the first challenge, we adopt the Thompson sampling technique with beta priors and Bernoulli likelihoods to achieve the tradeoff between the exploration and exploitation. The main idea is to assume a beta prior distribution with the shape parameters and (i.e., ) on the mean reward of each arm . Initially, we let , since we have no knowledge about each and
is a uniform distribution in. Then, after observing the available arms , we draw a sample from as an estimate for , and pull arms according to (6) as discussed later. The arms in return rewards
, which are used to update the beta distributions based on Bayes rules and Bernoulli likelihood for all arms in:
After a number of rounds, we are able to see that the mean of the posterior beta distributions will converge to the true mean of the reward distributions.
The virtual queue technique [18, 19] can be used to ensure that the fairness constraints are satisfied. The high-level idea behind the design is to establish a time-varying queue to record the number of times that arm has failed to meet the fairness. Initially, we set for all . For the ease of presentation, let
be a binary random variable indicating that whether armis pulled or not in round . Then for each arm , we can use the following way to maintain the queue:
Intuitively, the length of the virtual queue for arm increases if the arm is not pulled in round . Therefore, arms with longer queues are more unfair and will be given a higher priority to be pulled in future rounds. Note that our queue evolution is slightly different from  to avoid the rounding error accumulation issue.
To further balance the fairness and the reward, we introduce another parameter as a tradeoff between the reward and the virtual queue lengths. Then, in each round , the learning agent pulls arms as follows:
5 Results and Proofs
5.1 Fairness Satisfaction
For any fixed and finite , when is long enough and the fairness constraint vector is feasible, the proposed TSCSF-B algorithm satisfies the long-term fairness constraints defined in (1).
Proof 5.1 (Proof Sketch)
The main idea to prove Theorem 1 is to prove the virtual queue for each arm is stable when is feasible and is long enough for any fixed and finite . The proof is based on Lyapunov-drift analysis . Since it is not our main contribution and it follows similar lines to the proof of Theorem 1 in , we omit the proof. Interested readers are referred to .
The long-term fairness constraints does not require arms to be pulled for a certain number of times in each round but by the end of the time horizon. Theorem 1 states that the fairness constraints can always be satisfied by TSCSF-B as long as is finite and is long enough. A higher may require a longer time for the fairness constraints to be satisfied (see Sec. 6).
5.2 Regret Bounds
For any fixed and , the time-averaged regret of TSCSF-B is upper bounded by
Proof 5.2 (Proof Sketch)
We only provide a sketch of proof here, and the detailed proof can be found in Appendix. The optimal policy for CSMAB-F is a randomized algorithm defined in Sec. 3, while the optimal policies for classic MAB problems are deterministic. We follow the basic idea in  to convert the regret bound between the randomized optimal policy and TSCSF-B (i.e., regret) by the regret bound between a deterministic oracle and TSCSF-B. The deterministic oracle also knows the mean reward for each arm, and can achieve more rewards than the optimal policy by sacrificing fairness constraints a bit. Denote the arms pulled by the oracle in round as , which is defined by
Then, we can prove that the time-averaged regret defined in (3) is bounded by
where the first term is due to the queuing system, and the second part is due to the exploration and exploitation.
Next, we define two events and their complementary events for each arm to decompose . Let , where is the number of times that arm has been pulled at the beginning of round . Then for each arm , the two events and are defined as follows:
and let and be the complementary events for and , respectively. Notice that both and are low-probability events after a number of rounds.
With the events defined above, we can decompose as
Using the relationship between the summation and integration, we can bound and by .
Bounding and is the main theoretical contribution of our work and is not trivial. Since and are low-probability events, the total times they can happen are a constant value on expectation. Therefore, to bound and , the basic idea is to obtain the bounds for and . Currently, there is no existing work giving the bounds for and , and we prove that and are bounded by and , respectively. Then, it is straightforward to bound and by .
Comparing with the time-averaged regret bound for LFG , we have the same first term , as we adopt the virtual queue system to satisfy the fairness constraints. On the other hand, the second part of our regret bound, which is also the first problem-independent regret bound for CSMAB problems, has lower coefficients than that of LFG. Specifically, the coefficient for the time-dependent term (i.e., ) is in our bound, smaller than in that of LFG, and the time-independent term (i.e., ) has a coefficient in our bound, which is also less than in the bound of LFG.
If we let , the algorithm only focuses on CSMAB problems, and the bound boils down to the first problem-independent bound of TS-based algorithms for CSMAB problems, which matches the lower bound proposed in .
For any fixed and , when , the time-averaged regret of TSCSF-B is upper bounded by
The reason we let for a given is to control the first term to have a consistent or lower order than the second term. However, in practice, we need to tune according to such that both fairness constraints and high rewards can be achieved.
6 Evaluations and Applications
6.1 Numerical Experiments
In this section, we compare the TSCSF-B algorithm with the LFG algorithm  in two settings. The first setting is identical to the setting in , where , , and . The mean reward vector for the three arms is . The availability of the three arms is , and the fairness constraints for the three arms are . To see the impact of on the time-averaged regret and fairness constraints, we compare the algorithms under and in a time horizon , where indicates that both algorithms do not consider the long-term fairness constraints.
Further, we test the algorithms in a more complicated setting where , , and . The mean reward vector for the six arms is . The availability of the six arms is , and the fairness constraints for the six arms are . This setting is challenging because higher fairness constraints are given to the arms with less mean rewards and the arms with lower availability (i.e., arms and ). According to Corollary 1, we set and , and . Note that the following results are the average of
independent experiments. We omit the plotting of confidence interval and deviations because they are too small to be seen from the figures and are also omitted in most bandit papers.
6.1.1 Time-Averaged Regret
In each subplot, the -axis represents the rounds and the -axis is the time-averaged regret. A small figure inside each subplot zooms in the first rounds. We also plot the OPT with considering fairness (Opt-F) (i.e., the optimal solution to CSMAB-F), and OPT without considering fairness (Opt-NF) (i.e., the optimal solution to CSMAB). The time-averaged regret of Opt-NF is always below Opt-F, since Opt-NF does not need to satisfy the fairness constraints and can always achieve the highest rewards. By definition, the regret of Opt-F is always .
We can see that the proposed TSCSF-B algorithm has a better performance than the LFG algorithm, since it converges faster, and achieves a lower regret, as shown in Fig. 1 and Fig. 2. It is noteworthy that the gap between TSCSF-B and LFG is larger in Fig. 2, which indicates that TSCSF-B performs better than LFG in more complicated scenarios.
In terms of , the algorithms with a higher can achieve a lower time-averaged regret. For example, in the first setting, the lowest regrets achieved by the two considered algorithms are around when , but they are much closer to Opt-F when . However, when we continue to increase to (see Fig. (c)c), the considered algorithms achieve a negative time-averaged regret around rounds, but recover to the positive value afterwards. This is due to the fact that with a high the algorithms preferentially pull arms with the highest mean rewards, but the queues still ensure the fairness can be achieved in future rounds. When (see Fig. (d)d and Fig. (b)b), the fairness constraints are totally ignored and the regrets of the considered algorithms converge to Opt-NF. Therefore, significantly determines whether the algorithms can satisfy and how quickly they satisfy the fairness constraints.
6.1.2 Fairness Constraints
In the first setting, we show in Fig. 3 the final satisfaction of fairness constraints for all arms under . is an interesting setting where the fairness constraints are not satisfied in the first few rounds as aforementioned. We want to point out in the first setting, the fairness constraint for arm is relatively difficult to be satisfied, since arm has the lowest mean reward but has a relative high fairness constraint. However, we can see that the fairness constraints for all arms are satisfied finally, which means both TSCSF-B and LFG are able to ensure the fairness constraints in this simple setting.
In the second setting with , the fairness constraints for arms and are difficult to satisfy, as both arms have high fairness constraints but low availability or low mean reward. However, both TSCSF-B and LFG manage to satisfy the fairness constraints for all the arms, as shown in Fig. 4.
6.2 Tightness of the Upper bounds
Finally, we show the tightness of our bounds in the second setting, as plotted in Fig. 5. The -axis represents the change of the time horizon , and the -axis is the logarithmic time-averaged regret in the base of .
We can see that, the upper bound of TSCSF-B is always below that of LFG. However, there is a big gap between the TSCSF-B upper bound and the actual time-averaged regret in the second setting. This is reasonable, since the upper bound is problem-independent, but it is still of interest to find tighter bound for CSMAB-F problems.
6.3 High-rating movie recommendation System
In this part, we consider a high-rating movie recommendation system. The objective of the system is to recommend high-rating movies to users, but the ratings for the considered movies are unknown in advance. Thus, the system needs to learn the ratings of the movies while simultaneously recommending the high-rating ones to its users. Specifically, when each user comes, the movies that are relevant to the user’s preference are available to be recommended. Then, the system recommends the user with a subset of the available movies subject. After consuming the recommended movies, the user gives feedback to the system, which can be used to update the ratings of the movies to better serve the upcoming users. In order to acquire accurate ratings or to ensure the diversity of recommended movies, each movie should be recommended at least a number of times.
The above high-rating movie recommendation problem can be modeled as a CSMAB-F problem under three assumptions. First, we take serving one user as a round by assuming the next user always comes after the current user finishes rating. This assumption can be easily relaxed by adopting the delayed feedback framework with an additive penalty to the regret . Second, the availability set of movies is stochastically generated according to an unknown distribution. Last, given a movie, the ratings are i.i.d. over users with respect to an unknown distribution. The second and third assumptions are feasible, as it has been discovered that the user preference and ratings towards movies have a strong relationship to the Zipf distribution [4, 10].
We implement TSCSF-B and LFG on MovieLens 20M Dataset , which includes million ratings to movies by users. This dataset contains both users’ movie ratings between and and genre categories for each movie. In order to compare the proposed TSCSF-B algorithm to the LFG algorithm, we select movies with different genres as the ground set of arms , which are Toy Story (1995) in the genre of Adventure, Braveheart (1995) in Action, Pulp Fiction (1994) in Comedy, Godfather, The (1972) in Crime, and Alien (1979) in Horror.
Then, we count the total number of ratings on the selected genres and calculate occurrence of each selected genre among the genres as the availability of the corresponding selected movie. We note that the availability of the selected movies is only used by the OPT-F algorithm and is not used to determine the available set of movies in each round. During the simulation, when each user comes, the available set of movies is determined by whether the user has rated or not these movies in the dataset.
The ratings are scaled into to satisfy as the rewards. We choose users who have rated at least one of the selected movies as the number of rounds (one round one user according to the first assumption) for the algorithms, and take their ratings as the rewards to the recommended movies. When each user comes, the system will select no more than movies for recommendation and each movie shares the same weight, i.e., , and the same fairness constraints. The fairness constraints are set as such that (2) has a feasible solution.
We adopt such an implementation, including the determination of the available movie set, the same movie weights, and the same fairness constraints, to ensure that our simulation brings noise as little as possible to the MovieLens dataset.
We first show whether the considered algorithms are able to achieve accurate ratings. The final ratings of selected movies by TSCSF-B and LFG under and are shown in Fig. (a)a. The reason why we set is due to Corollary 1. We can observe that the performance of TSCSF-B is better than that of LFG, since the ratings of TSCSF-B are much closer to the true average ratings, while the ratings acquired by UCB are higher than the true average ratings.
The final satisfaction for the fairness constraints of the selected movies is shown in Fig. (b)b. Both TSCSF-B and LFG can satisfy the fairness constraints of the five movies under .
On the other hand, the time-averaged regret is shown in Fig. 7. We can see that the time-averaged regret of TSCSF-B is below that of LFG, which indicates the proposed TSCSF-B algorithm converges much faster. Since we are unable to obtain the true distribution of the available movie set (as discussed in Setup), the rewards achieved by the OPT-F algorithm may not be the optimal one, which explains why the lines of both TSCSF-B and LFG are below that of OPT-F in Fig. 7.
Generally, TSCSF-B performs much better than LFG in this application, which achieves better final ratings and a quicker convergence speed.
In this paper, we studied the stochastic combinatorial sleeping multi-armed bandit problem with fairness constraints, and designed the TSCSF-B algorithm with a provable problem-independent bound of when . Both the numerical experiments and real-world applications were conducted to verify the performance of the proposed algorithms.
As part of the future work, we would like to derive more rigorous relationship between and such that the algorithm can always satisfy the fairness constraints and achieves high rewards given any , as well as tighter bounds.
-  (2012) Analysis of Thompson Sampling for The Multi-Armed Bandit Problem. In Proc. Conference on Learning Theory (COLT), Vol. 23, pp. 39.1–39.26. Cited by: footnote 3.
-  (2017-09) Near-Optimal Regret Bounds for Thompson Sampling. Jounal of ACM 64 (5), pp. 30:1–30:24. External Links: Cited by: §1, footnote 3.
Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems.
Foundations and Trends® in Machine Learning5 (1), pp. 1–122. Cited by: §1, §1.
-  (2007) I Tube, You Tube, Everybody Tubes: Analyzing the World’s Largest User Generated Content Video System. In Proc. ACM SIGCOMM Conference on Internet Measurement (IMC), New York, NY, USA, pp. 1–14. Cited by: §6.3.
-  (2011) An Empirical Evaluation of Thompson Sampling. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 2249–2257. Cited by: §1.
Analysis of Thompson Sampling for Stochastic Sleeping Bandits.
Proc. Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §1, §2.
-  (2013) Combinatorial Multi-armed Bandit: General Framework and Applications. In Proc. International Conference on Machine Learning (ICML), pp. 151–159. Cited by: §2.
-  (2015) Combinatorial Bandits Revisited. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 2116–2124. Cited by: §2.
-  (2012) Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards and Individual Observations. IEEE/ACM Transactions on Networking (TON) 20 (5), pp. 1466–1478. Cited by: §2.
-  (2007) Youtube Traffic Characterization: A View from the Edge. In Proc. ACM SIGCOMM Conference on Internet Measurement (IMC), New York, NY, USA, pp. 15–28. Cited by: §6.3.
-  (2016) The Movielens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TIIS) 5 (4), pp. 19. Cited by: §6.3.1.
-  (2019) Intelligent Caching Algorithms in Heterogeneous Wireless Networks with Uncertainty. In Proc. IEEE Conference on Distributed Computing Systems (ICDCS), Cited by: §2, §2.
-  (2019) Problem-Dependent Regret Bounds for Online Learning with Feedback Graphs. In Proc. Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §1.
-  (2013) Online Learning under Delayed Feedback. In Proc. International Conference on Machine Learning (ICML), pp. 1453–1461. Cited by: §6.3.
Hardness of Online Sleeping Combinatorial Optimization Problems. In Proc. Advances in Neural Information Processing Systems (NeurIPS), D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2181–2189. Cited by: §2, §2.
-  (2010) Regret Bounds for Sleeping Experts and Bandits. Machine Learning 80 (2-3), pp. 245–272. Cited by: §2.
-  (2015) Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits. In Proc. Artificial Intelligence and Statistics (AISTATS), pp. 535–543. Cited by: §2, §2, Remark 2.
-  (2019-05) Combinatorial Sleeping Bandits with Fairness Constraints. In Proc. IEEE Conference on Computer Communications (INFOCOM), pp. 1702–1710. Cited by: 1st item, §1, §1, §1, §2, §2, §3, §3, §3, §4, Proof 5.1, Proof 5.2, §6.1, Remark 2.
-  (2010) Stochastic Network Optimization with Application to Communication and Queueing Systems. Synthesis Lectures on Communication Networks 3 (1), pp. 1–211. Cited by: §4, Proof 5.1.
-  (2014) Online Combinatorial Optimization with Stochastic Decision Sets and Adversarial Losses. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2780–2788. Cited by: §2.
-  (2018) Thompson Sampling for Combinatorial Semi-Bandits. In Proc. International Conference on Machine Learning (ICML), pp. 5101–5109. Cited by: §1, §2.
Appendix A Appendix
a.1 Notations and Facts
Recall that is the number of times that arm has been pulled at the beginning of round . Recall is the empirical mean of arm at the beginning of round . Therefore, we have .
For each arm , we have two events and defined as follows:
Define as the history of the plays until time , i.e., , where is the arm pulled in round .
Recall that the arms pulled by the deterministic oracle in round are , which is defined by
Let indicate whether arm is played by TSCSF-B in round . In the same way, let indicate whether arm is played by in round .
Fact 1 (Chernoff bound) Let be independent - random variables such that . Let . Then, for any ,
and for any ,
Fact 2 (Hoeffding inequality). Let be random variables with common range and such that . Let . Then, for all ,
Fact 3 (Relationship between beta and Binomial distributions). Let be the cdf of beta distribution with parameters and , and let
be the cdf of Binomial distribution with parametersand . Then, we have
for all positive integers and .
a.2 Proof of Theorem 2
The time-averaged regret of TSCSF-B can be upper bounded by
For all , the probability that event happens is upper bounded as follows:
For all , , the probability that event happens is upper bounded as follows:
Define event is the complementary event of as follows:
Then, we can decompose as
Since , is therefore bounded by
where the second inequality is due to Lemma 2.
Next, we show how to bound . Let be the round when arm is played for the -th time, i.e., . can be bounded as follows:
The last term in (9) can be further written as follows:
where the last inequality is due to Jensen’s inequality. Also by Jensen’s inequality, we have
Therefore, we can bound by
where the inequality is due to the fact that at most arms are selected in each round. Therefore, we have bounded by
Combining and gives
Define an event as the complementary event of :
can be decomposed by
Let be the round when arm is played for the -th time by policy , i.e., . Since , we can write as follows: