1 Introduction
In this paper, we focus on a recent variant of multiarmed bandit (MAB) problems, which is the combinatorial sleeping MAB with longterm fairness constraints (CSMABF) [18]. In CSMABF, a learning agent needs to simultaneously pull a subset of available arms subject to some constraints (usually the cardinality constraint) and only observes the reward of each pulled arm (semibandit setting) in each round. Both the availability and the reward of each arm are stochastically generated, and the longterm fairness among arms is further considered, i.e., each arm should be pulled at least a number of times in a long horizon of time. The objective is to accumulate as many rewards as possible in the finite time horizon. The CSMABF problem has a wide range of realworld applications. For example, in task assignment problems, we want each worker to be assigned for a certain number of tasks (i.e., fairness constraints), while some of the workers may be unavailable in some time slots (i.e., sleeping arms). In movie recommendation systems considering movie diversity, different movie genres should be recommended for a certain number of times (i.e., fairness constraints), while we do not recommend users with genres they dislike (i.e., sleeping arms).
Upper Confidence Bound (UCB) and Thompson Sampling (TS) are two wellknown families of algorithms to address the stochastic MAB problems. Theoretically, TS is comparable to UCB [13, 2], but practically, TS usually outperforms UCBbased algorithms significantly [5]. However, while the theoretical performance of UCBbased algorithms has been extensively studied for various MAB problems [3], there are only a few theoretical results for TSbased algorithms [2, 6, 21].
In [18], a UCBbased algorithm called Learning with Fairness Guarantee (LFG) was devised and a problemindependent regret bound ^{2}^{2}2If a regret bound depends on a specific problem instance, we call it a problemdependent regret bound while if a regret bound does not depend on any problem instances, we call it a problemindependent regret bound. was derived for the CSMABF problem, where is the number of arms, is the maximal number of arms that can be pulled simultaneously in each round, is the maximum arm weight, and is a parameter that balances the the fairness and the reward. However, as TSbased algorithms are usually comparable to UCB theoretically but practically perform better than UCB, we are motivated to devise TSbased algorithms and derive regret bounds of such algorithms for the CSMABF problem. The contributions of this paper can be summarized as follows.

[leftmargin=*]

We devise the first TSbased algorithm for CSMABF problems with a provable upper regret bound. To be fully comparable with LFG, we incorporate the virtual queue techniques defined in [18] but make a modification on the queue evolution process to reduce the accumulated rounding errors.

Our regret bound is in the same polynomial order as the one achieved by LFG, but with lower coefficients. This fact shows again that TSbased algorithms can achieve comparable theoretical guarantee as UCBbased algorithms but with a tighter bound.

We verify and validate the practical performance of our proposed algorithms by numerical experiments and realworld applications. Compared with LFG, it is shown that TSCSFB does perform better than LFG in practice.
It is noteworthy that our algorithmic framework and proof techniques are extensible to other MAB problems with other fairness definitions. Furthermore, if we do not consider the fairness constraints, our bound boils down to the first problemindependent upper regret bounds of TS algorithms for CSMAB problems and matches the lower regret bound [3].
The remainder of this paper is organized as follows. In Section 2, we summarize the most related works about CSMABF. The problem formulation of CSMABF is presented in Section 3, following what in [18] for comparison purposes. The proposed TSbased algorithm is presented in Section 4, with main results, i.e., the fairness guarantee, performance bounds and proof sketches, presented in Section 5. Performance evaluations are presented in Section 6, followed by concluding remarks and future work in Section 7. Detailed proofs can be found in Appendix A.
2 Related Works
Many variants of the stochastic MAB problems have been proposed and the corresponding regret bounds have been derived. The ones that are most related to our work are Combinatorial MAB (CMAB) and its variants, which was first proposed and analyzed in [9]. In CMAB, an agent needs to pull a combination of arms simultaneously from a fixed arm set. Considering a semibandit feedback setting, i.e., the individual reward of each arm in the played combinatorial action can be observed, the authors of [7] derived a sublinear problemdependent upper regret bound based on a UCB algorithm and this bound was further improved in [17]. In [8], a problemdependent lower regret bound was derived by constructing some problem instances. Very recently, the authors of [21] derived a problemdependent regret bound of TSbased algorithms for CMAB problems.
All the aforementioned works make an assumption that the arm set from which the learning agent can pull arms is fixed over all rounds, i.e., all the arms are always available and ready to be pulled. However, in practice, some of the arms may not be available in some rounds, for example, some items for online sales are out of stock temporarily. Therefore, a bunch of literature studied the setting of MAB with sleeping arms (SMAB) [16, 6, 12, 15, 20]. In the SMAB setting, the set of available arms for each round, i.e., the availability set, can vary. For the simplest version of SMAB (only one arm is pulled in each round), the problemdependent regret bounds of UCBbased algorithms and TSbased algorithms have been analyzed in [16] and [6], respectively.
Regarding the combinatorial SMAB setting (CSMAB), some negative results are shown in [15], i.e., efficient noregret learning algorithms sometimes are computationally hard. However, for some settings such as stochastic availability and stochastic reward, it is shown that it is still possible to devise efficient learning algorithms with good theoretical guarantees [12, 18]. More importantly, in the work of [18], they considered a new variant called the combinatorial sleeping MAB with longterm fairness constraints (CSMABF). In this setting, fairness among arms is further considered, i.e., each arm needs to be pulled for a number of times. The authors designed a UCBbased algorithm called Learning with Fairness Guarantee (LFG) and provided a problemindependent timeaveraged upper regret bound.
Due to the attractive practical performance and lack of theoretical guarantees for TSbased algorithms in CSMABF, it is desirable to devise a TSbased algorithm and derive regret bounds for such algorithms. We are interested to derive the problemindependent regret bound as it holds for all problem instances. In this work, we give the first provable regret bound that is in the same polynomial order as the one in [18] but with lower coefficients. To the best of our knowledge, the derived upper bound is also the first problemindependent regret bound of TSbased algorithms for CSMAB problems which matches the lower regret bounds [17] when relaxing the longterm fairness constraints.
3 Problem Formulation
In this section, we present the problem formulation of the stochastic combinatorial sleeping multiarmed bandit problem with fairness constraints (CSMABF), following [18] closely for comparison purposes. To state the problem clearly, we first introduce the CSMAB problem and then incorporate the fairness constraints.
Let set be an arm set and be the power set of . At the beginning of each round , a set of arms are revealed to a learning agent according to a fixed but unknown distribution over , i.e., . We call set the availability set in round . Meanwhile, each arm is associated with a random reward
drawn from a fixed Bernoulli distribution
with an unknown mean ^{3}^{3}3Note that we only consider Bernoulli distribution in this paper for brevity, but it is feasible to extend our algorithm and analysis with little modifications to other general distributions (see [1, 2])., and a fixed known nonnegative weight for that arm. Note that for all the arms in , their rewards are drawn independently in each round . Then the learning agent pulls a subset of arms from the availability set with the cardinality no more than , i.e., , and receives a weighted random reward .In this work, we consider the semibandit feedback setting, which is consistent with [18], i.e., the learning agent can observe the individual random reward of all the arms in . Note that since the availability set is drawn from a fixed distribution and the random rewards of the arms are also drawn from fixed distributions, we are in a bandit setting called the stochastic availability and the stochastic reward.
The objective of the learning agent is to pull the arms sequentially to maximize the expected timeaveraged rewards over rounds, i.e., .
Furthermore, we consider the longterm fairness constraints proposed in [18], where each arm is expected to be pulled at least times when the time horizon is long enough, i.e.,
(1) 
We say a vector
is feasible if there exists a policy such that (1) is satisfied.If we knew the availability set distribution and the mean reward for each arm in advance, and was feasible, then there would be a randomized algorithm which is the optimal solution for CSMABF problems. The algorithm chooses arms
with probability
when observing available arms . Let . We can determine by solving the following problem:(2) 
where the first constraint is equivalent to the fairness constraints defined in (1), and the second constraint states that for each availability set , the probability space for choosing should be complete.
Denote the optimal solution to (2) as , i.e., the optimal policy pulls with probability when observing an available arm set . We denote by the arms pulled by the optimal policy in round .
However, and
are unknown in advance, and the learning agent can only observe the available arms and the random rewards of the pulled arms. Therefore, the learning agent faces the dilemma between exploration and exploitation, i.e., in each round, the agent can either do exploration (acquiring information to estimate the mean reward of each arm) or exploitation (accumulating rewards as many as possible). The quality of the agent’s policy is measured by the
timeaveraged regret, which is the performance loss caused by not always performing the optimal actions. Considering the stochastic availability of each arm, we define the timeaveraged regret as follows:.  (3) 
4 Thompson Sampling with Beta Priors and Bernoulli Likelihoods for CSMABF (TSCSFB)
The key challenges to design an effective and efficient algorithm to solve the CSMABF problem can be twofold. First, the algorithm should well balance the exploration and exploitation in order to achieve a low timeaveraged regret. Second, the algorithm should make a good balance between satisfying the fairness constraints and accumulating more rewards.
To address the first challenge, we adopt the Thompson sampling technique with beta priors and Bernoulli likelihoods to achieve the tradeoff between the exploration and exploitation. The main idea is to assume a beta prior distribution with the shape parameters and (i.e., ) on the mean reward of each arm . Initially, we let , since we have no knowledge about each and
is a uniform distribution in
. Then, after observing the available arms , we draw a sample from as an estimate for , and pull arms according to (6) as discussed later. The arms in return rewards, which are used to update the beta distributions based on Bayes rules and Bernoulli likelihood for all arms in
:(4)  
After a number of rounds, we are able to see that the mean of the posterior beta distributions will converge to the true mean of the reward distributions.
The virtual queue technique [18, 19] can be used to ensure that the fairness constraints are satisfied. The highlevel idea behind the design is to establish a timevarying queue to record the number of times that arm has failed to meet the fairness. Initially, we set for all . For the ease of presentation, let
be a binary random variable indicating that whether arm
is pulled or not in round . Then for each arm , we can use the following way to maintain the queue:(5) 
Intuitively, the length of the virtual queue for arm increases if the arm is not pulled in round . Therefore, arms with longer queues are more unfair and will be given a higher priority to be pulled in future rounds. Note that our queue evolution is slightly different from [18] to avoid the rounding error accumulation issue.
To further balance the fairness and the reward, we introduce another parameter as a tradeoff between the reward and the virtual queue lengths. Then, in each round , the learning agent pulls arms as follows:
5 Results and Proofs
5.1 Fairness Satisfaction
Theorem 1
For any fixed and finite , when is long enough and the fairness constraint vector is feasible, the proposed TSCSFB algorithm satisfies the longterm fairness constraints defined in (1).
Proof 5.1 (Proof Sketch)
The main idea to prove Theorem 1 is to prove the virtual queue for each arm is stable when is feasible and is long enough for any fixed and finite . The proof is based on Lyapunovdrift analysis [19]. Since it is not our main contribution and it follows similar lines to the proof of Theorem 1 in [18], we omit the proof. Interested readers are referred to [18].
Remark 1
The longterm fairness constraints does not require arms to be pulled for a certain number of times in each round but by the end of the time horizon. Theorem 1 states that the fairness constraints can always be satisfied by TSCSFB as long as is finite and is long enough. A higher may require a longer time for the fairness constraints to be satisfied (see Sec. 6).
5.2 Regret Bounds
Theorem 2
For any fixed and , the timeaveraged regret of TSCSFB is upper bounded by
Proof 5.2 (Proof Sketch)
We only provide a sketch of proof here, and the detailed proof can be found in Appendix. The optimal policy for CSMABF is a randomized algorithm defined in Sec. 3, while the optimal policies for classic MAB problems are deterministic. We follow the basic idea in [18] to convert the regret bound between the randomized optimal policy and TSCSFB (i.e., regret) by the regret bound between a deterministic oracle and TSCSFB. The deterministic oracle also knows the mean reward for each arm, and can achieve more rewards than the optimal policy by sacrificing fairness constraints a bit. Denote the arms pulled by the oracle in round as , which is defined by
Then, we can prove that the timeaveraged regret defined in (3) is bounded by
,  (7) 
where the first term is due to the queuing system, and the second part is due to the exploration and exploitation.
Next, we define two events and their complementary events for each arm to decompose . Let , where is the number of times that arm has been pulled at the beginning of round . Then for each arm , the two events and are defined as follows:
and let and be the complementary events for and , respectively. Notice that both and are lowprobability events after a number of rounds.
With the events defined above, we can decompose as
Using the relationship between the summation and integration, we can bound and by .
Bounding and is the main theoretical contribution of our work and is not trivial. Since and are lowprobability events, the total times they can happen are a constant value on expectation. Therefore, to bound and , the basic idea is to obtain the bounds for and . Currently, there is no existing work giving the bounds for and , and we prove that and are bounded by and , respectively. Then, it is straightforward to bound and by .
Remark 2
Comparing with the timeaveraged regret bound for LFG [18], we have the same first term , as we adopt the virtual queue system to satisfy the fairness constraints. On the other hand, the second part of our regret bound, which is also the first problemindependent regret bound for CSMAB problems, has lower coefficients than that of LFG. Specifically, the coefficient for the timedependent term (i.e., ) is in our bound, smaller than in that of LFG, and the timeindependent term (i.e., ) has a coefficient in our bound, which is also less than in the bound of LFG.
If we let , the algorithm only focuses on CSMAB problems, and the bound boils down to the first problemindependent bound of TSbased algorithms for CSMAB problems, which matches the lower bound proposed in [17].
Corollary 1
For any fixed and , when , the timeaveraged regret of TSCSFB is upper bounded by
Remark 3
The reason we let for a given is to control the first term to have a consistent or lower order than the second term. However, in practice, we need to tune according to such that both fairness constraints and high rewards can be achieved.
6 Evaluations and Applications
6.1 Numerical Experiments
In this section, we compare the TSCSFB algorithm with the LFG algorithm [18] in two settings. The first setting is identical to the setting in [18], where , , and . The mean reward vector for the three arms is . The availability of the three arms is , and the fairness constraints for the three arms are . To see the impact of on the timeaveraged regret and fairness constraints, we compare the algorithms under and in a time horizon , where indicates that both algorithms do not consider the longterm fairness constraints.
Further, we test the algorithms in a more complicated setting where , , and . The mean reward vector for the six arms is . The availability of the six arms is , and the fairness constraints for the six arms are . This setting is challenging because higher fairness constraints are given to the arms with less mean rewards and the arms with lower availability (i.e., arms and ). According to Corollary 1, we set and , and . Note that the following results are the average of
independent experiments. We omit the plotting of confidence interval and deviations because they are too small to be seen from the figures and are also omitted in most bandit papers.
6.1.1 TimeAveraged Regret
The timeaveraged regret results under the first setting and the second setting are shown in Fig. 1 and Fig. 2, respectively.
In each subplot, the axis represents the rounds and the axis is the timeaveraged regret. A small figure inside each subplot zooms in the first rounds. We also plot the OPT with considering fairness (OptF) (i.e., the optimal solution to CSMABF), and OPT without considering fairness (OptNF) (i.e., the optimal solution to CSMAB). The timeaveraged regret of OptNF is always below OptF, since OptNF does not need to satisfy the fairness constraints and can always achieve the highest rewards. By definition, the regret of OptF is always .
We can see that the proposed TSCSFB algorithm has a better performance than the LFG algorithm, since it converges faster, and achieves a lower regret, as shown in Fig. 1 and Fig. 2. It is noteworthy that the gap between TSCSFB and LFG is larger in Fig. 2, which indicates that TSCSFB performs better than LFG in more complicated scenarios.
In terms of , the algorithms with a higher can achieve a lower timeaveraged regret. For example, in the first setting, the lowest regrets achieved by the two considered algorithms are around when , but they are much closer to OptF when . However, when we continue to increase to (see Fig. (c)c), the considered algorithms achieve a negative timeaveraged regret around rounds, but recover to the positive value afterwards. This is due to the fact that with a high the algorithms preferentially pull arms with the highest mean rewards, but the queues still ensure the fairness can be achieved in future rounds. When (see Fig. (d)d and Fig. (b)b), the fairness constraints are totally ignored and the regrets of the considered algorithms converge to OptNF. Therefore, significantly determines whether the algorithms can satisfy and how quickly they satisfy the fairness constraints.
6.1.2 Fairness Constraints
In the first setting, we show in Fig. 3 the final satisfaction of fairness constraints for all arms under . is an interesting setting where the fairness constraints are not satisfied in the first few rounds as aforementioned. We want to point out in the first setting, the fairness constraint for arm is relatively difficult to be satisfied, since arm has the lowest mean reward but has a relative high fairness constraint. However, we can see that the fairness constraints for all arms are satisfied finally, which means both TSCSFB and LFG are able to ensure the fairness constraints in this simple setting.
In the second setting with , the fairness constraints for arms and are difficult to satisfy, as both arms have high fairness constraints but low availability or low mean reward. However, both TSCSFB and LFG manage to satisfy the fairness constraints for all the arms, as shown in Fig. 4.
6.2 Tightness of the Upper bounds
Finally, we show the tightness of our bounds in the second setting, as plotted in Fig. 5. The axis represents the change of the time horizon , and the axis is the logarithmic timeaveraged regret in the base of .
We can see that, the upper bound of TSCSFB is always below that of LFG. However, there is a big gap between the TSCSFB upper bound and the actual timeaveraged regret in the second setting. This is reasonable, since the upper bound is problemindependent, but it is still of interest to find tighter bound for CSMABF problems.
6.3 Highrating movie recommendation System
In this part, we consider a highrating movie recommendation system. The objective of the system is to recommend highrating movies to users, but the ratings for the considered movies are unknown in advance. Thus, the system needs to learn the ratings of the movies while simultaneously recommending the highrating ones to its users. Specifically, when each user comes, the movies that are relevant to the user’s preference are available to be recommended. Then, the system recommends the user with a subset of the available movies subject. After consuming the recommended movies, the user gives feedback to the system, which can be used to update the ratings of the movies to better serve the upcoming users. In order to acquire accurate ratings or to ensure the diversity of recommended movies, each movie should be recommended at least a number of times.
The above highrating movie recommendation problem can be modeled as a CSMABF problem under three assumptions. First, we take serving one user as a round by assuming the next user always comes after the current user finishes rating. This assumption can be easily relaxed by adopting the delayed feedback framework with an additive penalty to the regret [14]. Second, the availability set of movies is stochastically generated according to an unknown distribution. Last, given a movie, the ratings are i.i.d. over users with respect to an unknown distribution. The second and third assumptions are feasible, as it has been discovered that the user preference and ratings towards movies have a strong relationship to the Zipf distribution [4, 10].
6.3.1 Setup
We implement TSCSFB and LFG on MovieLens 20M Dataset [11], which includes million ratings to movies by users. This dataset contains both users’ movie ratings between and and genre categories for each movie. In order to compare the proposed TSCSFB algorithm to the LFG algorithm, we select movies with different genres as the ground set of arms , which are Toy Story (1995) in the genre of Adventure, Braveheart (1995) in Action, Pulp Fiction (1994) in Comedy, Godfather, The (1972) in Crime, and Alien (1979) in Horror.
Then, we count the total number of ratings on the selected genres and calculate occurrence of each selected genre among the genres as the availability of the corresponding selected movie. We note that the availability of the selected movies is only used by the OPTF algorithm and is not used to determine the available set of movies in each round. During the simulation, when each user comes, the available set of movies is determined by whether the user has rated or not these movies in the dataset.
The ratings are scaled into to satisfy as the rewards. We choose users who have rated at least one of the selected movies as the number of rounds (one round one user according to the first assumption) for the algorithms, and take their ratings as the rewards to the recommended movies. When each user comes, the system will select no more than movies for recommendation and each movie shares the same weight, i.e., , and the same fairness constraints. The fairness constraints are set as such that (2) has a feasible solution.
We adopt such an implementation, including the determination of the available movie set, the same movie weights, and the same fairness constraints, to ensure that our simulation brings noise as little as possible to the MovieLens dataset.
6.3.2 Results
We first show whether the considered algorithms are able to achieve accurate ratings. The final ratings of selected movies by TSCSFB and LFG under and are shown in Fig. (a)a. The reason why we set is due to Corollary 1. We can observe that the performance of TSCSFB is better than that of LFG, since the ratings of TSCSFB are much closer to the true average ratings, while the ratings acquired by UCB are higher than the true average ratings.
The final satisfaction for the fairness constraints of the selected movies is shown in Fig. (b)b. Both TSCSFB and LFG can satisfy the fairness constraints of the five movies under .
On the other hand, the timeaveraged regret is shown in Fig. 7. We can see that the timeaveraged regret of TSCSFB is below that of LFG, which indicates the proposed TSCSFB algorithm converges much faster. Since we are unable to obtain the true distribution of the available movie set (as discussed in Setup), the rewards achieved by the OPTF algorithm may not be the optimal one, which explains why the lines of both TSCSFB and LFG are below that of OPTF in Fig. 7.
Generally, TSCSFB performs much better than LFG in this application, which achieves better final ratings and a quicker convergence speed.
7 Conclusion
In this paper, we studied the stochastic combinatorial sleeping multiarmed bandit problem with fairness constraints, and designed the TSCSFB algorithm with a provable problemindependent bound of when . Both the numerical experiments and realworld applications were conducted to verify the performance of the proposed algorithms.
As part of the future work, we would like to derive more rigorous relationship between and such that the algorithm can always satisfy the fairness constraints and achieves high rewards given any , as well as tighter bounds.
References
 [1] (2012) Analysis of Thompson Sampling for The MultiArmed Bandit Problem. In Proc. Conference on Learning Theory (COLT), Vol. 23, pp. 39.1–39.26. Cited by: footnote 3.
 [2] (201709) NearOptimal Regret Bounds for Thompson Sampling. Jounal of ACM 64 (5), pp. 30:1–30:24. External Links: ISSN 00045411 Cited by: §1, footnote 3.

[3]
(2012)
Regret Analysis of Stochastic and Nonstochastic MultiArmed Bandit Problems.
Foundations and Trends® in Machine Learning
5 (1), pp. 1–122. Cited by: §1, §1.  [4] (2007) I Tube, You Tube, Everybody Tubes: Analyzing the World’s Largest User Generated Content Video System. In Proc. ACM SIGCOMM Conference on Internet Measurement (IMC), New York, NY, USA, pp. 1–14. Cited by: §6.3.
 [5] (2011) An Empirical Evaluation of Thompson Sampling. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 2249–2257. Cited by: §1.

[6]
(2017)
Analysis of Thompson Sampling for Stochastic Sleeping Bandits.
In
Proc. Conference on Uncertainty in Artificial Intelligence (UAI)
, Cited by: §1, §2.  [7] (2013) Combinatorial Multiarmed Bandit: General Framework and Applications. In Proc. International Conference on Machine Learning (ICML), pp. 151–159. Cited by: §2.
 [8] (2015) Combinatorial Bandits Revisited. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 2116–2124. Cited by: §2.
 [9] (2012) Combinatorial Network Optimization with Unknown Variables: MultiArmed Bandits with Linear Rewards and Individual Observations. IEEE/ACM Transactions on Networking (TON) 20 (5), pp. 1466–1478. Cited by: §2.
 [10] (2007) Youtube Traffic Characterization: A View from the Edge. In Proc. ACM SIGCOMM Conference on Internet Measurement (IMC), New York, NY, USA, pp. 15–28. Cited by: §6.3.
 [11] (2016) The Movielens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TIIS) 5 (4), pp. 19. Cited by: §6.3.1.
 [12] (2019) Intelligent Caching Algorithms in Heterogeneous Wireless Networks with Uncertainty. In Proc. IEEE Conference on Distributed Computing Systems (ICDCS), Cited by: §2, §2.
 [13] (2019) ProblemDependent Regret Bounds for Online Learning with Feedback Graphs. In Proc. Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §1.
 [14] (2013) Online Learning under Delayed Feedback. In Proc. International Conference on Machine Learning (ICML), pp. 1453–1461. Cited by: §6.3.

[15]
(2016)
Hardness of Online Sleeping Combinatorial Optimization Problems
. In Proc. Advances in Neural Information Processing Systems (NeurIPS), D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2181–2189. Cited by: §2, §2.  [16] (2010) Regret Bounds for Sleeping Experts and Bandits. Machine Learning 80 (23), pp. 245–272. Cited by: §2.
 [17] (2015) Tight Regret Bounds for Stochastic Combinatorial SemiBandits. In Proc. Artificial Intelligence and Statistics (AISTATS), pp. 535–543. Cited by: §2, §2, Remark 2.
 [18] (201905) Combinatorial Sleeping Bandits with Fairness Constraints. In Proc. IEEE Conference on Computer Communications (INFOCOM), pp. 1702–1710. Cited by: 1st item, §1, §1, §1, §2, §2, §3, §3, §3, §4, Proof 5.1, Proof 5.2, §6.1, Remark 2.
 [19] (2010) Stochastic Network Optimization with Application to Communication and Queueing Systems. Synthesis Lectures on Communication Networks 3 (1), pp. 1–211. Cited by: §4, Proof 5.1.
 [20] (2014) Online Combinatorial Optimization with Stochastic Decision Sets and Adversarial Losses. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2780–2788. Cited by: §2.
 [21] (2018) Thompson Sampling for Combinatorial SemiBandits. In Proc. International Conference on Machine Learning (ICML), pp. 5101–5109. Cited by: §1, §2.
Appendix A Appendix
a.1 Notations and Facts
Recall that is the number of times that arm has been pulled at the beginning of round . Recall is the empirical mean of arm at the beginning of round . Therefore, we have .
For each arm , we have two events and defined as follows:
where .
Define as the history of the plays until time , i.e., , where is the arm pulled in round .
Recall that the arms pulled by the deterministic oracle in round are , which is defined by
Let indicate whether arm is played by TSCSFB in round . In the same way, let indicate whether arm is played by in round .
Fact 1 (Chernoff bound) Let be independent  random variables such that . Let . Then, for any ,
and for any ,
where .
Fact 2 (Hoeffding inequality). Let be random variables with common range and such that . Let . Then, for all ,
Fact 3 (Relationship between beta and Binomial distributions). Let be the cdf of beta distribution with parameters and , and let
be the cdf of Binomial distribution with parameters
and . Then, we havefor all positive integers and .
a.2 Proof of Theorem 2
Proof A.1
Lemma 1
The timeaveraged regret of TSCSFB can be upper bounded by
,  (8) 
Lemma 2
For all , the probability that event happens is upper bounded as follows:
Lemma 3
For all , , the probability that event happens is upper bounded as follows:
Bound
Define event is the complementary event of as follows:
Then, we can decompose as
Since , is therefore bounded by
where the second inequality is due to Lemma 2.
Next, we show how to bound . Let be the round when arm is played for the th time, i.e., . can be bounded as follows:
(9)  
The last term in (9) can be further written as follows:
where the last inequality is due to Jensen’s inequality. Also by Jensen’s inequality, we have
Therefore, we can bound by
where the inequality is due to the fact that at most arms are selected in each round. Therefore, we have bounded by
Combining and gives
Bound
Define an event as the complementary event of :
can be decomposed by
Let be the round when arm is played for the th time by policy , i.e., . Since , we can write as follows:
Comments
There are no comments yet.