The multi-armed bandit (MAB) model has been widely adopted for studying many practical optimization problems (network resource allocation, ad placement, crowdsourcing, etc.) with unknown parameters (see, e.g., ). In the basic stochastic MAB setting, there are arms (i.e., actions), each of which, if played, returns a random reward to the player (i.e., the decision maker). The random reward of each arm takes values in and is assumed to be independent and identically distributed (i.i.d.) over time. However, the reward distributions and the mean rewards are unknown a priori. The player decides which single arm to play in each round for a given time horizon of rounds, with a goal of maximizing the cumulative reward in the face of unknown mean rewards.
However, this basic MAB model neglects several important factors of the system in many real-world applications, where multiple actions can be simultaneously taken and an action could sometimes be “sleeping” (i.e., unavailable). Take wireless scheduling for example: multiple clients compete for a shared wireless channel to transmit packets to a common access point (AP). The AP decides which client(s) can transmit at what times. A successfully delivered packet will generate a random reward, which could represent the value of the information contained in the packet. In each scheduling cycle, multiple clients could be scheduled for simultaneous transmissions as the channel can typically be divided into multiple “sub-channels” using multiplexing technologies . On the other hand, some clients may be unable to transmit packets when experiencing a poor channel condition (due to fading or mobility). Furthermore, in addition to maximizing the reward, ensuring fairness among the clients or providing Quality of Service (QoS) guarantees to the clients is also a key design concern in wireless scheduling [3, 4], as well as in network resource allocation in general . These important factors (i.e., combinatorial actions, availability of actions, and fairness) are commonly shared by many other applications too (see more detailed discussions in Section VI). However, it remains largely unexplored in the literature to carefully integrate all these factors into a unified MAB model.
To that end, in this paper we propose a new Combinatorial Sleeping MAB model with Fairness constraints, called CSMAB-F, aiming to address the aforementioned modeling issues, which are practically important for a wide variety of applications. Compared to the basic MAB setting, in the proposed framework the set of available arms follows a certain distribution that is assumed to be i.i.d. over time and is unknown a priori. However, the information of available arms will be revealed at the beginning of each round. The player can then play multiple, but no more than , available arms and receives a compound reward being the weighted sum of the rewards of the played arms. We also impose fairness constraints that the player must ensure a (possibly different) minimum selection fraction for each individual arm. The goal is now to maximize the reward while satisfying the fairness requirement. We summarize our main contributions as follows.
First, to the best of our knowledge, this is the first work that integrates all three critical factors of combinatorial arms, availability of arms, and fairness into a unified MAB model. The proposed CSMAB-F framework successfully addresses these crucial modeling issues. This new problem, however, becomes much more challenging. In particular, integrating fairness constraints adds a new layer of difficulty to the combinatorial sleeping MAB problem that is already quite challenging. This is because not only the player encounters a fundamental tradeoff between exploitation (i.e., staying with the currently-known best option) and exploration (i.e., seeking better options) when attempting to maximize the reward, but she is also faced with a new dilemma: how to manage the balance between maximizing the reward and satisfying the fairness requirement? Several well-known MAB algorithms can successfully handle the exploitation-exploration tradeoff, but none of them was born with fairness constraints in mind.
To address this new challenge, we extend an online learning algorithm, called Upper Confidence Bound (UCB), to deal with the exploitation-exploration tradeoff and employ the virtual queue technique to properly handle the fairness constraints. By carefully integrating these two techniques, we develop a new algorithm, called Learning with Fairness Guarantee (LFG), for the CSMAB-F problem. Further, we rigorously prove that not only LFG is feasibility-optimal, but it also has a time-average regret (i.e., the reward difference between an optimal algorithm that has a priori knowledge of the mean rewards and the considered algorithm) upper bounded by , where and are constants and is a design parameter that we can tune. Note that our regret analysis is more challenging as the traditional regret analysis becomes non-applicable here due to the integration of virtual queues for handling the fairness constraints.
Finally, we conduct extensive simulations to elucidate the effectiveness of the proposed algorithm. From the simulation results, we observe that LFG can effectively meet the fairness requirement while achieving a good regret performance. Interestingly, the simulation results also reveal a critical tradeoff between the regret and the speed of convergence to a point satisfying the fairness constraints. We can control and optimize this tradeoff by tuning the value of parameter .
The rest of the paper is organized as follows. We first discuss related work and describe the proposed CSMAB-F framework in Sections II and III, respectively. Then, we develop the LFG algorithm for the CSMAB-F problem in Section IV, followed by the performance analysis in Section V. Detailed discussions about several real-world applications are provided in Section VI. Finally, we present simulation results in Section VII and make concluding remarks in Section VIII.
Ii Related Work
Starting with the seminal work of , the MAB problems have been extensively studied in a large body of work (see, e.g., [7, 1]). In the basic MAB setting, the authors of  establish a fundamental logarithmic lower bound on the regret of a class of “uniformly good policies” and propose UCB policies that asymptotically achieve the lower bound. Further, the work of  shows that logarithmic regret can be achieved uniformly over time rather than asymptotically by simpler sample-mean-based UCB policies and an -greedy policy.
Following this line of research, different MAB variants have been proposed to model several important factors of the system in real-world applications. The ones that are relevant to ours include combinatorial MAB (CMAB) where multiple arms form a super arm and can be simultaneously played [9, 10, 11, 12, 13, 14] and sleeping MAB (SMAB) where an arm could sometimes be “sleeping” (i.e., unavailable)[15, 16, 17, 18]. Being the first to study the CMAB problem, the work of  considers combinations of a fixed number of simultaneous plays. This simple combinatorial structure has been generalized to permutations  and matroids . The work of [12, 13] generalizes linear reward functions considered in [9, 10, 11] to include a large class of linear and nonlinear rewards. In , the authors prove a tight problem-specific lower bound for stochastic CMAB (where the reward of each played arm rather than the combinatorial reward is revealed) and propose an efficient sampling algorithm with an improved multiplicative factor. The work of  is among the first to study the SMAB problem. This work provides a computationally efficient algorithm for the setting of stochastic rewards while allowing both stochastic and adversarial availability. Follow-up work of [16, 17] studies the setting of adversarial rewards while the availability of arms is either stochastic or adversarial. Very recently, the authors of 
analyze the performance of Thompson Sampling for the SMAB problem and show that it empirically performs better than other algorithms.
MAB settings with constraints have also been considered in prior studies. Most of them focus on bandits with budgets (see, e.g., ) or bandits with knapsacks (see, e.g., ), where no more plays can be made if the budget/knapsack constraints are violated. Hence, these types of constraints are very different from the long-term fairness constraints we consider in this paper. Some very recent work considers multi-type rewards  and multi-level rewards [22, 23]. They introduce a minimum guarantee requirement that the total reward of some type/level must be no smaller than a given threshold. However, these studies differ significantly from ours in the following key aspects. First, and most importantly, their constraints do not model fairness among arms. The required minimum guarantee is for the total rewards (of some type/level) rather than for each individual arm. Second, no learning algorithm is proposed in ; the proposed learning algorithms in [22, 23] may violate the constraints, although they show provable violation bounds. Third, they assume that all the arms are available at all times. Last but not least, the proof techniques for regret analysis in [22, 23] are very different from ours.
. A key idea of their proposed fair algorithm is that two arms should be played with equal probability until they can be distinguished with a high confidence. Another work studies how to learn proportionally fair allocations by considering the maximization of a logarithmic utility function. These studies are less relevant to our work, although they share some high-level similarities with ours in modeling fairness.
Iii System model and problem formulation
In this section, we describe the detailed setting of our proposed CSMAB-F framework. Let denote the set of arms. Each arm is associated with a reward in round , where
. The reward is a random variable onand follows a certain distribution with mean . We assume that the reward for each arm is i.i.d.
over time. The mean reward vectoris unknown a priori. In our setting, an arm could sometimes be “sleeping” (i.e., unavailable). Let denote the set of available arms in round , where is the power set of . We use , where , to denote the distribution of available arms, which is assumed to be i.i.d over time. This distribution is unknown a priori, but the set of available arms will be revealed to the player at the beginning of each round .
In each round, the player is allowed to play multiple, but no more than , available arms (i.e., arms belonging to ). Each subset of available arms is also called a super arm . We restrict the size of a chosen super arm to be no larger than so as to account for resource constraints (see discussions on applications in Section VI). Let represent the set of all feasible super arms when the set of available arms is observed, i.e., where denotes the cardinality of set . In round , a player selects a super arm and receives a compound reward , which is a weighted sum of the rewards of the played arms, i.e., , where is the weight of arm . We assume that the weights are fixed positive numbers known a priori and are upper bounded by a finite constant . The goal of the player is to maximize the expected time-average reward for a given time horizon of rounds, i.e., .
To describe the action for each individual arm, we use a binary vector to indicate whether each arm is played or not in round , where if arm is played, i.e., ; otherwise, . Then, the action vector must satisfy for all .
As we discussed in the introduction, in addition to maximize the reward, ensuring fairness among the arms is also a key design concern for many real-world applications. To model the fairness requirement, we introduce the following constraints on a minimum selection fraction for each individual arm:
where is the required minimum fraction of rounds in which arm is played. The minimum selection fraction vector is said to be feasible if there exists a policy that makes a sequence of decisions for such that (1) is satisfied. Then, the maximal feasibility region is defined as the set of all such feasible vectors . A policy is said to be feasibility-optimal if it can support any vector (i.e., (1) is satisfied) strictly inside the maximal feasibility region .
We now consider the special class of stationary and randomized policies called -only policies. An -only policy observes the set of available arms for each round and independently chooses a super arm as a (possibly randomized) function of the observed only. An -only policy
is characterized by a group of probability distributions, denoted by, where is the probability that policy chooses super arm when observing the set of available arms , and for all . Then, under policy , the action is i.i.d. over time with the following mean:
for every arm and for all , and thus, constraint (1) is equivalent to for every arm . Further, we have the following lemma.
If a vector is strictly inside the maximal feasibility region , then there exists an -only policy that can support vector .
Lemma 1 implies that there exists an optimal -only policy. Hence, assuming that the mean reward vector
is known in advance, one can formulate the reward maximization problem with minimum selection fraction constraint as the following linear program (LP): q ∑_Z∈P(N) P_A(Z) ∑_S∈S(Z) q_S(Z)∑_i ∈Sw_iμ_i ∑_Z∈P(N) P_A(Z)∑_S∈S(Z): i ∈S q_S(Z)≥r_i,∀i ∈N, ∑_S∈S(Z) q_S(Z)= 1,∀Z∈P(N), q_S(Z) ∈[0,1], ∀S∈S(Z), ∀Z∈P(N).
Suppose that an optimal solution to the above LP is . Then an optimal -only policy characterized by obtains the maximum reward:
However, the mean reward vector
is unknown to the player in advance. Hence, the player not only needs to maximize the reward based on the estimated mean rewards (i.e., exploitation), but she also has to simultaneously learn to obtain a more accurate estimate of the mean rewards (i.e., exploration). Such a learning process typically incurs a loss in the obtained reward, which is called theregret. Formally, the time-average regret of a policy for a time horizon of rounds, denoted by , is defined as the difference between the maximum reward and the expected time-average reward obtained under policy that chooses super arm in round , i.e.,
Note that minimizing the regret is equivalent to maximizing the reward. Hence, the regret is a commonly used metric in the MAB literature for measuring the performance of learning algorithms. In this paper, we will adopt the time-average regret defined in (4) as the main performance metric.
The key notations used in this paper are listed in Table I.
Iv The LFG Algorithm
In this section, by carefully integrating the key ideas of UCB [6, 8] and the virtual queue technique , we develop a new algorithm, called Learning with Fairness Guarantee (LFG), to tackle the CSMAB-F problem. While UCB is extended to deal with the exploitation-exploration tradeoff, the virtual queue technique is employed to handle the fairness constraints.
There are two main challenges in designing an efficient algorithm for the CSMAB-F problem: (i) how to maximize the reward in the face of unknown mean rewards and (ii) how to satisfy the fairness constraints. Note that these two challenges cannot be addressed separately as they are tightly coupled together. Therefore, we need a holistic approach to manage the balance between maximizing the reward and satisfying the fairness constraints. In what follows, we will first discuss the key ideas for addressing each individual challenge and then propose the LFG algorithm by carefully integrating them.
The key of maximizing the reward with uncertainty is to strike a balance between exploitation (i.e., choosing the option that gave highest rewards in the past) and exploration (i.e., seeking new options that might give higher rewards in the future). We extend a simple UCB policy based on the concept of optimism in the face of uncertainty to address this challenge and describe the details as follows.
Let be the number of times arm has been played by the end of round , i.e., . We set as the system begins at . Also, let be the sample mean of the observed rewards of arm by the end of round , i.e., . We set if arm has not been played yet by the end of round (i.e., if ). We use to denote the UCB estimate of arm in round , which is given as follows:
where and correspond to exploitation and exploration, respectively. We use the above truncated version of the UCB estimate (i.e., capped at 1) as the actual reward must be in . Similarly, we set if .
|;||Set of arms; number of arms|
|Power set of|
|Maximum number of simultaneously played arms|
|Mean reward of arm|
|Weight of arm|
|Required minimum selection fraction for arm|
|Reward of arm in round|
|Sample mean of the observed reward of arm up to round|
|UCB estimate of arm in round|
|Number of times arm has been played up to round|
|Indicator of whether arm is played or not in round|
|Virtual queue length for arm in round|
|Set of available arms in round|
|Super arm played in round|
|Probability that the set of available arms is|
|Set of feasible super arms when observing available arms|
|Probability that an -only policy chooses super arm when observing available arms|
|Maximal feasibility region|
|Maximum reward with a priori knowledge of|
|Time-average regret of policy|
In the basic MAB setting, the classic UCB policy simply selects the arm that has the largest UCB estimate in each round [6, 8]. However, in the CSMAB-F setting we are faced with several new challenges introduced by combinatorial arms, availability of arms, and fairness constraints. In particular, integrating fairness constraints adds a new layer of difficulty to the combinatorial sleeping MAB problem that is already quite challenging. This is because not only the player is faced with the exploitation-exploration dilemma when attempting to maximize the reward, but she also encounters a new tradeoff between maximizing the reward and satisfying the fairness requirement. Therefore, directly applying the UCB policy will not work as it was designed without fairness constraints in mind. Next, we will explain how to use the virtual queue technique to properly handle the fairness constraints, as well as how to cohesively integrate it with UCB to address the overall challenge of the CSMAB-F problem.
Following the framework developed in , we create a virtual queue for each arm to handle the fairness constraints in (1). By slightly abusing the notation, we also use to denote the queue length of at the beginning of round , which is a counter that keeps track of the “debt” to arm up to round . Specifically, the virtual queue length evolves according to the following dynamics:
where . We set as the system begins at . As can be seen in the above queue-length evolution, the “debt” to arm increases by in each round as is the minimum selection fraction, and it decreases by one if arm is selected in round (i.e., ).
Having introduced the UCB estimate and the virtual queues, we are now ready to describe the proposed LFG algorithm, which is presented in Algorithm 1. At the very beginning, we initialize and for all arms (lines 1-3). In each round , we first update the UCB estimates and the virtual queue lengths according to (5) and (6) for all arms , respectively, based on the decision and the feedback from the previous rounds (lines 4-11); we set if . Then, we observe the set of available arms (line 12) and select a super arm that maximizes the compound value of the updated and as follows (line 13):
where is a positive parameter we can tune to manage the balance between the reward and the virtual queue lengths. Note that the size of is exponential in . Hence, the complexity of selecting a super arm according to (7) could be prohibitively high in general. However, thanks to the special structure of linear compound reward, we can efficiently solve (7) and find a best super arm by iteratively selecting best individual arms. Specifically, we select a super arm consisting of the top- arms in , where . That is, starting with an empty , we iteratively select arm such that
and after each iteration, we update super arm by adding arm to it, i.e., . Repeating the above procedure for iterations solves (7) and finds a best super arm . After we play arms in and set vector accordingly (line 14), we observe the reward for all played arms (lines 15-17) and update and accordingly for all arms (lines 18-20).
Remark: As we mentioned earlier, we introduce a design parameter to manage the balance between the reward and virtual queue lengths. When is large, the LFG algorithm gives a higher priority to maximizing the reward compared to meeting the fairness constraints. This is because an arm with a large estimated reward (i.e., UCB estimate) will be favored, compared to another arm that has a small estimated reward but a large “debt” (i.e., virtual queue length). In contrast, when is small, the LFG algorithm gives a higher priority to meeting the fairness constraints because an arm with a large virtual queue length will be favored even if it has a small estimated reward. Indeed, our simulation results presented in Section VII reveal an interesting tradeoff between the regret and the speed of convergence to a point satisfying the fairness constraints.
V Main Results
In this section, we analyze the performance of our proposed LFG algorithm and present our main results. Specifically, we show that the LFG algorithm is feasibility-optimal (i.e., it can satisfy any feasible requirement of minimum selection fraction for each individual arm) in Section V-A and derive an upper bound on the time-average regret in Section V-B.
V-a Feasibility Optimality
We first present the feasibility-optimality result. That is, the LFG algorithm can satisfy the fairness constraints in (1) for any minimum selection fraction vector strictly inside the maximal feasibility region .
Note that the constraints in (1) are satisfied as long as the virtual queue system defined in (6) is mean rate stable [28, pp. 56-57], i.e., . In our virtual queue system, mean rate stability is implied by a stronger notion called strong stability, i.e., . Therefore, in order to prove feasibility-optimality, it is sufficient to show that the virtual queue system is strongly stable whenever the minimum selection fraction vector is strictly inside . We state this result in Theorem 1.
The LFG algorithm is feasibility-optimal. Specifically, for any minimum selection fraction strictly inside the maximal feasibility region , the virtual queue system defined in (6) is strongly stable under LFG. That is,
where and is some positive constant satisfying that is still strictly inside , with being the -dimensional vector of all ones.
Remark: Note that the work of  also studies an MAB problem with minimum-guarantee constraints. However, their work differs significantly from ours because their considered minimum guarantee is for the total rewards (of some type/level) rather than for each individual arm, i.e., fairness among arms is not modeled. More importantly, the proposed learning algorithm in  may violate the constraints. Although they show that the violations are upper bounded by , this upper bound implies that the constraints may not be satisfied even after a long enough time. In stark contrast, Theorem 1 states that our proposed LFG algorithm can satisfy the (long-term) fairness constraints as long as the requirement is feasible. Another difference is that they do not consider sleeping bandits, which can further complicate the problem.
V-B Upper Bound on Regret
In this subsection, we prove an upper bound on the time-average regret (as defined in (4)) under the LFG algorithm. This upper bound is achieved uniformly over time (i.e., for any finite time horizon ) rather than asymptotically when goes to infinity. We state this result in Theorem 2.
Under the LFG algorithm, the time-average regret defined in (4) has the following upper bound:
where , and .
Remark: The derived regret upper bound in (10) is quite appealing as it separately captures the impact of the fairness constraints and the impact of the uncertainty in the mean rewards for any finite time horizon . Note that the regret upper bound in (10) has two terms. The first term is inversely proportional to and is attributed to the impact of the fairness constraints. Specifically, when is small, the LFG algorithm gives a higher priority to meeting the fairness requirement by favoring an arm with a larger “debt” (i.e., virtual queue length) as in (8), even if this arm has a small estimated reward. This results in a larger regret captured in the first term. Similarly, a larger leads to a smaller regret captured in the first term, but it will take longer for the LFG algorithm to converge to a point satisfying the fairness constraints. This interesting tradeoff can also be observed from our simulation results in Section VII. The second term is of the order . This part of the regret corresponds to the notion of regret in typical MAB problems and is attributed to the cost that needs to be paid in the learning/exploration process. Note that the second term is an instance-independent upper bound that does not depend on the problem-specific parameter . Our derived bound on the time-average regret is consistent with the instance-independent result for basic MAB problems [1, Ch. 2.4.3]111Time-average regret vs. cumulative regret ..
In this section, we provide more detailed discussions about real-world applications of our proposed CSMAB-F framework. Specifically, we will discuss the following three applications as examples: scheduling of real-time traffic in wireless networks , ad placement in online advertising systems , and task assignment in crowdsourcing platforms .
Vi-a Scheduling of Real-time Traffic in Wireless Networks
Consider the problem of scheduling real-time traffic with QoS constraints in a single-hop wireless network. Assume that there are clients competing for a shared wireless channel to transmit packets to a common AP (see, e.g., ). Time is slotted. The AP decides which client(s) can transmit at what times. Consider a scheduling cycle, called a frame, that consists of consecutive time slots. Every client generates one data packet at the beginning of each frame. To avoid interference, we assume that at most one client can transmit in each time slot. Note that some clients may sometimes be unable to transmit when experiencing poor channel conditions (due to fading or mobility). Assume that the channel conditions remain unchanged during a frame but may vary over frames and that the AP obtains the exact knowledge about the channel conditions via probing. At the beginning of each frame, the AP makes scheduling decisions by selecting an available client to transmit in each of the time slots; at the beginning of each time slot, the AP broadcasts a control packet that announces the scheduling decision, and then, the selected client transmits a packet to the AP in that time slot. We model real-time traffic by assuming that packets have a lifetime of time slots and expire at the end of the frame. The above framework is illustrated in Fig. 3. While a successfully delivered packet will generate a utility, which could represent the value of the information contained in the packet, an expired packet will be dropped at the end of the frame. We assume that the utility corresponding to each client is a random variable, and its mean is unknown a priori. There is a weight associated with each client, indicating the importance of the information provided by the client.
The goal of the AP is to maximize the cumulative utilities by scheduling packet transmissions in the face of unknown mean utilities. In addition, each client has a QoS requirement that a minimum delivery ratio must be guaranteed. Clearly, the scheduling problem with minimum delivery ratio guarantee can naturally be formulated as a CSMAB-F problem.
Vi-B Ad Placement in Online Advertising Systems
Online advertising has emerged as a very popular Internet application . Take a page of Weather.com website shown in Fig. 3 for example. When an Internet user visits the webpage, the publisher dynamically chooses multiple ads from the ads pool to display in the ad-mix areas (highlighted by red circles in Fig. 3). We assume that the ads pool consists of ads, and the ad-mix area has a limited capacity, which allows displaying no more than ads simultaneously. Note that some ads are irrelevant to certain users, depending on the context including users’ characteristics (gender, interest, location, etc.) and content of the webpage. Hence, such irrelevant ads can be viewed as unavailable to those users, and the availability of ads depends on the distribution of the context. After seeing a displayed ad, the user may or may not click it. The click-through rate (i.e., the rate at which the ad is clicked) of each ad is unknown a priori. Each click of an ad will potentially generate a revenue for the advertiser, which can be viewed as the weight of the ad.
The goal of the ad publisher is to maximize the cumulative revenues by determining a best subset of ads to display in the face of unknown click-through rates. In addition, the publisher must guarantee a minimum display frequency for advertisers who pay a fixed cost over a specified period, regardless of users’ responses to the displayed ads. Obviously, the ad placement problem with minimum display frequency guarantee fits perfectly into our proposed CSMAB-F framework.
Vi-C Task Assignment in Crowdsourcing Platforms
The increasing application of crowdsourcing is significantly changing the way people conduct business and many other activities . Consider a crowdsourcing platform such as Amazon Mechanical Turk, Amazon Flex (for package delivery), and Testlio (for software testing), as shown in Fig. 3. Tasks arriving to the crowdsourcing platform will be assigned to a group of workers with different unknown skill levels. Specifically, when a task arrives, the platform may divide the task into multiple sub-tasks; then the sub-tasks will be assigned to no more than workers from a pool of workers, due to the number of sub-tasks or a limited budget. Note that some workers could be unavailable to take certain tasks due to various reasons (time conflicts, location constraints, limited skills, preferences, etc.). Each completed task will generate a payoff that depends on the quality or efficiency of the workers. The payoff is a random variable, and its mean is unknown a priori due to unknown skill levels of workers.
The goal of the crowdsourcing platform is to maximize the cumulative payoffs by determining an optimal task allocation in the face of unknown mean payoffs. In addition, the platform has to take fairness towards workers into account through a minimum assignment ratio guarantee for each worker. This fairness guarantee helps maintain a healthy and sustainable platform with improved worker satisfaction and higher worker participation. Apparently, our proposed CSMAB-F framework can be applied to address the task assignment problem with minimum assignment ratio guarantee.
Vii Numerical Results
In this section, we conduct simulations to evaluate the performance of our proposed LFG algorithm and discuss several interesting observations based on the simulation results.
We consider two scenarios for the simulations: (i) and ; (ii) and . Since the observations are similar for these two scenarios, we will focus on the discussion about the first scenario due to space limitations. We assume that the availability of arm is a binary random variable that is i.i.d. over time with mean . Then, the distribution of available arms can be computed as for all . We also assume binary rewards with the same unit weight (i.e., ) for all the arms. The detailed setting of other parameters is as follows: , , and .
First, in order to demonstrate that LFG can effectively meet the fairness requirement, we compare LFG with a fairness-oblivious combinatorial MAB algorithm, called Learning with Linear Rewards (LLR) . We modify the LLR algorithm to accommodate sleeping bandits; the modified version is called LLR for Sleeping bandits (LLRS). In each round , observing the set of available arms , LLRS selects a super arm that has the largest weighted sum of the UCB estimates among all the feasible super arms in , i.e., . Note that LLRS is oblivious of the fairness constraints in (1).
We simulate LFG with and LLRS for rounds (at which all the considered algorithms are observed to converge) and present the results in Fig. 4. Fig. 4(a) shows the time-average regret over time for the considered algorithms; Fig. 4(b) shows the selection fraction of each arm at the end of the simulation (i.e., at ). From Fig. 4(a), we can make the following observations: (i) LFG with a larger results in a smaller regret, and LFG with approaches a zero regret; (ii) LLRS achieves the smallest regret, which is even negative (i.e., it achieves a reward larger than the optimal ). Observation (i) is expected, as we explained in Section V-B: the upper bound on regret in (10) approaches zero when both and become large. Observation (ii) is not surprising because LLRS is fairness-oblivious and may produce an infeasible solution. Indeed, Fig. 4(b) shows that Arm 1’s selection fraction under LLRS is smaller than the required value (0.4 vs. 0.5). This is because Arm 1 has the smallest mean reward and is not favored under LLRS, which is unaware of the fairness contraints. On the other hand, Fig. 4(b) also shows that with different values of , LFG consistently satisfies the required minimum selection fraction, which verifies our theoretical result on feasibility-optimality of LFG (Theorem 1).
At first glance, the above observations seem to suggest that LFG with a large is desirable because that leads to a vanishing regret while still providing fairness guarantee. However, what is missing here is the speed of convergence to a point satisfying the fairness requirement, which is another critical design concern in practice. To understand the convergence speed of LFG with different values of , in Fig. 6 we plot the selection fraction over time for each arm. Taking Fig. 5(a) for example, we can observe that the convergence slows down as increases. In addition, before LFG with converges (e.g., when ), the actual selection fraction of Arm 1 does not meet the required minimum value of 0.5. Since the constraints may be temporarily violated, the regret could even be negative before LFG converges (see in Fig. 4(a)). Therefore, the simulation results reveal an interesting tradeoff between the regret and the convergence speed. We can control and optimize this tradeoff by tuning . For example, for the considered scenario, LFG with seems to achieve a good balance between the regret and the convergence speed.
Finally, we want to investigate the tightness of the upper bound derived in (10). Consider the average of 100 independent simulation runs for LFG with . Fig. 6 shows the time-average regret vs. the time horizon in a log-log plot. Recall that the upper bound in (10) has two terms. The impact of appears in the second term that is of the order . When becomes large, it becomes difficult to see the impact of on the regret as the first term becomes dominant. Therefore, we consider the region with (i.e., ). Fig. 6 seems to suggest that the time-average regret follows the order rather than . This implies that the upper bound in (10) is not tight. One reason could be that the bound is instance-independent. It remains open whether one can come up with novel analytical techniques to derive a better bound of .
In this paper, we proposed a unified CSMAB-F framework that integrates several critical factors (i.e., combinatorial actions, availability of actions, and fairness) of the system in many real-world applications. In particular, no prior work has studied MAB problems with fairness constraints on a minimum selection fraction for each individual arm. To address the new challenges introduced by modeling these factors, we developed a new LFG algorithm that achieves a provable regret upper bound while effectively providing fairness guarantee.
We leave the following interesting questions to our future work: Can one prove a tighter upper bound on regret? How to develop efficient algorithms for a more general model that potentially accounts for nonlinear reward functions, more sophisticated combinatorial structures (e.g., matroids), and more general fairness criteria other than temporal fairness that we consider in this paper?
S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and
nonstochastic multi-armed bandit problems,”
Foundations and Trends® in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.
-  T. Rappaport, Wireless Communications: Principles and Practice, 2nd ed. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2001.
-  X. Liu, E. K. Chong, and N. B. Shroff, “A framework for opportunistic scheduling in wireless networks,” Computer networks, vol. 41, no. 4, pp. 451–474, 2003.
-  I. H. Hou, V. Borkar, and P. R. Kumar, “A theory of qos for wireless,” in Proceedings of IEEE INFOCOM, April 2009, pp. 486–494.
-  T. Lan, D. Kao, M. Chiang, and A. Sabharwal, “An axiomatic theory of fairness in network resource allocation,” in 2010 Proceedings IEEE INFOCOM, March 2010, pp. 1–9.
-  T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985.
-  J. Gittins, K. Glazebrook, and R. Weber, Multi-armed bandit allocation indices. John Wiley & Sons, 2011.
-  P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine learning, vol. 47, no. 2-3, pp. 235–256, 2002.
-  V. Anantharam, P. Varaiya, and J. Walrand, “Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: Iid rewards,” IEEE Transactions on Automatic Control, vol. 32, no. 11, pp. 968–976, 1987.
-  Y. Gai, B. Krishnamachari, and R. Jain, “Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations,” IEEE/ACM Transactions on Networking (TON), vol. 20, no. 5, pp. 1466–1478, 2012.
B. Kveton, Z. Wen, A. Ashkan, H. Eydgahi, and B. Eriksson, “Matroid bandits: Fast combinatorial optimization with learning,” inProceedings of UAI, 2014.
-  W. Chen, Y. Wang, and Y. Yuan, “Combinatorial multi-armed bandit: General framework and applications,” in International Conference on Machine Learning, 2013, pp. 151–159.
-  W. Chen, W. Hu, F. Li, J. Li, Y. Liu, and P. Lu, “Combinatorial multi-armed bandit with general reward functions,” in Advances in Neural Information Processing Systems, 2016, pp. 1659–1667.
-  R. Combes, M. S. T. M. Shahi, A. Proutiere et al., “Combinatorial bandits revisited,” in Advances in Neural Information Processing Systems, 2015, pp. 2116–2124.
-  R. Kleinberg, A. Niculescu-Mizil, and Y. Sharma, “Regret bounds for sleeping experts and bandits,” Machine learning, vol. 80, no. 2-3, pp. 245–272, 2010.
-  V. Kanade, H. B. McMahan, and B. Bryan, “Sleeping experts and bandits with stochastic action availability and adversarial rewards,” in Artificial Intelligence and Statistics, 2009, pp. 272–279.
-  V. Kanade and T. Steinke, “Learning hurdles for sleeping experts,” ACM Transactions on Computation Theory (TOCT), vol. 6, no. 3, p. 11, 2014.
-  A. Chatterjee, G. Ghalme, S. Jain, R. Vaish, and Y. Narahari, “Analysis of thompson sampling for stochastic sleeping bandits.” in UAI, 2017.
-  R. Combes, C. Jiang, and R. Srikant, “Bandits with budgets: Regret lower bounds and optimal algorithms,” ACM SIGMETRICS Performance Evaluation Review, vol. 43, no. 1, pp. 245–257, 2015.
-  A. Badanidiyuru, R. Kleinberg, and A. Slivkins, “Bandits with knapsacks,” Journal of the ACM (JACM), vol. 65, no. 3, pp. 13:1–13:55, 2018.
-  E. V. Denardo, E. A. Feinberg, and U. G. Rothblum, “The multi-armed bandit, with constraints,” Annals of Operations Research, vol. 208, no. 1, pp. 37–62, 2013.
-  K. Cai, X. Liu, Y. J. Chen, and J. C. S. Lui, “An online learning approach to network application optimization with guarantee,” in Proceedings of IEEE INFOCOM, 2018, in press.
-  K. Chen, K. Cai, L. Huang, and J. Lui, “Beyond the click-through rate: Web link selection with multi-level feedback,” arXiv preprint arXiv:1805.01702, 2018.
-  M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth, “Fairness in learning: Classic and contextual bandits,” in Advances in Neural Information Processing Systems, 2016, pp. 325–333.
-  M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth, “Fair algorithms for infinite and contextual bandits,” arXiv preprint arXiv:1610.09559, 2016.
-  M. S. Talebi and A. Proutiere, “Learning proportionally fair allocations with low regret,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 2, no. 2, pp. 36:1–36:31, 2018.
-  W.-K. Hsu, J. Xu, X. Lin, and M. R. Bell, “Integrate learning and control in queueing systems with uncertain payoffs,” Purdue University, available at https://engineering.purdue.edu/%7elinx/papers.html, Tech. Rep., 2018.
-  M. J. Neely, “Stochastic network optimization with application to communication and queueing systems,” Synthesis Lectures on Communication Networks, vol. 3, no. 1, pp. 1–211, 2010.
-  “Adspeed ad server,” https://www.adspeed.com/, 2018, [Accessed: 2018-06-30].
-  F. Basık, B. Gedik, H. Ferhatosmanoglu, and K.-L. Wu, “Fair task allocation in crowdsourced delivery,” IEEE Transactions on Services Computing, 2018.