1 Introduction
In a classical stochastic multiarmed bandit (SMAB) problem, a decision maker is faced with choices (henceforth referred to as arms). At each time , a decision maker decides which choice to select (referred to as pulling an arm). Once a decision maker pulls an arm, she gets a random reward drawn from a fixed reward distribution unknown to her. The arms which are not pulled do not give any reward. The goal of a decision maker at each round is to pull an arm so that the sum of the total expected reward from pulls is maximized. The challenge faced by the decision maker is famously known in literature as the exploration vs. exploitation dilemma i.e. whether to explore the arms to find the best arm in terms of expected rewards or to pull an arm that has given the best average reward so far.
In this paper we consider Fair SMAB, a variant of the SMAB problem where, in addition to the above objective of maximizing the sum of the expected rewards (or equivalently minimizing the cumulative regret), the algorithm also needs to ensure that each arm is pulled for at least a given fraction of the total number of rounds, in any round. This imposes an additional constraint on the algorithm. Such a constraint will be referred to as a fairness constraint. The fairness constraint is specified in terms of a vector of size , where each component is the minimum fraction of the total number of time steps for which the corresponding arm has to be pulled. The goal is to minimize the regret while satisfying the fairness requirement of each arm.
Such fairness constraints are natural in many real world resource allocation problems where the arms are individuals or agents competing for a common resource. In the context of the SMAB
setting, fairness constraints ensure that no individual starves from the lack of opportunities irrespective of her quality. This objective, which at times is at odds with the objective of maximizing efficiency, conforms with the
veil of ignorance doctrine of Rawls [1] wherein each individual has equal claim to the resource without the knowledge of their true qualities in original position (refer [2; 3] for detailed discussion). For concreteness, we next present several motivating examples for the work done in this paper.Sponsored Search: An advertiser, characterized by a clickthrough rate (CTR), competes for an adspace on a search engine such as Google, Bing, etc. In the absence of any regulatory measures to ensure equitable allocation of adspace, the new and/or local businesses run a risk of being starved for publicity by big corporations. Fairness regulations ensure that the local businesses get required visibility on online platforms to sustain their business.
Wireless Communication[[4]]: Consider a wireless communication system where a central access point allocates the channel to one of the transmitters for some fixed amount of time, called a time slot. For each successful transmission, a reward is generated that depends in some way on the transmitter (e.g. the quality of information transmitted). In addition to maximizing reward, the access point also needs to guarantee a certain minimum quality of service to each transmitter irrespective of the reward it generates.
Crowdsourcing: Consider a crowdsourcing scenario where a central planner assigns several micro tasks to the crowdworkers (or agents). The goal is to ensure high quality work from the agents. As the agents are heterogeneous in terms of their qualities, the goal is to find the best quality agents. However, in order to induce participation from the agents, the algorithm has to ensure that each agent is guaranteed a certain number of tasks beforehand. In this work we capture this constraint in terms of the fraction of tasks to be assigned to each agent.
Our contributions: In this paper, we study the FairSMAB problem, a variant of the SMAB problem where, in addition to the goal of maximizing expected cumulative reward, an algorithms also has to ensure that each arm is pulled for at least a given fraction of the total number of time step in any round. After formally defining the FairSMAB problem, we define the notion of fairness that we use in this paper. Further, we evaluate the regret of our algorithm with the fairnessaware regret notion called Regret . This regret notion is a natural extension of the conventional notion of regret, and is defined with respect to an optimal policy that also has to satisfy the fairness constraints. We then define a class of FairSMAB algorithms, called FairALG characterized by two parameters: the unfairness tolerance, and the learning algorithm used as a blackbox by FairALG . We prove a fairness guarantee for FairALG that holds uniformly over time, independent of the choice of the learning algorithm. Further, when the learning algorithm is UCB1, we show that Regret bound can be achieved. We then evaluate the cost of fairness in FairSMAB with respect to the conventional notion of regret. We conclude by providing detailed experimental results to validate our theoretical guarantees.
Outline of the Paper:
In the next section we discuss the related work in the area of fairness in machine learning and fairness in multiarmed bandits in specific. In Section
3 we discuss the model considered in the paper. In this section we introduce the notions of fairness and Regret. In Section 4 we propose Taware algorithms that guarantee fairness at the end of rounds. In Section 5 we introduce a fair learning framework which guarantees fairness at any time . Further, the proposed framework can be used as a blackbox for any learning algorithm. We use UCB algorithm as a plugin algorithm and show that FairUCB is Regret optimal(upto problem dependent constant). In Section 6 we compare UCB algorithm with FairUCB based on the conventional notion of regret. In Section 7 we show via extensive simulations the tradeoff between fairness vector , and unfairness tolerance value . We also compare the performance of proposed FairUCB with LFG algorithm proposed in [4]. We conclude the paper with Section 8 with a brief discussion on the future work.2 Related Work
There has been a surge in research efforts aimed at ensuring fairness in decision making by machine learning algorithms such as classification algorithms [5; 6; 7; 8], regression algorithms [9; 10], ranking and recommendation systems [11; 12; 13; 14; 15], online learning algorithms [4; 16; 17], etc. Here, we present the relevant work in the context of online learning, particularly in the SMAB setting.
Joseph et al. [18] propose a variant of the upper confidence bound algorithm that ensures what the authors call meritocratic fairness i.e. an arm is never preferred over a better arm irrespective of the algorithm’s confidence over the mean reward of each arm. This guarantees individual fairness (see [19]) for each arm while achieving efficiency in terms of sublinear regret. In contrast, we consider that the fairness constraints are exogenously specified and the choices made by the algorithm must adapt to these constraints so as to minimize the regret while satisfying these constraints. The work by Liu et al. [17] aims at ensuring “treatment equality", wherein similar individuals are treated similarly in the SMAB setup. This outcome based notion of fairness considers that the fairness constraints are built into the problem. Gillen et al. [16] consider individual fairness guarantees with respect to an unknown fairness metric.
A recent paper by Li et al. [4] considers a combinatorial, sleeping SMAB setup with fairness constraints similar to the ones considered in this paper. The algorithm in [4] controls the tradeoff between minimizing regret and satisfying fairness constraints using a tuning parameter. In our simulations we consider the algorithm proposed in [4] as a baseline to compare the performance of our algorithm in terms of both, fairness and regret. In addition to proving a instance independent Regret bound as in [4], we also show a Regret bound with finer dependence on the instance parameters. Further, we provide an explicit dependence of regret on fairness constraints. We provide a stronger fairness guarantee that holds uniformly over time as compared to the asymptotic fairness guarantee in [4]. A detailed comparison of the two algorithms is given in Section 7.
3 Model
In this section we formally define the FairSMAB problem followed by defining the notions of fairness and regret which are used in this work.
3.1 The FairSMAB Problem
An instance of a FairSMAB (respectively, SMAB) problem is a tuple (respectively, ), where is the time horizon, is the set of arms, represents the mean of the reward distribution associated with arm , and represents the fairness constraint vector. Given a fairness constraint vector , is the fairness constraint for arm and denotes the minimum fraction of times arm needs to be pulled in rounds, for any . Note that and .
In each round , a FairSMAB algorithm pulls an arm and collects the reward . We assume that the reward distributions are for each arm . This assumption holds without loss of generality since one can reduce the SMAB problem with general distributions supported on [0,1] to an SMAB problem with Bernoulli rewards using the extension provided in [20]. Note that the true value of is unknown to the algorithm. Throughout this paper we assume without loss of generality that and arm is called the optimal arm.
The performance of a FairSMAB algorithm is evaluated based on the regret that it incurs and the fairness guarantee that the algorithm can satisfy. In the next section, we formalize the notions of fairness and regret that we use in this paper.
3.2 Notion of Fairness
In the FairSMAB setting, the fairness constraints are exogenously specified to the algorithm in the form of a vector where , for all , and consequently and denotes the minimum fraction of times an arm has to be pulled in rounds, for any . We consider to be consistent with the notion of proportionately, wherein, guaranteeing any arm a fraction greater than its proportional fraction, which is , is unfair in itself. Note that our Regret guarantees hold for any such that where . We first begin with the definition of fairness put forth by Li et al. [4] and then define our notion of fairness.
Definition 1 ([4]).
A FairSMAB algorithm is called (asymptotically) fair if for all .
We refer to the above notion of fairness as asymptotic fairness for reasons that are clear from the definition itself. In our work we prove a stronger notion of fairness that holds uniformly over time. In addition to this, we define our fairness in terms of the unfairness tolerance allowed in the system which is denoted by a constant and is given to the algorithm. Formally, we introduce the following notion of fairness.
Definition 2.
Given an unfairness tolerance , a FairSMAB algorithm is called fair if for all and for all arms .
In particular, if the above guarantee holds for , then we call the FairSMAB algorithm fair. Note that our fairness guarantee holds uniformly over the time horizon and for any sequence of arm pulls by the algorithm. Hence it is much stronger than the guarantee in [4] which only guarantees asymptotic fairness. Note that, for a given constant , fairness implies asymptotic fairness.
3.3 Notions of Regret
The performance of an SMAB algorithm is measured based on the cumulative regret it incurs in rounds. The expected regret of a SMAB algorithm is defined as the difference between the cumulative reward of the optimal policy and that of the algorithm. In the SMAB setting, the optimal policy is the one which pulls the optimal arm in every round.
Definition 3.
The expected regret of an algorithm after rounds is defined as:
(1) 
The expected regret of can equivalently be written in terms of the expected number of pulls of the suboptimal arms and the expected regret incurred due to playing a suboptimal arm. In particular, if and denotes the number of pulls of an arm by in rounds, then the expected regret of after rounds is defined as:
(2) 
We call an algorithm optimal if it attains zero regret. It is easy to see that the above notion of regret does not adequately quantify the performance of a FairSMAB algorithm as the optimal policy here does not account for the fairness constraints. We first characterize the fairnessaware optimal policy that we consider as a baseline.
Observation 1.
A FairSMAB algorithm is optimal iff satisfies for all .
From Observation 1 we have that an optimal FairSMAB algorithm that knows the value of must play suboptimal arms (arms ) exactly times in order to satisfy the fairness constraint and play the optimal arm (arm 1) for the rest of the rounds i.e. for rounds. The regret of an algorithm is compared with such an optimal policy that satisfies the fairness constraints in the FairSMAB setting.
Definition 4.
Given a fairness constraint vector and the unfairness tolerance , the fairnessaware Regret of a FairSMAB algorithm is defined as:
(3) 
Note that for a given , the above definition is reasonable only when for all . Hence, we consider large enough so that the above definition is consistent. An algorithm that is not aware of the true means of the reward distributions of arms, faces the exploration v/s exploitation dilemma. On one hand, it has to sufficiently explore all the arms so as to find an optimal arm and on the other, it must exploit the information gathered about mean rewards of the arms. The fairness constraints assist in exploration by guaranteeing samples for each arm . Note that the pulls of any suboptimal arm do not incur any Regret , as the optimal fair algorithm also has to pull each suboptimal arm for rounds. A learning algorithm that pulls a suboptimal arm for more than rounds, incurs a regret of for each extra pull. The technical difficulties in designing an optimal algorithm for the FairSMAB problem are the conflicting constraints on the quantity for a suboptimal arm : for the algorithm to be fair we want to be at least whereas to minimize the regret we want to be close to .
4 Taware Algorithms
An algorithm that has access to time horizon can tradeoff fairness and regret more effectively. To see this, notice that in order to identify the best arm quickly it is important that an algorithm should explore the arms in the initial rounds. This observation along with Observation 1 gives us that if the arms are pulled initially to satisfy the fairness constraints, the algorithm incurs no regret and at the same time learns the rewards from each arm. In other words the algorithm incurs no regret for first number of rounds. If is such that the
is sufficient to explore each arm and find the best arm with high probability then one can pull the best arm for rest of the
rounds.^{1}^{1}1Notice that fairness constraints are satisfied at . Guided by this intuition we propose two Taware FairSMAB algorithms that achieve sublinear regret.Warming up – Naive Algorithm: We begin with Naive(Algorithm 1), a variant of exploration separated policy, ExpSep [21] that achieves sublinear regret guarantee in terms of time horizon . It is easy to see that Naive is fair. We show in Theorem 1 that Naive attains sublinear regret (Proof in Appendix B).
Theorem 1.
The regret of Naive algorithm for FairSMAB problem,
UCBbased Algorithm (Tfairucb): We propose a UCBbased Taware fair algorithm, Tfairucb. This algorithm knows the time horizon , and effectively separates the fairness constraint satisfaction phase and the regret minimization phase and achieves logarithmic regret in terms of with dependence on the values of the fairness fractions.
Tfairucb is presented in Algorithm 2. Note that Tfairucb satisfies the fairness requirements of all arms at itself and hence it is fair. Next we show that Tfairucb achieves logarithmic regret.
Theorem 2.
For FairSMAB problem, Tfairucb has regret . In particular, its rdependent regret is given by
where .
Proof Outline.
Tfairucb does not incur any regret in the first rounds. After , Tfairucb decides which arm to play at time based on the UCBestimates of the arms. For the UCB algorithm, we know that for any suboptimal arm . Hence, if for any arm we have , then that arm will be played for only a small constant number of times after and hence the regret due to such an arm is bounded by a small value. On the other hand, if for some suboptimal arm , , then we incur a regret equal to for rounds i.e. for at most rounds. Hence, the expected Regret of Tfairucb, . Proof is provided in Appendix B. ∎
A FairSMAB algorithm is evaluated based on two criteria: the fairness guarantee it can provide and the Regret bound of the algorithm. Our main contribution in this paper is proposing a class of FairSMAB algorithms, called FairALG, characterized by two parameters: the unfairness tolerance, and the learning algorithm used as a blackbox by FairALG. In the next section we consider an "anytime" version of the algorithm. We consider that the time horizon is not given as an input and hence the fairness guarantee has to be satisfied at all times.
5 Tagnostic Algorithms
In this section, we provide the template of our proposed FairSMAB algorithm. Recall from Section 3.2 our definition of an fair FairSMAB algorithm. For an algorithm to be fair , it needs to satisfy , for all , for all arms , which is equivalent to . In each round we’re interested in the arms that could possibly violate the fairness constraints and hence look at arms such that . Having provided this intuition, we describe our algorithm.
We show that our fairness guarantee holds for any algorithm in this class, i.e. FairALG. In particular, when the learning algorithm Learn() = UCB1, we call this algorithm FairUCBand in Theorem 4 we prove a Regret for FairUCB.
5.1 Theoretical Results
We begin by first analyzing the fairness guarantee provided by FairALG .
Theorem 3.
For a given and for any given fairness constraint vector where for all , FairALG is fair irrespective of the choice of the learning algorithm Learn().
Proof.
After each round (and before round ), we consider the sets, , and , as defined below:

arm ,

arm
Let , for all . Then the following lemma guarantees the fairness of the algorithm and is at the heart of the proof. It is proved immediately after the the proof of the Theorem.
Lemma 1.
For , we have


, for all
Condition 1 in Lemma 1 ensures that at any time , the sets form a partition of the set of arms. Hence the arm played at the th round by the algorithm is from one of these sets. As a part of the proof of Lemma 1, in Observation 2 we show that if is the arm played at the ()th round then after rounds . Also in Observation 3 we show that if an arm is not played in the ()th round then after rounds arm for all . We note that the two conditions in Lemma 1 are true after the first round, and then the two observations together ensure that these conditions remain true for all . Hence, all arms satisfy for all , which implies . In particular, we have , for all , for all , which by Definition 2 proves that FairALG is fair . ∎
Proof of Lemma 1.
We begin with two complementary observations and then prove the lemma by induction.
Observation 2.
Let be the arm pulled by FairALG in round .

if , then

if for some , then
Proof.
Case 1: . Then after round , we have
(Since ) 
Case 2: for some . Then after round , we have
(Since )  
(Since ) 
∎
Observation 3.
Let be any arm not pulled at time .

If , then

If for some , then
Proof.
Case 1: . Then after round , we have
(Since, )  
(Since ) 
Case 2: for some . Then after round , we have
(Since, )  
(Since )  
and
(Since ) 
∎
Induction base case (): Let be the arm pulled at . Then
For all , we have . Hence,
Thus, conditions (1) and (2) of the lemma hold.
Inductive Step: Assuming the conditions in the lemma hold after round , we show that they hold after round .
Case 1: . From Observation 2, we know . From Observation 3, we know that for any arm , . Hence,
Thus, Conditions (1) and (2) in the lemma hold after round .
Case 2: , for some .
From Observation 2, we know . Hence,
Also, . Hence, Conditions (1) and (2) of the lemma hold after round . ∎
We proved above that, given an unfairness tolerance , FairALG is fair. In particular, note that the guarantee also holds when and hence FairALG with is fair. Next, we provide an upper bound on the Regret of FairUCB.
Theorem 4.
For FairSMAB problem, FairUCB has Regret . In particular, the Regret of FairUCB is given by
where .
Proof.
Recall that
is the UCB estimate of the mean of arm
, where is the empirical estimate of the mean of arm when it is played in rounds andis the confidence interval of the arm
at round . Similar to the analysis of the UCB1 algorithm (Appendix A, Theorem 8), we upper bound the expected number of times a suboptimal arm is pulled. We do this for each suboptimal arm by considering two cases dependent on the number of times the suboptimal arm is required to be pulled for satisfying its fairness constraint i.e. on the value of the quantity .Case 1: Let and . Then
(Follows from Section A.2) 
Since , it follows from the proof of Theorem 8 that . Hence, .
Case 2: Let and
Then the proof of Theorem 8 can be appropriately adapted to show that . Hence
Suppose . Then from the two cases discussed above, we can conclude that
Hence, . ∎
Next, we prove that the instance independent regret of FairUCB is .
Theorem 5.
The instanceindependent Regret of FairUCB is .
Proof.
Recall from Definition 4 our expression for the Regret of a FairSMAB algorithm . We know,
Note that, given any instance with arms, , and a constant ,
The last inequality follows from the fact that for all , and is a constant. This implies that the regret for any instance with given value of is bounded by the regret of the same instance for . But when, , FairUCB is the same as UCB1. Hence, from the instance independent regret bound of UCB1 (See Appendix Achieving Fairness in the Stochastic Multiarmed Bandit Problem), the result follows. Thus we can bound the instance independent regret of FairUCB as . ∎
6 Cost of Fairness
Our regret guarantees until now have been in terms of the extended notion of regret i.e. Regret. In the previous section we showed that FairUCB achieves Regret . We now evaluate the cost of fairness in terms of the conventional notion of regret i.e. how much do we lose in terms of regret in comparison to a SMAB algorithm without any fairness constraints. In particular, we show the tradeoff between regret and fairness in terms of the unfairness tolerance .
Theorem 6.
For the FairSMAB problem where Learn() = UCB1, the regret of FairALG is given by
where
Proof.
From Section 3, Equation 2 we know that and hence, we can bound the expected regret of an algorithm by bounding the expected number of pulls of a suboptimal arm. In particular, we want to bound the quantity for every suboptimal arm . We do this by considering two cases dependent on how many times the arm has been pulled to satisfy the fairness constraint, i.e. on how large is the quantity .
Case 1: Let and . Then
(Follows from Section A.2) 
Since , it follows from the proof of Theorem 8 that .
Case 2: Let and
Then the proof of Theorem 8 can be appropriately adapted to show that . Hence
Then from the two cases discussed above, we can conclude that
where . ∎
Theorem 6 capture the explicit tradeoff in regret in terms of which characterizes the fairness constraints. Notice the tradeoff between fairness guarantees achieved by the algorithm and the asymptotic regret guarantees. If we have that the regret is . This implies that for the regret is . However, if then for each , an additional regret equal to is incurred. Note in this case that the regret can be of . We complement these results with simulations in Section 7.
7 Experimental Results
In this section we show the results of simulations that validate our theoretical guarantees. First, we represent the cost of fairness by showing the tradeoff between regret and fairness with respect to the unfairness tolerance . Second, we evaluate the performance of our algorithms in terms of Regret and fairness guarantee by considering the algorithm by Li et al. [4], called Learning with Fairness Guarantee(LFG), as a baseline.
7.1 Tradeoff: Fairness vs. Regret
We consider the following FairSMAB instance: , , and , where , and . We show the results for . Figure 5 shows the tradeoff between regret in terms of the conventional notion and fairness. As can be seen, the cost of fairness can be linear in terms of regret up to a certain value of . This implies that until the threshold for is reached where regret drop from linear to logarithmic, the fairness constraints cause some suboptimal arms to pulled more number of times as compared to number of times an arm needs to be pulled to determine its mean reward with sufficient confidence. On the other hand, for values of beyond this threshold, the regret reduces drastically, and we recover logarithmic regret as could be expected from the classic UCB1 algorithm. Note that threshold for is in this case is problemdependent.
7.2 Comparison: FairUCB vs. Lfg
As we detailed in Section 2, the work closest to ours is that by Li et al. [4] and their algorithm, which is called Learning with Fairness Guarantee (LFG) is used as a baseline in the following simulation results. The simulation parameters that we consider for comparing Regret are the same as in the previous section. Figure 5 shows the plot of time vs. Regret for FairUCB and LFG. Note that FairUCB and LFG perform comparably in terms of the Regret suffered by the algorithm. Also, the simulation results validate our theoretical claim of logarithmic Regret bound.
We next contrast the fairness guarantee of FairALG with that of LFG. To show this comparison we consider an instance with , , , and . Even though we tested the fairness guarantee for , we show the plot for as it turns out to be the appropriate scale to compare the performance of FairALG and LFG without losing any details in terms of the fairness violation. Figure 5 shows the plot of time vs.