In a classical stochastic multi-armed bandit (S-MAB) problem, a decision maker is faced with choices (henceforth referred to as arms). At each time , a decision maker decides which choice to select (referred to as pulling an arm). Once a decision maker pulls an arm, she gets a random reward drawn from a fixed reward distribution unknown to her. The arms which are not pulled do not give any reward. The goal of a decision maker at each round is to pull an arm so that the sum of the total expected reward from pulls is maximized. The challenge faced by the decision maker is famously known in literature as the exploration vs. exploitation dilemma i.e. whether to explore the arms to find the best arm in terms of expected rewards or to pull an arm that has given the best average reward so far.
In this paper we consider Fair S-MAB, a variant of the S-MAB problem where, in addition to the above objective of maximizing the sum of the expected rewards (or equivalently minimizing the cumulative regret), the algorithm also needs to ensure that each arm is pulled for at least a given fraction of the total number of rounds, in any round. This imposes an additional constraint on the algorithm. Such a constraint will be referred to as a fairness constraint. The fairness constraint is specified in terms of a vector of size , where each component is the minimum fraction of the total number of time steps for which the corresponding arm has to be pulled. The goal is to minimize the regret while satisfying the fairness requirement of each arm.
Such fairness constraints are natural in many real world resource allocation problems where the arms are individuals or agents competing for a common resource. In the context of the SMAB
setting, fairness constraints ensure that no individual starves from the lack of opportunities irrespective of her quality. This objective, which at times is at odds with the objective of maximizing efficiency, conforms with theveil of ignorance doctrine of Rawls  wherein each individual has equal claim to the resource without the knowledge of their true qualities in original position (refer [2; 3] for detailed discussion). For concreteness, we next present several motivating examples for the work done in this paper.
Sponsored Search: An advertiser, characterized by a click-through rate (CTR), competes for an ad-space on a search engine such as Google, Bing, etc. In the absence of any regulatory measures to ensure equitable allocation of ad-space, the new and/or local businesses run a risk of being starved for publicity by big corporations. Fairness regulations ensure that the local businesses get required visibility on online platforms to sustain their business.
Wireless Communication[]: Consider a wireless communication system where a central access point allocates the channel to one of the transmitters for some fixed amount of time, called a time slot. For each successful transmission, a reward is generated that depends in some way on the transmitter (e.g. the quality of information transmitted). In addition to maximizing reward, the access point also needs to guarantee a certain minimum quality of service to each transmitter irrespective of the reward it generates.
Crowdsourcing: Consider a crowdsourcing scenario where a central planner assigns several micro tasks to the crowdworkers (or agents). The goal is to ensure high quality work from the agents. As the agents are heterogeneous in terms of their qualities, the goal is to find the best quality agents. However, in order to induce participation from the agents, the algorithm has to ensure that each agent is guaranteed a certain number of tasks beforehand. In this work we capture this constraint in terms of the fraction of tasks to be assigned to each agent.
Our contributions: In this paper, we study the Fair-SMAB problem, a variant of the SMAB problem where, in addition to the goal of maximizing expected cumulative reward, an algorithms also has to ensure that each arm is pulled for at least a given fraction of the total number of time step in any round. After formally defining the Fair-SMAB problem, we define the notion of fairness that we use in this paper. Further, we evaluate the regret of our algorithm with the fairness-aware regret notion called -Regret . This regret notion is a natural extension of the conventional notion of regret, and is defined with respect to an optimal policy that also has to satisfy the fairness constraints. We then define a class of Fair-SMAB algorithms, called Fair-ALG characterized by two parameters: the unfairness tolerance, and the learning algorithm used as a black-box by Fair-ALG . We prove a fairness guarantee for Fair-ALG that holds uniformly over time, independent of the choice of the learning algorithm. Further, when the learning algorithm is UCB1, we show that -Regret bound can be achieved. We then evaluate the cost of fairness in Fair-SMAB with respect to the conventional notion of regret. We conclude by providing detailed experimental results to validate our theoretical guarantees.
Outline of the Paper:
In the next section we discuss the related work in the area of fairness in machine learning and fairness in multi-armed bandits in specific. In Section3 we discuss the model considered in the paper. In this section we introduce the notions of -fairness and -Regret. In Section 4 we propose T-aware algorithms that guarantee -fairness at the end of rounds. In Section 5 we introduce a fair learning framework which guarantees -fairness at any time . Further, the proposed framework can be used as a black-box for any learning algorithm. We use UCB algorithm as a plugin algorithm and show that Fair-UCB is -Regret optimal(upto problem dependent constant). In Section 6 we compare UCB algorithm with Fair-UCB based on the conventional notion of regret. In Section 7 we show via extensive simulations the tradeoff between fairness vector , and unfairness tolerance value . We also compare the performance of proposed Fair-UCB with LFG algorithm proposed in . We conclude the paper with Section 8 with a brief discussion on the future work.
2 Related Work
There has been a surge in research efforts aimed at ensuring fairness in decision making by machine learning algorithms such as classification algorithms [5; 6; 7; 8], regression algorithms [9; 10], ranking and recommendation systems [11; 12; 13; 14; 15], online learning algorithms [4; 16; 17], etc. Here, we present the relevant work in the context of online learning, particularly in the SMAB setting.
Joseph et al.  propose a variant of the upper confidence bound algorithm that ensures what the authors call meritocratic fairness i.e. an arm is never preferred over a better arm irrespective of the algorithm’s confidence over the mean reward of each arm. This guarantees individual fairness (see ) for each arm while achieving efficiency in terms of sub-linear regret. In contrast, we consider that the fairness constraints are exogenously specified and the choices made by the algorithm must adapt to these constraints so as to minimize the regret while satisfying these constraints. The work by Liu et al.  aims at ensuring “treatment equality", wherein similar individuals are treated similarly in the SMAB setup. This outcome based notion of fairness considers that the fairness constraints are built into the problem. Gillen et al.  consider individual fairness guarantees with respect to an unknown fairness metric.
A recent paper by Li et al.  considers a combinatorial, sleeping SMAB setup with fairness constraints similar to the ones considered in this paper. The algorithm in  controls the trade-off between minimizing regret and satisfying fairness constraints using a tuning parameter. In our simulations we consider the algorithm proposed in  as a baseline to compare the performance of our algorithm in terms of both, fairness and regret. In addition to proving a instance independent -Regret bound as in , we also show a -Regret bound with finer dependence on the instance parameters. Further, we provide an explicit dependence of regret on fairness constraints. We provide a stronger fairness guarantee that holds uniformly over time as compared to the asymptotic fairness guarantee in . A detailed comparison of the two algorithms is given in Section 7.
In this section we formally define the Fair-SMAB problem followed by defining the notions of fairness and regret which are used in this work.
3.1 The Fair-SMAB Problem
An instance of a Fair-SMAB (respectively, SMAB) problem is a tuple (respectively, ), where is the time horizon, is the set of arms, represents the mean of the reward distribution associated with arm , and represents the fairness constraint vector. Given a fairness constraint vector , is the fairness constraint for arm and denotes the minimum fraction of times arm needs to be pulled in rounds, for any . Note that and .
In each round , a Fair-SMAB algorithm pulls an arm and collects the reward . We assume that the reward distributions are for each arm . This assumption holds without loss of generality since one can reduce the SMAB problem with general distributions supported on [0,1] to an SMAB problem with Bernoulli rewards using the extension provided in . Note that the true value of is unknown to the algorithm. Throughout this paper we assume without loss of generality that and arm is called the optimal arm.
The performance of a Fair-SMAB algorithm is evaluated based on the regret that it incurs and the fairness guarantee that the algorithm can satisfy. In the next section, we formalize the notions of fairness and regret that we use in this paper.
3.2 Notion of Fairness
In the Fair-SMAB setting, the fairness constraints are exogenously specified to the algorithm in the form of a vector where , for all , and consequently and denotes the minimum fraction of times an arm has to be pulled in rounds, for any . We consider to be consistent with the notion of proportionately, wherein, guaranteeing any arm a fraction greater than its proportional fraction, which is , is unfair in itself. Note that our -Regret guarantees hold for any such that where . We first begin with the definition of fairness put forth by Li et al.  and then define our notion of fairness.
Definition 1 ().
A Fair-SMAB algorithm is called (asymptotically) fair if for all .
We refer to the above notion of fairness as asymptotic fairness for reasons that are clear from the definition itself. In our work we prove a stronger notion of fairness that holds uniformly over time. In addition to this, we define our fairness in terms of the unfairness tolerance allowed in the system which is denoted by a constant and is given to the algorithm. Formally, we introduce the following notion of fairness.
Given an unfairness tolerance , a Fair-SMAB algorithm is called -fair if for all and for all arms .
In particular, if the above guarantee holds for , then we call the Fair-SMAB algorithm fair. Note that our fairness guarantee holds uniformly over the time horizon and for any sequence of arm pulls by the algorithm. Hence it is much stronger than the guarantee in  which only guarantees asymptotic fairness. Note that, for a given constant , -fairness implies asymptotic fairness.
3.3 Notions of Regret
The performance of an SMAB algorithm is measured based on the cumulative regret it incurs in rounds. The expected regret of a SMAB algorithm is defined as the difference between the cumulative reward of the optimal policy and that of the algorithm. In the SMAB setting, the optimal policy is the one which pulls the optimal arm in every round.
The expected regret of an algorithm after rounds is defined as:
The expected regret of can equivalently be written in terms of the expected number of pulls of the sub-optimal arms and the expected regret incurred due to playing a sub-optimal arm. In particular, if and denotes the number of pulls of an arm by in rounds, then the expected regret of after rounds is defined as:
We call an algorithm optimal if it attains zero regret. It is easy to see that the above notion of regret does not adequately quantify the performance of a Fair-SMAB algorithm as the optimal policy here does not account for the fairness constraints. We first characterize the fairness-aware optimal policy that we consider as a baseline.
A Fair-SMAB algorithm is optimal iff satisfies for all .
From Observation 1 we have that an optimal Fair-SMAB algorithm that knows the value of must play sub-optimal arms (arms ) exactly times in order to satisfy the fairness constraint and play the optimal arm (arm 1) for the rest of the rounds i.e. for rounds. The regret of an algorithm is compared with such an optimal policy that satisfies the fairness constraints in the Fair-SMAB setting.
Given a fairness constraint vector and the unfairness tolerance , the fairness-aware -Regret of a Fair-SMAB algorithm is defined as:
Note that for a given , the above definition is reasonable only when for all . Hence, we consider large enough so that the above definition is consistent. An algorithm that is not aware of the true means of the reward distributions of arms, faces the exploration v/s exploitation dilemma. On one hand, it has to sufficiently explore all the arms so as to find an optimal arm and on the other, it must exploit the information gathered about mean rewards of the arms. The fairness constraints assist in exploration by guaranteeing samples for each arm . Note that the pulls of any sub-optimal arm do not incur any -Regret , as the optimal fair algorithm also has to pull each sub-optimal arm for rounds. A learning algorithm that pulls a sub-optimal arm for more than rounds, incurs a regret of for each extra pull. The technical difficulties in designing an optimal algorithm for the Fair-SMAB problem are the conflicting constraints on the quantity for a sub-optimal arm : for the algorithm to be fair we want to be at least whereas to minimize the regret we want to be close to .
4 T-aware Algorithms
An algorithm that has access to time horizon can trade-off fairness and regret more effectively. To see this, notice that in order to identify the best arm quickly it is important that an algorithm should explore the arms in the initial rounds. This observation along with Observation 1 gives us that if the arms are pulled initially to satisfy the fairness constraints, the algorithm incurs no regret and at the same time learns the rewards from each arm. In other words the algorithm incurs no regret for first number of rounds. If is such that the
is sufficient to explore each arm and find the best arm with high probability then one can pull the best arm for rest of therounds.111Notice that fairness constraints are satisfied at . Guided by this intuition we propose two T-aware Fair-SMAB algorithms that achieve sub-linear regret.
Warming up – Naive Algorithm: We begin with Naive(Algorithm 1), a variant of exploration separated policy, ExpSep  that achieves sub-linear regret guarantee in terms of time horizon . It is easy to see that Naive is fair. We show in Theorem 1 that Naive attains sub-linear regret (Proof in Appendix B).
The regret of Naive algorithm for Fair-SMAB problem,
UCB-based Algorithm (T-fair-ucb): We propose a UCB-based T-aware fair algorithm, T-fair-ucb. This algorithm knows the time horizon , and effectively separates the fairness constraint satisfaction phase and the regret minimization phase and achieves logarithmic regret in terms of with dependence on the values of the fairness fractions.
T-fair-ucb is presented in Algorithm 2. Note that T-fair-ucb satisfies the fairness requirements of all arms at itself and hence it is fair. Next we show that T-fair-ucb achieves logarithmic regret.
For Fair-SMAB problem, T-fair-ucb has regret . In particular, its r-dependent regret is given by
T-fair-ucb does not incur any regret in the first rounds. After , T-fair-ucb decides which arm to play at time based on the UCBestimates of the arms. For the UCB algorithm, we know that for any sub-optimal arm . Hence, if for any arm we have , then that arm will be played for only a small constant number of times after and hence the regret due to such an arm is bounded by a small value. On the other hand, if for some sub-optimal arm , , then we incur a regret equal to for rounds i.e. for at most rounds. Hence, the expected -Regret of T-fair-ucb, . Proof is provided in Appendix B. ∎
A Fair-SMAB algorithm is evaluated based on two criteria: the fairness guarantee it can provide and the -Regret bound of the algorithm. Our main contribution in this paper is proposing a class of Fair-SMAB algorithms, called Fair-ALG, characterized by two parameters: the unfairness tolerance, and the learning algorithm used as a black-box by Fair-ALG. In the next section we consider an "any-time" version of the algorithm. We consider that the time horizon is not given as an input and hence the fairness guarantee has to be satisfied at all times.
5 T-agnostic Algorithms
In this section, we provide the template of our proposed Fair-SMAB algorithm. Recall from Section 3.2 our definition of an -fair Fair-SMAB algorithm. For an algorithm to be -fair , it needs to satisfy , for all , for all arms , which is equivalent to . In each round we’re interested in the arms that could possibly violate the fairness constraints and hence look at arms such that . Having provided this intuition, we describe our algorithm.
We show that our fairness guarantee holds for any algorithm in this class, i.e. Fair-ALG. In particular, when the learning algorithm Learn() = UCB1, we call this algorithm Fair-UCBand in Theorem 4 we prove a -Regret for Fair-UCB.
5.1 Theoretical Results
We begin by first analyzing the fairness guarantee provided by Fair-ALG .
For a given and for any given fairness constraint vector where for all , Fair-ALG is -fair irrespective of the choice of the learning algorithm Learn().
After each round (and before round ), we consider the sets, , and , as defined below:
Let , for all . Then the following lemma guarantees the fairness of the algorithm and is at the heart of the proof. It is proved immediately after the the proof of the Theorem.
For , we have
, for all
Condition 1 in Lemma 1 ensures that at any time , the sets form a partition of the set of arms. Hence the arm played at the -th round by the algorithm is from one of these sets. As a part of the proof of Lemma 1, in Observation 2 we show that if is the arm played at the ()-th round then after rounds . Also in Observation 3 we show that if an arm is not played in the ()-th round then after rounds arm for all . We note that the two conditions in Lemma 1 are true after the first round, and then the two observations together ensure that these conditions remain true for all . Hence, all arms satisfy for all , which implies . In particular, we have , for all , for all , which by Definition 2 proves that Fair-ALG is -fair . ∎
Proof of Lemma 1.
We begin with two complementary observations and then prove the lemma by induction.
Let be the arm pulled by Fair-ALG in round .
if , then
if for some , then
Case 1: . Then after round , we have
Case 2: for some . Then after round , we have
Let be any arm not pulled at time .
If , then
If for some , then
Case 1: . Then after round , we have
Case 2: for some . Then after round , we have
Induction base case (): Let be the arm pulled at . Then
For all , we have . Hence,
Thus, conditions (1) and (2) of the lemma hold.
Inductive Step: Assuming the conditions in the lemma hold after round , we show that they hold after round .
Case 1: . From Observation 2, we know . From Observation 3, we know that for any arm , . Hence,
Thus, Conditions (1) and (2) in the lemma hold after round .
Case 2: , for some .
From Observation 2, we know . Hence,
Also, . Hence, Conditions (1) and (2) of the lemma hold after round . ∎
We proved above that, given an unfairness tolerance , Fair-ALG is -fair. In particular, note that the guarantee also holds when and hence Fair-ALG with is fair. Next, we provide an upper bound on the -Regret of Fair-UCB.
For Fair-SMAB problem, Fair-UCB has -Regret . In particular, the -Regret of Fair-UCB is given by
is the UCB estimate of the mean of arm, where is the empirical estimate of the mean of arm when it is played in rounds and
is the confidence interval of the armat round . Similar to the analysis of the UCB1 algorithm (Appendix A, Theorem 8), we upper bound the expected number of times a sub-optimal arm is pulled. We do this for each sub-optimal arm by considering two cases dependent on the number of times the sub-optimal arm is required to be pulled for satisfying its fairness constraint i.e. on the value of the quantity .
Case 1: Let and . Then
|(Follows from Section A.2)|
Since , it follows from the proof of Theorem 8 that . Hence, .
Case 2: Let and
Then the proof of Theorem 8 can be appropriately adapted to show that . Hence
Suppose . Then from the two cases discussed above, we can conclude that
Hence, . ∎
Next, we prove that the instance independent regret of Fair-UCB is .
The instance-independent -Regret of Fair-UCB is .
Recall from Definition 4 our expression for the -Regret of a Fair-SMAB algorithm . We know,
Note that, given any instance with arms, , and a constant ,
The last inequality follows from the fact that for all , and is a constant. This implies that the regret for any instance with given value of is bounded by the regret of the same instance for . But when, , Fair-UCB is the same as UCB1. Hence, from the instance independent regret bound of UCB1 (See Appendix Achieving Fairness in the Stochastic Multi-armed Bandit Problem), the result follows. Thus we can bound the instance independent regret of Fair-UCB as . ∎
6 Cost of Fairness
Our regret guarantees until now have been in terms of the extended notion of regret i.e. -Regret. In the previous section we showed that Fair-UCB achieves -Regret . We now evaluate the cost of fairness in terms of the conventional notion of regret i.e. how much do we lose in terms of regret in comparison to a SMAB algorithm without any fairness constraints. In particular, we show the trade-off between regret and fairness in terms of the unfairness tolerance .
For the Fair-SMAB problem where Learn() = UCB1, the regret of Fair-ALG is given by
From Section 3, Equation 2 we know that and hence, we can bound the expected regret of an algorithm by bounding the expected number of pulls of a sub-optimal arm. In particular, we want to bound the quantity for every sub-optimal arm . We do this by considering two cases dependent on how many times the arm has been pulled to satisfy the fairness constraint, i.e. on how large is the quantity .
Case 1: Let and . Then
|(Follows from Section A.2)|
Since , it follows from the proof of Theorem 8 that .
Case 2: Let and
Then the proof of Theorem 8 can be appropriately adapted to show that . Hence
Then from the two cases discussed above, we can conclude that
where . ∎
Theorem 6 capture the explicit trade-off in regret in terms of which characterizes the fairness constraints. Notice the trade-off between fairness guarantees achieved by the algorithm and the asymptotic regret guarantees. If we have that the regret is . This implies that for the regret is . However, if then for each , an additional regret equal to is incurred. Note in this case that the regret can be of . We complement these results with simulations in Section 7.
7 Experimental Results
In this section we show the results of simulations that validate our theoretical guarantees. First, we represent the cost of fairness by showing the trade-off between regret and fairness with respect to the unfairness tolerance . Second, we evaluate the performance of our algorithms in terms of -Regret and fairness guarantee by considering the algorithm by Li et al. , called Learning with Fairness Guarantee(LFG), as a baseline.
7.1 Trade-off: Fairness vs. Regret
We consider the following Fair-SMAB instance: , , and , where , and . We show the results for . Figure 5 shows the trade-off between regret in terms of the conventional notion and fairness. As can be seen, the cost of fairness can be linear in terms of regret up to a certain value of . This implies that until the threshold for is reached where regret drop from linear to logarithmic, the fairness constraints cause some sub-optimal arms to pulled more number of times as compared to number of times an arm needs to be pulled to determine its mean reward with sufficient confidence. On the other hand, for values of beyond this threshold, the regret reduces drastically, and we recover logarithmic regret as could be expected from the classic UCB1 algorithm. Note that threshold for is in this case is problem-dependent.
7.2 Comparison: Fair-UCB vs. Lfg
As we detailed in Section 2, the work closest to ours is that by Li et al.  and their algorithm, which is called Learning with Fairness Guarantee (LFG) is used as a baseline in the following simulation results. The simulation parameters that we consider for comparing -Regret are the same as in the previous section. Figure 5 shows the plot of time vs. -Regret for Fair-UCB and LFG. Note that Fair-UCB and LFG perform comparably in terms of the -Regret suffered by the algorithm. Also, the simulation results validate our theoretical claim of logarithmic -Regret bound.
We next contrast the fairness guarantee of Fair-ALG with that of LFG. To show this comparison we consider an instance with , ,