Log In Sign Up

Towards Soft Fairness in Restless Multi-Armed Bandits

Restless multi-armed bandits (RMAB) is a framework for allocating limited resources under uncertainty. It is an extremely useful model for monitoring beneficiaries and executing timely interventions to ensure maximum benefit in public health settings (e.g., ensuring patients take medicines in tuberculosis settings, ensuring pregnant mothers listen to automated calls about good pregnancy practices). Due to the limited resources, typically certain communities or regions are starved of interventions that can have follow-on effects. To avoid starvation in the executed interventions across individuals/regions/communities, we first provide a soft fairness constraint and then provide an approach to enforce the soft fairness constraint in RMABs. The soft fairness constraint requires that an algorithm never probabilistically favor one arm over another if the long-term cumulative reward of choosing the latter arm is higher. Our approach incorporates softmax based value iteration method in the RMAB setting to design selection algorithms that manage to satisfy the proposed fairness constraint. Our method, referred to as SoftFair, also provides theoretical performance guarantees and is asymptotically optimal. Finally, we demonstrate the utility of our approaches on simulated benchmarks and show that the soft fairness constraint can be handled without a significant sacrifice on value.


page 1

page 2

page 3

page 4


Efficient Resource Allocation with Fairness Constraints in Restless Multi-Armed Bandits

Restless Multi-Armed Bandits (RMAB) is an apt model to represent decisio...

Stochastic Multi-armed Bandits with Arm-specific Fairness Guarantees

We study an interesting variant of the stochastic multi-armed bandit pro...

Identifying Best Fair Intervention

We study the problem of best arm identification with a fairness constrai...

Field Study in Deploying Restless Multi-Armed Bandits: Assisting Non-Profits in Improving Maternal and Child Health

The widespread availability of cell phones has enabled non-profits to de...

Learn to Intervene: An Adaptive Learning Policy for Restless Bandits in Application to Preventive Healthcare

In many public health settings, it is important for patients to adhere t...

Planning to Fairly Allocate: Probabilistic Fairness in the Restless Bandit Setting

Restless and collapsing bandits are commonly used to model constrained r...

Collapsing Bandits and Their Application to Public Health Interventions

We propose and study Collpasing Bandits, a new restless multi-armed band...

1 Introduction

Restless Multi-Armed Bandit(RMAB) Process is a generalization of the classical Multi-Armed Bandit (MAB) Process, which has been studied since the 1930s Katehakis and Veinott Jr (1987)

. RMAB is a powerful framework for budget-constrained resource allocation tasks in which a decision-maker must select a subset of arms for interventions in each round. Each arm evolves according to an underlying Markov Decision Process (MDP). The overall objective in a RMAB model is to sequentially select arms so as to maximize the expected value of the cumulative rewards collected over all the arms. RMAB is of relevance in public health monitoring scenarios, recommendation systems and many others. Tracking a patient’s health or adherence and intervening at the right time is an ideal problem setting for an RMAB 

Akbarzadeh and Mahajan (2019); Bhattacharya (2018); Mate et al. (2020), where the patient health/adherence state is represented using an arm. Resource limitation constraint in RMAB comes about due to the severely limited availability of healthcare personnel. By developing practically relevant approaches for solving RMAB within severe resource limitations, RMAB can assist patients in alleviating health issues such as diabetes Newman et al. (2018). hypertension Brownstein et al. (2007), tuberculosis Chang et al. (2013); Ong’ang’o et al. (2014), depression Löwe et al. (2004); Mundorf et al. (2018), etc.

While Whittle index based approaches Mate et al. (2021); Lee et al. (2019); Li and Varakantham (2022) address the RMAB problem with a finite time horizon by providing an asymptotically optimal solution, they are susceptible to starving arms, which can have severe repercussions in public health scenarios. Owing to the deterministic selection strategy of picking arms that provide the maximum benefit, in many problems, only a small set of arms typically get picked. As shown in our experimental analysis, Figure 1 provides one example, where almost 50% of the arms do not get any interventions using the Whittle index approach. While it is an optimal decision, it should be noted that interventions help educate patients or beneficiaries on potential benefits and starvation of such interventions for many patients can result in a lack of proper understanding of the program and reduce its effectiveness in the long run. Thus, there is a need to not starve arms without significantly sacrificing optimality.

Existing works have proposed different notions of fairness in the context of MAB to prevent starvation by enabling the selection of non-optimal arms. Li et al. (2019) study a new Combinatorial Sleeping MAB model with Fairness constraints, called CSMAB-F. Their fairness definition requires algorithm to ensure a minimum selection fraction for each arm. Patil et al. (2020)

introduce similar fairness constraints in the stochastic MAB problem, where they use a pre-specified vector to denote the guaranteed number of pulls.

Joseph et al. (2016) define fairness as saying that a worse arm should not be picked compared to a better arm, despite the uncertainty on payoffs. Li and Varakantham (2022) form the allocation decision-making problem as the RMAB with fairness constraints, where fairness is defined as a minimum rate at which a task or resource is assigned to a user. Since knowing the guaranteed number (or proportion) of pulls is difficult to ascertain a priori, we generalize on these fairness notions for MAB and the fairness notion introduced by Jabbari et al. Jabbari et al. (2017)

for a reinforcement learning setting. We introduce a

soft fairness constraint for RMABs, that requires that an RMAB algorithm never favor an arm probabilistically over another arm, if the long-term cumulative reward of choosing the latter arm is higher.

In summary, our goal is to compute stochastic policies for selecting arms in finite horizon RMAB, which satisfy the soft fairness constraint. To that end, we make the following contributions:

  • A practically relevant algorithm called SoftFair, that enforces the soft fairness constraint and thereby avoids starvation of interventions for arms. Unlike the well-known Whittle index algorithm, SoftFair does not require any indexability assumptions.

  • Performance bounds and theoretical properties of the SoftFair algorithm.

  • Detailed experimental results which demonstrate that SoftFair is competitive with other policies while satisfying the soft fairness constraint.

2 RMAB with Soft Fairness Constraint

In this section, we formally introduce the RMAB problem with the soft fairness constraint. There are independent arms, each of which evolves according to an associated Markov Decision Process (MDP), characterized by the tuple . represents the state space, represents the action space, represents the transition function, and is the reward function that lies within the interval, . is the horizon in each episode, and is the discount factor. We use and to denote a state and action for arm , respectively. Let and denote the state vector and action vector of RMAB over all arms, respectively.

A policy maps from the states to a distribution over actions. Particularly,

denotes the probability of selecting action

in the state for the RMAB, with . Similar to Jabbari et al. (2017), we define the fairness using the state-action value function as follows:

Definition 1

(Fairness) A stochastic policy, is fair if for any time step , any joint state and actions , where :


In summary, the goal of a solution approach is to generate a stochastic policy that never prefers one action over another if the cumulative long-term reward of selecting the latter one is higher. The notations that are frequently used in this paper are summarized in Table 1.

In this paper, we specifically consider a discrete RMAB where each arm has two states , and represents being in the “good” (“bad”) state, and there is a finite time horizon for each episode where is known to the algorithm in advance. At each time step , the algorithm can choose arms to pull according to observed states , and all arms undergo an action-dependent Markovian state transition process to a new state . Each arm independently receives a reward determined by its new state. Specifically, represents the choice to select arm at time step (active action), and represents the decision to be passive for arm (passive action). Then denotes the vector of states observed at time step , and denotes the vector of actions taken at . We have


and represents the limited resource constraint. is the total reward obtained from RMAB at time step under the state and action . We use a simple reward function: determined by the next state obtained by taking action in the observed state is for all arms , note that the expected immediate reward should be .

Notation Description :number of all competing arms in RMAB, :number of arms can be selected each round, : time horizon. : multiplier parameter. , : state and action of arm , : state vector and action vector of RMAB. We use [n] to represent the set of integers for , so as [T]. , : A state-action value function for the subsidy and state when taking action start at time step followed by optimal policy using Whittle index based approach in the future time steps; : Value function for the subsidy and state start at time step using Whittle index based approach , : The state-action value function when taking action at time step with state : The value function at the time step with state .

Table 1: Notations

The objective of the algorithm is to efficiently approximate the maximum cumulative long-term reward while satisfying resource constraints and fairness constraints. Towards this end, the reward maximization problem can be formulated as

such that Equation. 2, and Equation. 1 are satisfied

In this paper, we consider the problem of interventions for patient adherence behaviors, and we assign same value to the adherence of a given arm/patient over time.

3 Method

In this section, we design a probabilistically fair selection algorithm by carefully integrating ideas from value iteration methods with RMAB setting to deal with our objective. We show that our method, called SoftFair, is fair under proposed fairness constraints. SoftFair relies on the notion of known next-state distribution. A next-state distribution is defined to be known as the algorithm has the full knowledge of the transition probabilities. SoftFair

requires particular care in computing the action probability distributions, and must restrict the set of such policies to balance the fair exploration and fair exploitation polices. Correctly formulating this restriction process to balance fairness and performance relies heavily on the observations about the relationship between fairness and performance.

In order to implement the value iteration methods in the RMAB setting, SoftFair

first need to identify the estimated value function of the state of each arm

at each time step, and calculate the difference of state-action value function between the active and passive action. Then SoftFair maps each arm ’s state to an state-specific probability distribution over actions, such that for each time step , . Providing such decision support with a fairness mindset can promote acceptability Rajkomar et al. (2018); Kelly et al. (2019). In the case of beneficiaries, we assume that an arm/patient might consider action/participation fair when participation of a certain patient (i.e., due to receiving an active action) resulted in a greater increase in expected time spent in a adherent state compared to non-participation (i.e., the passive action on the arm/patient). Finally, We can sample the actions to get the next states . It suffices to consider a single arm process due to strong decomposability of the RMAB, we now give the details on how to construct our SoftFair algorithm so as to efficiently approximate our constrained long-term reward maximization objective.

The SoftFair

method first independently computes the logit value

based on the value function for each arm under state at time step of episode , where .


Here is the value function of arm in the observed state of episode . Then SoftFair maps all arms to probability distribution over actions set based on the observed states vector . Note that here means that only one arm will be selected. We can sample times without replacement to get arms to pull, which ensures that we meet the resource constraint as well as the fairness constraint, and then is apprised of the next state . More specifically, we employ the softmax function to compute the corresponding action probability distribution.


where  111 is the indicator with value 1 at the th term and value 0 at other places denotes the action is to select arm while keeping other arms passive, and denotes the probability that arm will be selected under state . is the multiplier parameter 222The updation process of our Softfair algorithm will converge to the Bellman Equation 10 with an exponential rate in terms of  Song et al. (2019), and controls the asymptotic performance Kozuno et al. (2019). that can adjust the gap between the probabilities of choosing an arm. When , SoftFair becomes the standard optimal Bellman operations Asadi and Littman (2017) (Refer to Equation 10). After computing the relative probability that the arm will be selected, we then can derive the probability that the arm is among the selected arms, denoted as  333This can be computed through the permutation iteration. Note if . For each arm , the value function update at episode can be written as follows, for every :


Similarly, we can also rewrite update equation for the state-action value function, we provide it in appendix. The process of SoftFair is summarized in Algorithm 1.

Input: Transition matrix , time horizon , set of observed states , resource constraint , multiplier parameter , episodes
1 ;
2 for episode  do
3       Initialize for step  do
4             for arm  do
5                   Compute the and according to Equation. 4;
7            Compute the probability according to Equation. 5;
8             Sample arms and add them into action set;
9             for arm  do
10                   Compute the probability that it will be activated ;
11                   Update the value function according to Equation. 6
12            Play the arm in the action set, and observe next state
Output: The value function for arm
Algorithm 1 SoftFair Value Iteration (SoftFair)

4 Analysis of SoftFair

In this section we formally analyze SoftFair and associated theoretical supports. We begin by connecting SoftFair with the well-known Whittle index algorithm mentioned earlier, and show why the Whittle index approach is not suitable for our case (Fairness constraint and Finite horizon).

4.1 SoftFair vs. Whittle index based methods

Whittle index policy is known to be the asymtotically optimal solution to RMAB under the infinite time horizon. It independently assigns an index for each state of each arm to measure how attractive it is to pull an arm at a particular state.The index is computed using the concept of a "subsidy" , which can be viewed as the opportunity cost of remaining passive, and is rewarded to the algorithm for each arm that is kept passive, in addition to the usual reward. Whittle index for an arm is defined as the infimum value of subsidy, that must be offered to the algorithm to make the algorithm indifferent between pulling and not pulling the arm. Consider a single arm where the state is . At each time step , let and denote its active and passive state-action value functions under a subsidy , respectively. We drop subscript when there is no ambiguity, i.e., and . We can have:


The value function for the state is . The Whittle index can be formally written as:


After computing the Whittle index for each arm, a policy will pull those arms whose current states have the highest indices at each time step. In order to use the Whittle index approach, it need to satisfy a technical condition called indexability introduced by Weber and Weiss (1990). The indexability can be expressed in a simple way: Consider an arm with subsidy , the optimal action is passive, then , the optimal action should remain passive. The RMAB is indexable if every arm is indexable.

However, in the case of interventions with regards to public health, the Whittle index approach concentrates on the beneficiaries who can mostly improve the objective (public health outcomes). This can lead to some beneficiaries never have a chance to get intervention from public health workers, resulting in a bad adherence behavior and henceforth a bad state from where improvements can be small even with intervention and thus never getting selected by the index policy. Refer to Figure 1 to get a better picture of the difference between the Whittle index approach and SoftFair. We can see that when using the Threshold Whittle index method proposed by Mate et al. (2020), the activation frequency of the arm is extremely unbalanced, with nearly half of the arms never being selected. Such starvation of interventions may escalate to communities. To avoid such cycle between bad outcomes, the RMAB needs to consider fairness in addition to maximizing cumulative long-term reward when picking arms.

Figure 1: The x-axis is the number of times being selected, and the y-axis is the frequency distribution. We consider the RMAB given in Section 2, with , , . Left: the Whittle index algorithm. Right: SoftFair (). As can be noted, without fairness constraints in place, the arm activation frequency is lopsided, and almost 50% of the arms never get activated.

However, in addition to failing to meet the system fairness requirements in many real-world applications, traditional Whittle index based approaches also rely on the assumption of an infinite time horizon, and the performance deteriorates severely when time horizons are finite. Often, real-world phenomena are formalized in a finite horizon setting, which prohibits direct use of Whittle index based methods. We now show why the Whittle index based approach can not be applied to the finite time horizon setting. We demonstrate that a phenomenon called Whittle index decay Mate et al. (2021) exists in our problem. All detailed proofs can be found in the Appendix.

Theorem 1

In the round , the Whittle index for arm under the observed state is the value that satisfies the equation . The Whittle index will decay as the value of current time step increases: .

Proof Sketch.  We first show a lemma to show value function , and then we can calculate and by solving equation and . We then can derive by obtaining for based on the derived lemma. The detailed proof can be found in the appendix.

The Whittle index based approach needs to solve the costly finite horizon problem because the index value varies according to the time step even in the same state, and computing the index value under the finite horizon setting is ( time and space complexity Hu and Frazier (2017). However, as an alternative method, our SoftFair can naturally approximate the optimal value function at arbitrary time steps while requiring less memory space than model-free learning methods such as Q-learning. We now demonstrate why SoftFair can satisfy our proposed fairness constraint while effectively approximating our cumulative reward maximization objective.

Theorem 2

Choose top arms according to the value in Equation 5 () is equivalent to maximize the cumulative long-term reward.

Proof Sketch.  We first get the expression of where is the set of selected arms, then we prove that getting the highest value is equivalent to optimal policy.

When approaches infinity, the algorithm becomes the optimal policy which will suffer from the starvation phenomena. Given these facts, can control the trade-off between the optimal performance and the fairness constraint.

Theorem 3

SoftFair is fair under our proposed fairness constraint, and controls the trade-off between fairness and optimal performance.

Proof Sketch.  The trade-off is governed by , where a large means that SoftFair tends to choose arms with higher value, while a small means that SoftFair tends to ensure fairness among arms.

4.2 Performance bound of SoftFair

We investigate case, since the multi-selection at each time step can be viewed as an iteration. Let denote our Soft operator at time step , we ignore the subscript here, which is


Before we derive the performance bound for SoftFair, We can first show how to bound the state-action value function in the following lemma.

Lemma 1

The state-action value function is bounded within .

Proof Sketch.  The upper bound can be obtained by showing that , state-action value at the th iteration are bounded through induction.

Corollary 1

As we have and , we can easily derive that , for and .

Follow Song et al. (2019), we let denote the largest distance between state-action value functions. Then we have the following lemma.

Lemma 2

and , Let and , here the superscript denotes the vector transpose. We have .

Proof Sketch.  We first need to replace with , and then demonstrate it by looking at possible values for the difference between state-action value function with different actions.

Different from Soft Operator in Eq. 9 ,let denote the Bellman optimality operator, which we have


For the optimal state-action value function, we have .

Theorem 4

Our SoftFair method can achieve the performance bound as , where is the optimal value function. More specifically, we have

Proof.  We prove the bound by induction based on Lemma 1 and 2.

Conjecture 1

For the cause when multiple arms can be pulled at each time step, i.e., , Our SoftFair method can achieve the performance bound as . More specifically, we have

Figure 2: Intervention benefit of SoftFair is consistently greater than other baselines. (a) We fix , and , and let . (b) We fix , and , and let . (c) We fix , and , and let .
Figure 3: (a) The intervention benefit of different multiplier . Here refers to deterministically selecting the top arm with the highest cumulative rewards, but updating the value function with . (b) The action entropy of a single process. We investigate the action entropy for different value of range from to .
Policy Intervention benefit Action entropy
Table 2: Resutls for CAPA Adherence dataset with .

5 Experiments

In this section, we empirically demonstrate that our proposed method SoftFair that enforces the probabilistic fairness constraint introduced in Section 2, can effectively approximate the cumulative reward maximization objective compared to the baselines on both (a) a realistic patient adherence behavior dataset Kang et al. (2013) and (b) a synthetic dataset that the underlying structural constraints outlined in Appendix C are preserved. We consider the average rewards criterion over the finite time horizon where we use the discount factor , and set the following scenario for the simulation: , , . All results are averaged over 50 simulations. In particular, We compare our method against the following baselines:

  • [leftmargin=*]

  • Random: At each time step, algorithm randomly select arms to play. This will ensure that each arm has the same probability of being selected.

  • Myopic: A myopic policy ignores the impact of present actions on future rewards and instead focuses entirely on the predicted immediate returns. It select arms that maximize the expected reward at the immediate next time step. Formally, this could be described as choosing the arms with the largest gap at time step under the observed state .

  • FairMyopic: After computing for each arm, instead of deterministically selecting the arm with the highest immediate reward, we use the softmax function over to get the probability of each arm being selected. Then we sample the arms according to the probability.

  • Oracle: Algorithm by Mate et al. (2021)

    under assumption that the states of all arms are fully observable and the transition probabilities are known without considering fairness constraints. We use a sigmoid function as 

    Mate et al. (2021) to approximate the Whittle index to select arms deterministically for the finite time horizon problem.

We examine policy performances from two perspectives: cumulative rewards and fairness. To this end, we rely on two performance metrics:(a) Intervention benefit: The intervention benefit is defined as . It calculates the difference between one algorithm’s total expected cumulative reward and the total reward when no intervention is involved, then normalized by the difference between the reward obtained without intervention (0% intervention benefit) and the asymptotically optimal but fairness-agnostic Oracle algorithm in baselines (100% intervention benefit). (b) Action entropy: We calculate the selection frequency distribution across all time steps, and then compute its entropy after normalization through: , where is the normalization of the number of times arm is selected (i.e., the number of times that arm has been selected divided by ), and if an arm is never selected across all time steps.

Realistic dataset

: Obstructive sleep apnea is one of the most prevalent sleep disorder among adults, and continuous positive airway pressure therapy (CPAP) is a highly effective treatment when it is used consistently for the duration of each sleep bout. But non-adherence to CPAP in patients hinders effective treatment of this type of sleep disorder. Similar to Herlihy et al. (2021)

, we adapt the Markov model of CPAP adherence behavior in 

Kang et al. (2013) to a two-state system with the clinical adherence criteria. We add a small noise to each transition matrix so that the dynamics of each individual arm is different (See more details about the dataset in Appendix C).

In table 2, we report average results for each algorithm. Myopic method has the best performance, which is caused by the specific structure of the underlying transition matrices, since there is not too much difference between Markovian models, and in this case the Myopic approach is indeed close to optimal. However, given our notion of fairness, the Myopic technique is not fair. Furthermore, Myopic policy might fail in some circumstances, and is even worse than Random policy Mate et al. (2020). Meanwhile, our SoftFair performs well while adhering to the specified fairness requirement.

Synthetic dataset

: (a) We first test the performance when the number of patients (arms) varies. Figure 2a compares the intervention benefit for patients and of . As shown in Figure 2a, in addition to satisfying the fairness constraints, our SoftFair consistently outperforms the Random, Myopic and FairMyopic baselines. (b) We next compare the intervention benefit when the number of arms is fixed and the resource constraint is varied. Specifically, we fix patients, and let . Figure 2b shows that there has been a gradual increase in the intervention benefit as the increases. One possible reason is that a larger resource budget can make the arms with higher cumulative rewards more likely to be selected, thereby reducing the performance gap with the Oracle method. (c) The performance of our method is slightly influenced by the time horizon . As shown in Figure 2c, the common trend is that a smaller leads to better performance. This means that our method can efficiently solve the RMAB in a finite time horizon, while a larger horizon will make the convergence slower. Overall, all results demonstrate the effectiveness of our method compared to other baselines.

Intervention benefit when changes

: We also look into the effect of the multiplier parameter on performance. Formally, a larger will widen the gap between the probabilities of choosing an arm, resulting in a better performance as it prefers to choose an arm with a higher cumulative reward. Figure 3 reveals that SoftFair performs well empirically as increases, and if we deterministically choose the top arms according to the value, it will achieve the optimal result.

Action entropy comparison

: We also compare the entropy of the action of a process in the synthetic dataset when ranges from to . As shown in Figure 3, the Random policy has the highest value as it requires uniform selection of all arms. Our proposed method, SoftFair consistently has a higher action entropy than the Oracle method because we enforce fairness constraints. FairMyopic has a high action entropy value, but it is indeed unfair under our proposed fairness constraints, as it relies on immediate rewards.

6 Conclusion

In this paper, we study fairness constraints in the context of Restless Multi-Arm Bandit model, which is of critical importance for adherence problems in public health (e.g., monitoring adherence of prevention medicine for Tuberculosis, monitoring engagement of mothers during calls on good practices during pregnancy). To tackle the challenges introduced by the objective, we design a computationally efficient algorithm, a novel solution is proposed by integrating the softmax value iteration technique in the RMAB setting. Our algorithms can effectively approximate the optimal value function within the proven performance bounds.


  • N. Akbarzadeh and A. Mahajan (2019) Restless bandits with controlled restarts: indexability and computation of whittle index. In 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 7294–7300. Cited by: §1.
  • K. Asadi and M. L. Littman (2017) An alternative softmax operator for reinforcement learning. In

    International Conference on Machine Learning

    pp. 243–252. Cited by: §3.
  • B. Bhattacharya (2018) Restless bandits visiting villages: a preliminary study on distributing public health services. In Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies, pp. 1–8. Cited by: §1.
  • J. N. Brownstein, F. M. Chowdhury, S. L. Norris, T. Horsley, L. Jack Jr, X. Zhang, and D. Satterfield (2007) Effectiveness of community health workers in the care of people with hypertension. American journal of preventive medicine 32 (5), pp. 435–447. Cited by: §1.
  • A. H. Chang, A. Polesky, and G. Bhatia (2013) House calls by community health workers and public health nurses to improve adherence to isoniazid monotherapy for latent tuberculosis infection: a retrospective study. BMC public health 13 (1), pp. 1–7. Cited by: §1.
  • C. Herlihy, A. Prins, A. Srinivasan, and J. Dickerson (2021) Planning to fairly allocate: probabilistic fairness in the restless bandit setting. arXiv preprint arXiv:2106.07677. Cited by: §5.
  • W. Hu and P. Frazier (2017) An asymptotically optimal index policy for finite-horizon restless bandits. arXiv preprint arXiv:1707.00205. Cited by: §4.1.
  • S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, and A. Roth (2017) Fairness in reinforcement learning. In International conference on machine learning, pp. 1617–1626. Cited by: §1, §2.
  • M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth (2016) Fairness in learning: classic and contextual bandits. Advances in neural information processing systems 29. Cited by: §1.
  • Y. Kang, V. V. Prabhu, A. M. Sawyer, and P. M. Griffin (2013) Markov models for treatment adherence in obstructive sleep apnea. Age 49, pp. 11–6. Cited by: Appendix C, §5, §5.
  • M. N. Katehakis and A. F. Veinott Jr (1987) The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research 12 (2), pp. 262–268. Cited by: §1.
  • C. J. Kelly, A. Karthikesalingam, M. Suleyman, G. Corrado, and D. King (2019)

    Key challenges for delivering clinical impact with artificial intelligence

    BMC medicine 17 (1), pp. 1–9. Cited by: §3.
  • T. Kozuno, E. Uchibe, and K. Doya (2019) Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2995–3003. Cited by: footnote 2.
  • E. Lee, M. S. Lavieri, and M. Volk (2019) Optimal screening for hepatocellular carcinoma: a restless bandit model. Manufacturing & Service Operations Management 21 (1), pp. 198–212. Cited by: §1.
  • D. Li and P. Varakantham (2022) Efficient resource allocation with fairness constraints in restless multi-armed bandits. In The 38th Conference on Uncertainty in Artificial Intelligence, Cited by: §1, §1.
  • F. Li, J. Liu, and B. Ji (2019) Combinatorial sleeping bandits with fairness constraints. IEEE Transactions on Network Science and Engineering 7 (3), pp. 1799–1813. Cited by: §1.
  • B. Löwe, J. Unützer, C. M. Callahan, A. J. Perkins, and K. Kroenke (2004) Monitoring depression treatment outcomes with the patient health questionnaire-9. Medical care, pp. 1194–1201. Cited by: §1.
  • A. Mate, A. Biswas, C. Siebenbrunner, and M. Tambe (2021) Efficient algorithms for finite horizon and streaming restless multi-armed bandit problems. arXiv preprint arXiv:2103.04730. Cited by: §1, §4.1, 4th item.
  • A. Mate, J. A. Killian, H. Xu, A. Perrault, and M. Tambe (2020) Collapsing bandits and their application to public health interventions. arXiv preprint arXiv:2007.04432. Cited by: Appendix C, §1, §4.1, §5.
  • C. Mundorf, A. Shankar, T. Moran, S. Heller, A. Hassan, E. Harville, and M. Lichtveld (2018) Reducing the risk of postpartum depression in a low-income community through a community health worker intervention. Maternal and child health journal 22 (4), pp. 520–528. Cited by: §1.
  • P. M. Newman, M. F. Franke, J. Arrieta, H. Carrasco, P. Elliott, H. Flores, A. Friedman, S. Graham, L. Martinez, L. Palazuelos, et al. (2018) Community health workers improve disease control and medication adherence among patients with diabetes and/or hypertension in chiapas, mexico: an observational stepped-wedge study. BMJ global health 3 (1), pp. e000566. Cited by: §1.
  • J. R. Ong’ang’o, C. Mwachari, H. Kipruto, and S. Karanja (2014) The effects on tuberculosis treatment adherence from utilising community health workers: a comparison of selected rural and urban settings in kenya. PLoS One 9 (2), pp. e88937. Cited by: §1.
  • V. Patil, G. Ghalme, V. Nair, and Y. Narahari (2020) Achieving fairness in the stochastic multi-armed bandit problem. Proceedings of the AAAI Conference on Artificial Intelligence 34 (04), pp. 5379–5386. External Links: Document Cited by: §1.
  • A. Rajkomar, M. Hardt, M. D. Howell, G. Corrado, and M. H. Chin (2018) Ensuring fairness in machine learning to advance health equity. Annals of internal medicine 169 (12), pp. 866–872. Cited by: §3.
  • Z. Song, R. Parr, and L. Carin (2019) Revisiting the softmax bellman operator: new benefits and new perspective. In International conference on machine learning, pp. 5916–5925. Cited by: Appendix B, Appendix B, §4.2, footnote 2.
  • T. E. Weaver and R. R. Grunstein (2008) Adherence to continuous positive airway pressure therapy: the challenge to effective treatment. Proceedings of the American Thoracic Society 5 (2), pp. 173–178. Cited by: Appendix C.
  • R. R. Weber and G. Weiss (1990) On an index policy for restless bandits. Journal of applied probability, pp. 637–648. Cited by: §4.1.


  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? We incorporates softmax based value iteration method in the RMAB setting to design selection algorithms that manage to satisfy the proposed fairness constraint. The Method, Analysis and Experiment section reflect these claims.

    2. Did you describe the limitations of your work? There is the trade-off between the optimal performance and fairness constraint.

    3. Did you discuss any potential negative societal impacts of your work? We aim to ensure the fairness in resource allocation problem.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results? Please see the Analysis Section and Appendix.

    2. Did you include complete proofs of all theoretical results? Please see the Analysis Section and Appendix.

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? We include the code, datasets, readme file for reproducibility.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Please see the Experiment Section.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Please see Table 2.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? It does not require high computing resources.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators? We correctly cite the papers that provide datasets and code of baselines for experiment.

    2. Did you mention the license of the assets?

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Appendix

a.1 More Details about SoftFair

Similarly, we can also rewrite update equation for the state-action value function, we provide it in appendix.


The probability of choosing an arm is the softmax function on . We can write down the probability of , where is the selected arms when . More specifically, we have , and note that denote that probability that arm is in the set of selected arms when .

When , let the denote the set of selected arms, and denote the action to select arms in set while keeping other arms passive. We have , where can be obtained through the brute-force permutation iteration over .

Definition 2

(Fairness) Equivalently, a stochastic policy, is fair if for any time step , any joint state and any two arms , where , The following two statements are equal:


Here and denote any set include arm as the selected set. A proof to show this two statements are equivalent is provided in next Section (see Section B).

In summary, the goal of a solution approach is to generate a stochastic policy that never prefers one action over another if the cumulative long-term reward of selecting the latter one is higher.

Appendix B Proofs

Lemma 3

Consider the single arm with a finite horizon , let denote the value function start from time step under the state , we can have , for .

Proof of Lemma 3

Proof.  We drop the subscript , i.e., . This is easy to prove. For state , we can always find a algorithm that ensures . For example, we assume the optimal algorithm for the state start from the time step is , we can always find a algorithm : keep the same actions as the algorithm until reach the last time step as will has one more time slot compared to , and then we pick the action for the last time step according to the observed state . Since the reward is either or , thus ,so we can have


Proof of Theorem 1

Proof.  Consider the discount reward criterion with discount factor of (where corresponds to the average criterion). Again, we drop the subscript and let: . Because the state is fully observable, We can easily calculate , where it needs to satisfy , i.e., , thus . Similarly, can be solved by assuming equation :


Because and from the structural constraint we mentioned before and according to Lemma 3, we have . Now we show . Equivalently, this can be expressed as . Because the state is fully observable, we first get the close form of .

  • [leftmargin=*]

  • Case 1: The state ,


    Intuitively, we can have (see Lemma 3), and , we obtain and . Hence, we can get

  • Case 2: For state , similarly, we can get


Thus .

Proof of Theorem 2

Proof.  According to the Equation 4, we have


By replacing this into Equation 5, we can get


Because as approaches infinity, our algorithm becomes deterministically selecting the arm with the highest value of . Let set to be the set of actions containing the arms with the highest-ranking of value, and any arms that aren’t among the top k are included in the set . Let and denote the set that includes all of the arms except those in set and , respectively. Thus the first action vector can be represented as , and the latter action vector is . We could have: