1 Introduction
Preventive timely intervention (e.g., to check adherence or to provide medicines) can be used to significantly alleviate many public health issues such as diabetes [23], hypertension [5], tuberculosis [6, 25], HIV [7, 15], depression [19, 22], etc. In this paper, we focus on the area of maternal health (wellbeing of women during pregnancy, childbirth and the postnatal period) where preventive interventions can impact millions of women. A key challenge in such preventive intervention problems is the limited amount of resources for intervention (e.g., availability of health care workers). Furthermore, the human behavior (w.r.t. taking medicines or adhering to a protocol) changes over time and to interventions, thereby requiring a strategic assignment of the limited resources to the beneficiaries most in need.
We are specifically motivated towards improving maternal health among lowincome communities in developing countries, where maternal deaths remain unacceptably high due to not having timely preventive care information [26]. We work with a nonprofit organization, ARMMAN armman, that provides a free callbased program for around million pregnant women. This is similar to other programs, such as Momconnect (https://tinyurl.com/momconnectjnj). Each enrolled woman receives automated calls to equip them with critical lifesaving healthcare information across weeks (pregnancy and up to months after childbirth). Unfortunately, the engagement behavior (overall time spent listening to automated calls) changes and, for most women, the overall engagement decreases. This can have serious implications on their health. We ask the question: how do we design a mechanism that systematically chooses whom to provide interventions (e.g., inperson visit by a healthcare worker) in order to maximize the overall engagement of the beneficiaries?
Preventive intervention problems of interest in this paper are challenging, owing to multiple key reasons: i) number of interventions are budgeted and much smaller than the total number of beneficiaries; ii) beneficiary’s engagement is uncertain and may change after a few months; iii) postintervention improvement of beneficiary’s engagement is uncertain; iv) intervention decisions at a time step have an impact on the state of beneficiaries and decisions to be taken at the next step.
A relevant model for this setting is restless multiarmed bandits (RMABs). RMABs with prior knowledge of uncertainty model have been studied for health interventions [18, 20, 21, 16, 2], sensor monitoring tasks [13, 10], antipoaching patrols [24], and uplift modeling in eCommerce platforms [11]. Due to the unpredictability of human beneficiaries, it is unrealistic to know the uncertainty model a priori. Our key contribution is in designing an intervention scheme under limited budget, without relying on a priori knowledge of the uncertainty model.
Contributions: First, we represent the preventive intervention problem as a Restless MultiArm Bandit (RMAB), where the uncertainty model associated with beneficiary behaviors with and without intervention are not known a priori, thereby precluding the direct application of Whittle Index based methods [29]. Second, we develop a modelfree learning method based on Qlearning, referred to as Whittle Index based QLearning (WIQL) that executes actions based on the difference between Qvalues of active (intervention) and passive actions. We show that WIQL converges to the optimal solution asymptotically. Finally, to show that WIQL is a general approach and applicable to multiple domains, we evaluate WIQL on various examples. We then simulate the intervention problem in the context of maternal healthcare. We demonstrate that the model assumptions we make capture the problem of concern and the intervention scheme employed by WIQL significantly improves the engagement among beneficiaries compared to the existing benchmarks.
2 Related Work
The restless multiarmed bandit
(RMAB) problem was introduced by Whittle whittle1988restless. The main result involves formulating a relaxation of the problem and solving it optimally using a heuristic called
Whittle Index policy. This policy is optimal when the underlying Markov Decision Processes satisfy
indexability, which is computationally intensive to verify. Moreover, Papadimitriou and Tsitsiklis papadimitriou1994complexity established that solving RMAB is PSPACEhard.There are three main threads of relevant research. The first category focuses on specific classes of RMABs. Akbarzade et al. Akbarzadeh2019 provide a class of bandits with “controlled restarts” and stateindependent policies, which possess the indexability property and a Whittle index policy. Mate et al. Mate2020 consider twostate bandits to model a health intervention problem, where taking an intervention collapses the uncertainty about their current state. Bhattacharya bhattacharya2018restless models the problem of maximizing the coverage and spread of health information as an RMAB problem and proposes a hierarchical policy. Lee et al. lee2019optimal study the problem of screening patients to maximize earlystage cancer detection under limited resource, by formulating it as a subclass of RMAB. Similarly, Glazebrook et al. Glazebrook2006, Hsu Hsu2018, Sombabu et al. Sombabu2020, Liu and Zhao Liu2010 give Whittle indexability results for different subclasses of (hidden) Markov bandits. This category of research assumes that the transition and observation models are known beforehand. Instead, our focus is on providing learning methods when the transition model is unknown a priori.
The second category contains different learning methods for RMABs. Fu et al. fu2019towards provide a Qlearning method where the Q value is defined based on the Whittle indices, states, and actions. However, they do not provide proof of convergence to optimal solution and experimentally, do not learn (near)optimal policies. Along similar lines, Avrachenkov and Borkar avrachenkov2020whittle provide a fundamental change to the Qvalue definition for computing optimal whittle index policy. However, their convergence proof requires all homogeneous arms with same underlying MDPs. We provide a learning method that is not only shown to theoretically converge but also empirically outperforms the above mentioned methods on benchmark instances and also on a real (heterogeneous arms) problem setting.
The third relevant line of work is to predict adherence to a health program and effects of interventions, by formulating these as supervised learning problems. Killian
et al. Killian2019 use the data from 99DOTS [8]and train a deep learning model to target patients at high risk of not adhering to the health program. On similar lines, Nisthala
et al. nishtala2020missed use the engagement data of a health program and train deep learning model to predict patients who are at high risk of dropping out of the program. Son et al.son2010application use Support Vector Machine to predict adherence to medication among heart failure patients. There are many other papers
[12, 17] that train models on historical data for predicting adherence and provide interventions to patients who have low adherence probability. These works assume the training data to be available beforehand. In contrast, we consider the online nature of the problem where feedback is received after an intervention is provided, which in turn is used in making future decisions.3 Preliminaries
RMAB models various stochastic scheduling problems, where an instance is a 3tuple (, , ) with denoting the set of arms, is the budget restriction denoting how many arms can be pulled at a given time, and MDP is the Markov Decision Process for each arm . An MDP consists of a set of states , a set of actions , transition probabilities , and reward function . The action set of each MDP consists of two actions: an active action () and a passive action (). At each time step , an action is taken on an arm , such that . Then, each arm transitions to a new state and observes a reward, according to the underlying MDP. Let and denote the current state and reward obtained at time respectively. Now, a policy can be defined as a mapping from the current states of all beneficiaries to the actions to be taken on each arm. Thus, given a policy , the action on an arm is denoted as:
In an RMAB, the goal is to find the policy that maximizes the total expected average reward until time steps, subject to the budget.
(1)  
s.t. 
This problem is PSPACEhard, shown in the notable paper by Papadimitriou and Tsitsiklis papadimitriou1994complexity. To deal with the computational hardness, an indexbased heuristic policy based on the Lagrangian relaxation of the RMAB problem (1) was proposed by Whittle whittle1988restless—at each time step , an index is computed for each arm depending on the current state of the arm, transition probabilities, and reward function of its MDP. Then, the top arms with highest index values are selected.
Whittle Indexbased policy: Whittle’s relaxation is to replace the budget constraint of on the number of arms to a timeaveraged constraint, i.e.,
Further, using Lagrange’s relaxation (with as the Lagrange’s multipliers) and dropping other constants, the objective function can be rewritten as:
(2) 
Whittle showed that this problem can be decoupled and solved for each arm by computing the index which acts like a subsidy that needs to be given to an arm at state , so that taking the action is as beneficial as taking the action . Assuming indexability, choosing arms with higher subsidy leads to the optimal solution for Equation 2.
Definition 1 (Indexability)
Let be the set of states for which it is optimal to take action when taking an active action costs . An arm is indexable if monotonically increases from to when increases from to . An RMAB problem is indexable if all the arms are indexable.
4 The Model
We formulate the problem of selecting out of statetransitioning arms at each time step, as an RMAB, with beneficiaries being the arms in preventive healthcare intervention scenarios. We represent the engagement pattern as an MDP (Figure 1) with three (abstract) “behavioral” states.
Self Motivated (): In this state, beneficiary shows high engagement and there is no need to intervene in this state.
Persuadable (): In this state, the beneficiary engages less frequently with a possibility of increasing engagement when intervened, which makes this the most interesting state.
Lost Cause (): The engagement is very low in this state and very likely to remain low irrespective of intervention.
These three states capture different levels of engagement as well as differences in terms of the benefit obtained by interventions, which is not possible to represent using other existing intervention models, such as the model described by Mate et al. Mate2020. Note that the more the states, slower is the convergence of any online algorithm. Thus, for shortterm intervention programs (a year or two), we recommend a threestates model. In Section 7.2, we provide a mechanism to obtain the states from realdata. Let be the action taken on beneficiary at time ; denotes an intervention and otherwise. Also, for each time slot . Depending on the action taken, each beneficiary changes its states according to the transition probabilities.
Though our model is inspired by a preventive intervention in healthcare, this formulation captures the intervention problem in many other domains, such as sensor monitoring, antipoaching patrols, and uplift modeling in eCommerce. Depending on the domain, a persuadable state is where the benefit of an intervention is maximum. When a beneficiary is in the persuadable state and intervention is provided, an arm is more likely to transition to the selfmotivated state whereas, when no intervention is provided, it is more likely to transition to the lostcause state. These transitions are denoted as and for action and , respectively. Every transition generates a statedependent reward. Since better engagements have higher rewards, we assume . The transition probabilities
may be different for each beneficiary and thus, the effect of intervention would vary from one beneficiary to another. If the transition probabilities were known beforehand, one could compute the Whittleindex policy and select accordingly. However, here, transition probabilities are unknown, and hence, we consider the problem of learning the Whittle index while simultaneously selecting a set of best arms depending on the estimated Whittle Index.
5 Whittle Index based Qlearning (WIQL)
is a wellstudied reinforcement learning algorithm for estimating
for each stateaction pair () of an MDP.where the optimal expected value of a state is given by
QLearning estimates using point samples—at each time , an agent (policy maker) takes an action using estimated values at the current state , a reward is observed, a new state is reached, and values are updated according to the following update rule:
(3)  
Here, is the learning parameter. When , the agent does not learn anything new and retains the value obtained at step, while stores only the most recent information and overwrites all previously obtained rewards. Setting strikes a balance between the new values and the old ones. With mild assumptions on , the convergence of QLearning to the optimal values has been established [27, 14, 4]. We build on the results to provide a QLearning approach for learning Whittleindex policy.
We adopt QLearning for RMABs and store the Q values separately for each arm, since each arm has its own MDP. Q values of stateaction pairs are typically used for selecting the best action at each state; however, for RMAB, the problem is to select a set of arms. Our actionselection method ensures that an arm whose estimated benefit is higher is more likely to get selected. Algorithm 1 describes the actionselection and the update rule. This algorithm is not specific to any particular state representation and can be used for any finite state RMAB instance. However, when the number of states are large (10 states), the convergence is typically slow and hence not suitable for short horizon problems.
We propose Whittle Index based Qlearning (WIQL), that uses an decay policy to select arms at each time step . During early time steps, arms are likely to be selected uniformly at random. However, as time proceeds, more priority is given to the arms with a higher value of their estimated . The selected set of arms (who receive interventions) is called . Each arm is restless, i.e., each arm transitions to a new state and observes a reward, with or without interventions. These observations are then used for updating the Q values in Step (Equation 3). While updating , we use a learning parameter that decreases with increase in (the number of times the action was taken on the arm when it was at state ); eg, satisfies this criteria. These Q values are then used to estimate the Whittle index .
6 Theoretical Results
In this section, we show that the WIQL does not alter the optimality guarantees of Qlearning. First, we show that taking intervention decisions based on the benefit, (i.e., difference in Q values of active and passive actions) is equivalent to optimizing joint Q over all arms subject to the budget on intervention actions.
Theorem 1
Taking action on the top arms according to is equivalent to maximizing over all possible action profiles such that .
Proof Sketch. For ease of explanation, we prove this for M=1. Let be the arm that maximizes the benefit of taking an intervention action () at its current state . Then, for any .
(4)  
Adding on both sides  
(5) 
Equation (5) shows that taking intervention action on and passive actions on all other arms would maximize when . This argument holds true for any (complete proof in the Appendix).
Theorem 2
WIQL converges to the optimal with probability when
Proof Sketch. This proof follows from (1) the convergence guarantee of QLearning algorithm [27], (2) decay selection policy, and (3) theorem 1. It has been established in [27] that the update rule of QLearning converges to whenever and These assumptions require that all stateaction pairs be visited infinitely often, which is guaranteed by the decay selection process, where each arm has a nonzero probability of being selected uniformly at random. Thus, converges to which implies that converges to (using series convergence operation). Finally, using Theorem 1, we claim that selecting arms based on highest values of would lead to an optimal solution problem. This completes the proof that WIQL converges to the optimal.
7 Experimental Evaluation
We compare WIQL against five benchmarks: (1) OPT: assumes full knowledge of the underlying transition probabilities and has access to the optimal Whittle Indices, (2) AB [1], (3) Fu [9], (4) Greedy: greedily chooses the top arms with the highest difference in their observed average rewards between actions and at their current states, and (5) Random: chooses arms uniformly at random at each step.
We consider a numerical example^{1}^{1}1We evaluate WIQL and benchmark algorithms on two additional numerical examples which are deferred to the Appendix. and a maternal healthcare application to simulate RMAB instances using beneficiaries’ behavioral pattern from the callbased program. For each experiment, we plot the total reward averaged over trials, to reduce the effect of randomness in the actionselection policy.
7.1 Numerical Example: Circulant Dynamics
This example has been studied in the existing literature on learning Whittle Indices [1, 9]. Each arm has four states two actions . The rewards are , , and for . The transition probabilities for each action are represented as a matrix:
The optimal Whittle Indices for each state are as follows: , , , and . Whittle Index policy would prefer taking action on arms who are currently at state , followed by those at state , then those at state , and lastly those at state .
Results: Figure 2 demonstrates that WIQL gradually increases towards the OPT policy, for two sets of experiments—(1) = and = and (2) = and =. We observe that AB converges towards an average reward of zero. This happens because it prioritizes arms who are currently at state over other states. Since, the expected reward of taking action at state is , the total average reward also tends to zero. The result obtained by the algorithm Fu is the same as what is shown in Figure of their paper [9] where the total average reward converges to a value of . As expected, Greedy and Random, one being too myopic and the other being too exploratory in nature, are unable to converge to the optimal value.
These observations show that WIQL outperforms the existing algorithms on the example that was considered in the earlier papers. Note that, while implementing the algorithms AB and Fu, we fixed the hyperparameters to the values specified for this example. However for the realworld application, that we consider next, it is not obvious how to obtain the best set of hyperparameters for their algorithms. Thus, we do not compare these algorithms for the maternal healthcare application. Next, we compare the performance of WIQL algorithm with Greedy, Random and a Myopic policy (defined in
7.2).7.2 Realworld Application: Maternal Healthcare
We now focus on the maternal healthcare intervention problem where only a small subset of beneficiaries can be selected for providing interventions every week. We use the data, obtained from the callbased preventive care program, that contains callrecords of enrolled beneficiaries—how long they listened to the call, whether an intervention was given, and when. The data also contain the ID of a healthcare worker dedicated to provide personalized intervention to each enrolled beneficiary. The data was collected for an experimental study towards building up a robust intervention program (which is the focus of this work). During the experimental study, only one intervention (personalized visit by a health worker) was provided to a selected set of beneficiaries ( chosen from among beneficiaries) who were more likely to dropout of the program. We call this the Myopic intervention scheme and use it as a benchmark to compare our approach.
We use the threestate MDP model (Figure 1) to simulate the behavior of the beneficiaries. During a particular week, the beneficiaries listening to of the content of the automated calls are in state (selfmotivated), those listening to are in state (persuadable), and those listening to are in state (lost cause). A reward of is obtained when a beneficiary is in state , a reward of is obtained in state , and a reward of is obtained in state . Thus, a high total reward accumulated per week implies that a large number of beneficiaries are at either state or .
Some other observations on the callrecords are as follows. The beneficiaries who were in state on a particular week never transitioned to state in the immediate next week, unless they received an intervention. On the other hand, a few beneficiaries who were in state transitioned to state even without intervention. Moreover, the fraction of times a transition from state to occurred is almost the same with and without the intervention at state . Even though these transition probabilities at the level of all users are known based on this data, it is difficult to know the actual transitions for a given beneficiary a priori and hence, we simulate this behavior. We conduct two set of simulations, namely static (transition probabilities of each beneficiary remain same throughout their enrollment), and dynamic (transition probabilities change a few weeks after their enrollment).
7.2.1 7.2.1 Static StateTransition Model
We assume three categories of arms—(A) high chance of improvement: highly likely to improve their engagement on receiving an intervention, and deteriorate in absence of an intervention when they are at state , i.e., and , (B) medium chance of improvement: and , and (C) low change improvement: and . We assume arms belong to categoryA, arms belong to categoryB and arms belong to categoryC. This assumption helps us determine the efficacy of any learning algorithm; in particular, the most efficient algorithm would quickly learn to intervene the arms of categoryA whenever they are at state . We compare WIQL with greedy, random, and Myopic algorithm.
7.2.2 7.2.2 Dynamic StateTransition Model
For the dynamic setting, we simulate the first weeks as described in Section 7.2.1. Further, we assume that the arms in category (A) change to “medium” chance of improvement after they are enrolled for around weeks, those in category (B) change to “low” chance of improvement. Also, there is a new set of arms with “high” chance of improvement. Note that, in reality, the behavior change would happen at any arbitrary time; however, to check if WIQL adapts to dynamic change of transition probabilities, we set this simulation environment.
We run each simulation for weeks (the length of the program). In Figure 3 we provide results obtained by considering various values of , where a value of represents the total number of personalized visits made by the healthcare workers on a week. We observe that, for (each healthcare worker visits only beneficiaries per week), WIQL performs (only) marginally better than the other benchmarks. In contrast, when , the reward obtained by WIQL is higher than Greedy and significantly more than Myopic and Random. Comparing WIQL and Greedy based on their total reward per week, we observe that the perweek engagement outperforms Greedy by a significant margin. Observe that, the convergence of WIQL to a total reward of is quicker when is higher. This is because more sample points are observed per week. Additionally, we observe that the myopic algorithm leads to overall low engagement among the beneficiaries, even under the static setting.
These results show that WIQL increases the performance of the healthcare intervention service. WIQL is able to learn which arms should be intervened at which state without any prior knowledge about the transition probabilities, and also adapts better to the underlying dynamics of the transition probabilities.
8 Conclusion and Discussion
We focus on a limitedresource sequential decision problem and formulate it as an RMAB setting. We provide a mechanism to systematically learn and decide on whom to intervene, and hence, improve the overall benefit of intervention. Our method possesses the capacity of impacting and improving the wellbeing of millions, for example, in the maternal healthcare domain, as demonstrated in this paper.
WIQL is a general solution for learning RMABs, and we demonstrate this using examples from other domains, such as Circulant Dynamics. Additionally, WIQL can be used in a more general setting where new arms arrive over time. For the departing arms, we can assume that each arm ends in a new state “dropped” and never transitions to the earlier states. Moreover, our experiments show that WIQL adapts to the dynamic behavioral change. In practice, however, WIQL may not be directly applicable for domains where there are beneficiaries with extreme health risk. The nonzero probability of being selected randomly may come at a cost of a patient in critical need of intervention. One way to mitigate this issue is to assign priorities to beneficiaries depending on their comorbidities, possible complication during pregnancy and after childbirth, etc. If the highrisk class is small, we can target the intervention for all of them, and run WIQL on the remaining beneficiaries. This constraint may hamper the convergence guarantee of WIQL; however, it would benefit the enrolled women at large.
Going ahead, it would be interesting to solve the RMAB learning problem considering a large number of states and more than two actions, each with a different cost of operation.
References
 [1] (2020) Whittle index based qlearning for restless bandits with average reward. arXiv preprint arXiv:2004.14427. Cited by: 2nd item, §B.2.1, Appendix B, §7.1, §7.
 [2] (2018) Restless bandits visiting villages: a preliminary study on distributing public health services. In Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies, pp. 1–8. Cited by: §1.
 [3] (2021) Learning index policies for restless bandits with application to maternal healthcare. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1467–1468. Cited by: footnote.
 [4] (2000) The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38 (2), pp. 447–469. Cited by: §5.
 [5] (2007) Effectiveness of community health workers in the care of people with hypertension. American journal of preventive medicine 32 (5), pp. 435–447. Cited by: §1.
 [6] (2013) House calls by community health workers and public health nurses to improve adherence to isoniazid monotherapy for latent tuberculosis infection: a retrospective study. BMC public health 13 (1), pp. 894. Cited by: §1.
 [7] (2011) Thirty years after almaata: a systematic review of the impact of community health workers delivering curative interventions against malaria, pneumonia and diarrhoea on child mortality and morbidity in subsaharan africa. Human resources for health 9 (1), pp. 27. Cited by: §1.
 [8] (2019) 99DOTS: a lowcost approach to monitoring and improving medication adherence. In Proceedings of the Tenth International Conference on Information and Communication Technologies and Development, pp. 1–12. Cited by: §2.
 [9] (2019) Towards qlearning the whittle index for restless bandits. In 2019 Australian & New Zealand Control Conference (ANZCC), pp. 249–254. Cited by: 3rd item, §B.2.2, Appendix B, §7.1, §7.1, §7.
 [10] (2006) Some indexable families of restless bandit problems. Adv. Appl. Probab, pp. 643–672. Cited by: §1.
 [11] (2019) Conversion uplift in ecommerce: a systematic benchmark of modeling strategies. International Journal of Information Technology & Decision Making 18 (03), pp. 747–791. Cited by: §1.
 [12] (2012) Predicting adherence to treatment for schizophrenia from dialogue transcripts. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 79–83. Cited by: §2.
 [13] (2012) Optimality of myopic scheduling and whittle indexability for energy harvesting sensors. In Conference on Information Sciences and Systems (CISS), pp. . Cited by: §1.
 [14] (1994) Convergence of stochastic iterative dynamic programming algorithms. In Advances in neural information processing systems, pp. 703–710. Cited by: §5.
 [15] (2013) Using community health workers to improve clinical outcomes among people living with hiv: a randomized controlled trial. AIDS and Behavior 17 (9), pp. 2927–2934. Cited by: §1.
 [16] (2021) Beyond” to act or not to act”: fast lagrangian approaches to general multiaction restless bandits. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pp. 710–718. Cited by: §1.
 [17] (2018) Predicting adherence to chronic disease medications in patients with longterm initial medication fills using indicators of clinical events and health behaviors. Journal of managed care & specialty pharmacy 24 (5), pp. 469–477. Cited by: §2.
 [18] (2019) Optimal screening for hepatocellular carcinoma: a restless bandit model. Manufacturing & Service Operations Management 21 (1), pp. 198–212. Cited by: §1.
 [19] (2004) Monitoring depression treatment outcomes with the patient health questionnaire9. Medical care, pp. 1194–1201. Cited by: §1.
 [20] (2020) Collapsing bandits and their application to public health interventions. In Neural Information Processing Systems, NeurIPS, pp. . Cited by: §1.
 [21] (2021) Riskaware interventions in public health: planning with restless multiarmed bandits. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pp. 880–888. Cited by: §1.
 [22] (2018) Reducing the risk of postpartum depression in a lowincome community through a community health worker intervention. Maternal and child health journal 22 (4), pp. 520–528. Cited by: §1.
 [23] (2018) Community health workers improve disease control and medication adherence among patients with diabetes and/or hypertension in chiapas, mexico: an observational steppedwedge study. BMJ Global Health. Cited by: §1.
 [24] (2016) Restless poachers: handling exploration exploitation tradeoffs in security domains. In International Joint Conference on Autonomous Agents and MultiAgent Systems, AAMAS, pp. . Cited by: §1.
 [25] (2014) The effects on tuberculosis treatment adherence from utilising community health workers: a comparison of selected rural and urban settings in kenya. PLoS One 9 (2), pp. e88937. Cited by: §1.
 [26] (1994) Too far to walk: maternal mortality in context. Social science & medicine 38 (8), pp. 1091–1110. Cited by: §1.
 [27] (1992) Qlearning. Machine learning 8 (34), pp. 279–292. Cited by: Appendix A, §5, §5, §6.
 [28] (1989) Learning from delayed rewards. Cited by: §5.
 [29] (1988) Restless bandits: activity allocation in a changing world. Journal of applied probability, pp. 287–298. Cited by: Appendix A, §1.
Appendix A Proof for Whittle Index QLearning
Theorem 3
Taking action on the top arms according to is equivalent to maximizing over all possible action profiles such that .
Proof Let be the set containing the top beneficiaries (i.e., beneficiaries with highest values of ), and be the set contains the non top M beneficiaries. Let represent any set of k beneficiaries. We then have:
Adding on both sides, we obtain:  
(7) 
Equation 7 shows that providing intervention to beneficiaries in and taking passive actions on all other arms would maximize .
Now the rest of the proof follows from (1) the convergence guarantee of QLearning algorithm [27], (2) decay selection policy, and (3) the indexability assumption of RMABs [29]. It has been established in [27] that the update rule of QLearning converges to whenever and These assumptions require that all stateaction pairs be visited infinitely often, that is, the underlying MDP do not have a state such that after some finite time , is not reachable via any sequence of actions. Similarly, the assumptions require that each arm, state and action pair would be visited infinitely often. Thus, in addition to the reachability assumption on MDP, we need to show that each arm is visited infinitely often. This is guaranteed by the decay selection process, where each arm has a nonzero probability of being selected uniformly at random. Thus, converges to which implies that converges to (using series convergence operation). Note that the value is the Whittle Index. Using these values for selecting the best arms would lead to an optimal solution for the relaxed Lagrangian problem, because of the indexability assumption (Definition 1).
Appendix B Experimental Evaluation
Here we provide the details of the benchmark algorithms that we use for evaluation. Also, we provide examples from two other domains, one of them was considered in [1] and the other was considered in [9].
b.1 Benchmark Algorithms
We compare WIQL with five other algorithms:

OPT: This algorithm assumes full knowledge of the underlying transition probabilities and has access to the optimal Whittle Indices for each arm and each state . At each time step , the top arms are chosen according to the values of their current states. OPT is our benchmark to visualize the convergence of other algorithms.

AB: Proposed by [1] this method is based on a fundamental change in the definition of Q values that aims at converging to the optimal Whittle Index policy. Arms are assumed to be homogeneous and they use a shared Q update step for all the arms. The shared update helps in collecting the point samples of Q values very quickly and results in fast convergence to the optimal algorithm. However, we consider scenarios where the arms may have different transition probabilities and we need to differentiate among arms, even when they are in the same state. Thus, for our experiments, we store Q values separately for each arm.

Fu: This method has been proposed by [9] to update Q values separately for each arm, each , and each stateaction pair. They assume to be a set of input parameters. At each time step, is computed which represents the subsidy that minimizes the gap between Q values for taking action and that of action . For the experiments, we set and other hyperparameters the same as what they considered in their paper. This method is a heuristic with no convergence guarantee.

Greedy: This method maintains a running average reward for each arm, state, and action, depending on the history of observations. At each time step, it greedily chooses the top arms with the highest difference in their average rewards between actions and at their current states.

Random: This method chooses arms uniformly at random at each step. Note that this algorithm is the same as setting in Step of the WIQL algorithm.
b.2 Numerical Examples
In this section, we provide two additional numerical examples for more general RMAB instances.
b.2.1 Example with restart
This RMAB problem is considered by [1]. Here, each arm is assumed to be in one of the five states at any point of time. There are two actions—active () and passive (). The active action forces an arm to restart from the first state. At each time step, action can be taken only on arms. The reward from an arm at a state is assumed to be when passive action is taken. The transition probabilities for each action are represented as a matrix, where , , , and .
.
Results: Figure 4 compares the performance of our proposed method with the benchmarks. Similar to the Circulant Dynamics example, we observe that WIQL performs better in terms of the average total reward, compared to the other benchmark algorithms for the Example with Restart problem where the number of states is higher.
b.2.2 Mentoring Instructions
This RMAB problem is considered by [9]. Here, each arm, representing a student is assumed to be in one of the ten states as any point of time. At each time step, only mentors are available and thus, action can be taken only on arms, which would possibly improve the state of the student. The higher index of the state implies better reward (state represents the best study level and state represents the worst study level). In particular, the reward at a state is assumed to be . The transition probabilities for each action are represented as a matrix, where , , , and .
Results: Figure 5 compares the performance of our proposed method with the benchmarks. We observe that WIQL performs marginally better in terms of the average total reward, compared to the other benchmark algorithms for the Mentoring Instructions problem. However, unlike the previous example, the margin of improvement in performance is very low. A possible reason is the number of states being much more than that of the earlier example, which takes more time for the algorithms to learn the best policy. However, this example clearly shows that Random policy may perform even worse when the number of states is as high as 10 states.