1 Introduction
Many sequential decision problems can be modeled as multiarmed bandit problems. A bandit problem models each potential decision as an arm. In each round, we play arms out of a total of arms by choosing the corresponding decisions. We then receive a reward from the played arms. The goal is to maximize the expected longterm total discounted reward. Consider, for example, displaying advertisements on an online platform with the goal to maximize the longterm discounted clickthrough rates. This can be modeled as a bandit problem where each arm is a piece of advertisement and we choose which advertisements to be displayed every time a particular user visits the platform. It should be noted that the reward, i.e., clickthrough rate, of an arm is not stationary, but depends on our actions in the past. For example, a user that just clicked on a particular advertisement may be much less likely to click on the same advertisement in the near future. Such a problem is a classic case of the restless bandit problem, where the reward distribution of an arm depends on its state, which changes over time based on our past actions.
The restless bandit problem is notoriously intractable Papadimitriou and Tsitsiklis (1999). Most recent efforts, such as recovering bandits PikeBurke and Grunewalder (2019), rotting bandits Seznec et al. (2020), and Brownian bandits Slivkins and Upfal (2008), only study some special instances of the restless bandit problem. The fundamental challenge of the restless bandit problem lies in the explosion of state space, as the state of the entire system is the Cartesian product of the states of individual arms. A powerful tool traditionally used to solve the RMABs’ decisionmaking problem is the Whittle index policy Whittle (1988). In a nutshell, the Whittle index policy calculates a Whittle index for each arm based on the arm’s current state, where the index corresponds to the amount of cost that we are willing to pay to play the arm, and then plays the arm with the highest index. When the indexability condition is satisfied, it has been shown that the Whittle index policy is asymptotically optimal in a wide range of settings.
In this paper, we present Neural Whittle Index Network (NeurWIN), a principled machine learning approach that finds the Whittle indices for virtually all restless bandit problems. We note that the Whittle index is an artificial construct that cannot be directly measured. Finding the Whittle index is typically intractable. As a result, the Whittle indices of many practical problems remain unknown except for a few special cases.
We are able to circumvent the challenges of finding the Whittle indices by leveraging an important mathematical property of the Whittle index: Consider an alternative problem where there is only one arm and we decide whether to play the arm in each time instance. In this problem, we need to pay a constant cost of every time we play the arm. The goal is to maximize the longterm discounted net reward, defined as the difference between the rewards we obtain from the arm and the costs we pay to play it. Then, the optimal policy is to play the arm whenever the Whittle index becomes larger than . Based on this property, a neural network that produces the Whittle index can be viewed as one that finds the optimal policy for the alternative problem for any .
Using this observation, we propose a deep reinforcement learning method to train NeurWIN. To demonstrate the power of NeurWIN, we employ NeurWIN for three recently studied restless bandit problems, namely, recovering bandit PikeBurke and Grunewalder (2019), wireless scheduling Aalto et al. (2015), and stochastic deadline scheduling Yu et al. (2018). We compare NeurWIN against five other reinforcement learning algorithms and the applicationspecific baseline policies in the respective restless bandit problems. Experiment results show that the index policy using our NeurWIN significantly outperforms other reinforcement learning algorithms. Moreover, for problems where the Whittle indices are known, NeurWIN has virtually the same performance as the corresponding Whittle index policy, showing that NeurWIN indeed learns a precise approximation to the Whittle indices.
The rest of the paper is organized as follows: Section 2 reviews related literature. Section 3 provides formal definitions of the Whittle index and our problem statement. Section 4 introduces our training algorithm for NeurWIN. Section 5 demonstrates the utility of NeurWIN by evaluating its performance under three recently studied restless bandit problems. Finally, Section 6 concludes the paper.
2 Related work
Restless bandit problems were first introduced in Whittle (1988). They are known to be intractable, and are in general PSPACE hard Papadimitriou and Tsitsiklis (1999). As a result, many studies focus on finding the Whittle index policy for restless bandit problems, such as in Le Ny et al. (2008); Dance and Silander (2015); Meshram et al. (2018); Tripathi and Modiano (2019). However, these studies are only able to find the Whittle indices under various specific assumptions about the bandit problems.
There have been many studies on applying RL methods for bandit problems. Dann et al. (2017) proposed a tool called UniformPAC for contextual bandits. Zanette and Brunskill (2018) described a frameworkagnostic approach towards guaranteeing RL algorithms’ performance. Jiang et al. (2017) introduced contextual decision processes (CDPs) that encompass contextual bandits for RL exploration with function approximation. Riquelme et al. (2018)
compared deep neural networks with Bayesian linear regression against other posterior sampling methods. However, none of these studies are applicable to restless bandits, where the state of an arm can change over time.
Deep RL algorithms have been utilized in problems that resemble restless bandit problems, including HVAC control Wei et al. (2017), cyberphysical systems Leong et al. (2020), and dynamic multichannel access Wang et al. (2018). In all these cases, a major limitation for deep RL is scalability. As the state spaces grows exponentially with the number of arms, these studies can only be applied to smallscale systems, and their evaluations are limited to cases when there are at most 5 zones, 6 sensors, and 8 channels, respectively.
An emerging research direction is applying machine learning algorithms to learn Whittle indices. Borkar and Chadha (2018) proposed employing the LSPE(0) algorithm Yu and Bertsekas (2009) coupled with a polynomial function approximator. The approach was applied in Avrachenkov and Borkar (2019) for scheduling web crawlers. However, this work can only be applied to restless bandits whose states can be represented by a single number, and it only uses a polynomial function approximator, which may have low representational power Sutton and Barto (2018).
Biswas et al. (2021) studied a public health setting and models it as a restless bandit problem. A Qlearning based Whittle index approach was formulated to ideally pick patients for interventions based on their states. Fu et al. (2019)
proposed a Qlearning based heuristic to find Whittle indices. However, as shown in its experiment results, the heuristic may not produce Whittle indices even when the training converges.
Avrachenkov and Borkar (2020) proposed WIBQL: a Qlearning method for learning the Whittle indices by applying a modified tabular relative value iteration (RVI) algorithm from Abounadi et al. (2001). The experiments presented here show that WIBQL does not scale well with large state spaces.3 Problem Setting
In this section, we provide a brief overview of restless bandit problems and the Whittle index. We then formally define the problem statement.
3.1 Restless Bandit Problems
A restless bandit problem consists of restless arms. In each round , a control policy observes the state of each arm , denoted by , and selects arms to activate. We call the selected arms as active and the others as passive. We use to denote the policy’s decision on each arm , where if the arm is active and if it is passive at round . Each arm generates a stochastic reward with distribution if it is active, and with distribution if it is passive. The state of each arm in the next round evolves by the transition kernel of either or , depending on whether the arm is active. The goal of the control policy is to maximize the expected total discounted reward, which can be expressed as with being the discount factor.
A control policy is effectively a function that takes the vector
as the input and produces the vector as the output. It should be noted that the space of input is exponential in . If each arm can be in one of possible states, then the number of possible inputs is. This feature, which is usually referred to as the curse of dimensionality, makes finding the optimal control policy intractable.
3.2 The Whittle Index
An index policy seeks to address the curse of dimensionality through decomposition. In each round, it calculates an index, denoted by , for each arm based on its current state. The index policy then selects the arms with the highest indices to activate. It should be noted that the index of an arm is independent from the states of any other arms. In this sense, learning the Whittle index of a restless arm is an auxiliary task to finding the control policy for restless bandits.
Obviously, the performance of an index policy depends on the design of the index function . A popular index with solid theoretical foundation is the Whittle index, which is defined below. Since we only consider one arm at a time, we drop the subscript for the rest of the paper.
Consider a system with only one arm, and an activation policy that determines whether to activate the arm in each round . Suppose that the activation policy needs to pay an activation cost of every time it chooses to activate the arm. The goal of the activation policy is to maximize the total expected discounted net reward, . The optimal activation policy can be expressed by the set of states in which it would activate this arm for a particular , and we denote this set by . Intuitively, the higher the cost, the less likely the optimal activation policy would activate the arm in a given state, and hence the set would decrease monotonically. When an arm satisfies this intuition, we say that the arm is indexable.
Definition 1 (Indexability).
An arm is said to be indexable if decreases monotonically from the set of all states to the empty set as increases from to . A restless bandit problem is said to be indexable if all arms are indexable.
Definition 2 (The Whittle Index).
If an arm is indexable, then its Whittle index of each state is defined as .
Even when an arm is indexable, finding its Whittle index can still be intractable, especially when the transition kernel of the arm is convoluted.^{1}^{1}1NiñoMora (2007) described a generic approach for finding the Whittle index. The complexity of this approach is at least exponential to the number of states. Our NeurWIN finds the Whittle index by leveraging the following property of the Whittle index: Consider the singlearmed bandit problem. Suppose the initial state of an indexable arm is at round one. Consider two possibilities: The first is that the activation policy activates the arm at round one, and then uses the optimal policy starting from round two; and the second is that the activation policy does not activate the arm at round one, and then uses the optimal policy starting from round two. Let and be the expected discounted net reward for these two possibilities, respectively, and let be their difference. Clearly, the optimal activation policy should activate an arm under state and activation cost if . We present the property more formally in the following proposition:
Proposition 1.
(Zhao, 2019, Thm 3.14) If an arm is indexable, then, for every state , if and only if .
Our NeurWIN uses Prop. 1 to train neural networks that predict the Whittle index for any indexable arms. From Def. 1, a sufficient condition for indexability is when is a decreasing function. Thus, we define the concept of strong indexability as follows:
Definition 3 (Strong Indexability).
An arm is said to be strongly indexable if is strictly decreasing in for every state .
Intuitively, as the activation cost increases, it becomes less attractive to activate the arm in any given state. Hence, one would expect to be strictly decreasing in for a particular state . In Section 5.5, we further use numerical results to show that all three applications we evaluate in this paper are strongly indexable.
3.3 Problem Statement
We now formally describe the objective of this paper. We assume that we are given a simulator of one single restless arm as a black box. The simulator provides two functionalities: First, it allows us to set the initial state of the arm to any arbitrary state . Second, in each round , the simulator takes , the indicator function that the arm is activated, as the input and produces the next state and the reward as the outputs.
Our goal is to derive lowcomplexity index algorithms for restless bandit problems by training a neural network that approximates the Whittle index of each restless arm using its simulator. A neural network takes the state as the input and produces a real number as the output, where is the vector containing all weights and biases of the neural network. Recall that is the Whittle index of the arm. We aim to find appropriate that makes small for all . Such a neural network is said to be Whittleaccurate.
Definition 4 (Whittleaccurate).
A neural network with parameters is said to be Whittleaccurate if , for all .
4 NeurWIN Algorithm: Neural Whittle Index Network
In this section, we present NeurWIN, a deepRL algorithm that trains neural networks to predict the Whittle indices. Since the Whittle index of an arm is independent from other arms, NeurWIN trains one neural network for each arm independently. In this section, we discuss how NeurWIN learns the Whittle index for one single arm.
4.1 Conditions for WhittleAccurate
Before presenting NeurWIN, we discuss the conditions for a neural network to be Whittleaccurate.
Suppose we are given a simulator of an arm and a neural network with parameters . We can then construct an environment of the arm along with an activation cost as shown in Fig. 1. In each round , the environment takes the real number as the input.
The input is first fed into a step function to produce , where is the indicator function. Then, is fed into the simulator of the arm to produce and . Finally, the environment outputs the net reward and the next state . We call this environment . Thus, the neural network can be viewed as a controller for . The following corollary is a direct result from
Prop. 1.
Corollary 1.
If the arm is indexable and , then the neural network with parameters is the optimal controller for , for any and initial state . Moreover, given and , the optimal discounted net reward is .
Corollary 1 can be viewed as a necessary condition for a neural network to be 0Whittleaccurate. Next, we establish a sufficient condition that shows how a nearoptimal neural network controller for is also Whittleaccurate. Let be the average reward of applying a neural network with parameters to with initial state . We can then formally define the concept of nearoptimality as follows.
Definition 5 (optimal neural network).
A neural network with parameters is said to be optimal if there exists a small positive number such that for all , and .
Having outlined the necessary terms, we now move to establishing a sufficient condition for the Whittleaccuracy of a neural network applied on ).
Theorem 1.
If the arm is strongly indexable, then for any , there exists a positive such that any optimal neural network controlling is also Whittleaccurate.
Proof.
For a given , let,
Since the arm is strongly indexable and is its Whittle index, we have .
We prove the theorem by establishing the following equivalent statement: If the neural network is not Whittleaccurate, then there exists states , activation cost , such that the discounted net reward of applying a neural network to with initial state is strictly less than .
Suppose the neural network is not Whittleaccurate, then there exists a state such that . We set . For the case , we set . Since , we have and . On the other hand, since , the neural network would activate the arm in the first round and its discounted reward is at most
For the case , a similar argument shows that the discounted reward for the neural network when is smaller than . This completes the proof.
∎
4.2 Training Procedure for NeurWIN
Based on Thm. 1 and Def. 5, we define our objective function as
, with the estimated index
set as the environment’s activation cost. A neural network that achieves a nearoptimal is also Whittleaccurate, which motivates the usage of deep reinforcement learning to find Whittleaccurate neural networks. Therefore we propose NeurWIN: an algorithm based on REINFORCE Williams (1992) to update through stochastic gradient ascent, where the gradient is defined as . For the gradient to exist, we require the output of the environment to be differentiable with respect to the input . To fulfill the requirement, we replace the step function in Fig. 1with a sigmoid function,
(1) 
Where is a sensitivity parameter. The environment then chooses
with probability
, and with probability . We call this differentiable environment .The complete NeurWIN pseudocode is provided in Alg. 1. Our training procedure consists of multiple minibatches, where each minibatch is composed of episodes. At the beginning of each minibatch, we randomly select two states and . Motivated by the condition in Thm. 1, we consider the environment with initial state , and aim to improve the empirical discounted net reward of applying the neural network to such an environment.
In each episode from the current minibatch, we set and initial state to be . We then apply the neural network with parameters to and observe the sequences of actions and states . We can use these sequences to calculate their gradients with respect to through backward propagation, which we denote by . At the end of the minibatch, NeurWIN would have stored the accumulated gradients for each of the minibatch episodes to tune the parameters.
We also observe the discounted net reward and denote it by . After all episodes in the minibatch finish, we calculate the average of all as a bootstrapped baseline and denote it by . Finally, we do a weighted gradient ascent with the weight for episode being its offset net reward, .
When the step size is chosen appropriately, the neural network will be more likely to follow the sequences of actions of episodes with larger after the weighted gradient ascent, and thus will have a better empirical discounted net reward.
5 Experiments
5.1 Overview
In this section, we demonstrate NeurWIN’s utility by evaluating it under three recently studied applications of restless bandit problems. In each application, we consider that there are arms and a controller can play of them in each round. We evaluate three different pairs of : and , and average the results of 50 independent runs when the problems are stochastic. Some applications consider that different arms can have different behaviors. For such scenarios, we consider that there are multiple types of arms and train a separate NeurWIN for each type. During testing, the controller calculates the index of each arm based on the arm’s state and schedules the arms with the highest indices.
The performance of NeurWIN is compared against the proposed policies in the respective recent studies. In addition, we also evaluate the REINFORCE Williams (1992), WolpertingerDDPG (WOLPDDPG) DulacArnold et al. (2016), Amortized Qlearning (AQL) Van de Wiele et al. (2020), QWIC Fu et al. (2019), and WIBQL Avrachenkov and Borkar (2020). REINFORCE is a classical policybased RL algorithm. Both WOLPDDPG and AQL are modelfree deep RL algorithms meant to address problems with big action spaces. All three of them view a restless bandit problem as a Markov decision problem. Under this view, the number of states is exponential in and the number of possible actions is , which can be as large as in our setting. Neither REINFORCE nor WOLPDDPG can support such a big action space, so we only evaluate them for and
. On the other hand, QWIC and WIBQL aim to find the Whittle index through Qlearning. They are tabular RL methods and do not scale well as the state space increases. Thus, we only evaluate QWIC and WIBQL when the size of the state space is less than one thousand. We use opensource implementations for REINFORCE
Hubbs (2019) and WOLPDDPG ChangyWen (2021).In addition, we use experiment results to evaluate two important properties. First, we evaluate whether these three applications are strongly indexable. Second, we evaluate the performance of NeurWIN when the simulator does not perfectly capture the actual behavior of an arm.
We use the same neural network architecture for NeurWIN in all three applications. The neural network is a fully connected one that consists of one input layer, one output layer, and two hidden layers. There are 16 and 32 neurons in the two hidden layers. The output layer has one neuron, and the input layer size is the same as the dimension of the state of one single arm. As for the REINFORCE, WOLPDDPG, AQL algorithms, we choose the neural network architectures so that the total number of parameters is slightly more than
times as the number of parameters in NeurWIN to make a fair comparison. ReLU activation function is used for the two hidden layers. An initial learning rate
is set for all cases, with the Adam optimizer Kingma and Ba (2015) employed for the gradient ascent step. The discount factor is with an episode horizon of timesteps. Each minibatch consists of five episodes. More details can be found in the appendix.5.2 Deadline Scheduling
A recent study Yu et al. (2018) proposes a deadline scheduling problem for the scheduling of electrical vehicle charging stations. In this problem, a charging station has charging spots and enough power to charge vehicles in each round. When a charging spot is available, a new vehicle may join the system and occupy the spot. Upon occupying the spot, the vehicle announces the time that it will leave the station and the amount of electricity that it needs to be charged. The charging station obtains a reward of for each unit of electricity that it provides to a vehicle.
However, if the station cannot fully charge the vehicle by the time it leaves, then the station needs to pay a penalty of , where is the amount of unfulfilled charge. The goal of the station is to maximize its net reward, defined as the difference between the amount of reward and the amount of penalty. In this problem, each charging spot is an arm. Yu et al. (2018) has shown that this problem is indexable. We further show in Appendix A that the problem is also strongly indexable.
We use exactly the same setting as in the recent study Yu et al. (2018) for our experiment. In this problem, the state of an arm is denoted by a pair of integers with and . The size of state space is 120 for each arm.
The experiment results are shown in Fig. 2. It can be observed that the performance of NeurWIN converges to that of the deadline Whittle index in 600 training episodes. In contrast, other MDP algorithms have virtually no improvement over 2,000 training episodes and remain far worse than NeurWIN. This may be due to the explosion of state space. Even when is only 4, the total number of possible states is , making it difficult for the compared deep RL algorithms to learn the optimal control in just episodes. QWIC performs poorly compared to NeurWIN and the deadline index, while WIBQL’s performance degrades with more arms. The result suggest that both QWIC and WIBQL do not learn an accurate approximation of the Whittle index.
5.3 Recovering Bandits
The recovering bandits PikeBurke and Grunewalder (2019) aim to model the timevarying behaviors of consumers. In particular, it considers that a consumer who has just bought a certain product, say, a television, would be much less interested in advertisements of the same product in the near future. However, the consumer’s interest in these advertisements may recover over time. Thus, the recovering bandit models the reward of playing an arm, i.e., displaying an advertisement, by a function , where is the time since the arm was last played and is a constant specified by the arm. There is no known Whittle index or optimal control policy for this problem.
The recent study PikeBurke and Grunewalder (2019) considers the special case of . When the function is known, it proposes a step lookahead oracle. Once every rounds, the step lookahead oracle builds a step lookahead tree. Each leaf of the step lookahead tree corresponds to a sequence of actions. The step lookahead oracle then picks the leaf with the best reward and use the corresponding actions in the next rounds. As the size of the tree grows exponentially with , it is computationally infeasible to evaluate the step lookahead oracle when is large. We modify a heuristic introduced in PikeBurke and Grunewalder (2019) to greedily construct a 20step lookahead tree with leaves and pick the best leaf. PikeBurke and Grunewalder (2019) also proposes two online algorithms, RGPUCB and RGPTS, for exploring and exploiting when it is not known a priori. We incorporate PikeBurke and Grunewalder (2019)’s opensource implementations of these two algorithms in our experiments.
In our experiment, we consider different arms have different functions . The state of each arm is its value of and we set for all arms.
Experiment results are shown in Fig. 3. It can be observed that NeurWIN is able to outperform the 20step lookahead oracle in its respective setting with just a few thousands of training episodes. Other algorithms perform poorly.
5.4 Wireless Scheduling
A recent paper Aalto et al. (2015) studies the problem of wireless scheduling over fading channels. In this problem, each arm corresponds to a wireless client. Each wireless client has some data to be transmitted and it suffers from a holding cost of 1 unit per round until it has finished transmitting all its data. The channel quality of a wireless client, which determines the amount of data can be transmitted if the wireless client is scheduled, changes over time. The goal is to minimize the sum of holding costs of all wireless clients. Equivalently, we view the reward of the system as the negative of the total holding cost. Finding the Whittle index through theoretical analysis is difficult. Even for the simplified case when the channel quality is i.i.d. over time and can only be in one of two possible states, the recent paper Aalto et al. (2015) can only derive the Whittle index under some approximations. It then proposes a sizeaware index policy using its approximated index.
In the experiment, we adopt the settings of channel qualities of the recent paper. The channel of a wireless client can be in either a good state or a bad state. Initially, the amount of load is uniform between 0 and 1Mb. The state of each arm is its channel state and the amount of remaining load. The size of state space is for each arm. We consider that there are two types of arms, and different types of arms have different probabilities of being in the good state. During testing, there are arms of each type.
Experiment results are shown in Fig. 4. It can be observed that NeurWIN is able to perform as well as the sizeaware index policy with about training episodes. It can also be observed that other learning algorithms perform poorly.
5.5 Evaluation of NeurWIN’s Limitations
A limitation of NeurWIN is that it is designed for strongly indexable bandits. Hence, it is important to evaluate whether the considered bandit problem is strongly indexable. Recall that a bandit arm is strongly indexable if is strictly decreasing in , for all states . We extensively evaluate the function of different states for the three applications considered in this paper. We find that all of them are strictly decreasing in . Fig. 5 shows the function of five randomly selected states for each of the three applications. The deadline and wireless scheduling cases were averaged over 50 runs. These results confirm that the three considered restless bandit problems are indeed strongly indexable.
Another limitation of NeurWIN is that it requires a simulator for each arm. To evaluate the robustness of NeurWIN, we test the performance of NeurWIN when the simulator is not perfectly precise. In particular, let and be the rewards of an arm in state when it is activated and not activated, respectively. Then, the simulator estimates that the rewards are and , respectively, where and
are independent Gaussian random variables. The variance of these Gaussian random variables correspond to the magnitude of root mean square errors in the simulator.
We train NeurWIN using the noisy simulators with different levels of errors for the three problems. For each problem, we compare the performance of NeurWIN against the respective baseline policy. Unlike NeurWIN, the baseline policies make decisions based on the true reward functions rather than the estimated ones. The results for the case and are shown in Fig. 6. It can be observed that the performance of NeurWIN only degrades a little even when the root mean square error is as large as 40% of the actual rewards, and its performance remains similar or even superior to that of the baseline policies.
6 Conclusion
This paper introduced NeurWIN: a deep RL method for estimating the Whittle index for restless bandit problems. The performance of NeurWIN is evaluated by three different restless bandit problems. In each of them, NeurWIN outperforms or matches stateoftheart control policies. The concept of strong indexability for bandit problems was also introduced. In addition, all three considered restless bandit problems were empirically shown to be strongly indexable.
There are several promising research directions to take NeurWIN into: extending NeurWIN into the offline policy case. One way is to utilize the data samples collected offline to construct a predictive model for each arm. Recent similar attempts have been made for general MDPs in Kidambi et al. (2020); Yu et al. (2020). NeurWIN would then learn the Whittle index based on this predictive model of a single arm, which is expected to require fewer data samples due to the decomposition provided by index policies.
Another direction is to investigate NeurWIN’s performance in cases with nonstrong indexability guarantees. For cases with verifiably non stronglyindexable states, one would add a preprocessing step that provides performance precision thresholds on nonindexable arms.
Acknowledgment
This material is based upon work supported in part by the U.S. Army Research Laboratory and the U.S. Army Research Office under Grant Numbers W911NF1810331, W911NF1910367 and W911NF1920243, in part by Office of Naval Research under Contracts N000141812048 and N000142112385, in part by the National Science Foundation under Grant Numbers CNS 1955696 and CPS2038963, and in part by the Ministry of Science and Technology of Taiwan under Contract Numbers MOST 1092636E009012 and MOST 1102628EA49014.
References
 Whittle index approach to sizeaware scheduling with timevarying channels. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 57–69. Cited by: §1, §5.4.

Learning algorithms for markov decision processes with average cost
. SIAM Journal on Control and Optimization 40 (3), pp. 681–698. External Links: Document, Link, https://doi.org/10.1137/S0363012999361974 Cited by: §2.  A learning algorithm for the whittle index policy for scheduling web crawlers. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Vol. , pp. 1001–1006. Cited by: §2.
 Whittle index based qlearning for restless bandits with average reward. CoRR abs/2004.14427. External Links: Link, 2004.14427 Cited by: §2, §5.1.

Learn to Intervene: An Adaptive Learning Policy for Restless Bandits in Application to Preventive Healthcare.
In
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
, Montreal, Canada, pp. 4039–4046 (en). External Links: ISBN 9780999241196, Link, Document Cited by: §2.  A reinforcement learning algorithm for restless bandits. In 2018 Indian Control Conference (ICC), Vol. , pp. 89–94. Cited by: §2.
 ChangyWen/wolpertinger_ddpg. Note: https://github.com/ChangyWen/wolpertinger_ddpgoriginaldate: 20190621T02:39:45Z Cited by: §5.1.

When are kalmanfilter restless bandits indexable?
. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1711–1719. External Links: Link Cited by: §2.  Unifying pac and regret: uniform pac bounds for episodic reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, pp. 5717–5727. External Links: ISBN 9781510860964 Cited by: §2.
 Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679. Cited by: §5.1.
 Towards qlearning the whittle index for restless bandits. In 2019 Australian New Zealand Control Conference (ANZCC), Vol. , pp. 249–254. Cited by: §2, §5.1.

Learning reinforcement learning: reinforce with pytorch!
.Towards Data Science
. Note: https://towardsdatascience.com/learningreinforcementlearningreinforcewithpytorch5e8ad7fc7da0 Cited by: §5.1.  Contextual decision processes with low Bellman rank are PAClearnable. D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1704–1713. External Links: Link Cited by: §2.
 MOReL: modelbased offline reinforcement learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 21810–21823. External Links: Link Cited by: §6.
 Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §5.1.
 Multiuav dynamic routing with partial observations using restless bandit allocation indices. In 2008 American Control Conference, Vol. , pp. 4220–4225. Cited by: §2.
 Deep reinforcement learning for wireless sensor scheduling in cyber–physical systems. Automatica 113, pp. 108759. External Links: ISSN 00051098, Document, Link Cited by: §2.
 On the whittle index for restless multiarmed hidden markov bandits. IEEE Transactions on Automatic Control 63 (9), pp. 3046–3053. Cited by: §2.
 Dynamic priority allocation via restless bandit marginal productivity indices. Top 15 (2), pp. 161–198. Cited by: footnote 1.
 The complexity of optimal queuing network control. Mathematics of Operations Research 24 (2), pp. 293–305. External Links: ISSN 0364765X, 15265471, Link Cited by: §1, §2.
 Recovering bandits. In Advances in Neural Information Processing Systems, pp. 14122–14131. Cited by: §1, §1, §5.3, §5.3.

Deep bayesian bandits showdown: an empirical comparison of bayesian deep networks for thompson sampling
. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.  A single algorithm for both restless and rested rotting bandits. In International Conference on Artificial Intelligence and Statistics, pp. 3784–3794. Cited by: §1.
 Adapting to a changing environment: the brownian restless bandits.. In COLT, pp. 343–354. Cited by: §1.
 Reinforcement learning: an introduction. Second edition, The MIT Press. External Links: Link Cited by: §2.
 A whittle index approach to minimizing functions of age of information. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Vol. , pp. 1160–1167. Cited by: §2.
 Qlearning in enormous action spaces via amortized approximate maximization. arXiv preprint arXiv:2001.08116. Cited by: §5.1.
 Deep reinforcement learning for dynamic multichannel access in wireless networks. IEEE Transactions on Cognitive Communications and Networking 4 (2), pp. 257–265. Cited by: §2.
 Deep reinforcement learning for building hvac control. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), Vol. , pp. 1–6. Cited by: §2.
 Restless bandits: activity allocation in a changing world. Journal of applied probability, pp. 287–298. Cited by: §1, §2.
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Mach. Learn. 8 (3–4), pp. 229–256. External Links: ISSN 08856125, Link, Document Cited by: §4.2, §5.1.
 Convergence results for some temporal difference methods based on least squares. IEEE Transactions on Automatic Control 54 (7), pp. 1515–1531. Cited by: §2.
 MOPO: modelbased offline policy optimization. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 14129–14142. External Links: Link Cited by: §6.
 Deadline scheduling as restless bandits. IEEE Transactions on Automatic Control 63 (8), pp. 2343–2358. Cited by: Appendix A, §1, §5.2, §5.2, §5.2.
 Problem dependent reinforcement learning bounds which can identify bandit structure in MDPs. J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 5747–5755. External Links: Link Cited by: §2.
 Multiarmed bandits: theory and applications to online learning in networks. Synthesis Lectures on Communication Networks 12 (1), pp. 1–165. Cited by: Proposition 1.
Appendix A Proof of Deadline Scheduling’s Strong Indexability
Theorem 2.
The restless bandit for the deadline scheduling problem is strongly indexable.
Proof.
Fix a state , the function is a continuous and piecewise linear function since the number of states is finite. Thus, it is sufficient to prove that is strictly decreasing at all points of where is differentiable. Let be the sequence of actions taken by a policy that activates the arm at round 1, and then uses the optimal policy starting from round 2. Let be the sequence of actions taken by a policy that does not activate the arm at round 1, and then uses the optimal policy starting from round 2. We prove this theorem by comparing and on every sample path. We consider the following two scenarios:
In the first scenario, and are the same starting from round 2. Let be the remaining job size when the current deadline expires under . Since is the same as starting from round 2, its remaining job size when the current deadline expires is . Thus, , which is strictly decreasing in whenever is differentiable.
In the second scenario, and are not the same after round 2. Let be the first time after round 2 that they are different. Since they are the same between round 2 and round , the remaining job size under is no larger than that under . Moreover, by [34], the Whittle index is increasing in job size. Hence, we can conclude that, on round , activates the arm and does not activate the arm. After round , and are in the same state and will choose the same actions for all following rounds. Thus, the two sequences only see different rewards on round 1 and round , and we have , which is strictly decreasing in whenever is differentiable.
Combining the two scenarios, the proof is complete. ∎
Appendix B Additional NeurWIN Results For Neural Networks With Different Number of Parameters
In this section, we show the total discounted rewards’ performance for NeurWIN when trained on differentsized neural networks. We compare the performance against each case’s proposed baseline in order to observe the convergence rate of NeurWIN. We use the same training parameters as in section 5, and plot the results for . Fig. 7 shows the performance for a larger neural network per arm with neurons in 2 hidden layers. Fig. 8 provides results for a smaller neural network per arm with neurons in 2 hidden layers. For the deadline and wireless scheduling cases, the larger neural network has 3345 parameters per arm compared with 625 parameters for each network from section 5, and 165 parameters for the smaller neural network. A recovering bandits’ network has a larger network parameters’ count of 3297 parameters compared with 609 parameters for results in section 5, and 157 parameters for the smaller neural network.
It can be observed that a neural network with more parameters is able to converge to the baseline with fewer episodes compared with the smaller and original neural networks. In the case of deadline scheduling, the smaller network requires 1380 episodes in to reach the Whittle index performance. The larger network, in contrast, reaches the Whittle index performance in terms of total discounted rewards with considerably fewer episodes at 180 episodes. The original network used in section 5 converges in approximately 600 episodes. The same observation is true for the wireless scheduling case, with the smaller network requiring 10,000 episodes for , and fails to outperform the baseline for .The larger network converges in fewer episodes compared to the smaller and original networks.
More interestingly, the smaller neural networks fail to outperform the baselines in the recovering bandits case for and , which suggests that the selected neural network architecture is not richenough for learning the Whittle index.
Appendix C Training and Testing Environments
For all cases, we implement NeurWIN algorithm using PyTorch, and train the agent on a single arm modelled after OpenAI’s Gym API.
All algorithms were trained and tested on a Windows 10 build 19043.985 machine with an AMD Ryzen 3950X CPU, and 64 GB 3600 MHz RAM.
Appendix D Deadline Scheduling Training and Inference Details
d.1 Formulated Restless Bandit for the Deadline Scheduling Case
The state , action , reward , and next state of one arm are listed below:
State : The state is a vector . denotes the job size (i.e. amount of electricity needed for an electric vehicle), and is the job’s time until the hard drop deadline is reached (i.e. time until an electric vehicle leaves).
Action : The agent can either activate the arm , or leave it passive . The next state changes based on two different transition kernels depending on the selected action. The reward is also dependent on the action at time .
Reward : The agent, at time , receives a reward from the arm,
(2) 
In the equation above, is a constant processing cost incurred when activating the arm, is the penalty function for failing to complete the job before . The penalty function was chosen to be .
Next state : The next state is given by
(3) 
where is the arrival probability of a new job (i.e. a new electric vehicle arriving at a charging station) if the position is empty. For training and inference, we set and for all , .
d.2 Training Setting
NeurWIN training is made for episodes on the deadline scheduling case. We save the trained model parameters at an interval of episodes for inferring the control policy after training. The training produces different set of parameters that output the estimated index given their respective training limit. The neural network had trainable parameters given as layers , where the input layer matches the state size.
For the deadline scheduling training, we set the sigmoid value , episode’s time horizon timesteps, minibatch size to episodes, and the discount factor . The processing cost is set to . Training procedure follows section 4.2 from the main text. The arm randomly picks an initial state , with a maximum , and maximum . The initial states are the same across episodes in one minibatch for return comparison. The sequence of job arrivals in an episode’s horizon is also fixed across a minibatch. This way, the minibatch returns are compared for one initial state, and used in tuning the estimated index value .
At the agent side, NeurWIN receives the initial state , sets the activation cost from a random state . Training follows as described in NeurWIN’s pseudo code.
For the MDP algorithms (REINFORCE, AQL, WOLPDDPG), the training hyperparameters are the same as NeurWIN: Initial learning rate is
, episode time horizon timesteps, discount factor . Neural networks had two hidden layers with total parameters’ count slightly larger than number of NeurWIN networks for proper comparison.QWIC was trained for the sets with a table for each arm. QWIC selects from a set of candidate threshold values as index for each state. The algorithm learns function . The estimated index per state is determined during training as,
(4) 
Initial timestep was selected to be , and at later timesteps set to be . Other training hyperparameters: Initial learning rate , training episode time horizon of timesteps, discount factor .
d.3 Inference Setting
In inferring the control policy, we test the trained parameters at different episode intervals. In other words, the trained models’ parameters are tested at an interval of episodes, and their discounted rewards are plotted for comparison.
From the trained NeurWIN models described in D.2, we instantiate arms, and activate arms at each timestep based on their indices. For example, we load a NeurWIN model trained for episodes on one arm, and set arms each with its own trained agent on episodes. Once the testing is complete, we load the next model trained at episodes, and repeat the process for episodes. The testing setting has the same parameters as the training setting with horizon timesteps, and discount factor .
For the deadline Whittle index, we calculate the indices using the closedform Whittle index and activate the highest indicesassociated arms. The accumulated reward from all arm (activated and passive) is then discounted with .
For the MDP algorithms (REINFORCE, AQL, WOLPDDPG), arms combined states form the MDP state which is passed to the trained neural network. Reward is the sum of all rewards from the arms. Testing is made for neural networks trained at different episode limits. The same testing parameters as NeurWIN were chosen: horizon timestep, discount factor .
We perform the testing over independent runs up to episodes, where each run the arms are seeded differently. All algorithms were tested on the same seeded arms. Results were provided in the main text for this setting.
Appendix E Recovering Bandits’ Training and Inference Details
e.1 Formulated Restless Bandit for the Recovering Bandits’ Case
We list here the terms that describes one restless arm in the recovering bandits’ case:
State : The state is a single value called the waiting time. The waiting time indicates the time since the arm was last played. The arm state space is determined by the maximum allowed waiting time , giving a state space .
Action : As with all other considered cases, the agent can either activate the arm , or not select it . The action space is then .
Reward : The reward is provided by the recovering function , where is the time since the arm was last played at time . If the arm is activated, the function value at is the earned reward. A reward of zero is given if the arm is left passive . Fig. 9 shows the four recovering functions used in this work. The recovering functions are generated from,
(5) 
Where the values specify the recovering function. The values for each class are provided in table 1.
Class  Value  Value 

A  10  0.2 
B  8.5  0.4 
C  7  0.6 
D  5.5  0.8 
Next state : The state evolves based on the selected action. If , the state is reset to , meaning that bandit’s reward decayed to the initial waiting time . If the arm is left passive , the next state becomes .
e.2 Training Setting
Training procedure for NeurWIN algorithm follows the pseudocode in section 4. Here we discuss the parameter selection and details specific to the recovering bandits’ case. We train the neural network using NeurWIN for episode, and save the trained parameters at an episode interval of episodes. In total, for training episodes, we end up with models for inference. The selected neural network has trainable parameters with two hidden layers given as neurons.
For training parameters, we select the sigmoid value , the episode’s time horizon timesteps, the minibatch size to episodes, and the discount factor . As with all other cases, each minibatch of episodes has the same initial state which is provided by the arm. To ensure the agent experiences as many states in as possible, we set an initial state sampling distribution given as . Hence, the probability of selecting the initial state to be is .
At the agent side, we set the activation cost at the beginning of each minibatch. is chosen to be the estimate index value of a randomly selected state in . The training continues as described in NeurWIN’s pseudo code: the agent receives the state, and selects an action . If the agent activates the arm , it receives a reward equal to the recovery function’s value at , and subtracts from it. Otherwise, the reward is kept the same for . We train NeurWIN independently for each of the four activation functions described in table 1.
For the MDP algorithms, training hyperparameters were selected as: Initial learning rate , discount factor . The algorithms are trained on the MDP representation, where the state is the combined states of the arms.
QWIC training hyperparameters are the same. Training process is the same as in the deadline scheduling case from D.2.
e.3 Inference Setting
The inference setup measures NeurWIN’s control policy for several settings. We test, for a single run, the control policy of NeurWIN over a time horizon timesteps. We set arms such that a quarter have one recovering function class from table 1. For , we have three type A, three type B, two type C, and two type D arms.
At each timestep, the 8lookahead and 20lookahead policies rank the recovering functions reward values, and select the arms with the highest reward values for activation. The incurred discounted reward at time is the discounted sum of all activated arms’ rewards. The total discounted reward is then the discounted rewards over time horizon . For inferring NeurWIN’s control policy, we record the total discounted reward for each of the models. For example, we instantiate arms each having a neural network trained to episodes. At each timestep , the neural networks provide the estimated index for . The control policy activates the arms with the highest index values. We then load the model parameters trained on episodes, and repeat the aforementioned testing process using the same seed values.
With REINFORCE, AQL, WOLPDDPG, testing happens for the same seeded arms as NeurWIN and dlookahead policies. The MDP state is the combined states of arms, and the total reward is the sum of all arm rewards. Testing is done for the 300 saved models, with discount factor over horizon . QWIC and WIBQL are also tested for the saved index mappings and tables.
Appendix F Wireless Scheduling Training and Inference Details
f.1 Restless Arm Definition for the Wireless Scheduling Case
Here we list the state , action , reward , and next state that forms one restless arm:
State : The state is a vector , where is the arm’s remaining load in bits, and is the wireless channel’s state indicator.
means a good channel state and a higher transmission rate , while is a bad channel state with a lower transmission rate .
Action : The agent either activates the arm , or keeps it passive .
The reward and next state depend on the chosen action.
Reward : The arm’s reward is the negative of the holding cost , which is a cost incurred at each timestep for not completing the job.
Next state : Next state is . Remaining load equals , where if , and if . is 1 with probability , and 0, otherwise, where is a parameter describing the probability that the channel is in a good state.
f.2 Training Setting
The neural network has trainable parameters given as layers neurons. The training happens for episodes, and we save the model parameters at each episodes. Hence, the training results in models trained up to different episode limit.
For the wireless scheduling case, we set the sigmoid value , minibatch size to episodes, and the discount factor to . Maximum episode time horizon is set to timesteps. The holding cost is set to , which is incurred for each timestep the job is not completed. We also set the good transmission rate kb, and the bad channel transmission rate kb. During training, we train NeurWIN on two different good channel probabilities, , and .
The episode defines one job size sampled uniformly from the range . All episodes in one minibatch have the same initial state, as well as the same sequence of channel states .
At the agent side, NeurWIN receives the initial state , and sets the activation cost from a random state for all timesteps of all minibatch episodes. As mentioned before, we save the trained model at an interval of episodes. For training episodes, this results in models trained up to their respective episode limit.
f.3 Inference Setting
Testing compares the induced control policy for NeurWIN with the sizeaware index and learning algorithms. The algorithms’ control policies are tested at different training episodes’ limits. We instantiate arms and activate arms at each timestep until all users’ jobs terminate, or the time limit is reached.
Half of the arms have a good channel probability . The other half has a good channel probability .
The sizeaware index is defined as follows: at each timestep, the policy prioritizes arms in the good channel state, and calculates their secondary index. The secondary index of arm state is defined as,
(6) 
The sizeaware policy then activates the highest indexed arms. In case the number of good channel arms is below , the policy also calculate the primary index of all remaining arms. The primary index of arm state is defined as,
(7) 
Rewards received from all arms are summed, and discounted using . The inference phase proceeds until all jobs have been completed.
For NeurWIN’s control policy, we record the total discounted reward for the offlinetrained models. For example, we set arms each coupled with a model trained on episodes. The models output their arms’ indices, and the top indexed arms are activated. In case the remaining arms are less than , we activate all remaining arms at timestep . timestep reward is the sum of all arms’ rewards. Once testing for the current model is finished, we load the next model for each arm, and repeat the process. For the MDP algorithms, the MDP state is the combined states of all arms, with the reward being the sum of arms’ rewards.
We note that the arms’ initial loads are the same across runs, and that the sequence of good channel states is random. For all algorithms, we average the total discounted reward for all control policies over independent runs using the same seed values.