Reinforcement Learning for Finite-Horizon Restless Multi-Armed Multi-Action Bandits

We study a finite-horizon restless multi-armed bandit problem with multiple actions, dubbed R(MA)^2B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken. The goal is to sequentially choose actions for arms so as to maximize the expected value of the cumulative rewards collected. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy. Our policy is well-defined even if the underlying MDPs are not indexable. We prove that it is asymptotically optimal when the activation budget and number of arms are scaled up, while keeping their ratio as a constant. For the case when the system parameters are unknown, we develop a learning algorithm. Our learning algorithm uses the principle of optimism in the face of uncertainty and further uses a generative model in order to fully exploit the structure of Occupancy-Measured-Reward Index Policy. We call it the R(MA)^2B-UCB algorithm. As compared with the existing algorithms, R(MA)^2B-UCB performs close to an offline optimum policy, and also achieves a sub-linear regret with a low computational complexity. Experimental results show that R(MA)^2B-UCB outperforms the existing algorithms in both regret and run time.



page 1

page 2

page 3

page 4


Restless Multi-Armed Bandits under Exogenous Global Markov Process

We consider an extension to the restless multi-armed bandit (RMAB) probl...

Thompson Sampling with Information Relaxation Penalties

We consider a finite time horizon multi-armed bandit (MAB) problem in a ...

Stochastic Bandits with Delay-Dependent Payoffs

Motivated by recommendation problems in music streaming platforms, we pr...

Efficient Reinforcement Learning via Initial Pure Exploration

In several realistic situations, an interactive learning agent can pract...

Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits

We study the online restless bandit problem, where the state of each arm...

Sequential Decision Making under Uncertainty with Dynamic Resource Constraints

This paper studies a class of constrained restless multi-armed bandits. ...

Algorithms for slate bandits with non-separable reward functions

In this paper, we study a slate bandit problem where the function that d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study a variant of the popular restless multi-armed problem (RMAB) [48] in which the decision maker has to make choices for only a finite horizon, and can choose from amongst multiple actions for each arm. We call this problem as the restless multi-armed multi-action bandits (RB). A RMAB problem requires a decision maker to choose from amongst a fixed number of competing “arms” in a sequential manner. Each arm is endowed with a “state” that evolves according to a Markov decision process (MDP) [38] that is independent of other arms. In the multi-armed bandit (MAB) problem [11], i.e. the “rested MAB” or simply the MAB, states of only those arms evolve that are activated currently, and rewards are generated only from these arms. The goal is to maximize the expected value of the cumulative rewards collected, by choosing the arms in a sequential way. The celebrated Gittins index policy [11] yields an efficient solution to the MAB. At each time, it assigns an index to each arm, which is a function of the current state of this arm, and then activates the arm with the largest index. However, Gittins index policy is optimal only when the following assumptions hold (i) the MAB problem involves rested bandits; (ii) only one arm

can be activated at each decision epoch; and (iii) the objective is

infinite-horizon discounted expected reward. Whittle [48] generalized Gittins policy to also allow for the evolution of those arms that are not activated currently (dubbed as “a changing world setting”), thereby introducing the RMAB problem. Whittle’s setup also allows multiple arms to be activated simultaneously.

The RMAB problem is a very general setup that can be applied to solve a variety of sequential decision making problems ranging from job allocation [35, 16, 6], wireless communication [8, 40], sensor management [30, 1] and healthcare [9, 27, 31, 24]. However, the RMAB is notoriously intractable [37]

and the optimal policy for an RMAB is rarely an index policy. To that end, Whittle proposed a heuristic policy for the

infinite-horizon RMAB, which is now called the Whittle index policy. However, Whittle index policy is well-defined only when the so-called indexability [48] condition is satisfied. Furthermore, even when an arm is indexable, obtaining its Whittle indices could still be intractable, especially when the corresponding controlled Markov process is convoluted [35]. Finally, Whittle index policy is only guaranteed to be asymptotically optimal [47] under a difficult-to-verify condition which requires that the fluid approximation associated with the “population dynamics” of the overall system has a globally asymptotically stable attractor.

Inspired by Whittle’s work, many studies focused on finding the index policy for restless bandit problems, e.g., [34, 45, 15, 51, 7]. This line of works assumes that the system parameters are known to the decision-maker. Since in reality the true value of the parameters are unavailable, and possibly time-varying, it is important to examine RMAB from a learning perspective, e.g., [8, 28, 43, 29, 44, 36, 20, 19, 46, 49]. However, analyzing learning algorithms for RMAB is in general hard due to the uncertainty associated with the learner’s knowledge about the system parameters, and secondly since the design of optimal control policy even when the parameter is known, is still unresolved.

Firstly, the existing algorithms such as [28, 43, 44, 29] that are based on the upper confidence bound (UCB) strategy [3] may not perform close to the offline optimum. This is the case because the baseline policy in these works is often a heuristic policy that does not have any theoretical performance guarantees. An example of such a heuristic policy is one that pulls only one arm, or a fixed set of arms. Such policies are known to yield highly sub-optimal performance in the RMAB setting, and this makes the learning regret [25] less meaningful. Secondly, the aforementioned learning algorithms with a theoretical guarantee of an regret are often computationally expensive. For example, the colored-UCRL2 algorithm [36] suffers from an exponential computational complexity, and the regret bound is exponential in the number of states and arms. This is because it needs to solve Bellman equations on a state-space that has a size which grows exponentially with the number of arms. Thirdly, existing low-complexity policies such as [46, 49] often do not have a regret guarantee that scales as

, and moreover these also restrict to a specific Markovian model, that are hard to generalize. In a different line of works, Thompson sampling based algorithms

[20, 19] were used to solve this problem. These provide a theoretical guarantee in the Bayesian setup, but since very often the likelihood functions are complex, these are required to implement a computationally expensive method to update the posterior beliefs. To the best of our knowledge, there are no provably optimal policies for RMAB problems (let alone the RB in consideration) with an efficient learning algorithm that performs close to the offline optimum and achieves a sub-linear regret and with a low computational complexity, all at once.

In this paper, we address the above challenges for RB problems that involve operating over a finite time horizon. In contrast to most of the aforementioned existing literature, that allow for only a binary decision set (activate or not activate), our setup allows the decision maker to choose from multiple actions for each arm. This is a very useful generalization since many applications are not limited to binary actions. For example, while performing video streaming by transmitting video data packets across wireless channels, the transmitter can dynamically choose varying levels of transmission power or video resolution (i.e., actions), which in turn affects the quality of video streaming experienced by the users. However, the analysis of restless bandits with multiple actions largely remains elusive for the general setting in the literature. We make progress toward RB problems by making the following contributions:

Asymptotically optimal index policy. For the general finite-horizon RB problems in which the system parameters are known, we propose an index policy which we call Occupancy-Measured-Reward Index Policy. We show that our policy is asymptotically optimal, a result paralleling those known for the Whittle policy. However, unlike Whittle index policy, our index policy does not require the indexability condition to hold, and is well-defined for both indexable and nonindexable RB problems. This result is significant since the indexability condition is hard to verify or may not hold true in general, and the non-indexable settings have so far received little attention, even though they arise in many practical problems.

Reinforcement learning augmented index policy. We present one of the first generative model based reinforcement learning augmented algorithm toward an index policy in the context of finite-horizon RB problems. We call our algorithm RB-UCB. R

B-UCB consists of a novel optimistic planning step similar to the UCB strategy, in which it obtains an estimate of the model by sampling state-action pairs in an offline manner and then solves a so-called extended linear programming problem that is posed in terms of certain “occupancy measures”. The complexity of this procedure is linear in the number of arms, as compared with exponential complexity of the state-of-the-art colored-UCRL2 algorithm. Furthermore, we show that R

B-UCB achieves a regret and hence performs close to the offline optimum policy since it contains an efficient exploitation step enabled by the optimistic planning used by our Occupancy-Measured-Reward Index Policy. This significantly outperforms other existing methods [28, 43, 44, 29] that often rely upon a heuristic policy. Moreover, the multiplicative “pre-factor" that goes with the time-horizon dependent function in the regret is quite low for our policy since the “exploitation step” that we propose is much more efficient, in fact this is “exponentially better” than that of the colored-UCRL2. Our simulation results also show that RB-UCB outperforms existing algorithms in both regret and running time.

Notation. We denote the set of natural and real numbers by and , respectively. We let be the finite number of total decision epochs (time). We denote the cardinality of a finite set by We also use to represent the set of integers for

2 System Model

We begin by describing the finite-horizon RB problem in which the action set for each of the arms is allowed to be non-binary. Each arm is described by a unichain Markov decision process (MDP) [22] , where is its finite state space, is the set of finite actions, is the transition kernel and is the reward function. For the ease of readability, we assume that all arms share the same state and action spaces, and these are denoted by and , respectively. Our results and analysis can be extended in a straightforward manner to the case of different state and action spaces, though this will increase the complexity of notation.

Without loss of generality, we denote the action set as where . By using the standard terminology from the RMAB literature, we call an arm passive when action is applied to it, and active otherwise. An activation cost of units is incurred each time an arm is applied action (thus not activating the arm corresponds to activation cost). The total activation cost associated with activating a subset of the arms at each time is constrained by units. The quantity is called the activation budget. The initial state is chosen according to the initial distribution and is the operating time horizon.

We denote the state of arm at time as . The process

evolves as a controlled Markov process with the conditional probability distribution of

given by (almost surely). The instantaneous reward earned at time by activating arm

is denoted by a random variable

. Without loss of generality, we assume that , with expectation [10], and let be , i.e., no reward is earned when the arm is passive. Denote the total reward earned at time by , i.e., . Let denote the operational history until , i.e., the sigma-algebra [41] generated by the random variables . Our goal is to derive a policy , that makes decisions regarding which set of arms to activate at each time , so as to maximize the expected value of the cumulative rewards subject to a budget constraint on the activation resource, i.e.,


where the subscript indicates that the expectation is taken with respect to the measure induced by the policy We refer to the problem (1) as the “original problem”. Though this could be solved by using existing techniques such as dynamic programming [13]

, existing approaches suffer from the “curse of dimensionality

[4, 5], and hence are computationally intractable. We overcome this difficulty by developing a computationally feasible and provably optimal index-based policy.

3 Asymptotically Optimal Index Policy

In this section, we focus on the scenario when the controlled transition probabilities and the reward functions of each arm are known. We design an index policy for the finite-horizon R

Bs, and show that it is asymptotically optimal. We begin by introducing a certain “relaxed problem” [48]. The relaxed problem can be solved efficiently since it can equivalently be posed as a linear programming (LP) in the space of occupation measures of the controlled Markov processes [2], where each such process corresponds to one arm. This forms the building block of our proposed index-based policy, and is described next.

3.1 The Relaxed Problem

Consider the following problem obtained by relaxing the “hard” constraint in (1) in which the activation cost at each time is limited by units, by a “relaxed” constraint in which this is supposed to be true only in an expected sense, i.e.,


Obviously the optimal value of the relaxed problem (2) yields an upper bound on the optimal value of (1). We note that an optimal policy for (2) might require randomization [2]. It is well known [2] that the relaxed problem (2) can be reduced to a LP in which the decision variables are the occupation measures of the controlled process. More specifically, the occupancy measure of a policy of a finite-horizon MDP describes the probability with which state-action pair is visited at time . Formally,

The relaxed problem (2) can be reformulated as the following LP [2] in which the decision variables are these occupation measures:

s.t. (4)

where (4) is a restatement of the constraint in (2) for , which indicates the activation budget; (5) represents the transition of the occupancy measure from time to time , and ; and (6) indicates the initial condition for occupancy measure at time , . From the constraints (5)-(6), it can be easily checked that the occupancy measure satisfies , . Thus, the occupancy measure is a probability measure.

An optimal policy for the relaxed problem can be obtained from the solution of this LP as follows [2]. Let be a solution of the above LP. Construct the following Markovian non-stationary randomized policy as follows: if the state is at time , then chooses an action with a probability equal to


If the denominator of (7) equals zero, i.e., state for arm is not reachable at time , arm can be simply made passive, i.e., and

3.2 The Occupancy-Measured-Reward Index Policy

The Markov policy constructed from solutions to the above LP form the building block of our index policy for the original problem (1). Note that the policy (7) is not always feasible for the original problem since in the latter at most units of activation costs can be consumed at any time, while (7) could spend more than units of costs at any given time. To this end, our index policy assigns an index to each arm based on its current state and the current time. We denote by the index associated with arm at time ,


where is defined in (7). We call this the occupancy-measured-reward index (OMR index) since it is based solely upon the optimal occupancy measure derived by solving the LP (3)-(6) and the mean reward, representing the expected obtained reward for arm at state of time . Let be the OMR indices associated with arms at time . Let be the action for arm in its current state at time , and let be the set of arms that are active arms at time . Our index policy then activates arms with OMR indices in a decreasing order. The choice of satisfies the constraint . The remaining arms are kept passive at time . For each arm that has been chosen to be activated, the action applied to it is selected randomly according to the probability  (7). When multiple arms sharing the same OMR indices, a tie-breaking rule is needed. Our tie-breaking rule randomly activates one arm and allocates the remaining activation costs across all possible actions according to the probability . If it happens that the indices of all the arms are zero, then all the remaining arms are made passive. We call this an Occupancy-Measured-Reward Index Policy (OMR Index Policy), and denote it as , which is summarized in Algorithm 1.

Input: Initialize and as an empty set

1:  Construct the LP according to (3)-(6) and solve the occupancy measure ;
2:  Compute according to (7);
3:  Construct the index set ; and sort in a decreasing order;
4:  while  do
5:     Activate arms according to the order in step 3 and randomly generate a feasible action according to the distribution . Store the newly activated arm into ;
6:  end while
Algorithm 1 OMR Index Policy
Remark 1

Our index policy is computationally appealing since it is based only on the “relaxed problem” by solving a LP. Furthermore, if all arms share the same MDP, the LP can be decomposed across arms as in [48], and hence the computational complexity does not scale with the number of arms. Even more importantly, our index policy is well-defined even if the underlying MDPs are not indexable [48]. This is in contrast to most of the existing Whittle index-based policies that are only well defined in the case that the system is indexable, which is hard to verify and may not hold in general. A line of works [45, 15, 52] have been focusing on designing index policies without the indexability requirement, and closest to our work is the parallel work on restless bandits [52] with known transition probabilities and reward functions. In particular, [52] explores index policies that are similar to ours, but under the assumption that the individual MDPs of each arms are homogeneous. They consider the binary action setup, and focus mainly on characterizing the asymptotic sub-optimality gap. Our index policy in this section can be seen as the complement to it by considering the general case of heterogeneous MDPs in which multiple actions are allowed for each arm. Finally, reinforcement learning augmented index policy and the regret analysis in next section also distinguishes our work.

3.3 Asymptotic Optimality

For the abuse of notation, we let the number of arms be and the value of activation constraint be in the limit with In other words, it represents the scenarios where there are different classes of arms and each class contains arms. Our OMR Index Policy achieves asymptotic optimality when the number of arms and the activation constraint go to infinity while holding constant111We consider the asymptotic optimality in the same limit as by Whittle [48] and others [47, 45, 15, 51].. Let denote the expected reward of the original problem (1) obtained by an arbitrary policy in this limit. Denote the optimal policy of the original problem (1) as .

Theorem 1

The OMR Index Policy achieves the asymptotic optimality as follows

Remark 2

Theorem 1 indicates that as the number of per-class arms (i.e., ) goes to infinity, the gap between the performance achieved by our OMR Index Policy and the optimal policy is bounded, and thus per arm gap tends to be zero.

4 Reinforcement Learning for the OMR Index Policy

Computing the OMR Index Policy requires the knowledge of the controlled transition probabilities and the reward functions associated with the MDPs of each arm. Since these quantities are typically unknown, we propose a generative model based reinforcement learning (RL) augmented algorithm that learns this policy.

4.1 The Learning Problem

The setup is similar to the finite-horizon R(MA)B described earlier, in which each arm is associated with a controlled MDP . The only difference is that now the agent does not know the quantities . To judge the performance of the learning algorithm, we use the popular metric of learning regret [25]. Let be the average value of expected rewards, and denote the optimal average reward rate by . Note that the optimal average reward rate is independent of the initial state for MDPs that have a finite diameter [38].

The regret of a policy is defined as follows,


is the cumulative rewards collected when the system begins in state . Thus, regret measures the difference between the rewards collected by the learning policy, and the optimal stationary policy that could be implemented if the system parameters were known to the agent.

4.2 A Generative Model Based Learning Algorithm

Our proposed RL algorithm is based on the UCB strategy [3, 25], and also uses a generative model similar to [23]. We call our RL algorithm as R(MA)B-UCB policy, and depict it in Algorithm 2.

There are two phases in R(MA)B-UCB: (i) a planning phase, and (ii) a policy execution phase. The planning phase (lines 1-6 in Algorithm 2) constructs a confidence ball that contains a set of plausible MDPs for each of the arms. Specifically, we explore a generative approach with a single step simulator that can generate samples of the next state and reward given any state and action [23, 12]. It then obtains an optimistic estimate of the true MDP parameters by solving an optimistic planning problem in which the agent can choose MDP parameters from the confidence ball. This problem can be posed as a LP in which the decision variables are the occupancy measures corresponding to the processes associated with arms. We can then define the corresponding OMR Index Policy. The planning problem, referred to as an extended LP in Algorithm 2 is described below. Our key contribution here is to choose the right value of to balance the accuracy and complexity, which contributes to the properties of sub-linear regret and low-complexity of R(MA)B-UCB.

At the policy execution phase (line 7 in Algorithm 2), the derived OMR Index Policy is executed. Our key contribution here is to leverage our proposed OMR Index Policy, rather than using heuristic ones as in existing algorithms. Since our proposed OMR Index Policy is near-optimal, this guarantees that R(MA)B-UCB performs close to the offline optimum. Moreover, this contributes to the low multiplicative “pre-factor” that goes with the time-horizon dependent function in the regret. The prefactor of our algorithm is exponentially better than that of the state-of-the-art colored-UCRL2.

Input: Learning horizon , and learning counts .

1:  for  and  do
2:     Sample pairs of arm for times.
3:  end for
4:  Construct and according to (4.2);
5:  Compute the optimal solution of the extended LP (4.2);
6:  Establish the corresponding OMR Index Policy ;
7:  Execute for the rest of the game.
Algorithm 2 RB-UCB Policy

Optimistic planning. We sample each state-action pair of arm for (the value of will be specified later) number of times uniformly across all state-action pairs. We denote the number of times that a transition tuple was observed within as satisfying

where represents the state for arm at time and is the corresponding action. Then R(MA)B-UCB estimates the true transition probability and the true reward by the corresponding empirical averages as


B-UCB further defines confidence intervals for the transition probabilities (resp. the rewards), such that the true transition probabilities (resp. true rewards) lie in them with high probability. Formally, for

, we define


where the size of the confidence intervals is built using the empirical Hoeffding inequality [32]. For any , and it is defined as


The set of plausible MDPs associated with the confidence intervals is . Then R(MA)B-UCB computes a policy by performing optimistic planning. Given the set of plausible MDPs, it selects an optimistic transition (resp. reward) function and an optimistic policy by solving a “modified LP”, which is similar to the LP defined in (3)-(6), but with the transition and reward functions replaced by and in the confidence balls (4.2) since the corresponding true values are not available. More precisely, R(MA)B-UCB finds an optimal solution to the following problem


The extended LP problem. The modified LP can be further expressed as an extended LP by leveraging the state-action-state occupancy measure defined as to express the confidence intervals of the transition probabilities. The extended LP over is as follows:


where the last two constraints indicate that the transition probabilities lie in the desired confidence interval for . Such an approach was also used in [18, 39] in the context of adversarial MDPs and [10, 21, 12] in constrained MDPs. Once we compute from (4.2), the policy is recovered from the computed occupancy measures as


Finally, we compute the OMR index as in (8) using (13), from which we construct the OMR Index Policy, and execute this policy to the end.

Remark 3

Although R(MA)B-UCB looks similar to an “explore-then-commit” policy [36], a key novelty of R(MA)B-UCB lies in using the approach of optimism-in-the-face-of-uncertainty [17, 33] to balance exploration and exploitation in a non-episodic offline manner. As a result, there is no need for R(MA)B-UCB to episodically search for a new MDP instance within the confidence ball with a higher reward as in [36, 46], which is computationally expensive (i.e., exponential in the number of arms). The second key novelty is that R(MA)B-UCB only relies on samples initially obtained by a generative model to construct a upper-confidence ball, using which a policy can be derived by solving an extended LP just once, with a computational complexity of (which is if all arms are identical). However, the existing algorithms, e.g. colored UCRL2 are computationally expensive as they rely upon a complex recursive Bellman equation in order to derive the policy. Finally, R(MA)B-UCB uses the structure of our proposed near-optimal index policy in the policy execution phase rather than using a heuristic one as in existing algorithms e.g., [28, 43, 44, 29]. These key features ensure that R(MA)B-UCB achieves almost the same performance as the offline optimum, a sub-linear regret at a low computation expense.

4.3 Regret Bound

We present our main theoretical results in this section.

Theorem 2

The regret of the R(MA)B-UCB policy with satisfies:


Since there are two phases in R(MA)B-UCB, we decompose the regret as , where is the regret for the planning phase and is the regret for the policy execution phase with . The first term in (14) is the worst regret from explorations of each state-action pair under the generative model with time steps for sampling and at most arms being activated each time. The second term comes from the policy execution phase. Specifically, the regret occurs when explorations for each state-action pair construct a set of plausible MDPs that do not contain the true MDP in line 4 of Algorithm 2, which is a rare event with probability . The key then is to characterize the regret when the event that the true MDP lies in the set of plausible MDP occurs. Based on the optimism of plausible MDPs, the optimal average reward for the optimistic MDP is no less than . Therefore the expected regret can be bounded by which is directly related with the occupancy measure we defined.

Remark 4

Though R(MA)B-UCB is an offline non-episodic algorithm, it still achieves an regret no worse than the episodic colored-UCRL2. Note that for colored-UCRL2, the regret bound is instance-dependent due to the online episodic manner such that the regret bound tends to be logarithmic in the horizon as well. However, R(MA)B-UCB adopts explore-then-commit mechanism which uses generative model based sampling and constructs the plausible MDPs sets only once. This removes the instance-dependent regret with order of . Though the state-of-the-art Restless-UCB [46] has a similar mechanism as ours in obtaining samples in an offline manner, it lowers its implementation complexity by sacrificing the regret performance to since it heavily depends on the performance of an offline oracle approximator for policy execution. Instead, we leverage our proposed provably optimal and computationally appealing index policy for the policy execution phase. This also contributes to the low multiplicative “pre-factor” in the regret.

5 Experiments

We now present our experimental results that validate our model and theoretical results. These verify the asymptotic optimality of the OMR Index Policy, and the sub-linear regret of the R(MA)B-UCB policy. In particular, we evaluate the R(MA)B-UCB policy under two real-world applications of restless bandit problems, namely “a deadline scheduling problem” where each arm has binary actions, and “dynamic video streaming over fading wireless channel” where each arm has multiple actions, using real video traces.

5.1 Evaluation of the OMR Index Policy

Binary actions: Since most existing index policies are designed only for the conventional binary action settings in which arms are chosen to be either active or passive, and cannot be applied to the multi-action setting that is considered in our paper, we first consider a controlled Markov process in which there are two actions for each arm, and the states evolve as a specific birth-and-death process where state can only transit to or . We compare OMR Index Policy with the following popular state-of-the-art index policies: Whittle index policy [48], the Fluid-priority policy of [52], and a priority based policy proposed in [45]. We consider a setup with 10 classes of arms, in which each arm has a state space . The arrival rates are set as with a departure rate . The controlled transition probabilities satisfy and . When a class- arm is activated, it receives a random reward that is a Bernoulli random variable with a state dependent rate , i.e., where uniformly distributed in . If the arm is not activated then no reward is received. The time horizon is set to and the activation ratio is set to . For the ease of exposition, we let the number of arms vary from to

Figure 1: Accumulated reward: binary action setting.
Figure 2: Average optimality gap: binary action setting.

The cumulative rewards collected by these policies are presented in Figure 2. We observe that OMR Index Policy performs slightly better than the Fluid-priority policy. We conjecture that this is due to the fact that OMR Index Policy prioritizes the arms directly based on their contributions to the cumulative reward, while Fluid-priority policy does not differentiate arms in the same priority category. More importantly, both OMR Index Policy and Fluid-priority policy significantly outperform the Whittle index policy.

We further validate the asymptotic optimality of OMR Index Policy (see Theorem 1). In particular, we compare the rewards obtained by OMR Index Policy and the two baselines, with that obtained from the theoretical upper bound obtained by solving the LP in (3)-(6). We call this difference as the optimality gap. The average optimality gap, i.e., the ratio between the optimality gap and the number of arms of different policies is illustrated in Figure 2. Again, we observe that OMR Index Policy slightly outperforms the Fluid-priority in terms of the vanishing speed of the average optimality gap since OMR Index Policy achieves a higher accumulated reward as shown in Figure 2. Moreover, both OMR Index Policy and Fluid-priority significantly outperform the Whittle index policy. This is due to the fact that the optimality gap of the Fluid-priority index policy (i.e. a constant ) does not scale with the number of arms, while that of Whittle index policy does [52].

Figure 3: Optimality gap: multi-action setting.

Multiple actions: We further evaluate our index policy for the general multi-action setup, and consider a more general Markov process in which any two arbitrary states could communicate with each other. The controlled transition probabilities are generated randomly. For the ease of exposition, we consider the number of actions for each arm to assume values from the set . Our results and observations hold for other numbers of arms. Note that most existing index policies including the two state-of-the-art index policies considered above are designed only for the conventional binary action setup and cannot be applied to the multi-action setting considered in this paper. To this end, we compare OMR Index Policy with the “greedy policy,” that at each time selects actions that yield the maximum reward. Note that the choice of action would depend upon the current states, since the rewards depend upon state values. The performance in terms of optimality gap is shown in Figure 3. Firstly, we observe that the optimality gap slightly increases as the number of available actions increases while the number of arms is kept fixed. The impact of such marginal increase vanishes as the number of arms increases. Similar to the observations made in the case of binary actions, this indicates the asymptotic optimality of our proposed OMR Index Policy. Secondly, OMR Index Policy significantly outperforms the greedy policy, whose optimality gap increases with the number of arms.

Figure 4: Comparison of accumulated regret: binary action setting.
Figure 5: Comparison of average running time: binary action setting.

5.2 Evaluation of the R(MA)B-UCB Policy

Binary actions: We then evaluate the performance of R(MA)B-UCB. We compare with two state-of-the-art algorithms including Restless-UCB [46] and a Thompson sampling (TS) based policy [20] for restless bandits. Note that Restless-UCB is also an offline learning policy similar to ours while the TS-based policy is an online policy that has a sub-linear regret in the Bayesian setup but suffers from a high computation complexity. Colored-UCRL2 [36] is a popular algorithm for RMAB problems, but it is well known that the computational complexity of colored-UCRL2 grows exponentially with the number of arms. Furthermore, it has been shown in [46] that Restless-UCB outperforms colored-UCRL2, and hence we do not include it in our experiments.

We use the same settings for experiments as was described above for evaluating our index policy and for simplicity choose the number of arms to be , though the results for a larger number of arms would be similar. For the TS-based policy, we set the prior distribution to be uniform over a finite support Regrets of these algorithms are shown in Figure 5, in which we use the Monte Carlo simulation with independent trials. R(MA)B-UCB achieves the lowest cumulative regret. An explanation behind this phenomenon is that Restless-UCB sacrifices the regret performance for a lower computational complexity, and hence performs worse as compared with the online TS-based policy. R(MA)B-UCB achieves the best performance, which can be partly explained by the near-optimality of our index policy (see Remark 3). When the number of samples are sufficiently large, i.e is large), R(MA)B-UCB achieves a near optimal performance.

Figure 6: Comparison of accumulated regret: multi-action setting.

We also compare the average run time of different algorithms. In this experiment, the horizon is . The results are presented in Figure 5, which are averaged over Monte Carlo runs of a single-threaded program on Intel Core i5-6400 desktop with 16 GB RAM. It is clear that R(MA)B-UCB is more efficient in terms of run time. For example, R(MA)B-UCB reduces the run time by up to (resp. ) as compared with the Restless-UCB (resp. TS-based policy) when there are arms, and reduces the corresponding run time by up to (resp. ) when there are arms. The improvement over the colored-UCRL2 is even more significant when the number of arms is larger, since the time complexity of colored-UCRL2 grows exponentially with the number of arms. Hence we omit the comparison here. A significant improvement comes from the intrinsic design of our policy which only needs to solve an LP once, while the Restless-UCB needs a computation-intensive numerical approximation of the Oracle (e.g., Whittle index policy) and the TS-based policy is an online episodic algorithm that solves a Bellman equation for every episode.

Multiple actions: We further evaluate R(MA)B-UCB under multi-action settings by considering a more general Markov process in which any two arbitrary states may communicate with each other and the transition probability matrices are randomly generated. The other settings remain the same as in the index policy evaluation. For the ease of exposition, we consider the number of actions to be and . Figure 6 shows the accumulated regret vs. time for R(MA)B-UCB under different numbers of actions. Since the Restless-UCB and TS-based policies are hard to be extended to the multi-action setting, we do not consider them in this comparison. From Figure 6, we observe that R(MA)B-UCB achieves regret under multi-action settings, which validates our theoretical contributions in the paper (see Theorem 2). Furthermore, when the number of actions increases, it takes a larger number of time steps for the accumulated regret to converge. In other words, the planning phase in R(MA)B-UCB (see Algorithm 2) will take a longer time to learn the system parameters.

Case Study: A Deadline Scheduling Problem. We consider the deadline scheduling problem [50] for the scheduling of electrical vehicle charging stations. A charging station (agent) has total charging spots (arms) and can charge vehicles in each round. The charging station obtains a reward for each unit of electricity that it provides to a vehicle and receives a penalty (negative reward) when a vehicle is not fully charged. The goal of the station is to maximize its net reward. We use exactly the same setting as in [50] for our experiment. More specifically, the state of an arm is denoted by a pair of integers , where is the amount of electricity that the vehicle still needs and is the time until the vehicle leaves the station. When a charging spot is available, its state is . and are upper-bounded by and , respectively. Hence, the size of state space is for each arm. The reward received by agent from arm is as follows,

where means being passive and means being active. The state transition satisfies

where is a random state when a new vehicle arrives at the charging spot . There are total charging spots and a maximum can be served at each time.

Figure 7: Comparison of accumulated regret in the deadline scheduling problem.

We compare the learning performance of our R(MA)B-UCB with the two state-of-the-art algorithms, i.e., Restless-UCB and a Thompson sampling (TS)-based policy for this deadline scheduling problem, which is shown in Figure 7. We observe that all three polices achieve sub-linear regrets, which is consistent with their theoretical performance. Our R(MA)B-UCB performs better than the other two state-of-the-art algorithms. Note that the TS-based policy has a lower cumulative regret when the number of time steps is small as compared with the other two policies. This is because the TS-based policy is an episodic algorithm that improves the policy episode-by-episode while the R(MA)B-UCB and Restless-UCB run according to a fully random policy at the exploration phase.

Case Study: A Dynamic Video Streaming Over Fading Wireless Channel. We consider the adaptive video streaming problem, where multiple users compete for network resources in order to deliver video packets over unreliable wireless channels. This problem can be cast as a restless multi-armed bandit problem that has multiple actions [42]. In particular, an access point connected to a server dynamically controls (i) the quality of video chunks that are delivered, and (ii) the channel resources, e.g. transmission power that are allocated to users. These decisions are dynamic, i.e. based on each user’s current state, which in this case turns out to be the same as the remaining playtime. The goal is to maximize the total expected quality of experience (QoE).

The state of user at time is defined as with being the remaining play time of video chunks in its buffer, and being the quality of last successfully received video chunk before time . Specifically, represents the occurrence of a rebuffering event. The action for user at time determined by the access point is denoted as , where is the quality of the video chunk, and is the network resources allocated. The buffer length remains the same when one chunk is successfully transmitted to the user, otherwise the buffer length decreases by seconds. Therefore, the transition probability of the MDP associated with user is expressed as follows,

when a passive action is selected. When action is chosen for transmitting a video chunk to user at , we have

Note that if , user suffers from a rebuffering event, and the length increases by seconds if one chunk is successfully transmitted to user . The instantaneous reward received by user at time is defined by the QoE function as follows,

We evaluate the performance of R(MA)B-UCB Policy for adaptive video streaming over wireless networks using real video traces [26]. All videos are encoded into multiple chunks, with each chunk having a playtime of one second. Each video consists of three resolutions: p, p and p, from which we abstract the bitrate levels as . We consider users, and the total network resource is Mbps with . Denote each resource level in as index and , respectively. Therefore, in total there are different actions and the successful transmission probabilities under this trace are then calculated as follows, , , , , , , , , ,

Figure 8: Comparison of average QoE in the wireless video streaming problem.
Figure 9: Comparison of accumulated regret in the wireless video streaming problem.

Since the Restless-UCB and TS-based policies cannot be directly extended to multi-action settings, we compare the learning performance of our R(MA)B-UCB with the two well-known heuristic algorithms, i.e., Greedy and Vanilla. In particular, Vanilla is a base case with served users being allocated the highest resources, and no differentiation between users, and Greedy is the case where each user greedily selects the action with the largest reward for current state. The average QoE achieved by these policies are shown in Figure 9. We observe that R(MA)B-UCB significantly outperforms the two heuristic algorithms with the highest average QoE. We further evaluate the corresponding learning regret as shown in Figure 9. Since Greedy significantly outperforms Vanilla in average QoE and hence we do not include Vanilla in this comparison. It is clear that R(MA)B-UCB achieves a regret while Greedy achieves nearly linear regret as grows large.

6 Conclusion

In this paper, we studied an important extension of the popular restless multi-armed bandit problem that allows for choosing from multiple actions for each arm, which we denote by R(MA)B. We firstly proposed a computationally feasible index policy dubbed OMR Index Policy, and showed that it is asymptotically optimal. Since the system parameters are often unavailable in practice, we then developed a learning algorithm that learns the index policy. We combine a generative approach to reinforcement learning with the UCB strategy to get the R(MA)B-UCB algorithm. It enjoys a low learning regret since it can fully exploit the structure of the proposed OMR Index Policy. We also show that R(MA)B-UCB achieves a sub-linear regret. Our experimental results further showed that R(MA)B-UCB outperforms other state-of-the-art algorithms.


  • [1] S. H. A. Ahmad, M. Liu, T. Javidi, Q. Zhao, and B. Krishnamachari (2009) Optimality of Myopic Sensing in Multichannel Opportunistic Access. IEEE Transactions on Information Theory 55 (9), pp. 4040–4050. Cited by: §1.
  • [2] E. Altman (1999) Constrained Markov Decision Processes. Vol. 7, CRC Press. Cited by: §3.1, §3.1, §3.
  • [3] P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-Time Analysis of the Multiarmed Bandit Problem. Machine Learning 47 (2), pp. 235–256. Cited by: §1, §4.2.
  • [4] R. Bellman (2010) Dynamic Programming. Princeton University Press, USA. External Links: ISBN 0691146683 Cited by: §2.
  • [5] D. P. Bertsekas (1995) Dynamic Programming and Optimal Control. Vol. 1, Athena Scientific Belmont, MA. Cited by: §2.
  • [6] D. Bertsimas and J. Niño-Mora (2000) Restless Bandits, Linear Programming Relaxations, and A Primal-Dual Index Heuristic. Operations Research 48 (1), pp. 80–90. Cited by: §1.
  • [7] D. B. Brown and J. E. Smith (2020) Index Policies and Performance Bounds for Dynamic Selection Problems. Management Science 66 (7), pp. 3029–3050. Cited by: §1.
  • [8] W. Dai, Y. Gai, B. Krishnamachari, and Q. Zhao (2011) The Non-Bayesian Restless Multi-Armed Bandit: A Case of Near-Logarithmic Regret. In Proc. of IEEE ICASSP, Cited by: §1, §1.
  • [9] S. Deo, S. Iravani, T. Jiang, K. Smilowitz, and S. Samuelson (2013) Improving Health Outcomes Through Better Capacity Allocation in A Community-based Chronic Care Model. Operations Research 61 (6), pp. 1277–1294. Cited by: §1.
  • [10] Y. Efroni, S. Mannor, and M. Pirotta (2020) Exploration-Exploitation in Constrained MDPs. arXiv preprint arXiv:2003.02189. Cited by: §2, §4.2.
  • [11] J. Gittins (1974) A Dynamic Allocation Index for the Sequential Design of Experiments. Progress in Statistics, pp. 241–266. Cited by: §1.
  • [12] A. HasanzadeZonuzy, D. Kalathil, and S. Shakkottai (2021) Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs. In Proc. of AAAI, Cited by: §4.2, §4.2.
  • [13] O. Hernández-Lerma and J. B. Lasserre (2012) Further Topics on Discrete-Time Markov Control Processes. Vol. 42, Springer Science & Business Media. Cited by: §2.
  • [14] W. Hoeffding (1994) Probability Inequalities for Sums of Bounded Random Variables. In The Collected Works of Wassily Hoeffding, pp. 409–426. Cited by: Proof 3.
  • [15] W. Hu and P. Frazier (2017) An Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits. arXiv preprint arXiv:1707.00205. Cited by: §1, Remark 1, footnote 1.
  • [16] P. Jacko (2010) Restless Bandits Approach to the Job Scheduling Problem and Its Extensions. Modern Trends in Controlled Stochastic Processes: Theory and Applications, pp. 248–267. Cited by: §1.
  • [17] T. Jaksch, R. Ortner, and P. Auer (2010) Near-Optimal Regret Bounds for Reinforcement Learning. Journal of Machine Learning Research 11 (4). Cited by: Remark 3.
  • [18] C. Jin, T. Jin, H. Luo, S. Sra, and T. Yu (2019) Learning Adversarial MDPs with Bandit Feedback and Unknown Transition. arXiv preprint arXiv:1912.01192. Cited by: §4.2.
  • [19] Y. H. Jung, M. Abeille, and A. Tewari (2019) Thompson Sampling in Non-Episodic Restless Bandits. arXiv preprint arXiv:1910.05654. Cited by: §1, §1.
  • [20] Y. H. Jung and A. Tewari (2019) Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems. Proc. of NeurIPS. Cited by: §1, §1, §5.2.
  • [21] K. C. Kalagarla, R. Jain, and P. Nuzzo (2021) A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints. In Proc. of AAAI, Cited by: §4.2.
  • [22] L. Kallenberg (2003) Finite State and Action MDPs. In Handbook of Markov Decision Processes, pp. 21–87. Cited by: §2.
  • [23] M. Kearns, Y. Mansour, and A. Y. Ng (2002) A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes. Machine Learning 49 (2), pp. 193–208. Cited by: §4.2, §4.2.
  • [24] J. A. Killian, A. Perrault, and M. Tambe (2021) Beyond “To Act or Not to Act": Fast Lagrangian Approaches to General Multi-Action Restless Bandits. In Proc.of AAMAS, Cited by: §1.
  • [25] T. Lattimore and C. Szepesvári (2020) Bandit Algorithms. Cambridge University Press. Cited by: §1, §4.1, §4.2.
  • [26] S. Lederer, C. Müller, and C. Timmerer (2012) Dynamic Adaptive Streaming Over HTTP Dataset. In Proc. of ACM MMSys, Cited by: §5.2.
  • [27] E. Lee, M. S. Lavieri, and M. Volk (2019) Optimal Screening for Hepatocellular Carcinoma: A Restless Bandit Model. Manufacturing & Service Operations Management 21 (1), pp. 198–212. Cited by: §1.
  • [28] H. Liu, K. Liu, and Q. Zhao (2011) Logarithmic Weak Regret of Non-Bayesian Restless Multi-Armed Bandit. In Proc. of IEEE ICASSP, Cited by: §1, §1, §1, Remark 3.
  • [29] H. Liu, K. Liu, and Q. Zhao (2012) Learning in A Changing World: Restless Multi-Armed Bandit with Unknown Dynamics. IEEE Transactions on Information Theory 59 (3), pp. 1902–1916. Cited by: §1, §1, §1, Remark 3.
  • [30] A. Mahajan and D. Teneketzis (2008) Multi-Armed Bandit Problems. In Foundations and Applications of Sensor Management, pp. 121–151. Cited by: §1.
  • [31] A. Mate, A. Perrault, and M. Tambe (2021) Risk-Aware Interventions in Public Health: Planning with Restless Multi-Armed Bandits. In Proc.of AAMAS, Cited by: §1.
  • [32] A. Maurer and M. Pontil (2009)

    Empirical Bernstein Bounds and Sample Variance Penalization

    arXiv preprint arXiv:0907.3740. Cited by: §4.2.
  • [33] A. Mete, R. Singh, X. Liu, and P. Kumar (2021) Reward Biased Maximum Likelihood Estimation for Reinforcement Learning. In Proc. of L4DC, Cited by: Remark 3.
  • [34] J. Nino-Mora (2001) Restless Bandits, Partial Conservation Laws and Indexability. Advances in Applied Probability, pp. 76–98. Cited by: §1.
  • [35] J. Niño-Mora (2007) Dynamic Priority Allocation via Restless Bandit Marginal Productivity Indices. Top 15 (2), pp. 161–198. Cited by: §1.
  • [36] R. Ortner, D. Ryabko, P. Auer, and R. Munos (2012) Regret Bounds for Restless Markov Bandits. In Proc. of Algorithmic Learning Theory, Cited by: §1, §1, §5.2, Remark 3.
  • [37] C. H. Papadimitriou and J. N. Tsitsiklis (1994) The Complexity of Optimal Queueing Network Control. In Proc. of IEEE Conference on Structure in Complexity Theory, Cited by: §1.
  • [38] M. L. Puterman (1994) Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. Cited by: §1, §4.1.
  • [39] A. Rosenberg and Y. Mansour (2019) Online Convex Optimization in Adversarial Markov Decision Processes. In Proc. of ICML, Cited by: §4.2.
  • [40] S. Sheng, M. Liu, and R. Saigal (2014) Data-Driven Channel Modeling Using Spectrum Measurement. IEEE Transactions on Mobile Computing 14 (9), pp. 1794–1805. Cited by: §1.
  • [41] A. N. Shiryaev (2007) Optimal Stopping Rules. Vol. 8, Springer Science & Business Media. Cited by: §2.
  • [42] R. Singh and P. Kumar (2015) Optimizing Quality of Experience of Dynamic Video Streaming over Fading Wireless Networks. In Proc. of IEEE CDC, Cited by: §5.2.
  • [43] C. Tekin and M. Liu (2011) Adaptive Learning of Uncontrolled Restless Bandits with Logarithmic Regret. In Proc. of Allerton, Cited by: §1, §1, §1, Remark 3.
  • [44] C. Tekin and M. Liu (2012) Online Learning of Rested and Restless Bandits. IEEE Transactions on Information Theory 58 (8), pp. 5588–5611. Cited by: §1, §1, §1, Remark 3.
  • [45] I. M. Verloop (2016) Asymptotically Optimal Priority Policies for Indexable and Nonindexable Restless Bandits. The Annals of Applied Probability 26 (4), pp. 1947–1995. Cited by: §1, §5.1, Remark 1, footnote 1.
  • [46] S. Wang, L. Huang, and J. Lui (2020) Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits. In Proc. of NeurIPS, Cited by: §1, §1, §5.2, Remark 3, Remark 4.
  • [47] R. R. Weber and G. Weiss (1990) On An Index Policy for Restless Bandits. Journal of Applied Probability, pp. 637–648. Cited by: §1, footnote 1.
  • [48] P. Whittle (1988) Restless Bandits: Activity Allocation in A Changing World. Journal of Applied Probability, pp. 287–298. Cited by: §1, §1, §3, §5.1, Remark 1, footnote 1.
  • [49] G. Xiong, R. Singh, and J. Li (2021) Learning Augmented Index Policy for Optimal Service Placement at the Network Edge. arXiv preprint arXiv:2101.03641. Cited by: §1, §1.
  • [50] Z. Yu, Y. Xu, and L. Tong (2018) Deadline Scheduling as Restless Bandits. IEEE Transactions on Automatic Control 63 (8), pp. 2343–2358. Cited by: §5.2.
  • [51] G. Zayas-Cabán, S. Jasin, and G. Wang (2019) An Asymptotically Optimal Heuristic for General Nonstationary Finite-Horizon Restless Multi-Armed, Multi-Action Bandits. Advances in Applied Probability 51 (3), pp. 745–772. Cited by: §1, footnote 1.
  • [52] X. Zhang and P. I. Frazier (2021)

    Restless Bandits with Many Arms: Beating the Central Limit Theorem

    arXiv preprint arXiv:2107.11911. Cited by: §5.1, §5.1, Remark 1.

Appendix A Proof of Theorem 1

To prove Theorem 1, we first introduce some auxiliary notations. Let be the number of class- arms at state at time and be the number of class- arms at state at time that are being activated by action . In the following, we show that when the number of each class of arms goes to infinity, the ratios and converge.

Lemma 1

For and , we have

Proof 1

We prove the above equations by induction. When , denote the initial state of each arm as , and we have

Meanwhile, denote , and we have

Now we assume that the equations hold at time . Then we need to show that these conditions also hold for time

We first show that this is true for the first equation in Lemma 1. Denote as the number of class- arms which are activated under the policy and transit from state at time to state at time , and as the number of class- arms which are kept passive under the policy and transit from state