Near-optimality for infinite-horizon restless bandits with many arms

by   Xiangyu Zhang, et al.
cornell university

Restless bandits are an important class of problems with applications in recommender systems, active learning, revenue management and other areas. We consider infinite-horizon discounted restless bandits with many arms where a fixed proportion of arms may be pulled in each period and where arms share a finite state space. Although an average-case-optimal policy can be computed via stochastic dynamic programming, the computation required grows exponentially with the number of arms N. Thus, it is important to find scalable policies that can be computed efficiently for large N and that are near optimal in this regime, in the sense that the optimality gap (i.e. the loss of expected performance against an optimal policy) per arm vanishes for large N. However, the most popular approach, the Whittle index, requires a hard-to-verify indexability condition to be well-defined and another hard-to-verify condition to guarantee a o(N) optimality gap. We present a method resolving these difficulties. By replacing a global Lagrange multiplier used by the Whittle index with a sequence of Lagrangian multipliers, one per time period up to a finite truncation point, we derive a class of policies, called fluid-balance policies, that have a O(√(N)) optimality gap. Unlike the Whittle index, fluid-balance policies do not require indexability to be well-defined and their O(√(N)) optimality gap bound holds universally without sufficient conditions. We also demonstrate empirically that fluid-balance policies provide state-of-the-art performance on specific problems.



page 1

page 2

page 3

page 4


Restless Bandits with Many Arms: Beating the Central Limit Theorem

We consider finite-horizon restless bandits with multiple pulls per peri...

Stochastic Bandits with Delay-Dependent Payoffs

Motivated by recommendation problems in music streaming platforms, we pr...

Sequential Decision Making under Uncertainty with Dynamic Resource Constraints

This paper studies a class of constrained restless multi-armed bandits. ...

Issues concerning realizability of Blackwell optimal policies in reinforcement learning

N-discount optimality was introduced as a hierarchical form of policy- a...

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality

We study stochastic structured bandits for minimizing regret. The fact t...

Bayes-Optimal Effort Allocation in Crowdsourcing: Bounds and Index Policies

We consider effort allocation in crowdsourcing, where we wish to assign ...

Adaptive Policies for Sequential Sampling under Incomplete Information and a Cost Constraint

We consider the problem of sequential sampling from a finite number of i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study a stochastic control problem called the infinite-horizon restless bandit. In this problem, a decision maker is responsible for managing

Markov decision processes (called “arms”) whose states are fully observed and belong to a common finite state space. For each arm in each time period, the decision maker can either activate the arm (also called “pullng” the arm) or idle it. This arm then generates a random reward. This reward’s probability distribution depends in a known way on the action taken and the arm’s current state. A known transition kernel depending on the action and the arm’s current state then determines the probability distribution over the arm’s state in the next time period. When making decisions, the decision maker needs to respect a “budget” constraint in each period that constrains the number of arms that can be activated in each period. The objective is to maximize the expected total discounted reward over an infinite time horizon.

The infinite-horizon restless bandit problem was first formulated by Whittle (1980) and has since attracted much theoretical and practical interest. Many real-world decision-making problems are naturally formulated as restless bandits, with applications arising in network communication (Liu and Zhao 2008), unmanned aerial vehicles tracking (Le Ny et al. 2006), revenue management (Brown and Smith 2020) and active learning (Chen et al. 2013).

In principle, the problem can be solved by value iteration or other standard methods for maximizing the infinite-horizon expected total discounted reward in a stochastic dynamic program (Powell 2007). Unfortunately, the dimension of the collective description of arms’ states grows linearly with . Thus, the computation required grows exponentially with respect to

due to the “curse of dimensionality

(Powell 2007).

Because optimality appears unachievable by computationally tractable algorithms, theoretical analysis of restless bandits has focused on asymptotic optimality for the asymptotic regime where the budget constraint grows proportionally with . First defined by Whittle (1980), an asymptotically optimal policy is one whose optimality gap (the difference between the given strategy’s expected performance and that of an optimal policy, briefly, opt gap) divided by the number of arms vanishes as the number of arms grows.

Asymptotic optimality has been hard to guarantee. The most popular approach to restless bandits, the Whittle index, was conjectured to be asymptotically optimal by Whittle (1980). However, Weber and Weiss (1990) shows that this is false: there are problems where the Whittle index fails to be asymptotically optimal. That work also shows that the Whittle index is asymptotically optimal, but only if a certain hard-to-verify sufficient condition is met: that a differential equation characterizing the dynamics of the Whittle index in a certain fluid limit has a globally stable equilibrium. Moreover, for the Whittle index to be well defined, the problem must satisfy a so-called “indexability” condition, which may not be met and is hard to verify in practice. Another popular approach is simulation-based (Meshram and Kaza 2020, Nakhleh et al. 2021). However, the simulation-based method from Meshram and Kaza (2020) and Nakhleh et al. (2021) does not provide a theoretical guarantee on performance.

Although the existing infinite-horizon restless bandit policies of which we are aware suffer from difficulty in guaranteeing asymptotic optimality and, in the case of the Whittle index, challenges in establishing indexability and coping with its absence, some recent work shows that life is much easier for finite-horizon restless bandits.

For example, for finite-horizon restless bandits in the same asymptotic regime where budgets grow proportionally with , Hu and Frazier (2017) propose a policy with opt gap, thus being asymptotically optimal. Later, Brown and Smith (2020) proposes policies with stronger opt gaps. Moreover, Zhang and Frazier (2021) propose a class of policies with at most opt gaps and surprisingly opt gap if a non-degeneracy condition is met. These rates for opt gaps hold universally, unlike the hard-to-verify conditions required for the Whittle index to be asymptotically optimal. Moreover, neither policy requires an indexability condition. These papers overcome the challenges articulated above despite basing their analysis on the same Lagrangian relaxation technique proposed and used by Whittle (1980).

We argue in this paper that a key difference in the approach enabled these finite-horizon analyses to achieve asymptotic optimality and to avoid challenges in establishing indexability: their Lagrangian relaxation uses a sequence of Lagrange multipliers, one for each time period, while the Whittle index uses a single global Lagrange multiplier. Using a time-varying Lagrange multiplier is intuitive in the finite-horizon setting: the finite horizon causes the problem to be non-stationary, naturally inspiring a time-inhomogeneous approach.

We show in this paper that this time-inhomogeneous approach can be generalized to the infinite-horizon setting to overcome the shortcomings of the Whittle index and other past approaches to the infinite-horizon restless bandit. That a time-inhomogeneous approach would be relevant to the infinite-horizon setting may, at first glance, seem surprising: the infinite-horizon problem is stationary, implying the existence of stationary optimal policies, and suggesting that asymptotically optimal policies should also be stationary. Part of our contribution is to explain why non-stationarity is an important tool for providing asymptotic optimality in stationary infinite-horizon problems.

We provide a novel class of computationally scalable non-stationary infinite-horizon restless bandit policies called “fluid-balance” policies. We show that they are asymptotically optimal, achieving a

opt gap. This result does not require indexability or other sufficient conditions beyond those defining the problem we study, such as arms’ states belonging to a finite state space and state transitions that are conditionally independent across arms. Moreover, despite being time-inhomogeneous in an infinite horizon problem, we show that they can be computed in finite time. They are computed by considering a finite linear program formed by truncating the infinite-horizon problem. Truncating at the

-th period allows fluid-balance policies to achieve a opt gap.

This requires going substantially beyond applying a previously proposed finite-horizon policy to the truncated problem. Policies with opt gaps in the finite-horizon setting proposed in Brown and Smith (2020) and Zhang and Frazier (2021) have opt gap bounds that depend exponentially on the time horizon. Simply applying one of these policies and its associated performance bound to a truncated problem (and leveraging discounting to bound the reward obtained after truncation) results in an opt gap bound that grows faster than . Our fluid-balance policies and their analysis are specifically adapted to the infinite-horizon setting to circumvent this challenge.

We give intuition for why a non-stationary approach can resolve the past challenges in infinite-horizon restless bandits. Index policies, such as the Whittle index and the policies that we propose, operate by defining a priority or “index” for each state and then pulling arms in order from the ones in the highest priority to the lowest priority until the budget constraint on pulls in the current time period is exhausted. Essentially, non-stationary approaches are beneficial because performing well with an index policy requires using a different priority order over states in each time period.

Understanding the need to use a time-dependent priority order relies on a linear programming analysis of the so-called fluid approximation. In this approximation, we take the limit as the number of arms grows large while scaling up the budget constraint. A policy can be understood as taking an occupation measure (a vector comprising the fraction of arms in each state at a particular time) as input and deciding the fraction of arms in each state to pull. In the fluid limit, an optimal policy’s decisions in a time period can be understood as pulling arms according to their marginal benefit from high to low until all resources are consumed. This marginal benefit in each period depends on the occupation measure in that period. Critically, under an optimal policy in the fluid limit, the occupation measure changes over time. This causes the optimal ranking over states to vary across periods. Thus, matching the optimal policy in the fluid limit requires an index policy to use a different a priority order in different time periods.

The rest of the paper is structured as follows. First, Section 2 discusses relevant past work on infinite-horizon restless bandits and the novelty of our work. Then, Section 3 formulates the restless bandit formally as a Markov decision process and Section 4 discusses a standard linear programming relaxation used to support analysis of restless bandits in the past literature. Our proposed fluid-balance policies and analysis are also based on this relaxation. Section 5 then introduces a technical condition, diffusion regularity that is sufficient for a opt gap. Section 6 proposes our fluid-balance policies and shows that are diffusion regular and thus have a opt gap. Section 7 uses two numerical experiments to explore the performance of fluid-balance policies. Finally, Section 8 concludes the paper and discusses possible future work.

2 Literature Review

This section first reviews approaches specifically designed for the infinite-horizon setting. It then reviews recent progress in the finite-horizon setting motivating our approach.

2.1 Infinite-horizon restless bandits

The infinite-horizon restless bandit problem was first formulated by Whittle (1980). Since then, the problem has attracted substantial research interest, both from theoretical and practical perspectives. Here we review two main streams of this research: the Whittle index and simulation-based approaches.

Whittle index When the restless bandit problem was first formulated in Whittle (1980), this paper also proposed an index policy, the so-called Whittle index, as a solution. The Whittle index is defined by considering a problem with a single arm in which one can pull the arm, paying a cost, or idle it. The Whittle index for a state is the cost that makes an optimal policy indifferent between pulling the arm and idling it. This implies a ranking over states that, intuitively, is the same as ranking by a state’s “marginal productivity”: the difference in discounted long-run reward between activating and idling an arm in this state in the original problem (Niño-Mora 2007). Intuitively, it should be a good policy to simply pull the arms in the states with the highest marginal productivity. Then Whittle index policy does exactly this: it activates arms according to their indices, from high to low, until all resources are used.

Although intuitively promising, Whittle (1980) noticed that the willingness to pull an arm in a single-arm problem is not always monotone: it may be optimal to pull the arm when the cost-per-pull is low, idle it when the cost-per-pull is in an intermediate range, and pull it when the cost-per-pull is high. In such settings, the Whittle index is not well-defined and its link to marginal productivity is lost. Whittle (1980) conjectured that indexability would imply asymptotic optimality: the difference between the Whittle index’s expected performance and that of an optimal policy divided by the number of arms vanishes as the number of arms grows, allowing a constant fraction of the arms to be pulled per time period. Later, however, Weber and Weiss (1990) provided a counterexample to Whittle’s conjecture: the Whittle index can fail to be asymptotically optimal even when the indexability condition is satisfied.

Responding to the challenge of establishing indexability, Gittins et al. (2011), Nino-Mora (2001) establish alternate sufficient conditions for indexability and Glazebrook et al. (2006) characterize some indexable restless bandit families. Liu and Zhao (2008, 2010) and Le Ny et al. (2008) show their studied system is indexable. Guha and Munagala (2007, 2008), Guha et al. (2010), Ansell et al. (2003) and Jacko and Nino-Mora (2007) have extended these ideas to more general settings e.g. convex reward, convex resource budget, stochastic arriving and leaving arms, etc. Nevertheless, establishing indexability remains challenging for most problems and typically entails additional theoretical work that must be done on a problem-by-problem basis.

When a problem is not indexable, multiple values satisfy the conditions that usually define the Whittle index. Thus, attempting to deploy a Whittle index policy in practice without first verifying indexability requires the implementation to explicitly handle this non-uniqueness. The use of implementations assuming a unique Whittle index value in non-indexable problems creates a risk that Whittle index computation produces errors or fails to converge. Also, the intuition for why a Whittle index policy would perform well relies on indexability. When indexability is lacking, policies prioritizing arms based on a Whittle index computation (while handling non-uniqueness) may be less likely to perform well.

If indexability can be verified, establishing asymptotic optimality requires verifying the additional sufficient conditions discussed above. Past literature suggests that this may be even more difficult than verifying indexability. Most work using Whittle indices does not prove its asymptotic optimality in the problem studied (Liu and Zhao 2008, Le Ny et al. 2008) or only proves it in a specific parameter regime (Liu and Zhao 2010, Verloop 2016). Instead, past literature often relies on numerical simulation to justify the Whittle index’s performance.

Thus, despite its popularity, the difficulty of verifying indexability and the additional conditions needed for asymptotic optimality remain a challenge when applying Whittle index policy in real-world problems.

Simulation-based approaches Responding to the limitations of the Whittle index policy, simulation-based approaches have been developed. For example, Meshram and Kaza (2020)

develop rollout-based heuristic policies and

Nakhleh et al. (2021) and Wang et al. (2021)

develop a deep reinforcement learning strategy, using neural networks to approximate the value function. Numerical performance on a collection of benchmark problem instances is their primary concern rather than theoretical guarantees. A policy that performs well in the problem instances simulated may perform poorly in other closely-related problem instances, and so performing well in a simulation-based study may not guarantee good performance across a wider range of problem instances faced after a policy is deployed to the field.

Moreover, if all benchmark policies included in a numerical study are asymptotically suboptimal, a new asymptotically optimal policy has the potential to significantly outperform all of them. Thus, identifying new asymptotically optimal policies is of significant interest.

2.2 Finite-horizon restless bandits

While the Whittle index faces challenges in verifying indexability and the additional conditions required for asymptotic optimality, recent progress on finite-horizon restless bandits provides algorithms without these drawbacks in this alternate setting.

In rapid succession, Hu and Frazier (2017), Zayas-Caban et al. (2019), Brown and Smith (2020) proposed index policies and show that they have , and opt gaps respectively. Then, Zhang and Frazier (2021) proposed a class of index policies generalizing Brown and Smith (2020) and Hu and Frazier (2017), showing that this larger class of policies have at most a opt gap and, surprisingly, a opt gap if a non-degeneracy condition is met.

Unlike the Whittle index, these index policies do not require an indexability condition to be well-defined. Moreover, they come with performance guarantees that do not require verifying extra sufficient conditions: for Zhang and Frazier 2021, Brown and Smith 2020, Hu and Frazier 2017, for Zayas-Caban et al. 2019.

We build on ideas in these finite-horizon papers to develop policies and analysis for the infinite-horizon setting that avoids the drawbacks of past infinite-horizon work: we develop policies that are asymptotically optimal and do not require indexability or other sufficient conditions.

This requires substantial additional analysis. One might hope to simply truncate the infinite-horizon problem, apply a previously proposed finite-horizon policy with a finite-horizon performance guarantee to this truncated problem, and choose the truncation point large enough that the reward obtained afterward is a small part of the overall reward. This, however, does not produce a guarantee of asymptotic optimality in the infinite-horizon setting.

Indeed, previously proposed policies known to have opt gap in the finite-horizon setting have performance guarantees that depend exponentially on : the opt gap for a problem with horizon is with for both Brown and Smith (2020) (see its Proposition 5) and Zhang and Frazier (2021) (see its proof of Lemma 5). Thus, applying either policy and its associated bound to an infinite-horizon discounted problem truncated at with discount factor would have a bound of on the opt gap realized up to the truncation time and a bound of on the opt gap realized after the truncation time. Choosing to minimize the sum of these bounds would not provide a opt gap bound.

3 System Model

This section formulates the restless bandit problem as a Markov decision process (MDP).

Model The decision maker faces arms. Each arm is as an MDP, which is associated with a state space and an action space. The arms share the same finite state space and the same binary action space . For arm , we let denote its state and the action applied in period .

As we move from time period to , each arm transitions to its new state independently given its current state and the action applied . We use a kernel to describe this stochastic transition. The kernel is assumed known to the decision maker and is denoted by where

and gives the probability of an arm transitioning to state conditioned on its current state being and action being taken. We assume that each arm shares the same transition kernel and the transition kernel is time-homogeneous. Thus, as we write above, the transition kernel does not depend on the arm index or time index . For simplicity, we assume each arm starts from a common state at . Our analysis also applies if each arm’s initial state is chosen independently from a common distribution.

The decision maker must pull arms in each period, which we refer as the budget constraint. The constant is also known to the decision maker.

In each period, an arm generates a real-valued reward that is a deterministic function of its state and the action applied . The decision maker’s objective is to maximize the total infinite-horizon expected discounted reward collected across all arms with discount factor while respecting the budget constraint in each period.

To formulate this -arm decision-making problem as a MDP, we introduce some additional notation. These arms form a new MDP, which we refer as the joint MDP. The state space of this joint MDP is the Cartesian product of single-arm state spaces : . Similarly, the action space of the joint MDP is the Cartesian product of single-arm action spaces : . At period , we denote the state of the joint MDP as and the action of the joint MDP as , where -th components of and refer respectively to the state of arm and the action applied to it.

Since the state of each arm evolves independently given its previous state and the action applied, the probability of transitioning from one state to another in the joint MDP is the product of the each arm’s transition probability. Mathematically speaking,

To clearly describe the budget constraint, we introduce a norm in the state space . For an element , its norm is the sum of its components, noting that these components are non-negative. This norm gives the number of arms pulled in period . Thus, we can write our budget constraint as for each period .

The reward of the joint MDP is the sum of the rewards generated by each individual arm. To formalize this, we define the joint MDP’s reward function via .

A policy is a mapping from the product of the state space and set of possible times to the action space. Under a policy , the action taken in period is . The decision maker’s objective is to choose a policy that maximizes the joint MDP’s total expected discounted reward while respecting budget constraints. Mathematically speaking, this is the following stochastic constrained optimization problem,


where takes the expectation under the distribution on states, actions, and rewards induced by the policy .

To measure the performance of a policy , we define its value function,

as the expected total discounted reward collected by this policy. An optimal policy has performance . We define the optimality gap (opt gap) of a policy as

i.e., the difference in performance between this policy and an optimal policy. The smaller the opt gap, the better the policy.

As a MDP with a large but finite state space, Problem 3 can be solved in principle via dynamic programming. However, the time complexity of this approach grows exponentially with the number of arms because of the so-called curse of dimensionality (Powell 2007): the joint MDP has a state space whose cardinality is . Thus, we would like to find policies that are both computationally tractable and have strong theoretical performance guarantees in the regime with many arms.

4 Background: Preliminary Results and Notation

This section describes a linear programming relaxation of the restless bandit problem. This is a standard technique from the restless bandit literature. It is not part of our contribution and we introduce it simply to provide a self-contained treatment of our research contribution.

The relaxation provides an upper bound on an optimal policy’s performance, which can in turn bound the opt gap of any feasible policy. Also, the relaxation can be solved efficiently and its solution will provide insights into the design of an asymptotically optimal policy in later sections.

Linear Programming Relaxation Following (Hu and Frazier 2017, Zayas-Caban et al. 2019, Brown and Smith 2020, Zhang and Frazier 2021), which apply linear programming relaxations to finite-horizon restless bandits, we describe an equivalent relaxation for the infinite-horizon problem. We emphasize that this is not part of our research contribution and is introduced so that we can define notation and so that our treatment can be self-contained.

Instead of solving the original problem (3) with cardinality constraints on the number of arms pulled in each period, we consider a modified version where these cardinality constraints are relaxed to constraints on the expected number of arms pulled:


From now on we assume is a rational number and (an integer) is chosen such that is also an integer. Thus, we drop the floor operator in the constraint (4). When is not an integer, analysis in Appendix 9.2

shows that the rounding error caused by the floor operator does not affect any asymptotic analysis in later sections.

To support computation, we consider a version of problem (4) that is truncated at some horizon , introducing approximation error discussed below. In some problems with special structure, truncation will be unnecessary for computation allowing us to choose , while in others it will be necessary, requiring . We denote the truncated relaxed problem and truncated original problem as




This truncated relaxed problem (regardless of the value of ) can be decomposed across arms by an analysis similar to Fenchel duality, allowing us to solve (4) with arms via an equivalent single-arm problem. Second, since the feasible policies for the truncated relaxed problem (4) is a superset of the feasible policies for the truncated original problem (4), the value of (4) provides an upper bound on the value of (4). We state these properties formally in Lemma 4, whose proof is left to Appendix 9.1.

The single-arm truncated relaxed problem can also be formulated as a linear program. Choosing the components of the occupation measure as decision variables, the single-arm truncated relaxed problem can be formulated as


The first constraint ensures flow balance in each time period. The second constraint ensures that the budget constraint on the expected number of arms pulled is satisfied in each period. The third, forth and fifth constraints ensure that the occupation measure forms a probability measure in each period. Denoting the solution to (5) by , we have .

Additional Notation Starting in the next section, we analyze the optimal occupation measure and the number of arms in each state under a stochastic sample path. To support this analysis, we introduce some additional notations here.

Given an optimal occupation measure solving (5), we let denote the probability that an arm being in state at period . For notational simplicity, we use the vectors and to denote the distribution over an arm’s state and the distribution over an arm’s state-action pair given an optimal occupation measure.

We are also interested in the realized number of arms in each state under a stochastic sample path. We let denote the number of arms in state in period and let denote the number of arms in state for which we took action taken in period . We have that for any and . Similar to the vector notation used to describe an optimal occupation measure, we use the vectors and .

Starting from Section 5, we will be interested in deviations between the realized number of arms under a stochastic sample path from an optimal occupation measure. To characterize this deviation, we define the following diffusion statistics:

We also use the vectors and for notational simplicity.

With this new notation, we can rewrite a policy in term of its diffusion statistics. Given a policy , we can rewrite it as a sequence of mappings s.t.

We refer to the sequence as the induced maps from policy . Section 5 characterizes a class of policies satisfying a property called diffusion regularity, which is defined in term of their induced maps.

5 Diffusion Regularity Conditions

This section defines a property called diffusion regularity and shows that policies possessing this property satisfy a bound on their corresponding diffusion statistics’ first moments. This diffusion regularity property is shown to be satisfied by the fluid-balance policies proposed in section

6 and the bound shown here is a tool used to understand their performance theoretically in that section.

The intuition behind the diffusion regularity condition is that as long as remains bounded by a term that does not grow with , then is also bounded by another term that does not grow with . The first three conditions in the diffusion regularity condition are similar to conditions proposed in Zhang and Frazier (2021) and the last condition is added specifically for our infinite-horizon setting to guarantee that diffusion statistics accumulate noise at no more than a linear rate over time.

We now define diffusion regularity. A policy is diffusion regular up to period if its induced maps satisfy the following conditions, where is the Euclidean -norm:

  • For any , there is a constant such that for all and ,

  • For any , there is a constant such that for all ,

  • For any , there is a map such that for all as ;

  • For any , there is a constant such that for all and , we have

If a policy is diffusion regular up to period , its diffusion statistics’ first moments are bounded above by a linear function of time (Lemma 5). The proof of Lemma 5 may be found in the Appendix.

If a policy is diffusion regular, then there exists constant and (neither depends on ), s.t. for all and ( could be infinity),

6 Fluid-balance policy

This section defines fluid-balance policies, and shows that all fluid-balance policies are diffusion regular and achieve an opt gap.

Roughly speaking, a fluid-balance policy is parameterized by two components: an optimal solution of the LP relaxation and a prioritization scores over states. The resulting fluid-balance policy pulls arms respecting two rules: a consistency rule and a prioritization rule. The consistency rule requires that the diffusion statistics share the same sign as for each state . The prioritization rule requires pulling arms according to the prioritization score as much as possible while respecting the consistency rule.

Formally, a fluid-balance policy is parameterized by an occupation measure solving truncated Problem (5) up to period and “priority-score” functions assigning each state a time-dependent real number. The fluid-balance policy with these components is defined by Algorithm 1.

Input: optimal occupation measure , priority-score functions .

1:for  do
2:     Input: the number of arms in state and their associated diffusion statistics
3:     For each state , set
4:     while  do
5:         Find the state with the lowest priority-score such that
7:     end while
8:     For each state , pull arms in state
9:end for
Algorithm 1 Fluid-balance policy

We can show that any fluid-balance policy is diffusion regular and thus satisfies the bound on the expected norm of its diffusion statistic provided in Lemma 5. Moreover, it actually satisfies a more explicit bound than Lemma 5. These statements are shown in the following lemma, whose proof appears in the Appendix. Any fluid-balance policy is diffusion regular, and for .

Now we are able to show our main results Given any fluid-balance policy , .


Proof of Theorem 6 Denote the optimal policy maximizing the infinite-horizon Problem (3). Then by denoting ,

where the last inequality is due to the definition of .

We deal with these two terms and separately. For the first term,

Recall Lemma 6, we have

For the second term,

Combining above analysis together, we conclude .

Based on Theorem 6, we can show the following proposition: Choosing implies that for any fluid-balance policy ,we have


7 Numerical Experiment

This section illustrates the performance of fluid-balance policies through two numerical experiments, focusing on their advantage over the Whittle index policy.

The first experiment studies a simple non-indexable example encapsulating a tradeoff that is important in many more complex real-world decision problems. The decision-maker can either generate a substantial reward over a short time horizon or generate a steady small amount of reward over a very long horizon. We refer to this as the Slow-and-steady Problem, borrowing from the idiom “slow and steady wins the race”. We show this example is not indexable, and thus the Whittle index is not well-defined. However, the fluid-balance policy can be computed analytically without truncation (i.e., the truncation point is ) and is actually the optimal policy.

In the second experiment, we compare the fluid-balance policy against the Whittle index for a discounted version of a problem studied in Fu et al. (2019) and Biswas et al. (2021). Although this problem is indexable, the fluid-balance policy outperforms the Whittle index by over 30. This problem is drawn from the literature studying bandits with unknown transition kernels, where the Whittle index policy’s performance is used to represent the performance achievable when transition kernels are known. Our results suggest that the fluid-balance policy better represents the performance achievable with full information and thus might be a better benchmark than the Whittle index policy.

7.1 The Slow-and-Steady Problem: Large Fleeting Rewards vs. Small Steady Rewards

This section constructs a simple non-indexable problem reflecting an important tradeoff arising in real-world decision making: should we generate one large reward immediately, or generate a sequence of small rewards over a much longer period. We refer to this as the “slow-and-steady problem,” echoing the proverb “slow and steady wins the race” often uttered when considering such tradeoffs. We show that the Whittle index is not well defined for the slow-and-steady problem while the fluid-balance policy we consider is not just optimal asymptotically but also optimal for finite .

Problem Definition

In the slow-and-steady problem there are two states that generate non-zero rewards: the “Steady” state and the “Brief“ state. Once an arm enters the Steady state, it stays there and generates a reward of each time it is activated. An arm in the Brief state stays in the Brief state until it is activated, at which point it generates a reward of and transitions to the “End” state. We think of as being small and as being large. Once an arm is in the End state, it stays in that state.

Before transitioning into either the Steady or Brief states, an oscillates between the “Uncommitted-Steady” and “Uncommitted-Brief” states until the arm is activated. When activated, the arm transitions into the End state with a small probability and transitions into the Steady state (if it was in the Uncommitted-Steady state when it was activated) or the Brief state (if it was in the Uncommitted-Brief state when it was activated) otherwise. For technical reasons, we also have a “Pre-Steady” state, which always transitions into the Steady state regardless of whether it was activated.

Figure 1 shows the transition dynamics between states, where nodes represent states and edges between nodes represent transition probabilities. Formally, the transition kernel is given by:

Figure 1: State transition diagram for Slow-and-Steady Problem. Each node stands for a state. Edges between nodes stand for transitions: dashed lines represent transitions when an arm is idled; solid lines represent transition when an arm is activated. Unless the transition probability is specified on the edge, it is 1. Similarly, unless the reward is specified on an edge, it is 0.

Formally, the reward function is given by

with rewards for all other state-action pairs set to 0. We seek to maximize the expected infinite-horizon discounted reward with discount factor .

We assume our parameters satisfy , i.e. the Brief state generates a somewhat larger reward than Steady state, but not too much larger. Also we assume , i.e. the Uncommitted-Brief and Uncommitted-Steady states transition into the Brief and Steady states with a reasonably high probability. In these parameter ranges, and for the budget and initial occupation measure chosen below, we show below that the problem is not indexable.

The budget is set so that we can pull arms out of the total arms in each period. In the initial time period, arms in the Uncommitted-Steady state, arms are in the End state and arms are in the Pre-Steady state, where and . These parameters are chosen so that activating all arms in the Pre-Steady and Uncommitted-Steady states respects the constraint.


We first show, in the following proposition, that this problem is not indexable. Thus, the Whittle index is not well-defined. Proofs of this and other propositions in this section appear in the appendix. The slow-and-steady problem is not indexable.

Towards defining a fluid-balance policy, we show in the next proposition that the infinite-horizon linear programming relaxation of the slow-and-steady problem defined above permits an analytical solution. This allows defining fluid balance policies without truncation (i.e., the truncation point used is ). Consider the policy that pulls all arms in the Uncommitted-Steady and Pre-Steady in the first period, then pulls all arms in from the second period onward. This policy is optimal in the linear programming relaxation (4) of the slow-and-steady problem.

Using this optimal policy for the relaxed problem, we construct a fluid-balance policy. This fluid balance policy activates all arms in Uncommitted-Steady and Pre-Steady states in the first period, then activates as many arms in the Steady state as possible starting from the second period. If there are fewer than arms in the Steady state, it activates arms in the End state to meet the budget.

Theorem 6 implies that the fluid-balance policy is asymptotically optimal, i.e., its opt gap is . Surprisingly, this fluid-balance policy is not only asymptotically optimal, but optimal, i.e., its opt gap is . This is shown in the following proposition. The fluid-balance policy is an optimal policy for the slow-and-steady problem.

The bound on the opt gap of a fluid-balance policy in Theorem 6 is derived by comparing a fluid-balance policy’s reward expected total discounted reward against the optimal reward of the linear programming relaxation (which is an upper bound on the value of an optimal policy) rather than the value of an optimal policy in the original (unrelaxed) problem. Since the fluid-balance policy is optimal in the slow-and-steady problem, its opt gap is . The optimal reward of the linear programming relaxation, however, is strictly bigger than that of an optimal policy, causing the gap to the linear programming relaxation to remain . This is shown in the following proposition. , where .

7.2 Bandit literature benchmark

There is a stream of literature studying problems similar to the one we study, but where state transitions are unknown, e.g. Fu et al. (2019) and Biswas et al. (2021)

. Rather than designing policies based on knowledge of the problem’s state transition kernel, as in the restless bandit problem we study, this stream of literature designs algorithms that estimate the state transition kernel while simultaneously choosing actions and collecting rewards. A common practice in this literature is to benchmark a proposed algorithm’s performance against a full-information policy, such as the Whittle index policy, on a specific problem.

In this section, we study a problem based on one of the most commonly used benchmark problems from this literature. This problem is based on Fu et al. (2019) and Biswas et al. (2021), who study an undiscounted version in which arms are pulled per period. In the setting where the budget is arms are pulled per period, a numerical study shows that the Whittle index is actually also a fluid balance policy, and thus is asymptotically optimal by Theorem 6. When the budget is arms per period, however, we find that the fluid-balance policy outperforms the Whittle index by over , suggesting that fluid-balance policies provide better full-information benchmark.

Problem Definition

In the problem that we study, there are 4 different states: . Transition kernels for action and are given by

The reward solely depends on the state and is unaffected by the action:

We set the discount factor to and require arms to be pulled per period. Initially, there are arms in state , arms in state and arms in state .


Via direct calculation, we can show that this problem is indexable. Ranking states from the highest to the lowest according to the Whittle index, we have that the Whittle index prioritizes state 2 over state 1, and state 1 over state 3. It is unknown whether the Whittle index policy is asymptotically optimal in this problem, though numerical experiments below suggest that it is not.

To compute the fluid-balance policy, Since this infinite-horizon problem’s linear programming relaxation does not permit an analytical solution, we solve the truncated version up to . This provides an accurate approximation of the upper bound implied by the linear programming relaxation because the total reward after period is less than , much smaller than the precision of a 64-bit floating point number.

After solving the truncated relaxation problem, we need a final piece before implementing a fluid-balance policy: the prioritization score. We adhere to Whittle index here, where we rank states from high to low as state 2 state 1 state 0 state 3.

Figure 2 compares the performance of the Whittle index and fluid-balance policies. Simulation with 2000 independent replications is used to estimate the performance of each policy via the sample mean and confidence intervals. As we can see, the fluid-balance policy outperforms the Whittle index in both the small- and large- regimes.

Figure 2: Performance comparison between Whittle index and fluid-balance policy. The left panel shows the average reward per arm versus number of arms (), where we compare the upper bound from the linear programming relaxation with the performance of the Whittle index and fluid-balance policies as estimated via simulation. The right panel shows an upper bound on the opt gap (the upper bound from the linear programming relaxation minus a simulation-based estimate of the expected total discounted reward) versus the number of arms (). The Whittle index’s opt gap grows linearly with , consistent with a lack of asymptotic optimality, while the fluid-balance policy’s opt gap grows sublinearly, consistent with asymptotic optimality and our result from Theorem 6 that its opt gap is .

8 Conclusion and Future Work

In this paper, we propose a class of policies, called fluid-balance policies, which achieve an opt gap universally as the number of arms grows large. Unlike the Whittle index policy, fluid-balance policies do not require an indexability condition to be well-defined and our results show they are asymptotically optimal without the need for difficult-to-verify sufficient conditions.

Although we restrict our analysis to restless bandits, we believe the techniques and insights we develop here can be generalized to multi-action multi-resource restless bandits, also known as weakly coupled Markov Decision Processes. Another interesting direction for future work would be characterizing a non-degeneracy condition, similar to Zhang and Frazier (2021), sufficient for a fluid balance policy to achieve a optimality gap in an infinite-horizon restless bandit.


  • P. Ansell, K. D. Glazebrook, J. Nino-Mora, and M. O’Keeffe (2003) Whittle’s index policy for a multi-class queueing system with convex holding costs. Mathematical Methods of Operations Research 57 (1), pp. 21–39. Cited by: §2.1.
  • A. Biswas, G. Aggarwal, P. Varakantham, and M. Tambe (2021) Learn to intervene: an adaptive learning policy for restless bandits in application to preventive healthcare. arXiv preprint arXiv:2105.07965. Cited by: §7.2, §7.2, §7.
  • D. B. Brown and J. E. Smith (2020) Index policies and performance bounds for dynamic selection problems. Management Science. Cited by: §1, §1, §1, §2.2, §2.2, §2.2, §4.
  • X. Chen, Q. Lin, and D. Zhou (2013) Optimistic knowledge gradient policy for optimal budget allocation in crowdsourcing. In

    International conference on machine learning

    pp. 64–72. Cited by: §1.
  • V. F. Farias and R. Madan (2011) The irrevocable multiarmed bandit problem. Operations Research 59 (2), pp. 383–399. Cited by: §9.1.
  • J. Fu, Y. Nazarathy, S. Moka, and P. G. Taylor (2019) Towards q-learning the whittle index for restless bandits. In 2019 Australian & New Zealand Control Conference (ANZCC), pp. 249–254. Cited by: §7.2, §7.2, §7.
  • J. Gittins, K. Glazebrook, and R. Weber (2011) Multi-armed bandit allocation indices. John Wiley & Sons. Cited by: §2.1.
  • K. D. Glazebrook, D. Ruiz-Hernandez, and C. Kirkbride (2006) Some indexable families of restless bandit problems. Advances in Applied Probability 38 (3), pp. 643–672. Cited by: §2.1.
  • S. Guha, K. Munagala, and P. Shi (2010) Approximation algorithms for restless bandit problems. Journal of the ACM (JACM) 58 (1), pp. 3. Cited by: §2.1.
  • S. Guha and K. Munagala (2007) Approximation algorithms for budgeted learning problems. In

    Proceedings of the thirty-ninth annual ACM symposium on Theory of computing

    pp. 104–113. Cited by: §2.1.
  • S. Guha and K. Munagala (2008) Sequential design of experiments via linear programming. arXiv preprint arXiv:0805.2630. Cited by: §2.1, §9.1.
  • W. Hu and P. Frazier (2017) An asymptotically optimal index policy for finite-horizon restless bandits. arXiv preprint arXiv:1707.00205. Cited by: §1, §2.2, §2.2, §4.
  • P. Jacko and J. Nino-Mora (2007) Time-constrained restless bandits and the knapsack problem for perishable items. Electronic Notes in Discrete Mathematics 28, pp. 145–152. Cited by: §2.1.
  • J. Le Ny, M. Dahleh, and E. Feron (2006) Multi-agent task assignment in the bandit framework. In Proceedings of the 45th IEEE Conference on Decision and Control, pp. 5281–5286. Cited by: §1.
  • J. Le Ny, M. Dahleh, and E. Feron (2008) Multi-uav dynamic routing with partial observations using restless bandit allocation indices. In 2008 American Control Conference, pp. 4220–4225. Cited by: §2.1, §2.1.
  • K. Liu and Q. Zhao (2008) A restless bandit formulation of opportunistic access: indexablity and index policy. In 2008 5th IEEE Annual Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks Workshops, pp. 1–5. Cited by: §1, §2.1, §2.1.
  • K. Liu and Q. Zhao (2010) Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Transactions on Information Theory 56 (11), pp. 5547–5567. Cited by: §2.1, §2.1.
  • R. Meshram and K. Kaza (2020) Simulation based algorithms for markov decision processes and multi-action restless bandits. arXiv preprint arXiv:2007.12933. Cited by: §1, §2.1.
  • K. Nakhleh, S. Ganji, P. Hsieh, I. Hou, and S. Shakkottai (2021) NeurWIN: neural whittle index network for restless bandits via deep rl. In Thirty-Fifth Conference on Neural Information Processing Systems, Cited by: §1, §2.1.
  • J. Nino-Mora (2001) Restless bandits, partial conservation laws and indexability. Advances in Applied Probability 33 (1), pp. 76–98. Cited by: §2.1.
  • J. Niño-Mora (2007) Dynamic priority allocation via restless bandit marginal productivity indices. Top 15 (2), pp. 161–198. Cited by: §2.1.
  • W. B. Powell (2007) Approximate dynamic programming: solving the curses of dimensionality. Vol. 703, John Wiley & Sons. Cited by: §1, §3.
  • R. T. Rockafellar (1970) Convex analysis princeton university press. Princeton, NJ. Cited by: §9.1.
  • I. M. Verloop (2016) Asymptotically optimal priority policies for indexable and nonindexable restless bandits. The Annals of Applied Probability 26 (4), pp. 1947–1995. Cited by: §2.1.
  • K. Wang, S. Shat, H. Chen, A. Perrault, F. Doshi-Velez, and M. Tambe (2021) Learning mdps from features: predict-then-optimize for sequential decision problems by reinforcement learning. arXiv preprint arXiv:2106.03279. Cited by: §2.1.
  • R. R. Weber and G. Weiss (1990) On an index policy for restless bandits. Journal of Applied Probability 27 (3), pp. 637–648. Cited by: §1, §2.1.
  • P. Whittle (1980) Multi-armed bandits and the gittins index. Journal of the Royal Statistical Society: Series B (Methodological) 42 (2), pp. 143–149. Cited by: §1, §1, §1, §1, §2.1, §2.1, §2.1, §9.5.
  • G. Zayas-Caban, S. Jasin, and G. Wang (2019) An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits. Advances in Applied Probability 51 (3), pp. 745–772. Cited by: §2.2, §2.2, §4.
  • X. Zhang and P. I. Frazier (2021)

    Restless bandits with many arms: beating the central limit theorem

    arXiv preprint arXiv:2107.11911. Cited by: §1, §1, §2.2, §2.2, §2.2, §4, §5, §8.

9 Appendix

This section provides all technical proof in the main paper.

9.1 Proof for Lemma 4

In original problem (3) the budget constraint is in sense of cardinality, while the expectation constraint is need for relaxation problem (4). So a wider class of policy is feasible in the relaxation problem, which implies


To prove , we use Lagrangian Relaxation similar to Farias and Madan (2011), Guha and Munagala (2008) as the key idea in the following argument.

Through imitating straightforwardly the proof of Fenchel Duality Theorem (Rockafellar 1970),


where .

The let-hand side of Equation (8) equals to . On the right hand side, for fixed ,

Since all arms share the same transition kernel and reward function,

So we conclude


By using Fenchel Duality again on the one-arm problem,


To summarize Equation (8), (9) and (9.1) together,

9.2 Discussion of the rounding error in budget constraints

We want to show a rounding error in the relaxation Problem (4) results in at most a constant difference in the optimal objective value. Mathematically speaking, denote

Then , where does not depend on . Thus, all our analysis on the asymptotic regime of opt gap holds true since the LP relaxation upper bound (in rounded version) deviates from the unrounded version at most a constant away, not affecting the asymptotic analysis.

The proof of the above statement is straight forward. As seen from Lemma 4, there exists a single-arm pulling strategy which pulls arms per period in expectation and achieves objective value . Thus, we can pull arms according to this strategy and pull the only arm left with probability at period . Thus, we show

Similarly, we can show

Combining the above two inequality concludes the statement.

9.3 Proof of Lemma 5

To prove Lemma 5, first notice


where is the indicator function of event . By dynamic equation (11) in a vector form,

where is the sum of independent

-dimensional Bernoulli random variable with mean

. For simplicity, we denote in the following proof. With this new notation,