Scalable Operator Allocation for Multi-Robot Assistance: A Restless Bandit Approach

In this paper, we consider the problem of allocating human operators in a system with multiple semi-autonomous robots. Each robot is required to perform an independent sequence of tasks, subjected to a chance of failing and getting stuck in a fault state at every task. If and when required, a human operator can assist or teleoperate a robot. Conventional MDP techniques used to solve such problems face scalability issues due to exponential growth of state and action spaces with the number of robots and operators. In this paper we derive conditions under which the operator allocation problem is indexable, enabling the use of the Whittle index heuristic. The conditions can be easily checked to verify indexability, and we show that they hold for a wide range of problems of interest. Our key insight is to leverage the structure of the value function of individual robots, resulting in conditions that can be verified separately for each state of each robot. We apply these conditions to two types of transitions commonly seen in remote robot supervision systems. Through numerical simulations, we demonstrate the efficacy of Whittle index policy as a near-optimal and scalable approach that outperforms existing scalable methods.



There are no comments yet.


page 1

page 7


Scaled Autonomy: Enabling Human Operators to Control Robot Fleets

Autonomous robots often encounter challenging situations where their con...

Learn and Transfer Knowledge of Preferred Assistance Strategies in Semi-autonomous Telemanipulation

Increasing the autonomy level of a robot hand to accomplish remote objec...

Traversing Supervisor Problem: An Approximately Optimal Approach to Multi-Robot Assistance

The number of multi-robot systems deployed in field applications has inc...

A Task Allocation Approach for Human-Robot Collaboration in Product Defects Inspection Scenarios

The presence and coexistence of human operators and collaborative robots...

From Drinking Philosophers to Wandering Robots

In this paper, we consider the multi-robot path execution problem where ...

Adaptive Workload Allocation for Multi-human Multi-robot Teams for Independent and Homogeneous Tasks

Multi-human multi-robot (MH-MR) systems have the ability to combine the ...

Feasible Region-based Identification Using Duality (Extended Version)

We consider the problem of estimating bounds on parameters representing ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in robot autonomy have lead to a decrease in the necessity of strict human supervision of robots. This has enabled the development of human-robot collaborative systems where the task is primarily executed by a number of semi-autonomous robots requiring intermittent assistance from a human teammate, either in event of a fault [1, 2] or to further increase performance of the multi-agent teams in warehouse operation [3], search-and-rescue [4] or in a social setting [5]. However, identifying which robot to assist in an uncertain environment is a challenging task for human operators [6, 3]. Moreover, as the number of robots increases, it becomes challenging for the operators to maintain awareness of every robot, which cripples system’s performance [7, 8]. Therefore, human operators can benefit from having a decision support system (DSS) that advises which robots require attention and when [9, 3].

In this paper, we present such a DSS for a multi-robot system comprising a fleet of semi-autonomous robots with multiple human operators available for assistance if and when required. Figure 1 presents an overview of the problem setup, showing

robots navigating in a city block-like environment, moving from start to goal locations. While navigating, a robot passes through a series of waypoints, each characterized by a different probability of success in progressing to the next waypoint. There is also a possibility that the robot may fail at a task and get stuck in a fault (error) state from which assistance from a human operator is required to continue. There are

identical human operators available (), each of whom can assist/teleoperate at most one robot at a time. While being assisted by an operator, robots have different probabilities of success and failure, and to get out of a fault state.

Figure 1: Overview of the multi-robot assistance problem for robots navigating in an environment. A number of mobile robots are tasked to navigate through a series of waypoints. Operators are allocated to the robots when required. This is done by separating the complete -robot problem into single robot problems and computing the Whittle index heuristic. Given the current state of the system, this heuristic can be used to efficiently compute the operator allocation.

It is possible to solve the above problem by modelling it as a Markov Decision Process (MDP) [10]

. However, such formulations suffer from the curse of dimensionality, making conventional MDP-based solution techniques scale poorly with problem size 

[11]. Moreover, the policy needs to be re-solved every time a robot or an operator enters or leaves the system.

The most closely related work to our problem is presented in [3], where the authors discuss single-operator multi-robot supervision systems. An advising agent guides the operator on which robot they should assist. The problem is solved using an -step look-ahead (myopic) approach, which provides an efficient and practical solution, but suffers from scalability issues with increasing number of operators and the look-ahead steps. Researchers have also discussed deterministic versions of the problem, where exact outcomes of robots’ actions and times for fault occurrences are known, and the allocation policy is determined using a sampling-based planner [12]. In [13], the authors present an approximation algorithm for a similar scheduling problem. These approaches however do not apply to the stochastic problems.

In this paper, we show how an index-based policy can provide a scalable and better performing solution than the myopic approach for the given multi-robot multi-operator allocation problem under stochastic transitions. Specifically, our work makes the following contributions:

1) We show that the operator allocation problem with multiple independent robots can be formulated as an instance of the Restless Multi-Armed Bandit (RMAB) problem. We leverage this formulation to obtain a Whittle index-based policy that scales linearly with the number of robots and is independent of the number of operators.

2) We derive simple conditions to verify indexability of the model.These conditions can be checked independently for each state of each robot, thus providing a method that scales linearly with the size of the problem. This method can be applied to systems with any number of states and does not require the optimal policy to be of a threshold type.

3) We then implement our approach in two practical scenarios and present numerical experiments. The results show that the proposed method provides near-optimal solutions and outperforms existing efficient solution approaches, namely the reactive policy, - and -step myopic policies, and the benefit-maximizing policy.

1.1 Background and Related Work

The problem of allocating operators in a multi-robot team bares similarities with the disciplines of multi-robot supervision, task scheduling and queuing theory. In this section, we briefly review the related research, followed by an introduction of a Restless-Multi Armed Bandits.

In the literature, several studies discuss the problem of enabling human operators to assist multiple robots such as a team of navigating robots, a fleet of multiple UAVs, or a team performing search and rescue operations [14, 15]. To understand and improve human supervision, researchers have used frameworks such as sliding autonomy to incorporate various human-robot team capabilities (like coordination and situational awareness) [16, 17]. Some studies also present interaction interfaces to facilitate and improve such supervision [18, 19]. The problem of assisting a number of independent robots, has also been studied under a learning framework. The approach presented in [2] learns the decision-making model of human operator from recorded data and tries to replicate that behaviour, optimizing based on the operator’s internal utility function. In contrast, the problem presented in this paper is designed to optimize a global performance metric assuming the knowledge of success and failure rates of robots with and without an operator allocated to them.

Such knowledge can be estimated using recorded data similar to the work presented in

[3]. For the scope of this paper, we will assume this knowledge takes the form of known transition probabilities.

In the queuing discipline, several studies have investigated the effects of different queuing techniques [20] or threshold-based strategies [21] to prioritize operator’s attention to the robots. However, the model that we study is different from a queuing model as it is possible for the robots to complete their tasks without the help of operators, and for the operators to be allocated to robots not stuck in a fault state.

The multi-target–multi-agent problems form another class of problems similar to the operator allocation problem. These problems deal with allocation of multiple agents to a number of targets aiming to detect or follow the targets under certain constraints [22, 23]. However, our problem setup is different because the behaviour of the targets (robots) changes with the allocation of agents (operators) and it is not possible to allocate multiple agents to a single target at once. Moreover, our problem presents a collaborative task, where both the robots and operators are working to achieve a common goal.

Restless Multi-Armed Bandit

Restless Multi-Armed Bandits (RMAB), first introduced in [24], is a generalization of Multi-Armed Bandits (MAB) [25] which has been previously used in problems like assisting human partners [26] and distributing resources among human teammates [27]. RMAB is a class of scheduling problems where limited resources have to be allocated among several alternative choices. Each choice, referred to as an arm, is a discrete-time controlled Markov process which evolves depending on the resource allocated to it. RMAB framework has been applied to problems in stochastic scheduling, patrol planning, sensor management and resource allocation in general [28].

Finding the optimal policy for RMAB suffers from the curse of dimensionality as the state space grows exponentially with the number of arms. In general, obtaining the optimal policy in an RMAB is PSPACE-hard [29]. However, the Whittle index policy offers a simpler and scalable alternative to the optimal policy. Even though the Whittle index policy does not guarantee an optimal solution, it minimizes expected cost for a relaxed problem under time-averaged constraint [24]. This approach is shown to work quite well for several scheduling and resource allocation problems [30, 31, 32]. A few studies have also implemented index-based methods to solve a sensor scheduling problem [33] or to serve a number of users transmitting a queue of data packets through a channel [31]. Therefore, it is a reasonable approach to solve an RMAB given that the problem satisfies a technical condition known as Indexability (more details in Section 3). Unfortunately, it is difficult to verify this condition in general and there is no universal framework that applies to all problems. Existing methods proposed for verifying indexability have been investigated for specific systems such as two state restless bandits [34, 35] or restless bandits with optimal threshold-based policy [34, 36, 37].

1.2 Organization

The contents of this paper are organized as follows: The multi-robot assistance problem is presented in Section 2. We discuss the general Restless Bandit Problem and define property of indexability in Section 3. In Section 5, we present two practical classes of transition functions and establish conditions under which problem indexability is ensured. In Section 6, we cover the calculation of Whittle index heuristic and present an efficient policy for the problem. Next, we present simulations of the problem in Section 7 to examine validity and performance of the presented policy. The paper ends with a brief discussion and conclusion.

2 Multi-Robot Assistance Problem

Consider a decision support system (DSS), consisting of a team of human operators supervising a fleet of semi-autonomous robots. Each robot is required to complete a sequence of tasks to reach its goal. We will use a fleet of robots delivering packages in a city as a running example but similar interpretations hold for other applications mentioned in previous sections (e.g., robots reaching a sequence of configurations [12]). In this case, the robot’s trajectory would correspond to a series of waypoints that a robot needs to navigate to reach its destination (goal location). At each waypoint, a robot can either operate autonomously or be teleoperated by one of the human operators. We assume that all human operators are identical in the way they operate the robots and that a human operator can operate at most one robot at a time. We now provide a mathematical model for different components of the system111 Remark on notation:

Throughout this paper, we use calligraphic font to denote sets and roman font to denote variables. Uppercase letters are used to represent random variables and the corresponding lowercase letters represent their realizations. Bold letters are used for variables pertaining to multi-robot system while light letters represent corresponding single-robot variables.


2.1 Model of the robots

It is assumed that when operating autonomously, each robot uses a pre-specified control algorithm to complete its task. For the delivery robot example, this could be, for instance, a SLAM-based local path planner that the robot uses for navigating between the waypoints. We will not model the details of this control algorithm but simply assume that this control is imperfect and occasionally causes the robot to enter a fault state while doing a task (e.g., delivery robot getting stuck in a pothole or losing its localization). We model this behaviour by assuming that while completing each task, the robot may be in one of the two internal states: a normal state (denoted by ) or a fault state (denoted by ). When a robot is being teleoperated, it may still be possible for it to enter into a fault state.

The operating state of robot at time , denoted by , is tuple of its current task and internal state. The state space for robot is given by

where the terminal state indicates that all tasks have been completed. The state space for all robots is denoted by .

The state of a robot evolves differently depending on whether it is operating autonomously (denoted by mode ) or teleoperated (denoted by ). Given robot in state operating in mode , let denote the probability of successfully completing the current task at the current time step and let denote the probability of toggling the current internal state (i.e. going from normal to fault state and vice-versa). A diagram describing these transitions is shown in Fig. 2. Note that the terminal state is an absorbing state, so and .

Figure 2: State-transition diagram for robot working on task , where in the robot is in the normal state , and in the robot is in the fault state . Transitions can occur between and , and the probabilities change with operating mode .

There is a per-step cost , where denotes the cost of operating robot in mode when the robot is in state . Note that the per-step cost is zero in the terminal state, i.e, .

2.2 Model of the decision support system (DSS)

There is a decision support system that helps to allocate operators to the robots. At each time the decision support system observes the operating state of all robots and picks at most robots to teleoperate. We capture this by the allocation , where


The allocation is selected according to a time-homogeneous Markov policy . The expected total cost incurred by any policy is given by


where is the discount factor and is the initial state with for every .

2.3 Problem objective

We impose the following assumptions on the model:

  1. Given an allocation by the DSS, the operating states of the robots evolve independently of each other.

  2. For every robot , the probability of getting out the faulty internal state when teleoperated is strictly greater than , i.e., .

  3. Under autonomous operation, a robot stays in the fault state, i.e., .

The design objective is to solve the following optimization problem:

Problem 1.

Given the set of robots, the system dynamics and the per-step costs, the number of human operators, and the discount factor , choose a policy to minimize the total discounted cost given by (2).

Optimal solution for Problem 1 can be found by modelling it as a Markov decision process and solving using dynamic programming [10]. However, the sizes of state and action spaces of the resulting model grows exponential with the number of robots and operators. Thus, solving Problem 1 using dynamic programming becomes intractable for larger systems. To address this, we model Problem 1 as a restless multi-armed bandit (RMAB) problem and use the notion of indexability to find an efficient and scalable policy. We start by providing an overview of RMAB in the next section.

3 Overview of Restless Multi-armed bandits

In this section we provide an overview of Restless Multi-Armed Bandits (RMAB), indexability and the Whittle index policy.

3.1 Restless Bandit Process

A restless bandit (RB) process is a controlled Markov process  where is the state space, is the action space, is the transition probability function, is the per-step cost function, and is the initial state. By convention, action  is called the passive action and action  is called the active action.

3.2 Restless Multi-armed Bandit Problem

A Restless Multi-armed Bandit (RMAB) is a collection of independently evolving RBs , . Each process is conventionally called an arm. A decision-maker selects at most arms () at each time instance. Let and denote the state of arm  and the action chosen for arm  at time . Let and where

denote the states and actions of all arms. As the dynamics of each arm are independent, we have

The instantaneous cost of the system is the sum of costs incurred by each arm. The performance of any time homogeneous Markov policy is measured by


where denotes the discount factor. Finally, the RMAB optimization problem is as follows:

Problem 2.

Given a discount factor , a collection of arms , and the number  of arms to be chosen at each time, choose a policy that minimizes .

As discussed earlier, to tackle the scalability issues of dynamic programming-based solutions, the Whittle index policy is one of the commonly used heuristic to solve a RMAB problem [24]. This policy is computationally efficient and it readily generalizes to the setting where or changes over time. Next, we present the required definitions.

3.3 Indexability and the Whittle index policy

In this section, we restrict our discussion to a single arm and therefore omit the superscript for the ease of notation. Consider an arm  where, for some penalty , modify the per-step cost as


Then the performance of any given time-homogeneous Markov policy  is given by


Now consider the following auxiliary problem:

Problem 3.

Given an arm , the discount factor  and the penalty , choose a Markov policy to minimize given by (5).

Problem 3 is a Markov decision process. Let us denote the optimal policy of Problem 3 by . It is assumed that the optimal policy picks passive action at any state where both the active and passive actions result in same expected cost. Next, define passive sets and indexability.

Definition 1 (Passive set).

Given , the passive set  is the set of states where passive action is prescribed by , i.e.,

Definition 2 (Indexability).

An arm is indexable if is non-decreasing in , i.e., for any ,

A RMAB problem is indexable if all arms are indexable.

Definition 3 (Whittle index).

For an indexable arm, the Whittle index of the state  of an arm is the smallest value of for which state  is part of , i.e.,


Equivalently, the Whittle index is the smallest value of for which is indifferent between the active action and passive action when the arm is in state .

The Whittle index policy is as follows: At each time, compute the Whittle indices of the current state of all arms and select the arms in states with highest Whittle indices (provided they are positive).

4 Indexability of the assistance problem

Problem 1 can be formulated as an instance of RMAB, where each robot corresponds to an arm. Under such a formulation, the state of arm corresponds to operating state of robot . The transition function corresponds to the robot state evolution shown in Fig. 2 and the cost function corresponds to the associated per-step cost . In addition, allocating an operator to robot corresponds to choosing the active action for that arm while autonomous operation corresponds to choosing the passive action. This motivates using the Whittle index policy to solve Problem 1. However, before we can implement this approach, we must check for indexability of the problem. As discussed earlier, there is no universal framework to verify indexability of a problem. Moreover, the optimal policy for the given problem does not show any threshold-based behaviour. Therefore, we determine sufficient conditions for indexability from first principles by using properties of the value function of each individual arm.

Since indexability has to be checked for each arm separately, for this analysis, we drop the superscript  from all variables.

Let be the unique fixed point of the following equation



represents the Q-value of taking action in state . Here the transition function denotes the probability of transition from state to state under action and is represented by Fig. 2. Let be the corresponding optimal policy

To ensure uniqueness of the , we follow the convention that when , the passive action is chosen. Let be the passive set given penalty  and be the Whittle index of state  for the problem of operator allocation in a single-robot system. Furthermore, define the Benefit function as


Then, a sufficient condition for indexability is as follows:

Lemma 1.

A sufficient condition for Problem 1 to be indexable is that the benefit function for each robot is monotonically increasing in for all states .


The result follows from the observation that using (8) and Def. 1, we can re-write the passive set as


Thus, monotonicity of the benefit function implies that the condition for indexability given in Def. 2 is satisfied.

We verify the monotonicity of by bounding the value function and establish the following:

Theorem 1.

Let denote the probability of repeating a task under mode with internal state . Define and as follows:


Then, the single-robot problem is indexable if for all :


See Appendix. ∎

The multi-robot problem is indexable if the conditions given in Theorem 1 hold true for all robots. In the next section, we present specific instances of the general model described in Section 2 which are indexable and discuss their relevance in practical assistance problems for (semi)autonomous delivery robots.

5 Special Cases: Robot transitions in the city

This section presents two specific classes of transition functions which represent two types of faults commonly occurring in systems with remote navigating robots.

5.1 Transition Type-1 : Faults with continuation

Consider the following transition behaviour along a robot’s waypoints. At each time step, the robot moves to its next waypoint with a probability representing, for example, the crowd in the area. There is also a probability of getting into a fault state such as encountering an unidentifiable obstacle. A human operator can teleoperate the robot to its next waypoint both from a normal or fault state. Such transitions represent faults where the robot is functioning properly but is unsure about how to proceed due to uncertainty in its surroundings. Thus the probability of success when being teleoperated is the same regardless of whether the robot is in its normal state or stopped in the fault state, i.e., and . The corresponding transition dynamics are shown in Figure 3. Note that in this case .

Figure 3: State-transition probabilities under the autonomous operation and teleoperation for type-1 transitions.

In this case, the coefficients and can be simplified to the following expressions:


Note that

Thus, and satisfy the sufficient condition of Theorem 1 for all allowed values of transition probabilities and the discount factor . Therefore, any robot following the Type-1 transitions is indexable.

5.2 Transition Type-2 : Faults with reset

Consider another type of transition where the robot can get into a fault state and needs error fixing while staying at its next waypoint. This includes scenarios such as losing localization or getting stuck in a minor obstacle. The human operator can try to assist the robot out of that situation by fixing the fault, resetting it back to its current waypoint (assuming the system is equipped with means to do so). Such transitions will mean that the probabilities and the corresponding transition dynamics are shown in Fig.4:

Figure 4: State-transition probabilities under the autonomous and assist/teleoperate actions for Type-2 transition dynamics.

Substituting the values of transition probabilities from Fig. 4 to the expressions of and , the coefficients can be simplified to the following:


Note that here is the same as (11) and, therefore, satisfies (10). For to satisfy (10), the condition results in the following condition on :


As , (13) also yields the following condition on :


Therefore, any robot state following the Type-2 transitions will satisfy the condition of indexability in Theorem 1 if (13) and (14) are satisfied, i.e., the probability that the robot transitions from a normal state to fault state during autonomous operation is not too high and the probability that the operator brings the robot from a fault state to a normal state is not too small.

As an example, consider a robot following Type-2 transitions with and . In this setting, any satisfies (14) and any satisfies (13). Thus, the model is indexable if there is at least a chance that teleoperation successfully resets the robot from the fault state to a normal state.

6 Computation of Whittle Index

As discussed in Section 1.1, once the indexability of the problem instance has been verified, we can compute Whittle indices for all robots and determine the Whittle index policy.

General approaches of computing Whittle indices are either based on adaptive greedy algorithm [38, 32] or binary search [36]. In this section, we briefly provide details on adaptive greedy algorithm and describe how the Whittle index policy works. Readers are encouraged to refer to [32] for a detailed explanation and validation of the algorithm. The algorithm is presented in Alg. 1 for computing Whittle indices for a single robot.

1:  Input: Robot .
2:  Initialize .
3:  while  do
4:     Compute , using Eq. (15).
7:     ,
9:  end while
Algorithm 1 Adaptive Greedy Algorithm for Whittle Index Computation

The algorithm operates as follows: For any subset

, define the policy vector

222 In the following expressions is used as a vector of size , constructed as a mapping from each state to corresponding action . as

Also define, , the cost vector for all states under a policy , and , the transition matrix under policy .

Then, in each iteration of the while loop, compute as follows:



The minimum value of calculated in line 5 in Alg. 1 corresponds to the Whittle indices of the minimizing states (Line 6). These states are then taken out of consideration in the next iteration of the while loop by including them in the passive set . When the Alg. 1 exits the while loop, the Whittle indices for all states of that robot are calculated. This procedure is then repeated for all the robots in the system.

Once the Whittle indices for all states of all robots are obtained, the Whittle index policy can be implemented as given in Alg. 2. In line 2 of the algorithm, the function returns indices of top positive elements in a set, where ties are broken randomly. As determined in [32], the computational complexity of this method is . In contrast, the computational complexity of finding the optimal policy for Problem 1 is using value iteration, where is the size of state space of individual robot.

1:  Input: Set of Whittle indices for all and , No. of Operators
3:   for all
4:   for all   // Allocate operators
5:  return  
Algorithm 2 Whittle Index Policy

7 Simulations and Results

In this section, we present performance results for a simulated multi-robot assistance problem under the following policies (described later): 1) Optimal policy, 2) Index policy, 3) Benefit maximizing Policy, 4) Myopic Policy, and 5) Reactive Policy. The problem and the solution frameworks for all policies were implemented using POMDPs.jl library in Julia [39].

7.1 Simulation Setup

For the simulations, a city map is generated as shown in Fig. 5 where the map is randomly divided into different zones corresponding to one of the two transition types presented in Section 5.

Figure 5: An instance of multi-robot assistance problem for robots navigating in a city block-like environment. Transition zones are marked by different color shadings. The green and blue areas represent type-1 and type-2 transitions respectively, as described in Section 5. Three robots are shown navigating to their corresponding goal locations, via a sequence of waypoints shown as black circles. These waypoints may lie in different transition zones resulting in varied performance for the robots.

The exact values of transition probabilities at different locations in the map are sampled randomly from a uniform distribution, according to Table 

1. The bounds on transition probabilities and for transition type-2 are determined by (13) and (14) respectively.

Probability Type-1 Type-2
Table 1: Probabilities values used for simulations for the two types of transitions.

For the teleoperation problem, we use the following cost structure:


with for any . This cost function captures the time that a robot takes to reach its goal, i.e., zero cost on reaching goal state, non-negative costs for the intermediate states, and a fixed additional cost of while being assisted.

Values of the different costs and the discount factor used are sampled from ranges specified in Table 2.


For simplicity, for a given parameter, same range is used for every state of all robots. Therefore, we remove the superscript .

Table 2: Parameter values used in the simulated assistance task.

Separate tests were performed to test the validity, performance and scalability of the Index policy. At the beginning of each simulation, a number of robots are placed on the map (ranging from to ) with randomly generated start and goal locations, and waypoints uniformly placed between the two. In practice, these waypoints are generated by a separate robot path planner for each individual robot, and are considered as an input for the operator allocation problem.

7.2 Baseline Policies

We consider the following baseline policies to assess the performance of the Index policy.

Optimal policy

The Optimal policy , as defined by (2), is found by encoding the complete problem with all robots as an MDP and solving it using the Sparse Value Iteration Solver from the POMDP.jl library.

Reactive policy

The reactive policy simply allocates an operator to any robot stuck in a fault state. If there are more such robots than operators, a random subset of those robots is selected.

Myopic policy

Myopic/Greedy Policies are commonly used to obtain fast (but sub-optimal) solutions to intractable problems. In this paper, we implement an -step myopic policy presented in [3] for . Define as the expected cost incurred by the system from current time step to infinity under passive actions. The -step myopic policy is then defined as


where the -step look-ahead cost is given by


is the cost incurred in the current time step.

Benefit maximizing policy

The Benefit maximizing policy is inspired by the advantage function

used in reinforcement learning, see for example 

[40]. This policy considers the benefit or advantage of taking the active action over the passive action for each robot and picks the robots with highest benefit at each time step, i.e.,


where corresponds to defined in (8) with .

7.3 Comparison with the Optimal policy

First, the Index policy is compared against the Optimal policy to validate its applicability for our problem. Due to its poor scalability, the Optimal policy cannot be computed for larger problem instances, therefore this test is limited to smaller problem size (up to robots and operators). The relative performance (ratio of the cost incurred under Optimal policy to that under Index policy) is shown in Fig. 6. For comparison, problem instances were tested under both policies and were simulated through Monte Carlo rollouts. Each problem instance is run until all robots reach their respective goal locations. This is repeated times and average cost is recorded.

Figure 6: Relative performance of Index policy compared to the optimal policy. The plots show distribution of indexable problem instances based on their performance under the two policies. Relative performance is calculated as the ratio of the cost incurred under optimal policy to that under index policy.

It is observed that the Index policy performs quite close to the optimal policy for all test cases. As the ratio of number of robots to number of operators increases, the Index policy starts to degrade in comparison to the optimal. However, the relative cost still remains within of the optimal. Also, note that the optimal policy minimizes the expectation of the cost incurred. The graph shows distribution of the relative performance, which is centered around 1. Therefore, we still see some test runs where the rollout resulted in a slightly lesser cost for the index policy. (see Fig. 6). However, as number of test runs are increased the average relative performance approaches .

7.4 Comparison with other baseline policies

Next, we compare the performance (measured as average cost incurred per robot before reaching its goal) of the Index policy with the three baseline policies on larger problem instances. For the comparison, a set of problem instances is created, each with a set of waypoints with randomly sampled transition probabilities according to Table 1. Each instance of the problem is then simulated separately under the different policies using Monte Carlo rollouts until all robots reach their goal states, repeated for 500 iterations. Each simulation iteration (rollout) is timed out at seconds for each policy. If an iteration takes longer than this time, the simulations are interrupted and the result for that test condition is not reported.

Figure 7 shows performance comparison of the four policies. The Index policy performs best out of the four policies, followed by the benefit maximizing policy () and the myopic policy (). The Reactive policy performed the worst as expected. As a side note, the average cost incurred per robot under any policy is strongly correlated with the ratio of number of robots to the number of available operators. This observation supports the intuition that as human operators are required to distribute their assistance among more robots, their effectiveness decreases.

Figure 7:

Performance comparison of the four policies for different number of operators available for allocation. Error bars in the plots show one standard-deviation above and below the average. Note that in larger problem instances, the simulations for the myopic policies timed out and could not be completed in the specified time limit (10 seconds per rollout).

7.5 Scalability

Table 3 shows the time that each policy takes to compute operator the allocation under different problem sizes. For these simulations, each robot is set to have waypoints. As observed from Table 3, the online computation times of the two Myopic policies scale exponentially with both the number of robots and the number of operators, with the time for -step Myopic policy growing at a much higher rate.

The computation times for the Index and Benefit maximizing policies scale linearly with the number of robots and are independent of the number of operators. Also, note that the Whittle index computation for one robot is independent from the rest. Therefore robots can be added/removed without re-computation of already computed indices. Furthermore, if the number of operators change to , the policy simply allocates operators to the robots with the highest Whittle indices. As a result, the policy is efficiently scalable with the number of robots and operators. For reference, the simulations were run on a Desktop PC with a core, GHz processor and GB of RAM.

Operators/ Robots Index Policy -step Myopic -step Myopic Benefit Maximizing
Table 3: Computation times of different policies (seconds)

8 Conclusions and Discussion

In this paper, we provide an analysis of operator allocation problem for a multi-robot assistance task and demonstrate the effectiveness of Restless Bandit framework to obtain a scalable policy. This policy is based on Whittle index heuristic and performs close to the optimal and significantly better than other efficient solution approaches. We also provide an analysis of indexability of such problems and give a simplified condition to quickly verify if a problem instance is indexable. These results can also be used to specify required transition behavior in the form of bounds on transition probabilities.

There are, however, a few limitations of the proposed approach. When a problem instance is not indexable, the Whittle indices are not defined and the methods of computing these indices may not give meaningful values. Therefore, the proposed approach is not applicable in such cases. Also, note that the conditions for indexability identified in Theorem 1 are sufficient but not necessary. However, the assistance problem presented here is mostly indexable. This was verified by randomly generating problem instances without the bounds given in Table 1 and numerically verifying monotonicity of the passive set . Out of the random instances, and instances were found to be indexable for discount factors respectively. However, the conditions were satisfied for and of these instances for . This suggests that model is, in general, indexable and the Whittle index heuristic is applicable. It also suggests that the system may benefit from improved conditions for indexability.


  • [1] Y. Wang, Z. Shi, C. Wang, and F. Zhang, “Human-robot mutual trust in (semi) autonomous underwater robots,” in Cooperative Robots and Sensor Networks 2014.   Springer, 2014, pp. 115–137.
  • [2] G. Swamy, S. Reddy, S. Levine, and A. D. Dragan, “Scaled autonomy: Enabling human operators to control robot fleets,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 5942–5948.
  • [3] A. Rosenfeld, N. Agmon, O. Maksimov, and S. Kraus, “Intelligent agent supporting human–multi-robot team collaboration,” Artificial Intelligence, vol. 252, pp. 211–231, 2017.
  • [4] A. Khasawneh, H. Rogers, J. Bertrand, K. C. Madathil, and A. Gramopadhye, “Human adaptation to latency in teleoperated multi-robot human-agent search and rescue teams,” Automation in Construction, vol. 99, pp. 265–277, 2019.
  • [5] K. Zheng, D. F. Glas, T. Kanda, H. Ishiguro, and N. Hagita, “Supervisory control of multiple social robots for navigation,” in 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2013, pp. 17–24.
  • [6] J. Y. Chen and M. J. Barnes, “Supervisory control of multiple robots: Effects of imperfect automation and individual differences,” Human Factors, vol. 54, no. 2, pp. 157–174, 2012.
  • [7] D. R. Olsen Jr and S. B. Wood, “Fan-out: measuring human control of multiple robots,” in SIGCHI conference on Human factors in computing systems, 2004, pp. 231–238.
  • [8] S. Y. Chien, M. Lewis, S. Mehrotra, and K. Sycara, “Imperfect automation in scheduling operator attention on control of multi-robots,” in Human Factors and Ergonomics Society Annual Meeting, vol. 57, no. 1.   SAGE Publications Sage CA: Los Angeles, CA, 2013, pp. 1169–1173.
  • [9] J. Y. Chen and M. J. Barnes, “Human–agent teaming for multirobot control: A review of human factors issues,” IEEE Transactions on Human-Machine Systems, vol. 44, no. 1, pp. 13–29, 2014.
  • [10] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming.   John Wiley & Sons, 2014.
  • [11] M. L. Littman, T. L. Dean, and L. P. Kaelbling, “On the complexity of solving markov decision problems,” in Eleventh Annual Conference on Uncertainty in Artificial Intelligence, 1995, pp. 394–402.
  • [12] S. A. Zanlongo, P. Dirksmeier, P. Long, T. Padir, and L. Bobadilla, “Scheduling and path-planning for operator oversight of multiple robots,” Robotics, vol. 10, no. 2, p. 57, 2021.
  • [13] S. K. K. Hari, A. Nayak, and S. Rathinam, “An approximation algorithm for a task allocation, sequencing and scheduling problem involving a human-robot team,” Robotics and Automation Letters, vol. 5, no. 2, pp. 2146–2153, 2020.
  • [14] J. R. Peters, V. Srivastava, G. S. Taylor, A. Surana, M. P. Eckstein, and F. Bullo, “Human supervisory control of robotic teams: Integrating cognitive modeling with engineering design,” IEEE Control Systems Magazine, vol. 35, no. 6, pp. 57–80, 2015.
  • [15] J. W. Crandall, M. L. Cummings, M. Della Penna, and P. M. De Jong, “Computing the effects of operator attention allocation in human control of multiple robots,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 41, no. 3, pp. 385–397, 2010.
  • [16] S. Musić and S. Hirche, “Control sharing in human-robot team interaction,” Annual Reviews in Control, vol. 44, pp. 342–354, 2017.
  • [17] M. B. Dias, B. Kannan, B. Browning, E. Jones, B. Argall, M. F. Dias, M. Zinck, M. Veloso, and A. Stentz, “Sliding autonomy for peer-to-peer human-robot teams,” in International conference on intelligent autonomous systems, 2008, pp. 332–341.
  • [18] D. Szafir, B. Mutlu, and T. Fong, “Designing planning and control interfaces to support user collaboration with flying robots,” The International Journal of Robotics Research, vol. 36, no. 5-7, pp. 514–542, 2017.
  • [19] E. A. Kirchner, S. K. Kim, M. Tabie, H. Wöhrle, M. Maurus, and F. Kirchner, “An intelligent man-machine interface—multi-robot control adapted for task engagement based on single-trial detectability of p300,” Frontiers in human neuroscience, vol. 10, p. 291, 2016.
  • [20] S. Y. Chien, Y. L. Lin, P. J. Lee, S. Han, M. Lewis, and K. Sycara, “Attention allocation for human multi-robot control: Cognitive analysis based on behavior data and hidden states,” International Journal of Human-Computer Studies, vol. 117, pp. 30–44, 2018.
  • [21] M. M. Raeissi, N. Brooks, and A. Farinelli, “A balking queue approach for modeling human-multi-robot interaction for water monitoring,” in International Conference on Principles and Practice of Multi-Agent Systems.   Springer, 2017, pp. 212–223.
  • [22] J. Berger, N. Lo, and M. Noel, “A new multi-target, multi-agent search-and-rescue path planning approach,” International Journal of Computer, Electrical, Automation, Control and Information Engineering, vol. 8, no. 6, pp. 935–944, 2014.
  • [23] H. Pei, S. Chen, and Q. Lai, “Multi-target consensus circle pursuit for multi-agent systems via a distributed multi-flocking method,” International Journal of Systems Science, vol. 47, no. 16, pp. 3741–3748, 2016.
  • [24] P. Whittle, “Restless bandits: Activity allocation in a changing world,” Journal of applied probability, vol. 25, no. A, pp. 287–298, 1988.
  • [25] J. C. Gittins, “Bandit processes and dynamic allocation indices,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 41, no. 2, pp. 148–164, 1979.
  • [26] L. Chan, D. Hadfield-Menell, S. Srinivasa, and A. Dragan, “The assistive multi-armed bandit,” in ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2019, pp. 354–363.
  • [27] H. Claure, Y. Chen, J. Modi, M. Jung, and S. Nikolaidis, “Multi-armed bandits with fairness constraints for distributing resources to human teammates,” in 2020 ACM/IEEE International Conference on Human-Robot Interaction, 2020, pp. 299–308.
  • [28] A. Mahajan and D. Teneketzis, “Multi-armed bandit problems,” in Foundations and applications of sensor management.   Springer, 2008, pp. 121–151.
  • [29] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of optimal queuing network control,” Mathematics of Operations Research, vol. 24, no. 2, pp. 293–305, 1999.
  • [30] G. Xiong, R. Singh, and J. Li, “Learning augmented index policy for optimal service placement at the network edge,” arXiv preprint arXiv:2101.03641, 2021.
  • [31] V. S. Borkar, G. S. Kasbekar, S. Pattathil, and P. Y. Shetty, “Opportunistic scheduling as restless bandits,” IEEE Transactions on Control of Network Systems, vol. 5, no. 4, pp. 1952–1961, 2017.
  • [32] N. Akbarzadeh and A. Mahajan, “Conditions for indexability of restless bandits and an algorithm to compute Whittle index,” 2021.
  • [33] S. Wu, K. Ding, P. Cheng, and L. Shi, “Optimal scheduling of multiple sensors over lossy and bandwidth limited channels,” IEEE Transactions on Control of Network Systems, vol. 7, no. 3, pp. 1188–1200, 2020.
  • [34] K. Avrachenkov, U. Ayesta, J. Doncel, and P. Jacko, “Congestion control of TCP flows in internet routers by means of index policy,” Computer Networks, vol. 57, no. 17, pp. 3463–3478, 2013.
  • [35] K. Liu and Q. Zhao, “Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access,” IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5547–5567, 2010.
  • [36] N. Akbarzadeh and A. Mahajan, “Dynamic spectrum access under partial observations: A restless bandit approach,” in Canadian Workshop on Information Theory.   IEEE, 2019, pp. 1–6.
  • [37] N. Akbarzadeh and A. Mahajan, “Restless bandits: Indexability and computation of whittle index,” arXiv preprint arXiv:2008.06111, 2020.
  • [38] J. Niño-Mora, “Dynamic priority allocation via restless bandit marginal productivity indices,” Top, vol. 15, no. 2, pp. 161–198, 2007.
  • [39] M. Egorov, Z. N. Sunberg, E. Balaban, T. A. Wheeler, J. K. Gupta, and M. J. Kochenderfer, “POMDPs.jl: A framework for sequential decision making under uncertainty,”

    The Journal of Machine Learning Research

    , vol. 18, no. 1, pp. 831–835, 2017.
  • [40] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016.
  • [41] P. Carreno-Medrano, A. Dahiya, S. L. Smith, and D. Kulić, “Incremental estimation of users’ expertise level,” in International Conference on Robot & Human Interactive Communication (ROMAN).   IEEE, 2019.

Appendix A Preliminary Results

Lemma 2.

For all ,


Under an optimal policy , we have:

Therefore, we get

Since for all , we can write