1 Introduction
Advances in robot autonomy have lead to a decrease in the necessity of strict human supervision of robots. This has enabled the development of humanrobot collaborative systems where the task is primarily executed by a number of semiautonomous robots requiring intermittent assistance from a human teammate, either in event of a fault [1, 2] or to further increase performance of the multiagent teams in warehouse operation [3], searchandrescue [4] or in a social setting [5]. However, identifying which robot to assist in an uncertain environment is a challenging task for human operators [6, 3]. Moreover, as the number of robots increases, it becomes challenging for the operators to maintain awareness of every robot, which cripples system’s performance [7, 8]. Therefore, human operators can benefit from having a decision support system (DSS) that advises which robots require attention and when [9, 3].
In this paper, we present such a DSS for a multirobot system comprising a fleet of semiautonomous robots with multiple human operators available for assistance if and when required. Figure 1 presents an overview of the problem setup, showing
robots navigating in a city blocklike environment, moving from start to goal locations. While navigating, a robot passes through a series of waypoints, each characterized by a different probability of success in progressing to the next waypoint. There is also a possibility that the robot may fail at a task and get stuck in a fault (error) state from which assistance from a human operator is required to continue. There are
identical human operators available (), each of whom can assist/teleoperate at most one robot at a time. While being assisted by an operator, robots have different probabilities of success and failure, and to get out of a fault state.It is possible to solve the above problem by modelling it as a Markov Decision Process (MDP) [10]
. However, such formulations suffer from the curse of dimensionality, making conventional MDPbased solution techniques scale poorly with problem size
[11]. Moreover, the policy needs to be resolved every time a robot or an operator enters or leaves the system.The most closely related work to our problem is presented in [3], where the authors discuss singleoperator multirobot supervision systems. An advising agent guides the operator on which robot they should assist. The problem is solved using an step lookahead (myopic) approach, which provides an efficient and practical solution, but suffers from scalability issues with increasing number of operators and the lookahead steps. Researchers have also discussed deterministic versions of the problem, where exact outcomes of robots’ actions and times for fault occurrences are known, and the allocation policy is determined using a samplingbased planner [12]. In [13], the authors present an approximation algorithm for a similar scheduling problem. These approaches however do not apply to the stochastic problems.
In this paper, we show how an indexbased policy can provide a scalable and better performing solution than the myopic approach for the given multirobot multioperator allocation problem under stochastic transitions. Specifically, our work makes the following contributions:
1) We show that the operator allocation problem with multiple independent robots can be formulated as an instance of the Restless MultiArmed Bandit (RMAB) problem. We leverage this formulation to obtain a Whittle indexbased policy that scales linearly with the number of robots and is independent of the number of operators.
2) We derive simple conditions to verify indexability of the model.These conditions can be checked independently for each state of each robot, thus providing a method that scales linearly with the size of the problem. This method can be applied to systems with any number of states and does not require the optimal policy to be of a threshold type.
3) We then implement our approach in two practical scenarios and present numerical experiments. The results show that the proposed method provides nearoptimal solutions and outperforms existing efficient solution approaches, namely the reactive policy,  and step myopic policies, and the benefitmaximizing policy.
1.1 Background and Related Work
The problem of allocating operators in a multirobot team bares similarities with the disciplines of multirobot supervision, task scheduling and queuing theory. In this section, we briefly review the related research, followed by an introduction of a RestlessMulti Armed Bandits.
In the literature, several studies discuss the problem of enabling human operators to assist multiple robots such as a team of navigating robots, a fleet of multiple UAVs, or a team performing search and rescue operations [14, 15]. To understand and improve human supervision, researchers have used frameworks such as sliding autonomy to incorporate various humanrobot team capabilities (like coordination and situational awareness) [16, 17]. Some studies also present interaction interfaces to facilitate and improve such supervision [18, 19]. The problem of assisting a number of independent robots, has also been studied under a learning framework. The approach presented in [2] learns the decisionmaking model of human operator from recorded data and tries to replicate that behaviour, optimizing based on the operator’s internal utility function. In contrast, the problem presented in this paper is designed to optimize a global performance metric assuming the knowledge of success and failure rates of robots with and without an operator allocated to them.
Such knowledge can be estimated using recorded data similar to the work presented in
[3]. For the scope of this paper, we will assume this knowledge takes the form of known transition probabilities.In the queuing discipline, several studies have investigated the effects of different queuing techniques [20] or thresholdbased strategies [21] to prioritize operator’s attention to the robots. However, the model that we study is different from a queuing model as it is possible for the robots to complete their tasks without the help of operators, and for the operators to be allocated to robots not stuck in a fault state.
The multitarget–multiagent problems form another class of problems similar to the operator allocation problem. These problems deal with allocation of multiple agents to a number of targets aiming to detect or follow the targets under certain constraints [22, 23]. However, our problem setup is different because the behaviour of the targets (robots) changes with the allocation of agents (operators) and it is not possible to allocate multiple agents to a single target at once. Moreover, our problem presents a collaborative task, where both the robots and operators are working to achieve a common goal.
Restless MultiArmed Bandit
Restless MultiArmed Bandits (RMAB), first introduced in [24], is a generalization of MultiArmed Bandits (MAB) [25] which has been previously used in problems like assisting human partners [26] and distributing resources among human teammates [27]. RMAB is a class of scheduling problems where limited resources have to be allocated among several alternative choices. Each choice, referred to as an arm, is a discretetime controlled Markov process which evolves depending on the resource allocated to it. RMAB framework has been applied to problems in stochastic scheduling, patrol planning, sensor management and resource allocation in general [28].
Finding the optimal policy for RMAB suffers from the curse of dimensionality as the state space grows exponentially with the number of arms. In general, obtaining the optimal policy in an RMAB is PSPACEhard [29]. However, the Whittle index policy offers a simpler and scalable alternative to the optimal policy. Even though the Whittle index policy does not guarantee an optimal solution, it minimizes expected cost for a relaxed problem under timeaveraged constraint [24]. This approach is shown to work quite well for several scheduling and resource allocation problems [30, 31, 32]. A few studies have also implemented indexbased methods to solve a sensor scheduling problem [33] or to serve a number of users transmitting a queue of data packets through a channel [31]. Therefore, it is a reasonable approach to solve an RMAB given that the problem satisfies a technical condition known as Indexability (more details in Section 3). Unfortunately, it is difficult to verify this condition in general and there is no universal framework that applies to all problems. Existing methods proposed for verifying indexability have been investigated for specific systems such as two state restless bandits [34, 35] or restless bandits with optimal thresholdbased policy [34, 36, 37].
1.2 Organization
The contents of this paper are organized as follows: The multirobot assistance problem is presented in Section 2. We discuss the general Restless Bandit Problem and define property of indexability in Section 3. In Section 5, we present two practical classes of transition functions and establish conditions under which problem indexability is ensured. In Section 6, we cover the calculation of Whittle index heuristic and present an efficient policy for the problem. Next, we present simulations of the problem in Section 7 to examine validity and performance of the presented policy. The paper ends with a brief discussion and conclusion.
2 MultiRobot Assistance Problem
Consider a decision support system (DSS), consisting of a team of human operators supervising a fleet of semiautonomous robots. Each robot is required to complete a sequence of tasks to reach its goal. We will use a fleet of robots delivering packages in a city as a running example but similar interpretations hold for other applications mentioned in previous sections (e.g., robots reaching a sequence of configurations [12]). In this case, the robot’s trajectory would correspond to a series of waypoints that a robot needs to navigate to reach its destination (goal location). At each waypoint, a robot can either operate autonomously or be teleoperated by one of the human operators. We assume that all human operators are identical in the way they operate the robots and that a human operator can operate at most one robot at a time. We now provide a mathematical model for different components of the system^{1}^{1}1 Remark on notation:
Throughout this paper, we use calligraphic font to denote sets and roman font to denote variables. Uppercase letters are used to represent random variables and the corresponding lowercase letters represent their realizations. Bold letters are used for variables pertaining to multirobot system while light letters represent corresponding singlerobot variables.
.2.1 Model of the robots
It is assumed that when operating autonomously, each robot uses a prespecified control algorithm to complete its task. For the delivery robot example, this could be, for instance, a SLAMbased local path planner that the robot uses for navigating between the waypoints. We will not model the details of this control algorithm but simply assume that this control is imperfect and occasionally causes the robot to enter a fault state while doing a task (e.g., delivery robot getting stuck in a pothole or losing its localization). We model this behaviour by assuming that while completing each task, the robot may be in one of the two internal states: a normal state (denoted by ) or a fault state (denoted by ). When a robot is being teleoperated, it may still be possible for it to enter into a fault state.
The operating state of robot at time , denoted by , is tuple of its current task and internal state. The state space for robot is given by
where the terminal state indicates that all tasks have been completed. The state space for all robots is denoted by .
The state of a robot evolves differently depending on whether it is operating autonomously (denoted by mode ) or teleoperated (denoted by ). Given robot in state operating in mode , let denote the probability of successfully completing the current task at the current time step and let denote the probability of toggling the current internal state (i.e. going from normal to fault state and viceversa). A diagram describing these transitions is shown in Fig. 2. Note that the terminal state is an absorbing state, so and .
There is a perstep cost , where denotes the cost of operating robot in mode when the robot is in state . Note that the perstep cost is zero in the terminal state, i.e, .
2.2 Model of the decision support system (DSS)
There is a decision support system that helps to allocate operators to the robots. At each time the decision support system observes the operating state of all robots and picks at most robots to teleoperate. We capture this by the allocation , where
(1) 
The allocation is selected according to a timehomogeneous Markov policy . The expected total cost incurred by any policy is given by
(2) 
where is the discount factor and is the initial state with for every .
2.3 Problem objective
We impose the following assumptions on the model:

Given an allocation by the DSS, the operating states of the robots evolve independently of each other.

For every robot , the probability of getting out the faulty internal state when teleoperated is strictly greater than , i.e., .

Under autonomous operation, a robot stays in the fault state, i.e., .
The design objective is to solve the following optimization problem:
Problem 1.
Given the set of robots, the system dynamics and the perstep costs, the number of human operators, and the discount factor , choose a policy to minimize the total discounted cost given by (2).
Optimal solution for Problem 1 can be found by modelling it as a Markov decision process and solving using dynamic programming [10]. However, the sizes of state and action spaces of the resulting model grows exponential with the number of robots and operators. Thus, solving Problem 1 using dynamic programming becomes intractable for larger systems. To address this, we model Problem 1 as a restless multiarmed bandit (RMAB) problem and use the notion of indexability to find an efficient and scalable policy. We start by providing an overview of RMAB in the next section.
3 Overview of Restless Multiarmed bandits
In this section we provide an overview of Restless MultiArmed Bandits (RMAB), indexability and the Whittle index policy.
3.1 Restless Bandit Process
A restless bandit (RB) process is a controlled Markov process where is the state space, is the action space, is the transition probability function, is the perstep cost function, and is the initial state. By convention, action is called the passive action and action is called the active action.
3.2 Restless Multiarmed Bandit Problem
A Restless Multiarmed Bandit (RMAB) is a collection of independently evolving RBs , . Each process is conventionally called an arm. A decisionmaker selects at most arms () at each time instance. Let and denote the state of arm and the action chosen for arm at time . Let and where
denote the states and actions of all arms. As the dynamics of each arm are independent, we have
The instantaneous cost of the system is the sum of costs incurred by each arm. The performance of any time homogeneous Markov policy is measured by
(3) 
where denotes the discount factor. Finally, the RMAB optimization problem is as follows:
Problem 2.
Given a discount factor , a collection of arms , and the number of arms to be chosen at each time, choose a policy that minimizes .
As discussed earlier, to tackle the scalability issues of dynamic programmingbased solutions, the Whittle index policy is one of the commonly used heuristic to solve a RMAB problem [24]. This policy is computationally efficient and it readily generalizes to the setting where or changes over time. Next, we present the required definitions.
3.3 Indexability and the Whittle index policy
In this section, we restrict our discussion to a single arm and therefore omit the superscript for the ease of notation. Consider an arm where, for some penalty , modify the perstep cost as
(4) 
Then the performance of any given timehomogeneous Markov policy is given by
(5) 
Now consider the following auxiliary problem:
Problem 3.
Given an arm , the discount factor and the penalty , choose a Markov policy to minimize given by (5).
Problem 3 is a Markov decision process. Let us denote the optimal policy of Problem 3 by . It is assumed that the optimal policy picks passive action at any state where both the active and passive actions result in same expected cost. Next, define passive sets and indexability.
Definition 1 (Passive set).
Given , the passive set is the set of states where passive action is prescribed by , i.e.,
Definition 2 (Indexability).
An arm is indexable if is nondecreasing in , i.e., for any ,
A RMAB problem is indexable if all arms are indexable.
Definition 3 (Whittle index).
For an indexable arm, the Whittle index of the state of an arm is the smallest value of for which state is part of , i.e.,
(6) 
Equivalently, the Whittle index is the smallest value of for which is indifferent between the active action and passive action when the arm is in state .
The Whittle index policy is as follows: At each time, compute the Whittle indices of the current state of all arms and select the arms in states with highest Whittle indices (provided they are positive).
4 Indexability of the assistance problem
Problem 1 can be formulated as an instance of RMAB, where each robot corresponds to an arm. Under such a formulation, the state of arm corresponds to operating state of robot . The transition function corresponds to the robot state evolution shown in Fig. 2 and the cost function corresponds to the associated perstep cost . In addition, allocating an operator to robot corresponds to choosing the active action for that arm while autonomous operation corresponds to choosing the passive action. This motivates using the Whittle index policy to solve Problem 1. However, before we can implement this approach, we must check for indexability of the problem. As discussed earlier, there is no universal framework to verify indexability of a problem. Moreover, the optimal policy for the given problem does not show any thresholdbased behaviour. Therefore, we determine sufficient conditions for indexability from first principles by using properties of the value function of each individual arm.
Since indexability has to be checked for each arm separately, for this analysis, we drop the superscript from all variables.
Let be the unique fixed point of the following equation
where
(7) 
represents the Qvalue of taking action in state . Here the transition function denotes the probability of transition from state to state under action and is represented by Fig. 2. Let be the corresponding optimal policy
To ensure uniqueness of the , we follow the convention that when , the passive action is chosen. Let be the passive set given penalty and be the Whittle index of state for the problem of operator allocation in a singlerobot system. Furthermore, define the Benefit function as
(8) 
Then, a sufficient condition for indexability is as follows:
Lemma 1.
A sufficient condition for Problem 1 to be indexable is that the benefit function for each robot is monotonically increasing in for all states .
Proof.
The result follows from the observation that using (8) and Def. 1, we can rewrite the passive set as
(9) 
Thus, monotonicity of the benefit function implies that the condition for indexability given in Def. 2 is satisfied.
∎
We verify the monotonicity of by bounding the value function and establish the following:
Theorem 1.
Let denote the probability of repeating a task under mode with internal state . Define and as follows:
and
Then, the singlerobot problem is indexable if for all :
(10) 
Proof.
See Appendix. ∎
The multirobot problem is indexable if the conditions given in Theorem 1 hold true for all robots. In the next section, we present specific instances of the general model described in Section 2 which are indexable and discuss their relevance in practical assistance problems for (semi)autonomous delivery robots.
5 Special Cases: Robot transitions in the city
This section presents two specific classes of transition functions which represent two types of faults commonly occurring in systems with remote navigating robots.
5.1 Transition Type1 : Faults with continuation
Consider the following transition behaviour along a robot’s waypoints. At each time step, the robot moves to its next waypoint with a probability representing, for example, the crowd in the area. There is also a probability of getting into a fault state such as encountering an unidentifiable obstacle. A human operator can teleoperate the robot to its next waypoint both from a normal or fault state. Such transitions represent faults where the robot is functioning properly but is unsure about how to proceed due to uncertainty in its surroundings. Thus the probability of success when being teleoperated is the same regardless of whether the robot is in its normal state or stopped in the fault state, i.e., and . The corresponding transition dynamics are shown in Figure 3. Note that in this case .
In this case, the coefficients and can be simplified to the following expressions:
(11) 
Note that
Thus, and satisfy the sufficient condition of Theorem 1 for all allowed values of transition probabilities and the discount factor . Therefore, any robot following the Type1 transitions is indexable.
5.2 Transition Type2 : Faults with reset
Consider another type of transition where the robot can get into a fault state and needs error fixing while staying at its next waypoint. This includes scenarios such as losing localization or getting stuck in a minor obstacle. The human operator can try to assist the robot out of that situation by fixing the fault, resetting it back to its current waypoint (assuming the system is equipped with means to do so). Such transitions will mean that the probabilities and the corresponding transition dynamics are shown in Fig.4:
Substituting the values of transition probabilities from Fig. 4 to the expressions of and , the coefficients can be simplified to the following:
(12) 
Note that here is the same as (11) and, therefore, satisfies (10). For to satisfy (10), the condition results in the following condition on :
(13) 
As , (13) also yields the following condition on :
(14) 
Therefore, any robot state following the Type2 transitions will satisfy the condition of indexability in Theorem 1 if (13) and (14) are satisfied, i.e., the probability that the robot transitions from a normal state to fault state during autonomous operation is not too high and the probability that the operator brings the robot from a fault state to a normal state is not too small.
6 Computation of Whittle Index
As discussed in Section 1.1, once the indexability of the problem instance has been verified, we can compute Whittle indices for all robots and determine the Whittle index policy.
General approaches of computing Whittle indices are either based on adaptive greedy algorithm [38, 32] or binary search [36]. In this section, we briefly provide details on adaptive greedy algorithm and describe how the Whittle index policy works. Readers are encouraged to refer to [32] for a detailed explanation and validation of the algorithm. The algorithm is presented in Alg. 1 for computing Whittle indices for a single robot.
The algorithm operates as follows: For any subset
, define the policy vector
^{2}^{2}2 In the following expressions is used as a vector of size , constructed as a mapping from each state to corresponding action . asAlso define, , the cost vector for all states under a policy , and , the transition matrix under policy .
Then, in each iteration of the while loop, compute as follows:
(15) 
where
The minimum value of calculated in line 5 in Alg. 1 corresponds to the Whittle indices of the minimizing states (Line 6). These states are then taken out of consideration in the next iteration of the while loop by including them in the passive set . When the Alg. 1 exits the while loop, the Whittle indices for all states of that robot are calculated. This procedure is then repeated for all the robots in the system.
Once the Whittle indices for all states of all robots are obtained, the Whittle index policy can be implemented as given in Alg. 2. In line 2 of the algorithm, the function returns indices of top positive elements in a set, where ties are broken randomly. As determined in [32], the computational complexity of this method is . In contrast, the computational complexity of finding the optimal policy for Problem 1 is using value iteration, where is the size of state space of individual robot.
7 Simulations and Results
In this section, we present performance results for a simulated multirobot assistance problem under the following policies (described later): 1) Optimal policy, 2) Index policy, 3) Benefit maximizing Policy, 4) Myopic Policy, and 5) Reactive Policy. The problem and the solution frameworks for all policies were implemented using POMDPs.jl library in Julia [39].
7.1 Simulation Setup
For the simulations, a city map is generated as shown in Fig. 5 where the map is randomly divided into different zones corresponding to one of the two transition types presented in Section 5.
The exact values of transition probabilities at different locations in the map are sampled randomly from a uniform distribution, according to Table
1. The bounds on transition probabilities and for transition type2 are determined by (13) and (14) respectively.Probability  Type1  Type2 

For the teleoperation problem, we use the following cost structure:
(16) 
with for any . This cost function captures the time that a robot takes to reach its goal, i.e., zero cost on reaching goal state, nonnegative costs for the intermediate states, and a fixed additional cost of while being assisted.
Values of the different costs and the discount factor used are sampled from ranges specified in Table 2.
Parameter  

Value 
For simplicity, for a given parameter, same range is used for every state of all robots. Therefore, we remove the superscript .
Separate tests were performed to test the validity, performance and scalability of the Index policy. At the beginning of each simulation, a number of robots are placed on the map (ranging from to ) with randomly generated start and goal locations, and waypoints uniformly placed between the two. In practice, these waypoints are generated by a separate robot path planner for each individual robot, and are considered as an input for the operator allocation problem.
7.2 Baseline Policies
We consider the following baseline policies to assess the performance of the Index policy.
Optimal policy
The Optimal policy , as defined by (2), is found by encoding the complete problem with all robots as an MDP and solving it using the Sparse Value Iteration Solver from the POMDP.jl library.
Reactive policy
The reactive policy simply allocates an operator to any robot stuck in a fault state. If there are more such robots than operators, a random subset of those robots is selected.
Myopic policy
Myopic/Greedy Policies are commonly used to obtain fast (but suboptimal) solutions to intractable problems. In this paper, we implement an step myopic policy presented in [3] for . Define as the expected cost incurred by the system from current time step to infinity under passive actions. The step myopic policy is then defined as
(17) 
where the step lookahead cost is given by
where
is the cost incurred in the current time step.
Benefit maximizing policy
The Benefit maximizing policy is inspired by the advantage function
used in reinforcement learning, see for example
[40]. This policy considers the benefit or advantage of taking the active action over the passive action for each robot and picks the robots with highest benefit at each time step, i.e.,(18) 
where corresponds to defined in (8) with .
7.3 Comparison with the Optimal policy
First, the Index policy is compared against the Optimal policy to validate its applicability for our problem. Due to its poor scalability, the Optimal policy cannot be computed for larger problem instances, therefore this test is limited to smaller problem size (up to robots and operators). The relative performance (ratio of the cost incurred under Optimal policy to that under Index policy) is shown in Fig. 6. For comparison, problem instances were tested under both policies and were simulated through Monte Carlo rollouts. Each problem instance is run until all robots reach their respective goal locations. This is repeated times and average cost is recorded.
It is observed that the Index policy performs quite close to the optimal policy for all test cases. As the ratio of number of robots to number of operators increases, the Index policy starts to degrade in comparison to the optimal. However, the relative cost still remains within of the optimal. Also, note that the optimal policy minimizes the expectation of the cost incurred. The graph shows distribution of the relative performance, which is centered around 1. Therefore, we still see some test runs where the rollout resulted in a slightly lesser cost for the index policy. (see Fig. 6). However, as number of test runs are increased the average relative performance approaches .
7.4 Comparison with other baseline policies
Next, we compare the performance (measured as average cost incurred per robot before reaching its goal) of the Index policy with the three baseline policies on larger problem instances. For the comparison, a set of problem instances is created, each with a set of waypoints with randomly sampled transition probabilities according to Table 1. Each instance of the problem is then simulated separately under the different policies using Monte Carlo rollouts until all robots reach their goal states, repeated for 500 iterations. Each simulation iteration (rollout) is timed out at seconds for each policy. If an iteration takes longer than this time, the simulations are interrupted and the result for that test condition is not reported.
Figure 7 shows performance comparison of the four policies. The Index policy performs best out of the four policies, followed by the benefit maximizing policy () and the myopic policy (). The Reactive policy performed the worst as expected. As a side note, the average cost incurred per robot under any policy is strongly correlated with the ratio of number of robots to the number of available operators. This observation supports the intuition that as human operators are required to distribute their assistance among more robots, their effectiveness decreases.
7.5 Scalability
Table 3 shows the time that each policy takes to compute operator the allocation under different problem sizes. For these simulations, each robot is set to have waypoints. As observed from Table 3, the online computation times of the two Myopic policies scale exponentially with both the number of robots and the number of operators, with the time for step Myopic policy growing at a much higher rate.
The computation times for the Index and Benefit maximizing policies scale linearly with the number of robots and are independent of the number of operators. Also, note that the Whittle index computation for one robot is independent from the rest. Therefore robots can be added/removed without recomputation of already computed indices. Furthermore, if the number of operators change to , the policy simply allocates operators to the robots with the highest Whittle indices. As a result, the policy is efficiently scalable with the number of robots and operators. For reference, the simulations were run on a Desktop PC with a core, GHz processor and GB of RAM.
Operators/ Robots  Index Policy  step Myopic  step Myopic  Benefit Maximizing 

8 Conclusions and Discussion
In this paper, we provide an analysis of operator allocation problem for a multirobot assistance task and demonstrate the effectiveness of Restless Bandit framework to obtain a scalable policy. This policy is based on Whittle index heuristic and performs close to the optimal and significantly better than other efficient solution approaches. We also provide an analysis of indexability of such problems and give a simplified condition to quickly verify if a problem instance is indexable. These results can also be used to specify required transition behavior in the form of bounds on transition probabilities.
There are, however, a few limitations of the proposed approach. When a problem instance is not indexable, the Whittle indices are not defined and the methods of computing these indices may not give meaningful values. Therefore, the proposed approach is not applicable in such cases. Also, note that the conditions for indexability identified in Theorem 1 are sufficient but not necessary. However, the assistance problem presented here is mostly indexable. This was verified by randomly generating problem instances without the bounds given in Table 1 and numerically verifying monotonicity of the passive set . Out of the random instances, and instances were found to be indexable for discount factors respectively. However, the conditions were satisfied for and of these instances for . This suggests that model is, in general, indexable and the Whittle index heuristic is applicable. It also suggests that the system may benefit from improved conditions for indexability.
References
 [1] Y. Wang, Z. Shi, C. Wang, and F. Zhang, “Humanrobot mutual trust in (semi) autonomous underwater robots,” in Cooperative Robots and Sensor Networks 2014. Springer, 2014, pp. 115–137.
 [2] G. Swamy, S. Reddy, S. Levine, and A. D. Dragan, “Scaled autonomy: Enabling human operators to control robot fleets,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 5942–5948.
 [3] A. Rosenfeld, N. Agmon, O. Maksimov, and S. Kraus, “Intelligent agent supporting human–multirobot team collaboration,” Artificial Intelligence, vol. 252, pp. 211–231, 2017.
 [4] A. Khasawneh, H. Rogers, J. Bertrand, K. C. Madathil, and A. Gramopadhye, “Human adaptation to latency in teleoperated multirobot humanagent search and rescue teams,” Automation in Construction, vol. 99, pp. 265–277, 2019.
 [5] K. Zheng, D. F. Glas, T. Kanda, H. Ishiguro, and N. Hagita, “Supervisory control of multiple social robots for navigation,” in 2013 8th ACM/IEEE International Conference on HumanRobot Interaction (HRI), 2013, pp. 17–24.
 [6] J. Y. Chen and M. J. Barnes, “Supervisory control of multiple robots: Effects of imperfect automation and individual differences,” Human Factors, vol. 54, no. 2, pp. 157–174, 2012.
 [7] D. R. Olsen Jr and S. B. Wood, “Fanout: measuring human control of multiple robots,” in SIGCHI conference on Human factors in computing systems, 2004, pp. 231–238.
 [8] S. Y. Chien, M. Lewis, S. Mehrotra, and K. Sycara, “Imperfect automation in scheduling operator attention on control of multirobots,” in Human Factors and Ergonomics Society Annual Meeting, vol. 57, no. 1. SAGE Publications Sage CA: Los Angeles, CA, 2013, pp. 1169–1173.
 [9] J. Y. Chen and M. J. Barnes, “Human–agent teaming for multirobot control: A review of human factors issues,” IEEE Transactions on HumanMachine Systems, vol. 44, no. 1, pp. 13–29, 2014.
 [10] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
 [11] M. L. Littman, T. L. Dean, and L. P. Kaelbling, “On the complexity of solving markov decision problems,” in Eleventh Annual Conference on Uncertainty in Artificial Intelligence, 1995, pp. 394–402.
 [12] S. A. Zanlongo, P. Dirksmeier, P. Long, T. Padir, and L. Bobadilla, “Scheduling and pathplanning for operator oversight of multiple robots,” Robotics, vol. 10, no. 2, p. 57, 2021.
 [13] S. K. K. Hari, A. Nayak, and S. Rathinam, “An approximation algorithm for a task allocation, sequencing and scheduling problem involving a humanrobot team,” Robotics and Automation Letters, vol. 5, no. 2, pp. 2146–2153, 2020.
 [14] J. R. Peters, V. Srivastava, G. S. Taylor, A. Surana, M. P. Eckstein, and F. Bullo, “Human supervisory control of robotic teams: Integrating cognitive modeling with engineering design,” IEEE Control Systems Magazine, vol. 35, no. 6, pp. 57–80, 2015.
 [15] J. W. Crandall, M. L. Cummings, M. Della Penna, and P. M. De Jong, “Computing the effects of operator attention allocation in human control of multiple robots,” IEEE Transactions on Systems, Man, and CyberneticsPart A: Systems and Humans, vol. 41, no. 3, pp. 385–397, 2010.
 [16] S. Musić and S. Hirche, “Control sharing in humanrobot team interaction,” Annual Reviews in Control, vol. 44, pp. 342–354, 2017.
 [17] M. B. Dias, B. Kannan, B. Browning, E. Jones, B. Argall, M. F. Dias, M. Zinck, M. Veloso, and A. Stentz, “Sliding autonomy for peertopeer humanrobot teams,” in International conference on intelligent autonomous systems, 2008, pp. 332–341.
 [18] D. Szafir, B. Mutlu, and T. Fong, “Designing planning and control interfaces to support user collaboration with flying robots,” The International Journal of Robotics Research, vol. 36, no. 57, pp. 514–542, 2017.
 [19] E. A. Kirchner, S. K. Kim, M. Tabie, H. Wöhrle, M. Maurus, and F. Kirchner, “An intelligent manmachine interface—multirobot control adapted for task engagement based on singletrial detectability of p300,” Frontiers in human neuroscience, vol. 10, p. 291, 2016.
 [20] S. Y. Chien, Y. L. Lin, P. J. Lee, S. Han, M. Lewis, and K. Sycara, “Attention allocation for human multirobot control: Cognitive analysis based on behavior data and hidden states,” International Journal of HumanComputer Studies, vol. 117, pp. 30–44, 2018.
 [21] M. M. Raeissi, N. Brooks, and A. Farinelli, “A balking queue approach for modeling humanmultirobot interaction for water monitoring,” in International Conference on Principles and Practice of MultiAgent Systems. Springer, 2017, pp. 212–223.
 [22] J. Berger, N. Lo, and M. Noel, “A new multitarget, multiagent searchandrescue path planning approach,” International Journal of Computer, Electrical, Automation, Control and Information Engineering, vol. 8, no. 6, pp. 935–944, 2014.
 [23] H. Pei, S. Chen, and Q. Lai, “Multitarget consensus circle pursuit for multiagent systems via a distributed multiflocking method,” International Journal of Systems Science, vol. 47, no. 16, pp. 3741–3748, 2016.
 [24] P. Whittle, “Restless bandits: Activity allocation in a changing world,” Journal of applied probability, vol. 25, no. A, pp. 287–298, 1988.
 [25] J. C. Gittins, “Bandit processes and dynamic allocation indices,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 41, no. 2, pp. 148–164, 1979.
 [26] L. Chan, D. HadfieldMenell, S. Srinivasa, and A. Dragan, “The assistive multiarmed bandit,” in ACM/IEEE International Conference on HumanRobot Interaction (HRI), 2019, pp. 354–363.
 [27] H. Claure, Y. Chen, J. Modi, M. Jung, and S. Nikolaidis, “Multiarmed bandits with fairness constraints for distributing resources to human teammates,” in 2020 ACM/IEEE International Conference on HumanRobot Interaction, 2020, pp. 299–308.
 [28] A. Mahajan and D. Teneketzis, “Multiarmed bandit problems,” in Foundations and applications of sensor management. Springer, 2008, pp. 121–151.
 [29] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of optimal queuing network control,” Mathematics of Operations Research, vol. 24, no. 2, pp. 293–305, 1999.
 [30] G. Xiong, R. Singh, and J. Li, “Learning augmented index policy for optimal service placement at the network edge,” arXiv preprint arXiv:2101.03641, 2021.
 [31] V. S. Borkar, G. S. Kasbekar, S. Pattathil, and P. Y. Shetty, “Opportunistic scheduling as restless bandits,” IEEE Transactions on Control of Network Systems, vol. 5, no. 4, pp. 1952–1961, 2017.
 [32] N. Akbarzadeh and A. Mahajan, “Conditions for indexability of restless bandits and an algorithm to compute Whittle index,” 2021.
 [33] S. Wu, K. Ding, P. Cheng, and L. Shi, “Optimal scheduling of multiple sensors over lossy and bandwidth limited channels,” IEEE Transactions on Control of Network Systems, vol. 7, no. 3, pp. 1188–1200, 2020.
 [34] K. Avrachenkov, U. Ayesta, J. Doncel, and P. Jacko, “Congestion control of TCP flows in internet routers by means of index policy,” Computer Networks, vol. 57, no. 17, pp. 3463–3478, 2013.
 [35] K. Liu and Q. Zhao, “Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access,” IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5547–5567, 2010.
 [36] N. Akbarzadeh and A. Mahajan, “Dynamic spectrum access under partial observations: A restless bandit approach,” in Canadian Workshop on Information Theory. IEEE, 2019, pp. 1–6.
 [37] N. Akbarzadeh and A. Mahajan, “Restless bandits: Indexability and computation of whittle index,” arXiv preprint arXiv:2008.06111, 2020.
 [38] J. NiñoMora, “Dynamic priority allocation via restless bandit marginal productivity indices,” Top, vol. 15, no. 2, pp. 161–198, 2007.

[39]
M. Egorov, Z. N. Sunberg, E. Balaban, T. A. Wheeler, J. K. Gupta, and M. J.
Kochenderfer, “POMDPs.jl: A framework for sequential decision making under
uncertainty,”
The Journal of Machine Learning Research
, vol. 18, no. 1, pp. 831–835, 2017.  [40] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016.
 [41] P. CarrenoMedrano, A. Dahiya, S. L. Smith, and D. Kulić, “Incremental estimation of users’ expertise level,” in International Conference on Robot & Human Interactive Communication (ROMAN). IEEE, 2019.
Appendix A Preliminary Results
Lemma 2.
For all ,
Proof.
Under an optimal policy , we have:
Therefore, we get
Since for all , we can write
∎
Define
Comments
There are no comments yet.