Multi-agent Time-based Decision-making for the Search and Action Problem

02/27/2018 ∙ by Takahiro Miki, et al. ∙ ETH Zurich 0

Many robotic applications, such as search-and-rescue, require multiple agents to search for and perform actions on targets. However, such missions present several challenges, including cooperative exploration, task selection and allocation, time limitations, and computational complexity. To address this, we propose a decentralized multi-agent decision-making framework for the search and action problem with time constraints. The main idea is to treat time as an allocated budget in a setting where each agent action incurs a time cost and yields a certain reward. Our approach leverages probabilistic reasoning to make near-optimal decisions leading to maximized reward. We evaluate our method in the search, pick, and place scenario of the Mohamed Bin Zayed International Robotics Challenge (MBZIRC), by using a probability density map and reward prediction function to assess actions. Extensive simulations show that our algorithm outperforms benchmark strategies, and we demonstrate system integration in a Gazebo-based environment, validating the framework's readiness for field application.



There are no comments yet.


page 1

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Thanks to ongoing developments in sensing and processing technologies, mobile robots are becoming more capable of working in dynamic and challenging environments in many practical applications, such as search-and-rescue [1, 2], multi-robot exploration [3, 4, 5] and terrain monitoring [6]. In many multi-agent applications, however, efficient task allocation remains a field of open research. To perform fully autonomous missions, algorithms for agent cooperation and efficient area search are required.

To address this, our work focuses on cooperative planning strategies within the context of the Mohamed Bin Zayed International Robotics Challenge (MBZIRC)111 [7]. In one stage of this competition (Challenge 3), a team of unmanned aerial vehicles is required to collaboratively search, pick, and place a set of static and moving objects, gaining points for each one collected. This task poses several challenges:

  • coordinating multiple UAVs to explore the field,

  • tracking moving targets efficiently,

  • trading off exploration to find new objects with picking them up to score points,

  • making decisions based on the time limitation of a task.

A key aspect of such missions is that the timing to execute actions should be considered, given the targets found. In the MBZIRC, for instance, a UAV can greedily pick up an object to attain a certain score. However, it could be better to invest in exploration to find a more valuable object nearby. The optimal decision here includes several aspects, and differs with the time remaining until mission completion. With stricter time limits, exploration becomes riskier and acting greedily might be preferred. The exploration-exploitation trade-off must thus be addressed while accounting for the fact that optimal decisions differ at various stages of the mission.

Fig. 1: Our decision-making algorithm applied to the search, pick, and place scenario of the MBZIRC in a 3D simulation. We use a PDM (bottom-right) to plan exploratory paths on a grid representing the sum of expected scores to be found, calculated from the PDM (upper-right). By predicting the value of search and action decisions using the PDM, three UAVs coordinate implicitly to maximize the value of successful deliveries given the time constraints.

To tackle these issues, this paper introduces a multi-agent decision-making algorithm which accounts for (i) target movement, (ii) mission time constraints, and (iii) variable target rewards. The main idea is to treat time as a budget used by the agents. At a given time, all possible actions consume some budget while yielding a certain reward. Each agent has an initial budget specified as the mission time limit, and must choose the sequence of actions maximizing the final reward. To evaluate actions, future rewards are predicted by planning on a probabilistic map. A key aspect of our approach is that all agents can operate not fully synchronously. Moreover, by using implicit coordination [8], it does not suffer from computational growth with the number of agents, making it applicable to real-time scenarios.

The contributions of this paper are:

  1. A multi-agent decision-making framework which:

    • is decentralized and real-time with near-optimality guarantees,

    • considers a fixed time budget,

    • addresses the search and action problem without using a trade-off parameter.

  2. The validation of our framework on Challenge 3 of the MBZIRC, with an evaluation against alternative strategies.

While our framework is motivated by the MBZIRC, it can be used in any multi-agent search and action scenario.

The remainder of this paper is organized as follows. Section 2 describes related work. We formulate our proposed method as a general search and action problem in Section 3. Sections 4 and 5 detail our experimental set-up and results. In Section 6, we conclude with a view towards future work.

Ii Related Work

Significant work has been done in the field of decision-making for the search and action problem. This section provides a brief overview. We first discuss searching for targets only before also considering executing tasks on them.

Ii-a Coverage Methods

Pure exploration tasks are typically addressed using complete coverage methods. The aim is to generate an obstacle-free path passing through all points in an area, usually by decomposing the problem space [9]. Based on these ideas, Maza and Ollero [5] proposed a sweep-based approach for target search using multiple UAVs. However, their strategies are not applicable in our setup due to the presence of moving targets, which can enter areas already explored.

Ii-B Search and Pursuit-evasion

There are many search algorithms which allow for target movement. Isler et al. [10] categorized autonomous search problems using several features, dividing them into two categories: (i) adversarial pursuit-evasion games, and (ii) probabilistic search.

In a pursuit-evasion game, pursuers try to capture evaders as evaders try to avoid capture. One graph-based example of this game is cops-and-robbers [11, 12], which assumes that players have perfect knowledge of each other’s positions. Similarly, Adler et al. [13] studied hunter-and-rabbit, where players do not know each other’s positions unless found on the same vertex. Their search method aims to maximize worst-case performance so that the targets can be found regardless of how they move. These deterministic methods, however, do not cater for practical sensing limitations.

In contrast, probabilistic search methods try to maximize probability of detection or minimize expected detection time, assuming that the targets’ and searchers’ movements are independent of each other. Like Wong et al. [14], we follow a Bayesian approach to compute the probability density of targets and use this to coordinate multiple searchers.

Formally, general search problems can be formulated as a Partially Observable Markov Decision Process (POMDP) with hidden states. However, real-time planning within such frameworks is difficult because of the exponential growth in computation with the number of searchers, even with near-optimal solvers [15, 16]. To address this, Hollinger and Singh [8] propose implicit coordination with receding-horizon planning to achieve near-optimality while scaling only linearly with the number of teams.

The above methods assume that all agents act simultaneously, i.e., using a synchronization system. Alleviating this requirement, Messias et al. [17] study a multi-agent POMDP for asynchronous execution.

Ii-C Combined Search and Action

While the strategies above are suitable for efficient target search, they do not take into account any subsequent tasks on them (e.g., object pickup and delivery). Hollinger et al. [18] were first to examine both aspects in the context of robotics. In their work, the aim is to minimize the combined search and action time with targets found in an environment using finite-horizon plans. However, a key assumption is that actions are executed as soon as targets are found. We leverage similar principles, but also allow agents to choose between search and action given the time budget constraints.

Iii Proposed Approach

In this section, we present our planning approach for the general search and action problem. By maximizing the attainable reward, our strategy aims to efficiently decide whether to explore an environment or to execute actions given the targets found so far. We begin by defining the problem, then outline the details of our algorithm.

Iii-a Problem Setup

Our problem involves finding tasks associated with static and moving targets and performing actions to complete them within a specified time limit.

We consider a graph-like grid environment with edge-connected cells. Each edge is weighted by a distance and an agent moves to detect tasks in its current cell. The agent searches for tasks associated with multiple targets, which can be either static or moving. The motion of moving targets is non-adversarial, i.e., it does not change with the searchers’ behaviour. Each target task can be executed through an action, which incurs a time cost and yields a certain reward.

The objective is to obtain the highest possible reward within the time limit with multiple search agents. To achieve this, they must explore the field, locate the tasks, and execute the associated actions cooperatively.

Iii-B Algorithm Design

The main aspects of the algorithm are: (i) decision-making for search vs. task execution, (ii) evaluation of action candidates, (iii) consideration of time constraints, and (iv) cooperation of multiple agents.

To solve this problem, we extend the Multi-robot Efficient Search Path Planning (MESPP) algorithm proposed by Hollinger and Singh [8]. The agents try to maximize a reward function within a finite horizon. Then, they cooperate implicitly by executing the actions in a sequential manner. Efficiency is provably within of the optimum when the optimization function is non-decreasing and submodular [19]. To apply this for our problem, we define an optimization function with these properties that accounts for the remaining time at a particular stage in the mission. Our function is derived in three steps:

  1. Definition of a function which predicts the total reward within the time limit.

  2. Formulation of an action-evaluation function, and demonstration of a decision-making algorithm which employs this function.

  3. Derivation of the non-decreasing submodular optimization function for multiple agents.

Iii-B1 Reward prediction function

This function forecasts the total reward attainable within the time limit. We assume that agents have a set of already found tasks . Each task is associated with an execution time (cost) and a reward . With this information, the total reward can be predicted by choosing tasks such that the total time cost does not exceed a given constraint. This reward prediction function is represented as , where indexes the time budget remaining in for a particular agent:




Iii-B2 Action-evaluation function and decision-making algorithm

The action-evaluation function assesses exploratory actions as the probability of the predicted reward increases. An exploratory action consuming time could increase or decrease the total reward, depending on whether or not it leads to finding new, more valuable, tasks.

The utility of exploration depends on the set of already found tasks , remaining time budget , and probability of finding new tasks. By accounting for these effects, the action-evaluation function computes the reward increase:


where is the set of tasks findable by performing action (including the case that no task is found), represents the probability of finding the set of tasks with action , and denotes the total reward with new tasks after deducting the cost from the agent’s budget .

Algorithm 1 shows our logic for determining the best action based on the expected reward increase. Eq. 4 is used to evaluate all exploratory actions (Lines 1-3) and pick the best one (Line 4). If the reward function decreases with , a known task is executed instead (Lines 5-7).

0:  Found tasks , time budget
0:  Highest-value action
1:  for All exploratory actions  do
2:     Calculate using Eq. 4.
3:  end for
5:  if  then
6:     execute_task()
7:  end if
Algorithm 1 select_action procedure

Iii-B3 Optimization function

Finally, we define an optimization function for executing a sequence of actions :


where is the set of tasks found along actions and is the initial time limit.

The key idea is that executing Algorithm 1 is the same as maximizing for a single action . Therefore, if Eq. 5 is nondecreasing and submodular222A proof of the properties required for near-optimality is presented in the Appendix., a sequential decision-making procedure at each replanning stage gives near-optimal solutions with multiple agents (Algorithm 2). After one agent plans, the chosen action is taken into account through the reward function, allowing the others to plan based on its decision. A key benefit of this approach is that its complexity increases only linearly with the number of agents, making it feasible for real-time planning.

0:  Found tasks , time budget
  while  do
     select_action(, )
     Update , , and .
  end while
Algorithm 2 replan_actions procedure

Iv Mbzirc Application

This section overviews the MBZIRC task (Challenge 3) motivating our general decision-making approach. Then, we describe how its elements are adapted for this scenario.

Iv-a Challenge Description

Challenge 3 of the MBZIRC takes place in an  m   m field containing three types of objects. Each object has a different score, distinguishable by color. Using three UAVs, the objective is to explore the field for objects, then pick up and drop them in a designated box to score points. Points are only obtained upon successful delivery to the box. The three object classes are:

  • . Does not move, has several point categories.

  • . Moves randomly, has a higher point category.

  • . Does not move, requires UAV collaboration to pick up333For simplicity, this work does not consider large objects. We note that they can be easily included in the current framework using simple cooperative logic..

The field contains ten static, ten moving, and three large objects. The dropping box is located in the middle of the field and the three UAVs start from the same point. The time limit for the mission is  mins.

Iv-B Action-evaluation Function

To show the flexibility of our framework, this section describes how it can be adapted to the MBZIRC setup. We specify the task set and define its probability density for exploration. Based on these ideas, we formulate the reward prediction function (Eq. 4) for decision-making.

Iv-B1 Task definitions

We associate a new task to each object in the arena. The cost of a task is the time taken to complete the pick-up and drop-off action, calculated as:


where is the time for a UAV to move to the object from its current position, is the time to pick up the object given its type, and and are the times to transfer the object to and deposit it in the dropping box.

The reward of a task is simply the score points that are obtained upon the successful delivery of an object.

Iv-B2 Probability of finding new tasks

The probability of finding new tasks (objects) is expressed using a Probability Density Map (PDM), updated through a Bayesian filtering procedure. The PDM is created by discretizing the arena area into a grid, where each cell represents a state . At time-step , the filter calculates the belief distribution from measurement and control data using the equations [20]:


where is the transition probability between successive states given the control input , is the measurement model for observation , and is a normalization constant.

To handle multiple tasks, seperate PDMs are stored for each object and updated using Eqs. 7 and 8. Note that is neglected since the UAVs have no control over the objects’ motion. Static objects maintain a constant probability:


The motion of moving objects is treated as a random walk, such that they enter adjacent cells with probability :


Observations are made by detecting the objects in images recorded by a downward-facing camera. For each UAV

, state estimation is performed by fusing

Global Positioning System (GPS) and visual odometry data. By combining this with attitude information, we determine a detected object’s position in a fixed coordinate frame, and compute given its color and position.

To reduce calculation, we approximate Eq. 4 as Eq. 11:


where represents the probability of finding the specific task with action .

Iv-B3 Reward prediction function

From Section III-B1, reward predictions are determined by calculating the maximum attainable reward for a set of found tasks within a time limit. To do this, we cast Eq. 1 as the well-known knapsack problem and solve it using dynamic programming [21].

For our application, the set of found tasks contains their costs and rewards as defined in Section IV-B1. We treat the moving objects as static within a certain time frame, such that their probability of moving to cells which distance is larger than the camera’s viewing range is zero. If the time since the last observation of a moving object exceeds a threshold, it is considered unknown and must be searched for again. However, moving objects propagate their probability, making it easier to search for them again.

A key concept is that the task cost changes with an agent’s position. If the UAV is close to an object, is relatively short. Moreover, upon a delivery, the UAV starts from the dropping box position. The choice of the first object is thus addressed as a special case, given that the order of the later objects does not change the optimization (the UAV always starts from the dropping box position).

We handle this by using two tables in our dynamic programming method. The first and second tables calculate the maximum reward with the cost considered from the dropping box position and an arbitrary UAV position in the arena, respectively, corresponding to successive and first objects. As new objects are found, they are added to the table and assigned one of three decision labels: (1) pick up now (first), (2) pick up later (successive), or (3) do not pick up.

The first table , considering only cost from the dropping box position, is calculated as:


where is the cost to pick up from the dropping box and is the reward of the task .

Using Eq. 12, the second table , considering cost from the current UAV position, is calculated as:


where is the pick-up cost from the current UAV position. , , and correspond to decision labels (1), (2), and (3), respectively.

This reward prediction function captures the probability of finding objects during exploration. For example, with little time remaining, searching areas far away from the dropping box offers no rewards since deliveries are impossible.

Iv-C Action Definitions

A search action (exploration) involves flying a path in the arena to obtain measurements from the downward-facing camera. To plan paths, we use 2D grid-based planning on the PDM at a constant altitude, where each cell has the dimensions of the camera field of view (FoV) (Fig. 2). For an arbitrary horizon, we enumerate every path executable from the current cell. In addition, the paths starting from the highest-probability cell are considered (four red arrows in Fig. 2). The cost of each action is computed as the travel time assuming constant velocity.

A task action involves picking up an object, then transferring it to and depositing it in the dropping box. This cost is defined in Eq. 6.

V Experimental Setup

(a) 2D simulator
(b) 3D simulator
(c) PDM
(d) PDM grid for planning
Fig. 2: (a) shows our 2D simulator with UAVs (green), their camera FoVs (green squares), static objects (red, blue, black), and moving objects (yellow). (b) depicts our RotorS setup used to simulate missions with all system components. (c) visualizes the PDM used for predicting scores with the position of each UAV (yellow cube) and its planned path (red line). The height of each colored cell corresponds to the probability of finding objects. Based on this, (d) illustrates the grid used for planning exploration paths. The color of a cell represents the sum of predicted score (, where is the probability of finding object with score ). The color scale transitions from blue (low) to red (high). Red arrows indicate possible paths emanating from a UAV position. We also consider paths exploring the cell with the highest predicted score, shown in orange.

This section outlines the setups used for our simulation experiments. First, we detail the 2D simulator used to validate our framework by comparison to different decision-making methods. Then, we present the 3D simulator developed for testing the system with all mission components.

V-a 2D Simulation

Our decision-making framework is validated in a python-based 2D simulator. It is assumed that the three UAVs fly at constant altitude and detect objects within the camera FoV reliably. Table I summarizes our experimental parameters.

Objects No. of 1-point static 4
No. of 2-point static 3
No. of 3-point static 3
No. of 3-point moving 10
Velocity of moving  ms
UAV Camera FoV area  m
Velocity  ms
Reward Grid resolution  m
static  s
moving  s
static  s
moving  s
Calculation time  s
Tracking timeout for moving  s
TABLE I: 2D simulation parameters

We use this setup to evaluate our algorithm against the three different decision-making strategies illustrated in Fig. 3 and described below.

Fig. 3: Three decision-making strategies used as benchmarks to evaluate our framework in 2D simulation: (top) Random, (middle) Cover-field-first, and (bottom) Cover-and-pickup.

V-A1 Random

All UAVs move by following random paths. When a UAV finds an object, it picks it up immediately.

V-A2 Cover-field-first

All UAVs first cover the field with a “zig-zag” coverage pattern and then pick up the static objects found. The order of objects picked up is chosen based on time efficiency (the cost of an object divided by its reward). When a UAV finds a moving object, it picks it up immediately. After picking up all the static objects, the UAVs explore the field randomly to find moving objects.

V-A3 Cover-and-pickup

Each UAV is assigned an area of the arena to cover using a “zig-zag” pattern. When a UAV finds an object, it picks it up immediately, and returns to its last position upon dropoff to continue the coverage path.

For each strategy above, and our own, we consider different initial time limits  s with the parameters in Table I. For each budget, five trials are conducted with randomly initialized object positions.

V-B 3D Simulation

We use RotorS [22], a Gazebo-based simulation environment, to demonstrate our system running in a mission scenario. Our UAVs are three AscTec Firefly platforms modelled with realistic dynamics and on-board sensors.

Our decision-making unit is implemented in an integrated system containing all other functions required for the mission, including attitude control, localization, object detection, state machine, visual servoing for pickup, collision avoidance, etc. [23] This module receives the objects detected in images and estimated UAV states as inputs, and outputs actions for waypoint-based exploration or target pickup.

V-C System Overview

Each UAV in our system has its own PDM and shares decision and measurement information with the others. The UAVs update the PDM individually, but using the same information, by relaying object detections to each other. The decision proceeds as shown in Fig. 4. The shared decisions are used for implicit coordination. In addition, the UAVs can detect the availability of other UAVs and can plan based on this information. This enables the agents to adapt if some agents crashed.

Fig. 4: State machine of a UAV. The UAVs perform decision-making after each action (explore or pickup an object) while sending their decision, position, and measurements to each other.

Vi Experimental Results

This section presents our experimental results. We evaluate our framework against benchmarks in 2D simulation. Then, we show examples of different decision-making behaviour to highlight the benefits of our approach in a typical mission.

Vi-a Comparison Against Benchmarks

Fig. 5 depicts the evaluation results with varying time limits , as described in Section V-A. The -axis measures the average competition points attained over the five trials. As expected, the cover-field-first method (yellow) obtains no points with shortest limits ( s) because it lacks time to start collecting objects. This confirms that a decision-making strategy is needed to balance search and action when an exhaustive search is impossible. Our approach (red) performs best with shorter limits ( s). Unlike the benchmarks, our algorithm uses reward predictions to distinguish between object types given the UAV positions, allowing them to decide between search and pickup within the time limit, e.g., to seek a more valuable object than one already found. We show this with examples in Section VI-B.

With longer limits ( s), the importance of decision-making decreases as there is more time for deliveries. The covering-and-pickup strategy (green) scores highly since its “zig-zag” path ensures complete field coverage. Our method performs competitively even without this guarantee, as it keeps track of areas already explored. The cover-first method is worse in comparison as the UAVs search for highly-valued moving objects only later in the mission. The performance of the random strategy (blue) also deteriorates due to its low probability of finding objects away from the dropping box.

Fig. 6 shows the evolution of score for two missions with different limits. Using our algorithm, more points are obtained at later stages as the UAVs decide to secure object pickups when there is little time remaining.

Fig. 5: Comparison of our decision-making method (red) against benchmarks with different initial time limits. The solid dots show the Challenge scores averaged over 5 trials, with error bars indicating extrema. By accounting for the time remaining for search and action decisions, our method performs best across the tested range.
(a) Time limit:  s
(b) Time limit:  s
Fig. 6: Comparison of the score point evolution for our decision-making method (red) against benchmarks in (a)  s and (b)  s missions. In (a), significantly more points are obtained using our approach as it accounts for the time restriction.
(a) Explore
(b) Pick up moving object
(c) Pick up static objects
(d) Pick up static object instead of a moving one
Fig. 7: Examples of decision-making situations in our Gazebo-based environment. (a)-(d) show the visualization (left) and simulation views (right) corresponding to the four decisions described in the text. Blue (1pt), green (2pt), and red (3pt) shapes indicate static objects, while yellow (3pt) shapes represent moving ones. The UAV (red circles) and object (blue circles) positions are shown. Red and blue signify decisions to fly exploratory paths and pick up objects, respectively, with blue arrows illustrating pick-up actions. By using a reward prediction function, our algorithm accounts for different trade-off aspects to maximize the score obtained.

Vi-B Decision-making Examples in a 3D Simulation

In the following, we present examples of decision-making using our framework in the RotorS environment (Section V-B). The aim is to show that our method can account for different situational aspects when making trade-offs, e.g., picking up moving objects before they are lost, choosing between search and action based on the time remaining.

  1. [label=()]

  2. Explore. Early in the mission, the UAVs decided not to pick up objects despite having found them, as there was ample time to search for alternatives (Fig 7). Exploratory paths with implicit coordination are generated to maximize the predicted total score based on the PDM.

  3. Pick up moving object. A UAV picked up a moving object upon its detection (Fig. 7). Because moving objects are considered static only temporarily, neglecting pickup would lead to the object becoming unknown in the future, causing the predicted total score to decrease.

  4. Pick up static objects. With many static objects and little time remaining, the UAVs decided to pick the objects up (Fig. 7). As there was insufficient time to both risk exploration and return to the objects later, a decision to search caused the predicted total score to decrease.

  5. Pick up static object instead of a moving one. A UAV found both a new static and a new moving object (Fig. 7). With enough time, the moving object would be targeted. However, in a situation with little time remaining, the static object is preferred because its pick-up time is shorter. This decision considers both the time constraints and object type, so cannot be performed with simple rule-based logic, e.g., “Pick up a moving object as soon as it is found”.

Vii Conclusion

This work introduced a decentralized multi-agent decision-making framework for the search and action problem in a time-constrained setting. Our algorithm uses probabilistic reasoning to make decisions leading to highest predicted rewards over a given time limit. The near-optimality of the output policies is guaranteed by exploiting the properties of the optimization function used for implicit coordination.

Our framework was applied in the MBZIRC search, pick, and place scenario by specifying a PDM and reward prediction function for action selection. We showed its advantages over alternative decision-making strategies in terms of mission performance with varying time limits. Experiments in a 3D environment demonstrated real-time system integration with examples of informed decision-making.

Future research will address adapting our algorithm to different scenarios, e.g. search-and-rescue. We aim to apply our algorithm for further search-and-action scenarios where the time cost and reward can be defined for all tasks and exploring actions. Here, the time limit can reflect a limited resource, e.g., battery level or energy consumptions. Another interesting avenue for future research is the use of other methods than bayesian filtering and PDM for the probability calculation of finding new tasks, e.g., for unknown field size and probability distribution. Possible extensions involve allowing for flight at variable altitudes, alternative sensor types, and unknown environments/tasks.


This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 644227 and from the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number 15.0029, and was partially sponsored by the Mohamed Bin Zayed International Robotics Challenge.

In the following, we sketch the proof of the aforementioned properties (positive monotonicity and submodularity) of the optimization function specified in Eq 5.

-a Positive monotonicity

If the function satisfies:


it is called non-decreasing.

Proof. , and because contains all actions in . From the definition of , it is obvious that with a larger set of tasks, will get larger. As satisfies this condition, it is proven to be nondecreasing.

-B Submodularity

If the function satisfies:


it is called submodular.

This means that when a new set is added, the increase of is smaller with the larger set than the smaller set .

Proof. Since , there are three cases, distinguished by whether the total cost of tasks exceeds the time limit.

Case 1. and

In this case, the reward increase is as same as the reward of task .

As a result,


Case 2. and

In this case, because the tasks must be selected to satisfy the constraint, which would reduce the increase of .



Case 3. and This means that the tasks are already selected from to match the constraint. Now, consider adding a new task . Let .
In this case, a reward increase occurs by replacing tasks with :


With a larger set of tasks , the tasks initially selected are gained by replacing the tasks from . If tasks are not replaced in this operation, the reward increase is the same. If the tasks are already replaced, the reward increase is smaller than because they are already replaced by more valuable tasks. Then,


As satisfies the three cases above, it is proven to be submodular.


  • Rudol and Doherty [2008] P. Rudol and P. Doherty, “Human body detection and geolocalization for UAV search and rescue missions using color and thermal imagery,” in Aerospace Conference, 2008 IEEE.    IEEE, 2008, pp. 1–8.
  • Murphy et al. [2008] R. R. Murphy, S. Tadokoro, D. Nardi, A. Jacoff, P. Fiorini, H. Choset, and A. M. Erkmen, “Search and rescue robotics,” in Springer Handbook of Robotics.    Springer, 2008, pp. 1151–1173.
  • Kingston et al. [2016] D. Kingston, S. Rasmussen, and L. Humphrey, “Automated UAV tasks for search and surveillance,” in Control Applications (CCA), 2016 IEEE Conference on.    IEEE, 2016, pp. 1–8.
  • Jin et al. [2003] Y. Jin, A. A. Minai, and M. M. Polycarpou, “Cooperative real-time search and task allocation in UAV teams,” in Decision and Control, 2003. Proceedings. 42nd IEEE Conference on, vol. 1.    IEEE, 2003, pp. 7–12.
  • Maza and Ollero [2007] I. Maza and A. Ollero, “Multiple UAV cooperative searching operation using polygon area decomposition and efficient coverage algorithms,” in Distributed Autonomous Robotic Systems 6.    Springer, 2007, pp. 221–230.
  • Popović et al. [2017] M. Popović, T. Vidal-Calleja, G. Hitz, I. Sa, R. Y. Siegwart, and J. Nieto, “Multiresolution Mapping and Informative Path Planning for UAV-based Terrain Monitoring,” in IEEE/RSJ International Conference on Intelligent Robots and Systems.    Vancouver: IEEE, 2017.
  • Bähnemann et al. [2017a] R. Bähnemann, M. Pantic, M. Popović, D. Schindler, M. Tranzatto, M. Kamel, M. Grimm, J. Widauer, R. Siegwart, and J. Nieto, “The ETH-MAV team in the MBZ international robotics challenge,” arXiv preprint arXiv:1710.08275, 2017.
  • Hollinger and Singh [2008] G. Hollinger and S. Singh, “Proofs and experiments in scalable, near-optimal search by multiple robots,” Proceedings of Robotics: Science and Systems IV, Zurich, Switzerland, vol. 1, 2008.
  • Galceran and Carreras [2013] E. Galceran and M. Carreras, “A survey on coverage path planning for robotics,” Robotics and Autonomous Systems, vol. 61, no. 12, pp. 1258–1276, 2013.
  • Isler et al. [2011] V. Isler, G. A. Hollinger, and T. H. Chung, “Search and pursuit-evasion in mobile robotics, a survey,” 2011.
  • Nowakowski and Winkler [1983] R. Nowakowski and P. Winkler, “Vertex-to-vertex pursuit in a graph,” Discrete Mathematics, vol. 43, no. 2-3, pp. 235–239, 1983.
  • Aigner and Fromme [1984] M. Aigner and M. Fromme, “A game of cops and robbers,” Discrete Applied Mathematics, vol. 8, no. 1, pp. 1–12, 1984.
  • Adler et al. [2003] M. Adler, H. Räcke, N. Sivadasan, C. Sohler, and B. Vöcking, “Randomized pursuit-evasion in graphs,” Combinatorics, Probability and Computing, vol. 12, no. 03, pp. 225–244, 2003.
  • Wong et al. [2005] E.-M. Wong, F. Bourgault, and T. Furukawa, “Multi-vehicle bayesian search for multiple lost targets,” in Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on.    IEEE, 2005, pp. 3169–3174.
  • Smith [2007] T. Smith, “Probabilistic planning for robotic exploration,” Ph.D. dissertation, Massachusetts Institute of Technology, 2007.
  • Kurniawati et al. [2008] H. Kurniawati, D. Hsu, and W. S. Lee, “SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces.” in Robotics: Science and systems, vol. 2008.    Zurich, Switzerland., 2008.
  • Messias et al. [2013] J. V. Messias, M. T. Spaan, and P. U. Lima, “Asynchronous execution in multiagent POMDPs: Reasoning over partially-observable events,” in AAMAS, vol. 13, 2013, pp. 9–14.
  • Hollinger et al. [2009] G. Hollinger, D. Ferguson, S. Srinivasa, and S. Singh, “Combining search and action for mobile robots,” in Robotics and Automation, 2009. ICRA’09. IEEE International Conference on.    IEEE, 2009, pp. 952–957.
  • Krause and Golovin [2012] A. Krause and D. Golovin, “Submodular function maximization,” Tractability: Practical Approaches to Hard Problems, vol. 3, no. 19, p. 8, 2012.
  • Thrun [2002] S. Thrun, “Probabilistic robotics,” Communications of the ACM, vol. 45, no. 3, pp. 52–57, 2002.
  • Martello et al. [1999] S. Martello, D. Pisinger, and P. Toth, “Dynamic programming and strong bounds for the 0-1 knapsack problem,” Management Science, vol. 45, no. 3, pp. 414–424, 1999.
  • Furrer et al. [2016] F. Furrer, M. Burri, M. Achtelik, and R. Siegwart, Robot Operating System (ROS): The Complete Reference (Volume 1).    Cham: Springer International Publishing, 2016, ch. RotorS—A Modular Gazebo MAV Simulator Framework, pp. 595–625.
  • Bähnemann et al. [2017b] R. Bähnemann, D. Schindler, M. Kamel, R. Siegwart, and J. Nieto, “A decentralized multi-agent unmanned aerial system to search, pick up, and relocate objects,” arXiv preprint arXiv:1707.03734, 2017.