Thanks to ongoing developments in sensing and processing technologies, mobile robots are becoming more capable of working in dynamic and challenging environments in many practical applications, such as search-and-rescue [1, 2], multi-robot exploration [3, 4, 5] and terrain monitoring . In many multi-agent applications, however, efficient task allocation remains a field of open research. To perform fully autonomous missions, algorithms for agent cooperation and efficient area search are required.
To address this, our work focuses on cooperative planning strategies within the context of the Mohamed Bin Zayed International Robotics Challenge (MBZIRC)111http://www.mbzirc.com/ . In one stage of this competition (Challenge 3), a team of unmanned aerial vehicles is required to collaboratively search, pick, and place a set of static and moving objects, gaining points for each one collected. This task poses several challenges:
coordinating multiple UAVs to explore the field,
tracking moving targets efficiently,
trading off exploration to find new objects with picking them up to score points,
making decisions based on the time limitation of a task.
A key aspect of such missions is that the timing to execute actions should be considered, given the targets found. In the MBZIRC, for instance, a UAV can greedily pick up an object to attain a certain score. However, it could be better to invest in exploration to find a more valuable object nearby. The optimal decision here includes several aspects, and differs with the time remaining until mission completion. With stricter time limits, exploration becomes riskier and acting greedily might be preferred. The exploration-exploitation trade-off must thus be addressed while accounting for the fact that optimal decisions differ at various stages of the mission.
To tackle these issues, this paper introduces a multi-agent decision-making algorithm which accounts for (i) target movement, (ii) mission time constraints, and (iii) variable target rewards. The main idea is to treat time as a budget used by the agents. At a given time, all possible actions consume some budget while yielding a certain reward. Each agent has an initial budget specified as the mission time limit, and must choose the sequence of actions maximizing the final reward. To evaluate actions, future rewards are predicted by planning on a probabilistic map. A key aspect of our approach is that all agents can operate not fully synchronously. Moreover, by using implicit coordination , it does not suffer from computational growth with the number of agents, making it applicable to real-time scenarios.
The contributions of this paper are:
A multi-agent decision-making framework which:
is decentralized and real-time with near-optimality guarantees,
considers a fixed time budget,
addresses the search and action problem without using a trade-off parameter.
The validation of our framework on Challenge 3 of the MBZIRC, with an evaluation against alternative strategies.
While our framework is motivated by the MBZIRC, it can be used in any multi-agent search and action scenario.
The remainder of this paper is organized as follows. Section 2 describes related work. We formulate our proposed method as a general search and action problem in Section 3. Sections 4 and 5 detail our experimental set-up and results. In Section 6, we conclude with a view towards future work.
Ii Related Work
Significant work has been done in the field of decision-making for the search and action problem. This section provides a brief overview. We first discuss searching for targets only before also considering executing tasks on them.
Ii-a Coverage Methods
Pure exploration tasks are typically addressed using complete coverage methods. The aim is to generate an obstacle-free path passing through all points in an area, usually by decomposing the problem space . Based on these ideas, Maza and Ollero  proposed a sweep-based approach for target search using multiple UAVs. However, their strategies are not applicable in our setup due to the presence of moving targets, which can enter areas already explored.
Ii-B Search and Pursuit-evasion
There are many search algorithms which allow for target movement. Isler et al.  categorized autonomous search problems using several features, dividing them into two categories: (i) adversarial pursuit-evasion games, and (ii) probabilistic search.
In a pursuit-evasion game, pursuers try to capture evaders as evaders try to avoid capture. One graph-based example of this game is cops-and-robbers [11, 12], which assumes that players have perfect knowledge of each other’s positions. Similarly, Adler et al.  studied hunter-and-rabbit, where players do not know each other’s positions unless found on the same vertex. Their search method aims to maximize worst-case performance so that the targets can be found regardless of how they move. These deterministic methods, however, do not cater for practical sensing limitations.
In contrast, probabilistic search methods try to maximize probability of detection or minimize expected detection time, assuming that the targets’ and searchers’ movements are independent of each other. Like Wong et al. , we follow a Bayesian approach to compute the probability density of targets and use this to coordinate multiple searchers.
Formally, general search problems can be formulated as a Partially Observable Markov Decision Process (POMDP) with hidden states. However, real-time planning within such frameworks is difficult because of the exponential growth in computation with the number of searchers, even with near-optimal solvers [15, 16]. To address this, Hollinger and Singh  propose implicit coordination with receding-horizon planning to achieve near-optimality while scaling only linearly with the number of teams.
The above methods assume that all agents act simultaneously, i.e., using a synchronization system. Alleviating this requirement, Messias et al.  study a multi-agent POMDP for asynchronous execution.
Ii-C Combined Search and Action
While the strategies above are suitable for efficient target search, they do not take into account any subsequent tasks on them (e.g., object pickup and delivery). Hollinger et al.  were first to examine both aspects in the context of robotics. In their work, the aim is to minimize the combined search and action time with targets found in an environment using finite-horizon plans. However, a key assumption is that actions are executed as soon as targets are found. We leverage similar principles, but also allow agents to choose between search and action given the time budget constraints.
Iii Proposed Approach
In this section, we present our planning approach for the general search and action problem. By maximizing the attainable reward, our strategy aims to efficiently decide whether to explore an environment or to execute actions given the targets found so far. We begin by defining the problem, then outline the details of our algorithm.
Iii-a Problem Setup
Our problem involves finding tasks associated with static and moving targets and performing actions to complete them within a specified time limit.
We consider a graph-like grid environment with edge-connected cells. Each edge is weighted by a distance and an agent moves to detect tasks in its current cell. The agent searches for tasks associated with multiple targets, which can be either static or moving. The motion of moving targets is non-adversarial, i.e., it does not change with the searchers’ behaviour. Each target task can be executed through an action, which incurs a time cost and yields a certain reward.
The objective is to obtain the highest possible reward within the time limit with multiple search agents. To achieve this, they must explore the field, locate the tasks, and execute the associated actions cooperatively.
Iii-B Algorithm Design
The main aspects of the algorithm are: (i) decision-making for search vs. task execution, (ii) evaluation of action candidates, (iii) consideration of time constraints, and (iv) cooperation of multiple agents.
To solve this problem, we extend the Multi-robot Efficient Search Path Planning (MESPP) algorithm proposed by Hollinger and Singh . The agents try to maximize a reward function within a finite horizon. Then, they cooperate implicitly by executing the actions in a sequential manner. Efficiency is provably within of the optimum when the optimization function is non-decreasing and submodular . To apply this for our problem, we define an optimization function with these properties that accounts for the remaining time at a particular stage in the mission. Our function is derived in three steps:
Definition of a function which predicts the total reward within the time limit.
Formulation of an action-evaluation function, and demonstration of a decision-making algorithm which employs this function.
Derivation of the non-decreasing submodular optimization function for multiple agents.
Iii-B1 Reward prediction function
This function forecasts the total reward attainable within the time limit. We assume that agents have a set of already found tasks . Each task is associated with an execution time (cost) and a reward . With this information, the total reward can be predicted by choosing tasks such that the total time cost does not exceed a given constraint. This reward prediction function is represented as , where indexes the time budget remaining in for a particular agent:
Iii-B2 Action-evaluation function and decision-making algorithm
The action-evaluation function assesses exploratory actions as the probability of the predicted reward increases. An exploratory action consuming time could increase or decrease the total reward, depending on whether or not it leads to finding new, more valuable, tasks.
The utility of exploration depends on the set of already found tasks , remaining time budget , and probability of finding new tasks. By accounting for these effects, the action-evaluation function computes the reward increase:
where is the set of tasks findable by performing action (including the case that no task is found), represents the probability of finding the set of tasks with action , and denotes the total reward with new tasks after deducting the cost from the agent’s budget .
Iii-B3 Optimization function
Finally, we define an optimization function for executing a sequence of actions :
where is the set of tasks found along actions and is the initial time limit.
The key idea is that executing Algorithm 1 is the same as maximizing for a single action . Therefore, if Eq. 5 is nondecreasing and submodular222A proof of the properties required for near-optimality is presented in the Appendix., a sequential decision-making procedure at each replanning stage gives near-optimal solutions with multiple agents (Algorithm 2). After one agent plans, the chosen action is taken into account through the reward function, allowing the others to plan based on its decision. A key benefit of this approach is that its complexity increases only linearly with the number of agents, making it feasible for real-time planning.
Iv Mbzirc Application
This section overviews the MBZIRC task (Challenge 3) motivating our general decision-making approach. Then, we describe how its elements are adapted for this scenario.
Iv-a Challenge Description
Challenge 3 of the MBZIRC takes place in an m m field containing three types of objects. Each object has a different score, distinguishable by color. Using three UAVs, the objective is to explore the field for objects, then pick up and drop them in a designated box to score points. Points are only obtained upon successful delivery to the box. The three object classes are:
. Does not move, has several point categories.
. Moves randomly, has a higher point category.
. Does not move, requires UAV collaboration to pick up333For simplicity, this work does not consider large objects. We note that they can be easily included in the current framework using simple cooperative logic..
The field contains ten static, ten moving, and three large objects. The dropping box is located in the middle of the field and the three UAVs start from the same point. The time limit for the mission is mins.
Iv-B Action-evaluation Function
To show the flexibility of our framework, this section describes how it can be adapted to the MBZIRC setup. We specify the task set and define its probability density for exploration. Based on these ideas, we formulate the reward prediction function (Eq. 4) for decision-making.
Iv-B1 Task definitions
We associate a new task to each object in the arena. The cost of a task is the time taken to complete the pick-up and drop-off action, calculated as:
where is the time for a UAV to move to the object from its current position, is the time to pick up the object given its type, and and are the times to transfer the object to and deposit it in the dropping box.
The reward of a task is simply the score points that are obtained upon the successful delivery of an object.
Iv-B2 Probability of finding new tasks
The probability of finding new tasks (objects) is expressed using a Probability Density Map (PDM), updated through a Bayesian filtering procedure. The PDM is created by discretizing the arena area into a grid, where each cell represents a state . At time-step , the filter calculates the belief distribution from measurement and control data using the equations :
where is the transition probability between successive states given the control input , is the measurement model for observation , and is a normalization constant.
To handle multiple tasks, seperate PDMs are stored for each object and updated using Eqs. 7 and 8. Note that is neglected since the UAVs have no control over the objects’ motion. Static objects maintain a constant probability:
The motion of moving objects is treated as a random walk, such that they enter adjacent cells with probability :
Observations are made by detecting the objects in images recorded by a downward-facing camera. For each UAV
, state estimation is performed by fusingGlobal Positioning System (GPS) and visual odometry data. By combining this with attitude information, we determine a detected object’s position in a fixed coordinate frame, and compute given its color and position.
Iv-B3 Reward prediction function
From Section III-B1, reward predictions are determined by calculating the maximum attainable reward for a set of found tasks within a time limit. To do this, we cast Eq. 1 as the well-known knapsack problem and solve it using dynamic programming .
For our application, the set of found tasks contains their costs and rewards as defined in Section IV-B1. We treat the moving objects as static within a certain time frame, such that their probability of moving to cells which distance is larger than the camera’s viewing range is zero. If the time since the last observation of a moving object exceeds a threshold, it is considered unknown and must be searched for again. However, moving objects propagate their probability, making it easier to search for them again.
A key concept is that the task cost changes with an agent’s position. If the UAV is close to an object, is relatively short. Moreover, upon a delivery, the UAV starts from the dropping box position. The choice of the first object is thus addressed as a special case, given that the order of the later objects does not change the optimization (the UAV always starts from the dropping box position).
We handle this by using two tables in our dynamic programming method. The first and second tables calculate the maximum reward with the cost considered from the dropping box position and an arbitrary UAV position in the arena, respectively, corresponding to successive and first objects. As new objects are found, they are added to the table and assigned one of three decision labels: (1) pick up now (first), (2) pick up later (successive), or (3) do not pick up.
The first table , considering only cost from the dropping box position, is calculated as:
where is the cost to pick up from the dropping box and is the reward of the task .
where is the pick-up cost from the current UAV position. , , and correspond to decision labels (1), (2), and (3), respectively.
This reward prediction function captures the probability of finding objects during exploration. For example, with little time remaining, searching areas far away from the dropping box offers no rewards since deliveries are impossible.
Iv-C Action Definitions
A search action (exploration) involves flying a path in the arena to obtain measurements from the downward-facing camera. To plan paths, we use 2D grid-based planning on the PDM at a constant altitude, where each cell has the dimensions of the camera field of view (FoV) (Fig. 2). For an arbitrary horizon, we enumerate every path executable from the current cell. In addition, the paths starting from the highest-probability cell are considered (four red arrows in Fig. 2). The cost of each action is computed as the travel time assuming constant velocity.
A task action involves picking up an object, then transferring it to and depositing it in the dropping box. This cost is defined in Eq. 6.
V Experimental Setup
This section outlines the setups used for our simulation experiments. First, we detail the 2D simulator used to validate our framework by comparison to different decision-making methods. Then, we present the 3D simulator developed for testing the system with all mission components.
V-a 2D Simulation
Our decision-making framework is validated in a python-based 2D simulator. It is assumed that the three UAVs fly at constant altitude and detect objects within the camera FoV reliably. Table I summarizes our experimental parameters.
|Objects||No. of 1-point static||4|
|No. of 2-point static||3|
|No. of 3-point static||3|
|No. of 3-point moving||10|
|Velocity of moving||ms|
|UAV||Camera FoV area||m|
|Tracking timeout for moving||s|
We use this setup to evaluate our algorithm against the three different decision-making strategies illustrated in Fig. 3 and described below.
All UAVs first cover the field with a “zig-zag” coverage pattern and then pick up the static objects found. The order of objects picked up is chosen based on time efficiency (the cost of an object divided by its reward). When a UAV finds a moving object, it picks it up immediately. After picking up all the static objects, the UAVs explore the field randomly to find moving objects.
Each UAV is assigned an area of the arena to cover using a “zig-zag” pattern. When a UAV finds an object, it picks it up immediately, and returns to its last position upon dropoff to continue the coverage path.
For each strategy above, and our own, we consider different initial time limits s with the parameters in Table I. For each budget, five trials are conducted with randomly initialized object positions.
V-B 3D Simulation
We use RotorS , a Gazebo-based simulation environment, to demonstrate our system running in a mission scenario. Our UAVs are three AscTec Firefly platforms modelled with realistic dynamics and on-board sensors.
Our decision-making unit is implemented in an integrated system containing all other functions required for the mission, including attitude control, localization, object detection, state machine, visual servoing for pickup, collision avoidance, etc.  This module receives the objects detected in images and estimated UAV states as inputs, and outputs actions for waypoint-based exploration or target pickup.
V-C System Overview
Each UAV in our system has its own PDM and shares decision and measurement information with the others. The UAVs update the PDM individually, but using the same information, by relaying object detections to each other. The decision proceeds as shown in Fig. 4. The shared decisions are used for implicit coordination. In addition, the UAVs can detect the availability of other UAVs and can plan based on this information. This enables the agents to adapt if some agents crashed.
Vi Experimental Results
This section presents our experimental results. We evaluate our framework against benchmarks in 2D simulation. Then, we show examples of different decision-making behaviour to highlight the benefits of our approach in a typical mission.
Vi-a Comparison Against Benchmarks
Fig. 5 depicts the evaluation results with varying time limits , as described in Section V-A. The -axis measures the average competition points attained over the five trials. As expected, the cover-field-first method (yellow) obtains no points with shortest limits ( s) because it lacks time to start collecting objects. This confirms that a decision-making strategy is needed to balance search and action when an exhaustive search is impossible. Our approach (red) performs best with shorter limits ( s). Unlike the benchmarks, our algorithm uses reward predictions to distinguish between object types given the UAV positions, allowing them to decide between search and pickup within the time limit, e.g., to seek a more valuable object than one already found. We show this with examples in Section VI-B.
With longer limits ( s), the importance of decision-making decreases as there is more time for deliveries. The covering-and-pickup strategy (green) scores highly since its “zig-zag” path ensures complete field coverage. Our method performs competitively even without this guarantee, as it keeps track of areas already explored. The cover-first method is worse in comparison as the UAVs search for highly-valued moving objects only later in the mission. The performance of the random strategy (blue) also deteriorates due to its low probability of finding objects away from the dropping box.
Vi-B Decision-making Examples in a 3D Simulation
In the following, we present examples of decision-making using our framework in the RotorS environment (Section V-B). The aim is to show that our method can account for different situational aspects when making trade-offs, e.g., picking up moving objects before they are lost, choosing between search and action based on the time remaining.
Pick up static object instead of a moving one. A UAV found both a new static and a new moving object (Fig. 7). With enough time, the moving object would be targeted. However, in a situation with little time remaining, the static object is preferred because its pick-up time is shorter. This decision considers both the time constraints and object type, so cannot be performed with simple rule-based logic, e.g., “Pick up a moving object as soon as it is found”.
This work introduced a decentralized multi-agent decision-making framework for the search and action problem in a time-constrained setting. Our algorithm uses probabilistic reasoning to make decisions leading to highest predicted rewards over a given time limit. The near-optimality of the output policies is guaranteed by exploiting the properties of the optimization function used for implicit coordination.
Our framework was applied in the MBZIRC search, pick, and place scenario by specifying a PDM and reward prediction function for action selection. We showed its advantages over alternative decision-making strategies in terms of mission performance with varying time limits. Experiments in a 3D environment demonstrated real-time system integration with examples of informed decision-making.
Future research will address adapting our algorithm to different scenarios, e.g. search-and-rescue. We aim to apply our algorithm for further search-and-action scenarios where the time cost and reward can be defined for all tasks and exploring actions. Here, the time limit can reflect a limited resource, e.g., battery level or energy consumptions. Another interesting avenue for future research is the use of other methods than bayesian filtering and PDM for the probability calculation of finding new tasks, e.g., for unknown field size and probability distribution. Possible extensions involve allowing for flight at variable altitudes, alternative sensor types, and unknown environments/tasks.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 644227 and from the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number 15.0029, and was partially sponsored by the Mohamed Bin Zayed International Robotics Challenge.
In the following, we sketch the proof of the aforementioned properties (positive monotonicity and submodularity) of the optimization function specified in Eq 5.
-a Positive monotonicity
If the function satisfies:
it is called non-decreasing.
Proof. , and because contains all actions in . From the definition of , it is obvious that with a larger set of tasks, will get larger. As satisfies this condition, it is proven to be nondecreasing.
If the function satisfies:
it is called submodular.
This means that when a new set is added, the increase of is smaller with the larger set than the smaller set .
Proof. Since , there are three cases, distinguished by whether the total cost of tasks exceeds the time limit.
Case 1. and
In this case, the reward increase is as same as the reward of task .
As a result,
Case 2. and
In this case, because the tasks must be selected to satisfy the constraint, which would reduce the increase of .
Case 3. and
This means that the tasks are already selected from to match the constraint.
Now, consider adding a new task .
In this case, a reward increase occurs by replacing tasks with :
With a larger set of tasks , the tasks initially selected are gained by replacing the tasks from . If tasks are not replaced in this operation, the reward increase is the same. If the tasks are already replaced, the reward increase is smaller than because they are already replaced by more valuable tasks. Then,
As satisfies the three cases above, it is proven to be submodular.
- Rudol and Doherty  P. Rudol and P. Doherty, “Human body detection and geolocalization for UAV search and rescue missions using color and thermal imagery,” in Aerospace Conference, 2008 IEEE. IEEE, 2008, pp. 1–8.
- Murphy et al.  R. R. Murphy, S. Tadokoro, D. Nardi, A. Jacoff, P. Fiorini, H. Choset, and A. M. Erkmen, “Search and rescue robotics,” in Springer Handbook of Robotics. Springer, 2008, pp. 1151–1173.
- Kingston et al.  D. Kingston, S. Rasmussen, and L. Humphrey, “Automated UAV tasks for search and surveillance,” in Control Applications (CCA), 2016 IEEE Conference on. IEEE, 2016, pp. 1–8.
- Jin et al.  Y. Jin, A. A. Minai, and M. M. Polycarpou, “Cooperative real-time search and task allocation in UAV teams,” in Decision and Control, 2003. Proceedings. 42nd IEEE Conference on, vol. 1. IEEE, 2003, pp. 7–12.
- Maza and Ollero  I. Maza and A. Ollero, “Multiple UAV cooperative searching operation using polygon area decomposition and efficient coverage algorithms,” in Distributed Autonomous Robotic Systems 6. Springer, 2007, pp. 221–230.
- Popović et al.  M. Popović, T. Vidal-Calleja, G. Hitz, I. Sa, R. Y. Siegwart, and J. Nieto, “Multiresolution Mapping and Informative Path Planning for UAV-based Terrain Monitoring,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. Vancouver: IEEE, 2017.
- Bähnemann et al. [2017a] R. Bähnemann, M. Pantic, M. Popović, D. Schindler, M. Tranzatto, M. Kamel, M. Grimm, J. Widauer, R. Siegwart, and J. Nieto, “The ETH-MAV team in the MBZ international robotics challenge,” arXiv preprint arXiv:1710.08275, 2017.
- Hollinger and Singh  G. Hollinger and S. Singh, “Proofs and experiments in scalable, near-optimal search by multiple robots,” Proceedings of Robotics: Science and Systems IV, Zurich, Switzerland, vol. 1, 2008.
- Galceran and Carreras  E. Galceran and M. Carreras, “A survey on coverage path planning for robotics,” Robotics and Autonomous Systems, vol. 61, no. 12, pp. 1258–1276, 2013.
- Isler et al.  V. Isler, G. A. Hollinger, and T. H. Chung, “Search and pursuit-evasion in mobile robotics, a survey,” 2011.
- Nowakowski and Winkler  R. Nowakowski and P. Winkler, “Vertex-to-vertex pursuit in a graph,” Discrete Mathematics, vol. 43, no. 2-3, pp. 235–239, 1983.
- Aigner and Fromme  M. Aigner and M. Fromme, “A game of cops and robbers,” Discrete Applied Mathematics, vol. 8, no. 1, pp. 1–12, 1984.
- Adler et al.  M. Adler, H. Räcke, N. Sivadasan, C. Sohler, and B. Vöcking, “Randomized pursuit-evasion in graphs,” Combinatorics, Probability and Computing, vol. 12, no. 03, pp. 225–244, 2003.
- Wong et al.  E.-M. Wong, F. Bourgault, and T. Furukawa, “Multi-vehicle bayesian search for multiple lost targets,” in Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on. IEEE, 2005, pp. 3169–3174.
- Smith  T. Smith, “Probabilistic planning for robotic exploration,” Ph.D. dissertation, Massachusetts Institute of Technology, 2007.
- Kurniawati et al.  H. Kurniawati, D. Hsu, and W. S. Lee, “SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces.” in Robotics: Science and systems, vol. 2008. Zurich, Switzerland., 2008.
- Messias et al.  J. V. Messias, M. T. Spaan, and P. U. Lima, “Asynchronous execution in multiagent POMDPs: Reasoning over partially-observable events,” in AAMAS, vol. 13, 2013, pp. 9–14.
- Hollinger et al.  G. Hollinger, D. Ferguson, S. Srinivasa, and S. Singh, “Combining search and action for mobile robots,” in Robotics and Automation, 2009. ICRA’09. IEEE International Conference on. IEEE, 2009, pp. 952–957.
- Krause and Golovin  A. Krause and D. Golovin, “Submodular function maximization,” Tractability: Practical Approaches to Hard Problems, vol. 3, no. 19, p. 8, 2012.
- Thrun  S. Thrun, “Probabilistic robotics,” Communications of the ACM, vol. 45, no. 3, pp. 52–57, 2002.
- Martello et al.  S. Martello, D. Pisinger, and P. Toth, “Dynamic programming and strong bounds for the 0-1 knapsack problem,” Management Science, vol. 45, no. 3, pp. 414–424, 1999.
- Furrer et al.  F. Furrer, M. Burri, M. Achtelik, and R. Siegwart, Robot Operating System (ROS): The Complete Reference (Volume 1). Cham: Springer International Publishing, 2016, ch. RotorS—A Modular Gazebo MAV Simulator Framework, pp. 595–625.
- Bähnemann et al. [2017b] R. Bähnemann, D. Schindler, M. Kamel, R. Siegwart, and J. Nieto, “A decentralized multi-agent unmanned aerial system to search, pick up, and relocate objects,” arXiv preprint arXiv:1707.03734, 2017.