Efficient Multi-robot Exploration via Multi-head Attention-based Cooperation Strategy

by   Shuqi Liu, et al.

The goal of coordinated multi-robot exploration tasks is to employ a team of autonomous robots to explore an unknown environment as quickly as possible. Compared with human-designed methods, which began with heuristic and rule-based approaches, learning-based methods enable individual robots to learn sophisticated and hard-to-design cooperation strategies through deep reinforcement learning technologies. However, in decentralized multi-robot exploration tasks, learning-based algorithms are still far from being universally applicable to the continuous space due to the difficulties associated with area calculation and reward function designing; moreover, existing learning-based methods encounter problems when attempting to balance the historical trajectory issue and target area conflict problem. Furthermore, the scalability of these methods to a large number of agents is poor because of the exponential explosion problem of state space. Accordingly, this paper proposes a novel approach - Multi-head Attention-based Multi-robot Exploration in Continuous Space (MAMECS) - aimed at reducing the state space and automatically learning the cooperation strategies required for decentralized multi-robot exploration tasks in continuous space. Computational geometry knowledge is applied to describe the environment in continuous space and to design an improved reward function to ensure a superior exploration rate. Moreover, the multi-head attention mechanism employed helps to solve the historical trajectory issue in the decentralized multi-robot exploration task, as well as to reduce the quadratic increase of action space.



There are no comments yet.


page 21


Inner Attention Supported Adaptive Cooperation for Heterogeneous Multi Robots Teaming based on Multi-agent Reinforcement Learning

Humans can selectively focus on different information based on different...

Deep Reinforcement Learning with Stage Incentive Mechanism for Robotic Trajectory Planning

To improve the efficiency of deep reinforcement learning (DRL) based met...

Deep Reinforcement Learning for Decentralized Multi-Robot Exploration with Macro Actions

Cooperative multi-robot teams need to be able to explore cluttered and u...

Explore-Bench: Data Sets, Metrics and Evaluations for Frontier-based and Deep-reinforcement-learning-based Autonomous Exploration

Autonomous exploration and mapping of unknown terrains employing single ...

Inner Attention Modeling for Flexible Teaming of Heterogeneous Multi Robots Using Multi-Agent Reinforcement Learning

With the advantages of member diversity and team scale, heterogeneous mu...

Rule-Based Reinforcement Learning for Efficient Robot Navigation with Space Reduction

For real-world deployments, it is critical to allow robots to navigate i...

Smooth and Efficient Policy Exploration for Robot Trajectory Learning

Many policy search algorithms have been proposed for robot learning and ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problems associated with exploring an unknown environment using a team of robots are among the fundamental problems in mobile robotics. These problems arise in a wide range of applications, including disaster rescue, planetary exploration, reconnaissance and surveillance sheng2006distributed ; burgard2005coordinated . The key question during exploration is that of how to figure out each agent’s next move so that the overall mission time is minimized and the exploration rate is maximized. We here focus on a sub-problem of robotic exploration, namely that of decentralized multi-robot exploration tasks, where robots make their own decisions without a centralized controller. In order to devise cooperation strategies, robots need to broadcast their local observations and historical trajectories by communicating with each other, which allows more information about the environment to be acquired.

To this end, several multi-agent exploration approaches have been developed. The original approaches, such as the frontier-based approach yamauchi1998frontier ; faigl2015benchmarking ; sharma2016frontier and the cost-utility approach burgard2000collaborative ; colares2016next , were designed by experts based on cooperation strategies including explicit communication and action rules. However, many real-world applications have proven too complex to be dealt with efficiently by human-designed strategies. Moreover, these approaches also find it difficult to cope with the “historical trajectory” issue. Most “pre-designed” methods assume that the current robot only communicates with nearby robots; however, more distant robots that have explored the surrounding areas also need to be involved in the communications network to avoid repeated exploration.

Recent work in this area has attempted to combine the strengths of deep learning techniques with the control policies for robotics applications

pinto2016supersizing ; chen2014door ; shvets2018automatic . In particular, deep reinforcement learning (DRL) methods allow multiple agents to autonomously learn the required cooperation strategy kretzschmar2016socially ; gu2017deep ; kahn2018self . Therefore, these learning-based approaches can resolve the difficulties associated with developing precise and complicated control strategies for each move, and thus achieve more flexible and effective performance in complex scenarios.

Despite this progress, however, algorithms for multi-agent exploration are still far from being universal (to the continuous space) and scalable (to a larger number of agents). Various previous works yamauchi1998frontier ; carrillo2015autonomous ; mox2018information have modeled the exploration environment as a discrete space in which agents’ actions are restricted to their surrounding grids. When extending the task into continuous space, however, it is hard to design an accurate reward function based on the historical trajectories, meaning that some areas may be ignored or repeatedly explored. In addition, the maximum number of agents is limited in previous works of this kind, as the action space increases exponentially with the number of agents. Although recent single-head attention-based methods have shown great potential in multi-agent cooperation tasks by focusing only on the relevant agent, they are still simple and limited compared with multi-head attention mechanisms, as each attention head used in the multi-head methods can focus on a different weighted mixture of agents (e.g. locations, historical trajectories, etc.).

Accordingly, our proposed approach, Multi-head Attention-based Multi-robot Exploration in Continuous Space (MAMECS), extends these prior works in several directions. We model the environment as a continuous space in which agents can move to an arbitrary point at every step. Computational geometry knowledge is applied to describe the environment and design an improved reward function. Inspired by team performance in real-world applications, each team member tends to focus only on the teammates that exist in a cooperative or competitive relationship with itself; we thus learn the multi-agent cooperation strategy through a multi-head attention-based critic. Therefore, each agent is aware of which other agents it should be paying attention to rather than simply considering all agents at every time step. Moreover, the quadratic increase in the action space is sharply reduced due to the selected attention mechanism, meaning that the number of agents involved can be increased.

We have validated our approach MAMECS on the typical multi-robot exploration task. Extensive experiments have shown that MAMECS can perfectly fit the continuous space, effectively extend the total number of agents and improve exploration performance compared with previous works. The rest of this paper is organized as follows. In section 2, we discuss related work, followed by a detailed description of our approach in section 3. We report experimental studies in section 4 and conclusion in section 5.

2 Related Work

Our approach MAMECS aims to solve the multi-robot exploration task through multi-head attention-based reinforcement learning in continuous space. Therefore, we mainly focus on two research fields: traditional “human-designed” methods for multi-robot exploration and the “learning-based” multi-robot exploration approaches.

2.1 “Human-designed” Methods for Multi-robot Exploration

Multi-robot exploration is a fundamental robotic problem, which employs a team of autonomous robots to explore an unknown environment with obstacles. Most early works started with heuristic and rule-based approaches. Yamauchi’s Frontier Based Exploration Using Mobile Robots yamauchi1998frontier is a foundational paper used by many successful approaches. In this approach, each robot makes the assignment that maximizes the joint utility to the frontiers and navigate to the nearest unvisited frontier.

Market-based approaches zlot2002multi employs the concept of frontier cells and utility in a market environment to produce complex coordinated strategy in multi-robot exploration. Spanning tree coverage approaches gabriely2001spanning adapted the single robot complete coverage algorithm to multi-robot scenario. Each robot is assigned a part of the constructed spanning tree and covers the section in a counterclockwise fashion. Recent approaches andre2016collaboration ; corah2017efficient ; corah2019communication focuses more on the mutual information for ranging sensors, and they attempt to maximize mutual information directly. The above methods are all based on precisely designed rules, and they should take all the details and situations into account. Therefore, these “pre-designed” cooperation methods will perform poorly especially for partial observation task. It is extremely hard for human to design effective strategies only based on the local view of the whole environment.

2.2 Learning-based Methods in Multi-robot Exploration

Deep Reinforcement Learning has been proved to be effective for enabling sophisticated and hard-to-design behaviors of robot individuals kretzschmar2016socially ; kahn2018self . For the multi-robot exploration task, geng2018learning

proposes a learning-based method to enable the robots to actively learn the cooperation strategies as well as the action policies. Their method is robust enough to handle complex and dynamic environments and beats the performance of several “human-designed” methods. The communication model used in

geng2018learning is CommNet sukhbaatar2016learning , which simply averages the communication message to realize coordination. geng2019learning ; liu2019learning improves the communication process by introducing the attention mechanism, which can precisely calculate whether the communication is necessary for each pair of agents in the exploration scenario. The attention mechanism enables the agents to communicate only with the necessary partners and further improves the cooperation performance. However, the above methods simply model the environment by occupancy grids, which is discrete and easy to represent the information such as historical trajectories.

In this paper, we focus on the multi-robot exploration task in continuous space, which is extremely difficult due to the reason of reward function designing. Besides, we exploit the multi-head attention mechanism and each head can focus on a different weighted mixture of agents (i.e., the locations, the historical trajectories). Furthermore, our method is more flexible than the existing learning-based methods, which can further increase the number of agents in the limited action space and is closer to reality.

3 Our Approach

The exploration rate and scalability of existing multi-robot exploration methods are hard to satisfy the requirement of realistic applications. Therefore, it is meaningful to improve the exploration rate and increase the number of involving agents. In this section, we introduce our MAMECS method from the following aspects: the basic framework, learning the shared attentive critic, continuous environment modeling, design of entropy-oriented reward function and the exploration rate-based training approach.

3.1 Problem Formulation

We consider the application scenario of multi-robot exploration as a partially observable distributed environment. Assuming that each agent could obtain the accurate positions of other agents and the obstacles within its visual range. Each agent learns a policy on observations which maps each agent’s observation to a distribution over the actions . The learning process of the individual policy has to regard observations from other agents with focus, so that the number of involving agents could be extended. Therefore, each agent should consider other agents’ different contribution to the decision making process, rather than considering them all at all the time. Due to each agent could not weight other agents’ observation on their own, they should learn the ability to decide the importance of shared information and calculate a integrated contribution.

3.2 Framework

More formally, multi-head attention mechanism is introduced to centrally learn a critic to enable each agent to select which agents to attend to at each time step. The shared critic receives the observations, actions and historical trajectories from all agents and generates Q-values for each agent, and the contribution of other agents’ information is evaluated by multiple attention heads through attention weight. In the training process, all critics are updated together by minimizing a joint loss function. The main architecture of MAMECS is shown in Fig. 1.

Figure 1: The architecture of MAMECS for agents. The multi-head attention determines the attention weight between each agent based on the inputs of agents’ observations, actions and historical trajectories.

The goal of our method is to selectively paying attention to other agents’ information in agent’s decision-making process. In detail, the encoder takes each agent’s observation, action and historical trajectories as input, then the encoded information are fed into the multi attention heads to generate the integrated message. is evenly separated into

parts, equaling the number of attention heads. Each attention head employs a separate set of parameters to weight the different contribution from other agents, and each part of the information with different weights is concatenated as a single vector

. Each agent takes the relevant information of concatenated message and the local information

into account for estimating its value function


The shared critic receives the observations , actions and historical trajectories from all agents indexed by . The Q-value function for the agent can be calculated as


Here, represents the Q-Network of a two-layer multi-layer perception (MLP) and is the encoder of a one-layer MLP embedding function. represents the contribution from all other agents, which is the concatenated vector from attention heads. For attention head , the corresponding part of is the weighted sum of the attention weight and the embedded information . To be concrete, agent ’s embedded information is transformed by matrix into a value.


To evaluate the corresponding attention weight of agent to the agent , a bilinear mapping is used to project the embedded features into a query-key system. transforms agent ’s embedded information into a “query” and transforms agent ’s embedded information into a “key”. We then perform Softmax operation to process the similarity value between these two embeddings.


Each attention head uses a separate set of parameters to process the embedded information, and calculate the contribution from all other agents to the current agent. The aggregated message from each attention head is then simply concatenated into a single vector.

3.3 Learning the Shared Attentive Critic

As for the question of how to update all critics together within an shared attentive critic. Due to the critic parameters are shared across all agents, all critics can be updated together by minimizing a joint regression loss function:




where is the estimate action-value for agent , while is the ground-truth value. and are respectively the parameters of target critics and target policies. is used to balance the entropy and rewards. So each agent’s policy is updated with the following gradient:


We represent the set of all agents except as . is the multi-agent baseline, which is the average action value of all agents:


The baseline can assist each agent judge its own contribution to the team in a cooperation scenario. By comparing agent’s Q-value with the average action value, the certain contribution of agent to the reward value can be found. Full training details and hyperparameters can be found in the following subsection.

3.4 Dynamic Environment Modeling

Rather than modeling the environment using occupying grids, we instead model the two-dimensional world in continuous space. Consequently, a robot can move to arbitrary positions on the map rather than only positions on the grids surrounding it , which represents a more flexible and practical approach to the multi-robot exploration task. The basic idea behind our approach is to represent each agent as the center point of a circle, so that the agent can explore the area within the radius of this circle at each time step.

We make some definitions first: stands for the observation of robot at time step , which is indexed by ; stands for the output actions given the corresponding inputs; is the contribution from other agents, a weighted sum of the attention weight and the embedded information of other agents. stands for the coordinates of agent’s history trajectory. ( stands for the policy of choosing controls based on the past observations, trajectories and contribution from other agents.

We assume that there is an underlying map primarily unknown to the agents. To be concrete, is a dynamic coordinate set which records the position of robot at time step . Each robot wishes to infer its belief map over map at time given all its previous observations, trajectories and other robots’ weighted contribution leading up to that time step. To simplify the problem, we assume the individual map for each agent, indexed as , are independent:


In information gain approaches, the goal of exploration is twofold - not just to map the environment but to move the robot to maximize the amount of new information in the environment. We apply an information gain method to measure the environment uncertainty in a probability distribution

by the entropy


This is a measure of the uncertainty associated with the constructed belief map . As becomes more peaked decreases, and reaches zero when the outcome of a random trial is certain.

3.5 Entropy-oriented Reward Function

Now, we describe our reward functions which encourage the agents to explore more unknown dynamic environment in the shortest time. In the learning process, there is a central node that records the trajectory of each agent and gives the corresponding reward based on their performances. At the time step , the agent obtains its own observation and the contribution from other agents . The agent is likely to execute the action with highest reward and updates its belief map based on the obtained inputs and its history trajectory .

To describe the reward function accurately, moreover, we first illustrate our expectations of the agents in the exploration tasks. Each agent is expected to avoid collisions with other agents and obstacles in the environment, avoid exploring the same area repeatedly, explore the map in minimal time steps, and reduce the uncertainty for the whole map as soon as possible. In other words, the tasks we encourage agents to do are rewarded positively, while behavior we wish the agents to avoid is rewarded negatively . So at the time step , each agent seeks a policy that could reach the expected goals. Reward function is as follows:


Here, is the combination of three aspects: the environment’s reduced entropy, collision information, and the repeated area coverage information. is the information gain after agent taking an action, which is defined to be the decrease in entropy. In the context of robotic exploration, we measure the information gain with the difference value of the map entropy between time step and . It is the value that we wish to maximize by selecting new poses. As for the collisions with other agents and obstacles, refers to the number of collisions. Two agents collide if the circle centered on their coordinates coincides. A collision incurs a negative reward .

is applied to calculate the repeated area coverage information, which is a piecewise function of agents’ intersection area . We assume that the environment is a square while the agent is represented as a circle. For the problem of fully cover a square with a minimum amount of radius circles, there is no known way to find optimal solutions. However, in our case, agents are not required to fully cover the environment but are expected to achieve the maximum coverage ratio. So, we propose a theorem on coverage ratio and design a calculate method to achieve this goal (shown in Appendix A).

We have proved that there is a better arrangement of circles to achieve a higher cover ratio. As a result, the piecewise function reaches its maximum when equals the intersection area of two circles in the second circumstance, where there is a higher coverage ratio (the second circumstance shown in Fig. A10 in Appendix A). is designed to be a continuous function, and agents could receive reward signals in the whole exploration process. So, the sparse reward problem can be avoided, agents are able to learn the exploration task better.



Figure 2: The piecewise function that calculates the repeated area coverage information based on the agents’ intersection area .

When circles tangent to each other, the coverage ratio may not lead to the maximum while the agent explores new areas in every time step, so the reward . With the increase of two circles’ intersection area, the value of reward increases and reaches its maximum at . Then falls in the form of a quadratic function and reaches bottom when two circles coincide completely.

3.6 Exploration Rate-based Training Approach

To simulate the dynamic obstacles in the practical environment and to enhance the robustness of agents to dynamic settings, we gradually add random obstacles every time-steps to the original environment. Thus, each agent is expected to learn the strategy to avoid both static and dynamic obstacles. To ensure the model find better local optimum and accelerate the training speed, curriculum learning is adapted to the training process by gradually increasing task difficulty. In detail, the value of decreases during the training time so that the frequency of adding random obstacles increases, which means the difficulty of the mission is increasing. However, the value of keeps constant, which means the number of random obstacles also decreases when they are added more frequently. It is an essential setting due to the value of (a crucial component to measure the success standard) is fixed in the given time-steps.

Each simulation is terminated after a specified number of time-steps and classified as a failure if collisions with obstacles have occurred or the exploration rate

is less than . here is calculated as follows:


Here, is the explored area in the map, which is the union area of agents’ trajectory in each time step. is the subset of final obstacles, including the static and random obstacles in the map. means the total area of the environment and is modeled as a square in this scenario. Since each agent and obstacle is represented as a circle with a certain radius and the positions of newly generated circles may overlap with the area that has been explored by the multi-agent system, we take the operation to calculate the union area of these circles.

We use Simpson adaptive algorithm, a classic computational geometry method to calculate the union area of the circles. We first judge the position between circles to optimize calculation. If the centers of the circles coincide, then only the area of one circle is retained; Or if the distance from the center of a circle to any other center exceeds the radius, then we add the area of a complete circle to the total area. After such screening, we use the Simpson adaptive algorithm to calculate the area corresponding to each arc. We first randomly segment an arc, for each interval , we recursively calculate the values corresponding to the endpoints and the intermediate point . is taken as the total length of the transversal lines of and all the circles, so the area between the interval (l, r) is:


4 Experiments

In this section, we will first introduce our experimental settings and locations storage. Then, we will show the training performance compared with the baseline methods. Finally, we will give the attention visualization and the corresponding analyze.

4.1 Experiment Set Up

We use MPE (Multi-agent Particle Environment) framework to construct an environment to test various capabilities of our approach (MAMECS) and baselines. The square map of size represents an artificial environment with various obstacles, which can satisfy the amount of exploration needed to test our method, but not too large to cause inadequate exploration. The experimental environment has continuous action space, so the agent can move to anywhere on the map determined by its velocity and acceleration parameters. Each agent can sense the environment information within the exploration radius of and has a communication range covering the whole environment. The goal for the whole system is to explore the map as much as possible in a fixed time.

To be concrete, four agents enter the environment through four arrival points and the positions traveled by each agent form a trajectory, represented as red circles within the same radius of agent . As for the obstacles, there are 4 original blocks in the prime environment and new blocks are introduced according to a uniform random distribution across the search space. The size of the obstacles is the same as that of the agent and the obstacles will stay on the map until the end of the episode. For the number of agents, two new agents enter the map randomly from the four arrival points every 4 time-steps. However, the total number of agents at a given time is limited to . Each agent has a life cycle of 60 time-steps and is encouraged not to collide with other agents and obstacles as well as to keep inside the map.

Figure 3: The experimental environment in continuous space which has dynamic number of blocks.

4.2 Storage of Location

Agents’ location information is a set of points in two-dimensional space, so we build a 2d tree to record and process these coordinates. A K-D tree is a space partitioning data structure for organizing points in a K-Dimensional space, k is 2 in our two-dimensional environment. Each leaf node in the binary tree is a 2-dimensional point and every non-leaf node can be thought of generating a splitting hyperplane that divides the space into two parts. Points to the left of this hyperplane are represented by the left subtree of that node and points to the right of the hyperplane are represented by the right subtree.

We first use the initial agents’ location in the environment to construct a balanced 2d tree. The feature with the largest variance is selected as the segmentation feature, so the segmented data will be relatively scattered. Then, we select the median of the feature as the segmentation point, thus the number of nodes in the left subtree and right subtree is approximately the same, which is convenient for binary search. When the agent moves to a new position, the coordinate will be recorded and added to the 2d tree. As for adding elements, we traverse the tree from the root node and move to either the left or the right child depending on which side of the node’s splitting plane contains the new node. When the agent leaves the environment due to the battery problem, its position will also be removed from the 2d tree.

To calculate the intersection area with different agents, we need to obtain the coordinates of its surrounding agents. This question can be thought of as the range search problem in the 2d tree. To find all points contained in a given query rectangle, which is centered on the coordinate of the current agent and the diameter of an agent’s exploration range is used as a side length. We start at the root and recursively search for points in both subtrees using the following pruning rule: if the query rectangle does not intersect the rectangle corresponding to a node, there is no need to explore that node (or its subtrees). That is, search a subtree only if it might contain a point contained in the query rectangle.

4.3 Training Performance

As for our training procedure, we use an off-policy, actor-critic method Soft Actor-Critic for maximum entropy reinforcement learning in the training progress of 40000 episodes. There are 12 threads to process training data in parallel and a replay buffer to store experience tuples of for each time step. The environment gets reset every episode of 60 steps. The policy network and the attention critic network get updated 4 times after the first episode. In detail, we sample 1024 tuples from the replay buffer and update the parameters of the Q-function loss and the policy objective through policy gradients. Adam optimizer is used and the learning rate is set as 0.001. We use a discount factor of 0.99 and 0.2 as our temperature setting for Soft Actor-Critic. The embedded information uses a hidden dimension of 128, and 4 attention heads are used in our attention critics.

We compare our method MAMECS to two recently proposed approaches: MADDPG lowe2017multi and COMA foerster2018counterfactual , in the exploration task for each agent. MADDPG extends the traditional actor-critic methods for multi-agent mixed cooperative-competitive environments and becomes a common baseline method in various multi-agent scenarios. Unlike MADDPG, COMA uses a centralized critic to estimate the Q-function and decentralized actors to optimize the agents’ policies. All methods have approximately the same number of parameters across agents, and each model is trained with 6 random seeds each. Hyperparameters for each underlying algorithm are tuned based on performance and kept constant across all variants of critic architectures for that algorithm.

Figure 4: Exploration rate of 4 agents on MAMECS and baselines.

The performance of each approach is assessed by the average exploration rate in each episode. As shown in Fig. 4, MAMECS outperforms MADDPG AND COMA in the exploration rate and respectively reaches 94.65%, 91.52%, and 77.78%. This indicates that MAMECS has a better learning ability in the exploration task, which contributes to the capability of focusing other agents’ relevant information determined by the attention heads. To be concrete, although MADDPG takes other agents’ observations as input, MADDPG does not weight the information differently. COMA uses a single centralized critic network for all agents which may perform best in environments with global rewards and agents with similar action spaces. However, our environments have agents facing completely independent situations of different rewards.

Figure 5:

Exploration rate of 16 agents on MAMECS. Error bar are a 95% confidence interval across 6 runs.

Figure 6: Attention “entropy” for each head over the course of training for the four agents in the multi-robot exploration environment

Due to the action space size increasing exponentially with the number of agents in MADDPG and COMA, the exploration task for 16 agents is not trainable. However, MAMECS only focus on the relevant information from other agents, which is equivalent to pruning the space to linearly increasing with the number of agents. Thus, the exploration task could extend to 16 agents through our approach. Meanwhile, exploration rate converges faster when more agents get involved (shown in Fig. 5).

After training, we evaluate MAMECS, MADDPG, and COMA by running 1000 episodes and compare the number of collisions, the exploration rate and the average rewards at the end of each episode. As shown in Table 1, MAMECS outperforms other methods in all aspects. MAMECS not only increase the exploration ratio by 2.83% than MADDPG but also reduce the collisions during the exploration process. Meanwhile, MAMECS has a higher reward which means better performance in the exploration task.

Approach Collisions Exploration-Rate () Average-Rewards
MAMECS 29 12 93.25 2.47 63.61 4.92
MADDPG 37 18 90.42 2.53 55.21 4.79
COMA 71 16 76.64 3.17 41.93 5.87
Table 1: Average performance over 1000 out-of-sample episodes in 60 time-steps.

4.4 Visualizing Attention

Furthermore, in order to demonstrate the effect of the attention head on the agent during the training process, we test the “entropy” of the attention weights for each agent for each of the four attention heads that we use in the exploration task (Figures 6 and 7). A lower entropy value indicates that the head is focusing on specific agents, with an entropy of 0 indicating attention focused on one agent. In the exploration task for agents 0, 1, 2 and 3, we plot the attention entropy for each agent. In more detail, each agent tends to use a different combination of these four heads, indicating that each agent uses more than one attention head in the exploration process, although their use is not mutually exclusive. This different combination of attention heads is appropriate due to the nature of the exploration task.

Figure 7: Attention “entropy” for each head of four agents over the course of training in the multi-agent environment.
Figure 8: Left: An exploration state of four agents from the last training episode. Right: The corresponding heatmap of attention weight among each agent.

Since obstacles appear randomly in the training process and the topography distribution of each part of the map is different, each agent faces various difficulties and gets the independent reward at every time step. In addition, each of the four attention heads uses a separate set of parameters to determine an aggregated contribution from all other agents, which means each agent tends to be influenced differently by other agents, so it is reasonable that each agent uses a different combination of four attention heads.

As shown in figure 6, each agent mostly uses the attention head 2, which indicates that the agents’ observation and action information focused by attention head 2 assists more in the exploration task. However, as for agent 1, it needs the main participation of both attention head 2 and head 1 during the training process. As a result, it is obvious that all four attention heads are necessary due to the different concerns about agents’ information.

In order to analyze the impact of attention mechanism more, we consider the attention entropy of each attention head for the four agents (Figure 7). Similarly, each head focuses on the different agents at every time step in the training process and focuses a different combination of four agents, which is the same conclusion as the one shown above. It is clear that each head has a different emphasis on agents’ observation and action information determined by a specific set of parameters. For instance, head 0, head 1 and head 3 prefer to focus more on the information of agent 1 later in the training phase, while head 2 gives roughly the same concern on all the agents. Besides, each head tends to give a large focus on the information of agent 1, which can also be seen from Figure 6 that all the four heads are used a lot by agent 1.

To investigate the correlation between the attention weight and the state between agents, we further pick a special state from the last training epoch that could explain the optimization ability of the attention mechanism in MAMECS. The exploration state of four agents from the last training episode (left) and the corresponding heatmap of attention weight among each agent (right) is illustrated in Fig. 8. The regions that have higher attention weight are lighter in color, and the sum of attention weight of each agent is 1 due to the normalization.

Generally, there is larger attention weight between agents with closer distance, like agent 0 and 2, agent 1 and 3. However, regarding the agents far from the current agent, whose trajectory area tends to be explored by the current agent obtains the higher weight. To be specific, as for agent 0, the attention weight between agent 2 is higher than that between agent 1 , and they are both higher than that between agent 3 , which could illustrate the effect of the attention mechanism. Agent 2 is closest to agent 0, which leads to the highest weight. Although agent 0 is far from both agent 1 and agent 3, it is going to explore the trajectory area of agent 1, so agent 0 will pay more attention to the information of agent 1 rather than agent 3. Therefore, our multi attention heads have learned exactly what we expect.

5 Conclusion

This paper proposes an multi-head attention based training policies for multi-robot exploration task, MAMECS. The key idea is to utilize multi-head attention mechanism to select meaningful information between related agents for estimating critics. Evaluations on the task of multi-robot exploration clearly show the model outperforms the recently proposed approaches: MADDPG and COMA. MAMECS can obtain higher average rewards and improve exploration performance. We also analyze the attention weight to illustrate the function of each attention head.

In our future work, we will compare the performance of MAMECS with other baseline methods in Predator and Prey scenario. Besides, we will increase the number of agents and further highlight the advantage of cooperation ability in multi-agent reinforcement learning systems.

6 Acknowledgment

This work was supported by the National Natural Science Foundation of China (Grant Numbers 61751208, 61502510, and 61773390), the Outstanding Natural Science Foundation of Hunan Province (Grant Number 2017JJ1001), and the Advanced Research Program (No. 41412050202).

7 Reference


  • (1) Sheng W, Yang Q, Tan J, et al. Distributed multi-robot coordination in area exploration. Robotics and Autonomous Systems, 2006, 54(12): 945-955.
  • (2) Burgard W, Moors M, Stachniss C, et al. Coordinated multi-robot exploration. IEEE Transactions on robotics, 2005, 21(3): 376-386.
  • (3) Yamauchi B. Frontier-based exploration using multiple robots. Agents. 1998, 98: 47-53.
  • (4) Faigl J, Kulich M. On benchmarking of frontier-based multi-robot exploration strategies. 2015 european conference on mobile robots (ECMR). IEEE, 2015: 1-8.
  • (5) Sharma K R, Honc D, Dusek F, et al. Frontier Based Multi Robot Area Exploration Using Prioritized Routing. ECMS. 2016: 25-30.
  • (6) Burgard W, Moors M, Fox D, et al. Collaborative multi-robot exploration. ICRA. 2000: 476-481.
  • (7) Colares R G, Chaimowicz L. The next frontier: combining information gain and distance cost for decentralized multi-robot exploration. Proceedings of the 31st Annual ACM Symposium on Applied Computing. ACM, 2016: 268-274.
  • (8)

    Pinto L, Gupta A. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. 2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016: 3406-3413.

  • (9) Chen W, Qu T, Zhou Y, et al. Door recognition and deep learning algorithm for visual based robot navigation. 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO 2014). IEEE, 2014: 1793-1798.
  • (10)

    Shvets A A, Rakhlin A, Kalinin A A, et al. Automatic instrument segmentation in robot-assisted surgery using deep learning. 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2018: 624-628.

  • (11) Kretzschmar H, Spies M, Sprunk C, et al. Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research, 2016, 35(11): 1289-1307.
  • (12) Gu S, Holly E, Lillicrap T, et al. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017: 3389-3396.
  • (13) Kahn G, Villaflor A, Ding B, et al. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018: 1-8.
  • (14) Yamauchi B. Frontier-based exploration using multiple robots. Agents. 1998, 98: 47-53.
  • (15) Carrillo H, Dames P, Kumar V, et al. Autonomous robotic exploration using occupancy grid maps and graph SLAM based on Shannon and Rényi entropy. 2015 IEEE international conference on robotics and automation (ICRA). IEEE, 2015: 487-494.
  • (16) Mox D, Cowley A, Hsieh M A, et al. Information Based Exploration with Panoramas and Angle Occupancy Grids. Distributed Autonomous Robotic Systems. Springer, Cham, 2018: 45-58.
  • (17) Zlot R, Stentz A, Dias M B, et al. Multi-robot exploration controlled by a market economy. Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292). IEEE, 2002, 3: 3016-3023.
  • (18)

    Gabriely Y, Rimon E. Spanning-tree based coverage of continuous areas by a mobile robot. Annals of mathematics and artificial intelligence, 2001, 31(1-4): 77-98.

  • (19) Andre T, Bettstetter C. Collaboration in multi-robot exploration: to meet or not to meet?. Journal of Intelligent & Robotic Systems, 2016, 82(2): 325-337.
  • (20) Corah M, Michael N. Efficient Online Multi-robot Exploration via Distributed Sequential Greedy Assignment. Robotics: Science and Systems. 2017.
  • (21) Corah M, O’Meadhra C, Goel K, et al. Communication-efficient planning and mapping for multi-robot exploration in large environments. IEEE Robotics and Automation Letters, 2019, 4(2): 1715-1721.
  • (22) Kretzschmar H, Spies M, Sprunk C, et al. Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research, 2016, 35(11): 1289-1307.
  • (23) Kahn G, Villaflor A, Ding B, et al. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018: 1-8.
  • (24) Geng M, Zhou X, Ding B, et al. Learning to cooperate in decentralized multi-robot exploration of dynamic environments. International Conference on Neural Information Processing. Springer, Cham, 2018: 40-51.
  • (25)

    Sukhbaatar S, Fergus R. Learning multiagent communication with backpropagation. Advances in Neural Information Processing Systems. 2016: 2244-2252.

  • (26)

    Geng M, Xu K, Zhou X, et al. Learning to cooperate via an attention-based communication neural network in decentralized multi-robot exploration. Entropy, 2019, 21(3): 294.

  • (27) Liu S, Geng M, Xu K. Learning to Communicate Efficiently with Group Division in Decentralized Multi-agent Cooperation. 2019 IEEE International Conference on Service-Oriented System Engineering (SOSE). IEEE, 2019: 331-3316.
  • (28) Lowe R, Wu Y, Tamar A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in Neural Information Processing Systems. 2017: 6379-6390.
  • (29) Foerster J N, Farquhar G, Afouras T, et al. Counterfactual multi-agent policy gradients. Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
  • (30) Lowe R, Wu Y, Tamar A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in Neural Information Processing Systems. 2017: 6379-6390.

Appendix A

Considering there is no known way to find optimal solutions for the problem of fully cover a square with minimum amount of radius circles, we have proposed a theorem to achieve higher coverage ratio.

Theorem 1

Arrangement of circles tangent to each other does not necessarily lead to the maximum coverage ratio.

Proof A.1

We prove this theorem by an example, considering the coverage problem of 4 circles, the coverage ratio is calculated as the ratio of circles’ union area to the area of its circumscribed square . The circles are tangent to the edge of circumscribed square. The following example illustrates 4 same circles with radius arranged in two patterns.

(a) 4 circles tangent to each other.
(b) 4 circles intersect in a certain pattern.
Figure 9: The diagram of 4 circles arranged in different patterns.

As for the circumstance of 4 circles tangent to each other shown in Fig. A.9(a), due to the tangent and symmetric relation,


So, the coverage ratio .

As for the other circumstance of 4 circles intersect in a certain pattern in Fig. A.9(b), there are


So, the coverage ratio . is nearly higher than the coverage ratio , which means the arrangement of circles tangent to each other does not necessarily lead to the maximum coverage ratio. Therefore, the theorem has been proved by this example.