1 Introduction
The problems associated with exploring an unknown environment using a team of robots are among the fundamental problems in mobile robotics. These problems arise in a wide range of applications, including disaster rescue, planetary exploration, reconnaissance and surveillance sheng2006distributed ; burgard2005coordinated . The key question during exploration is that of how to figure out each agent’s next move so that the overall mission time is minimized and the exploration rate is maximized. We here focus on a subproblem of robotic exploration, namely that of decentralized multirobot exploration tasks, where robots make their own decisions without a centralized controller. In order to devise cooperation strategies, robots need to broadcast their local observations and historical trajectories by communicating with each other, which allows more information about the environment to be acquired.
To this end, several multiagent exploration approaches have been developed. The original approaches, such as the frontierbased approach yamauchi1998frontier ; faigl2015benchmarking ; sharma2016frontier and the costutility approach burgard2000collaborative ; colares2016next , were designed by experts based on cooperation strategies including explicit communication and action rules. However, many realworld applications have proven too complex to be dealt with efficiently by humandesigned strategies. Moreover, these approaches also find it difficult to cope with the “historical trajectory” issue. Most “predesigned” methods assume that the current robot only communicates with nearby robots; however, more distant robots that have explored the surrounding areas also need to be involved in the communications network to avoid repeated exploration.
Recent work in this area has attempted to combine the strengths of deep learning techniques with the control policies for robotics applications
pinto2016supersizing ; chen2014door ; shvets2018automatic . In particular, deep reinforcement learning (DRL) methods allow multiple agents to autonomously learn the required cooperation strategy kretzschmar2016socially ; gu2017deep ; kahn2018self . Therefore, these learningbased approaches can resolve the difficulties associated with developing precise and complicated control strategies for each move, and thus achieve more flexible and effective performance in complex scenarios.Despite this progress, however, algorithms for multiagent exploration are still far from being universal (to the continuous space) and scalable (to a larger number of agents). Various previous works yamauchi1998frontier ; carrillo2015autonomous ; mox2018information have modeled the exploration environment as a discrete space in which agents’ actions are restricted to their surrounding grids. When extending the task into continuous space, however, it is hard to design an accurate reward function based on the historical trajectories, meaning that some areas may be ignored or repeatedly explored. In addition, the maximum number of agents is limited in previous works of this kind, as the action space increases exponentially with the number of agents. Although recent singlehead attentionbased methods have shown great potential in multiagent cooperation tasks by focusing only on the relevant agent, they are still simple and limited compared with multihead attention mechanisms, as each attention head used in the multihead methods can focus on a different weighted mixture of agents (e.g. locations, historical trajectories, etc.).
Accordingly, our proposed approach, Multihead Attentionbased Multirobot Exploration in Continuous Space (MAMECS), extends these prior works in several directions. We model the environment as a continuous space in which agents can move to an arbitrary point at every step. Computational geometry knowledge is applied to describe the environment and design an improved reward function. Inspired by team performance in realworld applications, each team member tends to focus only on the teammates that exist in a cooperative or competitive relationship with itself; we thus learn the multiagent cooperation strategy through a multihead attentionbased critic. Therefore, each agent is aware of which other agents it should be paying attention to rather than simply considering all agents at every time step. Moreover, the quadratic increase in the action space is sharply reduced due to the selected attention mechanism, meaning that the number of agents involved can be increased.
We have validated our approach MAMECS on the typical multirobot exploration task. Extensive experiments have shown that MAMECS can perfectly fit the continuous space, effectively extend the total number of agents and improve exploration performance compared with previous works. The rest of this paper is organized as follows. In section 2, we discuss related work, followed by a detailed description of our approach in section 3. We report experimental studies in section 4 and conclusion in section 5.
2 Related Work
Our approach MAMECS aims to solve the multirobot exploration task through multihead attentionbased reinforcement learning in continuous space. Therefore, we mainly focus on two research fields: traditional “humandesigned” methods for multirobot exploration and the “learningbased” multirobot exploration approaches.
2.1 “Humandesigned” Methods for Multirobot Exploration
Multirobot exploration is a fundamental robotic problem, which employs a team of autonomous robots to explore an unknown environment with obstacles. Most early works started with heuristic and rulebased approaches. Yamauchi’s Frontier Based Exploration Using Mobile Robots yamauchi1998frontier is a foundational paper used by many successful approaches. In this approach, each robot makes the assignment that maximizes the joint utility to the frontiers and navigate to the nearest unvisited frontier.
Marketbased approaches zlot2002multi employs the concept of frontier cells and utility in a market environment to produce complex coordinated strategy in multirobot exploration. Spanning tree coverage approaches gabriely2001spanning adapted the single robot complete coverage algorithm to multirobot scenario. Each robot is assigned a part of the constructed spanning tree and covers the section in a counterclockwise fashion. Recent approaches andre2016collaboration ; corah2017efficient ; corah2019communication focuses more on the mutual information for ranging sensors, and they attempt to maximize mutual information directly. The above methods are all based on precisely designed rules, and they should take all the details and situations into account. Therefore, these “predesigned” cooperation methods will perform poorly especially for partial observation task. It is extremely hard for human to design effective strategies only based on the local view of the whole environment.
2.2 Learningbased Methods in Multirobot Exploration
Deep Reinforcement Learning has been proved to be effective for enabling sophisticated and hardtodesign behaviors of robot individuals kretzschmar2016socially ; kahn2018self . For the multirobot exploration task, geng2018learning
proposes a learningbased method to enable the robots to actively learn the cooperation strategies as well as the action policies. Their method is robust enough to handle complex and dynamic environments and beats the performance of several “humandesigned” methods. The communication model used in
geng2018learning is CommNet sukhbaatar2016learning , which simply averages the communication message to realize coordination. geng2019learning ; liu2019learning improves the communication process by introducing the attention mechanism, which can precisely calculate whether the communication is necessary for each pair of agents in the exploration scenario. The attention mechanism enables the agents to communicate only with the necessary partners and further improves the cooperation performance. However, the above methods simply model the environment by occupancy grids, which is discrete and easy to represent the information such as historical trajectories.In this paper, we focus on the multirobot exploration task in continuous space, which is extremely difficult due to the reason of reward function designing. Besides, we exploit the multihead attention mechanism and each head can focus on a different weighted mixture of agents (i.e., the locations, the historical trajectories). Furthermore, our method is more flexible than the existing learningbased methods, which can further increase the number of agents in the limited action space and is closer to reality.
3 Our Approach
The exploration rate and scalability of existing multirobot exploration methods are hard to satisfy the requirement of realistic applications. Therefore, it is meaningful to improve the exploration rate and increase the number of involving agents. In this section, we introduce our MAMECS method from the following aspects: the basic framework, learning the shared attentive critic, continuous environment modeling, design of entropyoriented reward function and the exploration ratebased training approach.
3.1 Problem Formulation
We consider the application scenario of multirobot exploration as a partially observable distributed environment. Assuming that each agent could obtain the accurate positions of other agents and the obstacles within its visual range. Each agent learns a policy on observations which maps each agent’s observation to a distribution over the actions . The learning process of the individual policy has to regard observations from other agents with focus, so that the number of involving agents could be extended. Therefore, each agent should consider other agents’ different contribution to the decision making process, rather than considering them all at all the time. Due to each agent could not weight other agents’ observation on their own, they should learn the ability to decide the importance of shared information and calculate a integrated contribution.
3.2 Framework
More formally, multihead attention mechanism is introduced to centrally learn a critic to enable each agent to select which agents to attend to at each time step. The shared critic receives the observations, actions and historical trajectories from all agents and generates Qvalues for each agent, and the contribution of other agents’ information is evaluated by multiple attention heads through attention weight. In the training process, all critics are updated together by minimizing a joint loss function. The main architecture of MAMECS is shown in Fig. 1.
The goal of our method is to selectively paying attention to other agents’ information in agent’s decisionmaking process. In detail, the encoder takes each agent’s observation, action and historical trajectories as input, then the encoded information are fed into the multi attention heads to generate the integrated message. is evenly separated into
parts, equaling the number of attention heads. Each attention head employs a separate set of parameters to weight the different contribution from other agents, and each part of the information with different weights is concatenated as a single vector
. Each agent takes the relevant information of concatenated message and the local informationinto account for estimating its value function
.The shared critic receives the observations , actions and historical trajectories from all agents indexed by . The Qvalue function for the agent can be calculated as
(1) 
Here, represents the QNetwork of a twolayer multilayer perception (MLP) and is the encoder of a onelayer MLP embedding function. represents the contribution from all other agents, which is the concatenated vector from attention heads. For attention head , the corresponding part of is the weighted sum of the attention weight and the embedded information . To be concrete, agent ’s embedded information is transformed by matrix into a value.
(2) 
To evaluate the corresponding attention weight of agent to the agent , a bilinear mapping is used to project the embedded features into a querykey system. transforms agent ’s embedded information into a “query” and transforms agent ’s embedded information into a “key”. We then perform Softmax operation to process the similarity value between these two embeddings.
(3) 
Each attention head uses a separate set of parameters to process the embedded information, and calculate the contribution from all other agents to the current agent. The aggregated message from each attention head is then simply concatenated into a single vector.
3.3 Learning the Shared Attentive Critic
As for the question of how to update all critics together within an shared attentive critic. Due to the critic parameters are shared across all agents, all critics can be updated together by minimizing a joint regression loss function:
(4) 
where
(5) 
where is the estimate actionvalue for agent , while is the groundtruth value. and are respectively the parameters of target critics and target policies. is used to balance the entropy and rewards. So each agent’s policy is updated with the following gradient:
(6)  
We represent the set of all agents except as . is the multiagent baseline, which is the average action value of all agents:
(7) 
The baseline can assist each agent judge its own contribution to the team in a cooperation scenario. By comparing agent’s Qvalue with the average action value, the certain contribution of agent to the reward value can be found. Full training details and hyperparameters can be found in the following subsection.
3.4 Dynamic Environment Modeling
Rather than modeling the environment using occupying grids, we instead model the twodimensional world in continuous space. Consequently, a robot can move to arbitrary positions on the map rather than only positions on the grids surrounding it , which represents a more flexible and practical approach to the multirobot exploration task. The basic idea behind our approach is to represent each agent as the center point of a circle, so that the agent can explore the area within the radius of this circle at each time step.
We make some definitions first: stands for the observation of robot at time step , which is indexed by ; stands for the output actions given the corresponding inputs; is the contribution from other agents, a weighted sum of the attention weight and the embedded information of other agents. stands for the coordinates of agent’s history trajectory. ( stands for the policy of choosing controls based on the past observations, trajectories and contribution from other agents.
We assume that there is an underlying map primarily unknown to the agents. To be concrete, is a dynamic coordinate set which records the position of robot at time step . Each robot wishes to infer its belief map over map at time given all its previous observations, trajectories and other robots’ weighted contribution leading up to that time step. To simplify the problem, we assume the individual map for each agent, indexed as , are independent:
(8) 
In information gain approaches, the goal of exploration is twofold  not just to map the environment but to move the robot to maximize the amount of new information in the environment. We apply an information gain method to measure the environment uncertainty in a probability distribution
by the entropy(9) 
This is a measure of the uncertainty associated with the constructed belief map . As becomes more peaked decreases, and reaches zero when the outcome of a random trial is certain.
3.5 Entropyoriented Reward Function
Now, we describe our reward functions which encourage the agents to explore more unknown dynamic environment in the shortest time. In the learning process, there is a central node that records the trajectory of each agent and gives the corresponding reward based on their performances. At the time step , the agent obtains its own observation and the contribution from other agents . The agent is likely to execute the action with highest reward and updates its belief map based on the obtained inputs and its history trajectory .
To describe the reward function accurately, moreover, we first illustrate our expectations of the agents in the exploration tasks. Each agent is expected to avoid collisions with other agents and obstacles in the environment, avoid exploring the same area repeatedly, explore the map in minimal time steps, and reduce the uncertainty for the whole map as soon as possible. In other words, the tasks we encourage agents to do are rewarded positively, while behavior we wish the agents to avoid is rewarded negatively . So at the time step , each agent seeks a policy that could reach the expected goals. Reward function is as follows:
(10) 
Here, is the combination of three aspects: the environment’s reduced entropy, collision information, and the repeated area coverage information. is the information gain after agent taking an action, which is defined to be the decrease in entropy. In the context of robotic exploration, we measure the information gain with the difference value of the map entropy between time step and . It is the value that we wish to maximize by selecting new poses. As for the collisions with other agents and obstacles, refers to the number of collisions. Two agents collide if the circle centered on their coordinates coincides. A collision incurs a negative reward .
is applied to calculate the repeated area coverage information, which is a piecewise function of agents’ intersection area . We assume that the environment is a square while the agent is represented as a circle. For the problem of fully cover a square with a minimum amount of radius circles, there is no known way to find optimal solutions. However, in our case, agents are not required to fully cover the environment but are expected to achieve the maximum coverage ratio. So, we propose a theorem on coverage ratio and design a calculate method to achieve this goal (shown in Appendix A).
We have proved that there is a better arrangement of circles to achieve a higher cover ratio. As a result, the piecewise function reaches its maximum when equals the intersection area of two circles in the second circumstance, where there is a higher coverage ratio (the second circumstance shown in Fig. A10 in Appendix A). is designed to be a continuous function, and agents could receive reward signals in the whole exploration process. So, the sparse reward problem can be avoided, agents are able to learn the exploration task better.
(11) 
where,
(12) 
When circles tangent to each other, the coverage ratio may not lead to the maximum while the agent explores new areas in every time step, so the reward . With the increase of two circles’ intersection area, the value of reward increases and reaches its maximum at . Then falls in the form of a quadratic function and reaches bottom when two circles coincide completely.
3.6 Exploration Ratebased Training Approach
To simulate the dynamic obstacles in the practical environment and to enhance the robustness of agents to dynamic settings, we gradually add random obstacles every timesteps to the original environment. Thus, each agent is expected to learn the strategy to avoid both static and dynamic obstacles. To ensure the model find better local optimum and accelerate the training speed, curriculum learning is adapted to the training process by gradually increasing task difficulty. In detail, the value of decreases during the training time so that the frequency of adding random obstacles increases, which means the difficulty of the mission is increasing. However, the value of keeps constant, which means the number of random obstacles also decreases when they are added more frequently. It is an essential setting due to the value of (a crucial component to measure the success standard) is fixed in the given timesteps.
Each simulation is terminated after a specified number of timesteps and classified as a failure if collisions with obstacles have occurred or the exploration rate
is less than . here is calculated as follows:(13) 
Here, is the explored area in the map, which is the union area of agents’ trajectory in each time step. is the subset of final obstacles, including the static and random obstacles in the map. means the total area of the environment and is modeled as a square in this scenario. Since each agent and obstacle is represented as a circle with a certain radius and the positions of newly generated circles may overlap with the area that has been explored by the multiagent system, we take the operation to calculate the union area of these circles.
We use Simpson adaptive algorithm, a classic computational geometry method to calculate the union area of the circles. We first judge the position between circles to optimize calculation. If the centers of the circles coincide, then only the area of one circle is retained; Or if the distance from the center of a circle to any other center exceeds the radius, then we add the area of a complete circle to the total area. After such screening, we use the Simpson adaptive algorithm to calculate the area corresponding to each arc. We first randomly segment an arc, for each interval , we recursively calculate the values corresponding to the endpoints and the intermediate point . is taken as the total length of the transversal lines of and all the circles, so the area between the interval (l, r) is:
(14) 
4 Experiments
In this section, we will first introduce our experimental settings and locations storage. Then, we will show the training performance compared with the baseline methods. Finally, we will give the attention visualization and the corresponding analyze.
4.1 Experiment Set Up
We use MPE (Multiagent Particle Environment) framework to construct an environment to test various capabilities of our approach (MAMECS) and baselines. The square map of size represents an artificial environment with various obstacles, which can satisfy the amount of exploration needed to test our method, but not too large to cause inadequate exploration. The experimental environment has continuous action space, so the agent can move to anywhere on the map determined by its velocity and acceleration parameters. Each agent can sense the environment information within the exploration radius of and has a communication range covering the whole environment. The goal for the whole system is to explore the map as much as possible in a fixed time.
To be concrete, four agents enter the environment through four arrival points and the positions traveled by each agent form a trajectory, represented as red circles within the same radius of agent . As for the obstacles, there are 4 original blocks in the prime environment and new blocks are introduced according to a uniform random distribution across the search space. The size of the obstacles is the same as that of the agent and the obstacles will stay on the map until the end of the episode. For the number of agents, two new agents enter the map randomly from the four arrival points every 4 timesteps. However, the total number of agents at a given time is limited to . Each agent has a life cycle of 60 timesteps and is encouraged not to collide with other agents and obstacles as well as to keep inside the map.
4.2 Storage of Location
Agents’ location information is a set of points in twodimensional space, so we build a 2d tree to record and process these coordinates. A KD tree is a space partitioning data structure for organizing points in a KDimensional space, k is 2 in our twodimensional environment. Each leaf node in the binary tree is a 2dimensional point and every nonleaf node can be thought of generating a splitting hyperplane that divides the space into two parts. Points to the left of this hyperplane are represented by the left subtree of that node and points to the right of the hyperplane are represented by the right subtree.
We first use the initial agents’ location in the environment to construct a balanced 2d tree. The feature with the largest variance is selected as the segmentation feature, so the segmented data will be relatively scattered. Then, we select the median of the feature as the segmentation point, thus the number of nodes in the left subtree and right subtree is approximately the same, which is convenient for binary search. When the agent moves to a new position, the coordinate will be recorded and added to the 2d tree. As for adding elements, we traverse the tree from the root node and move to either the left or the right child depending on which side of the node’s splitting plane contains the new node. When the agent leaves the environment due to the battery problem, its position will also be removed from the 2d tree.
To calculate the intersection area with different agents, we need to obtain the coordinates of its surrounding agents. This question can be thought of as the range search problem in the 2d tree. To find all points contained in a given query rectangle, which is centered on the coordinate of the current agent and the diameter of an agent’s exploration range is used as a side length. We start at the root and recursively search for points in both subtrees using the following pruning rule: if the query rectangle does not intersect the rectangle corresponding to a node, there is no need to explore that node (or its subtrees). That is, search a subtree only if it might contain a point contained in the query rectangle.
4.3 Training Performance
As for our training procedure, we use an offpolicy, actorcritic method Soft ActorCritic for maximum entropy reinforcement learning in the training progress of 40000 episodes. There are 12 threads to process training data in parallel and a replay buffer to store experience tuples of for each time step. The environment gets reset every episode of 60 steps. The policy network and the attention critic network get updated 4 times after the first episode. In detail, we sample 1024 tuples from the replay buffer and update the parameters of the Qfunction loss and the policy objective through policy gradients. Adam optimizer is used and the learning rate is set as 0.001. We use a discount factor of 0.99 and 0.2 as our temperature setting for Soft ActorCritic. The embedded information uses a hidden dimension of 128, and 4 attention heads are used in our attention critics.
We compare our method MAMECS to two recently proposed approaches: MADDPG lowe2017multi and COMA foerster2018counterfactual , in the exploration task for each agent. MADDPG extends the traditional actorcritic methods for multiagent mixed cooperativecompetitive environments and becomes a common baseline method in various multiagent scenarios. Unlike MADDPG, COMA uses a centralized critic to estimate the Qfunction and decentralized actors to optimize the agents’ policies. All methods have approximately the same number of parameters across agents, and each model is trained with 6 random seeds each. Hyperparameters for each underlying algorithm are tuned based on performance and kept constant across all variants of critic architectures for that algorithm.
The performance of each approach is assessed by the average exploration rate in each episode. As shown in Fig. 4, MAMECS outperforms MADDPG AND COMA in the exploration rate and respectively reaches 94.65%, 91.52%, and 77.78%. This indicates that MAMECS has a better learning ability in the exploration task, which contributes to the capability of focusing other agents’ relevant information determined by the attention heads. To be concrete, although MADDPG takes other agents’ observations as input, MADDPG does not weight the information differently. COMA uses a single centralized critic network for all agents which may perform best in environments with global rewards and agents with similar action spaces. However, our environments have agents facing completely independent situations of different rewards.
Due to the action space size increasing exponentially with the number of agents in MADDPG and COMA, the exploration task for 16 agents is not trainable. However, MAMECS only focus on the relevant information from other agents, which is equivalent to pruning the space to linearly increasing with the number of agents. Thus, the exploration task could extend to 16 agents through our approach. Meanwhile, exploration rate converges faster when more agents get involved (shown in Fig. 5).
After training, we evaluate MAMECS, MADDPG, and COMA by running 1000 episodes and compare the number of collisions, the exploration rate and the average rewards at the end of each episode. As shown in Table 1, MAMECS outperforms other methods in all aspects. MAMECS not only increase the exploration ratio by 2.83% than MADDPG but also reduce the collisions during the exploration process. Meanwhile, MAMECS has a higher reward which means better performance in the exploration task.
Approach  Collisions  ExplorationRate ()  AverageRewards 

MAMECS  29 12  93.25 2.47  63.61 4.92 
MADDPG  37 18  90.42 2.53  55.21 4.79 
COMA  71 16  76.64 3.17  41.93 5.87 
4.4 Visualizing Attention
Furthermore, in order to demonstrate the effect of the attention head on the agent during the training process, we test the “entropy” of the attention weights for each agent for each of the four attention heads that we use in the exploration task (Figures 6 and 7). A lower entropy value indicates that the head is focusing on specific agents, with an entropy of 0 indicating attention focused on one agent. In the exploration task for agents 0, 1, 2 and 3, we plot the attention entropy for each agent. In more detail, each agent tends to use a different combination of these four heads, indicating that each agent uses more than one attention head in the exploration process, although their use is not mutually exclusive. This different combination of attention heads is appropriate due to the nature of the exploration task.
Since obstacles appear randomly in the training process and the topography distribution of each part of the map is different, each agent faces various difficulties and gets the independent reward at every time step. In addition, each of the four attention heads uses a separate set of parameters to determine an aggregated contribution from all other agents, which means each agent tends to be influenced differently by other agents, so it is reasonable that each agent uses a different combination of four attention heads.
As shown in figure 6, each agent mostly uses the attention head 2, which indicates that the agents’ observation and action information focused by attention head 2 assists more in the exploration task. However, as for agent 1, it needs the main participation of both attention head 2 and head 1 during the training process. As a result, it is obvious that all four attention heads are necessary due to the different concerns about agents’ information.
In order to analyze the impact of attention mechanism more, we consider the attention entropy of each attention head for the four agents (Figure 7). Similarly, each head focuses on the different agents at every time step in the training process and focuses a different combination of four agents, which is the same conclusion as the one shown above. It is clear that each head has a different emphasis on agents’ observation and action information determined by a specific set of parameters. For instance, head 0, head 1 and head 3 prefer to focus more on the information of agent 1 later in the training phase, while head 2 gives roughly the same concern on all the agents. Besides, each head tends to give a large focus on the information of agent 1, which can also be seen from Figure 6 that all the four heads are used a lot by agent 1.
To investigate the correlation between the attention weight and the state between agents, we further pick a special state from the last training epoch that could explain the optimization ability of the attention mechanism in MAMECS. The exploration state of four agents from the last training episode (left) and the corresponding heatmap of attention weight among each agent (right) is illustrated in Fig. 8. The regions that have higher attention weight are lighter in color, and the sum of attention weight of each agent is 1 due to the normalization.
Generally, there is larger attention weight between agents with closer distance, like agent 0 and 2, agent 1 and 3. However, regarding the agents far from the current agent, whose trajectory area tends to be explored by the current agent obtains the higher weight. To be specific, as for agent 0, the attention weight between agent 2 is higher than that between agent 1 , and they are both higher than that between agent 3 , which could illustrate the effect of the attention mechanism. Agent 2 is closest to agent 0, which leads to the highest weight. Although agent 0 is far from both agent 1 and agent 3, it is going to explore the trajectory area of agent 1, so agent 0 will pay more attention to the information of agent 1 rather than agent 3. Therefore, our multi attention heads have learned exactly what we expect.
5 Conclusion
This paper proposes an multihead attention based training policies for multirobot exploration task, MAMECS. The key idea is to utilize multihead attention mechanism to select meaningful information between related agents for estimating critics. Evaluations on the task of multirobot exploration clearly show the model outperforms the recently proposed approaches: MADDPG and COMA. MAMECS can obtain higher average rewards and improve exploration performance. We also analyze the attention weight to illustrate the function of each attention head.
In our future work, we will compare the performance of MAMECS with other baseline methods in Predator and Prey scenario. Besides, we will increase the number of agents and further highlight the advantage of cooperation ability in multiagent reinforcement learning systems.
6 Acknowledgment
This work was supported by the National Natural Science Foundation of China (Grant Numbers 61751208, 61502510, and 61773390), the Outstanding Natural Science Foundation of Hunan Province (Grant Number 2017JJ1001), and the Advanced Research Program (No. 41412050202).
7 Reference
References
 (1) Sheng W, Yang Q, Tan J, et al. Distributed multirobot coordination in area exploration. Robotics and Autonomous Systems, 2006, 54(12): 945955.
 (2) Burgard W, Moors M, Stachniss C, et al. Coordinated multirobot exploration. IEEE Transactions on robotics, 2005, 21(3): 376386.
 (3) Yamauchi B. Frontierbased exploration using multiple robots. Agents. 1998, 98: 4753.
 (4) Faigl J, Kulich M. On benchmarking of frontierbased multirobot exploration strategies. 2015 european conference on mobile robots (ECMR). IEEE, 2015: 18.
 (5) Sharma K R, Honc D, Dusek F, et al. Frontier Based Multi Robot Area Exploration Using Prioritized Routing. ECMS. 2016: 2530.
 (6) Burgard W, Moors M, Fox D, et al. Collaborative multirobot exploration. ICRA. 2000: 476481.
 (7) Colares R G, Chaimowicz L. The next frontier: combining information gain and distance cost for decentralized multirobot exploration. Proceedings of the 31st Annual ACM Symposium on Applied Computing. ACM, 2016: 268274.

(8)
Pinto L, Gupta A. Supersizing selfsupervision: Learning to grasp from 50k tries and 700 robot hours. 2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016: 34063413.
 (9) Chen W, Qu T, Zhou Y, et al. Door recognition and deep learning algorithm for visual based robot navigation. 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO 2014). IEEE, 2014: 17931798.

(10)
Shvets A A, Rakhlin A, Kalinin A A, et al. Automatic instrument segmentation in robotassisted surgery using deep learning. 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2018: 624628.
 (11) Kretzschmar H, Spies M, Sprunk C, et al. Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research, 2016, 35(11): 12891307.
 (12) Gu S, Holly E, Lillicrap T, et al. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017: 33893396.
 (13) Kahn G, Villaflor A, Ding B, et al. Selfsupervised deep reinforcement learning with generalized computation graphs for robot navigation. 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018: 18.
 (14) Yamauchi B. Frontierbased exploration using multiple robots. Agents. 1998, 98: 4753.
 (15) Carrillo H, Dames P, Kumar V, et al. Autonomous robotic exploration using occupancy grid maps and graph SLAM based on Shannon and Rényi entropy. 2015 IEEE international conference on robotics and automation (ICRA). IEEE, 2015: 487494.
 (16) Mox D, Cowley A, Hsieh M A, et al. Information Based Exploration with Panoramas and Angle Occupancy Grids. Distributed Autonomous Robotic Systems. Springer, Cham, 2018: 4558.
 (17) Zlot R, Stentz A, Dias M B, et al. Multirobot exploration controlled by a market economy. Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292). IEEE, 2002, 3: 30163023.

(18)
Gabriely Y, Rimon E. Spanningtree based coverage of continuous areas by a mobile robot. Annals of mathematics and artificial intelligence, 2001, 31(14): 7798.
 (19) Andre T, Bettstetter C. Collaboration in multirobot exploration: to meet or not to meet?. Journal of Intelligent & Robotic Systems, 2016, 82(2): 325337.
 (20) Corah M, Michael N. Efficient Online Multirobot Exploration via Distributed Sequential Greedy Assignment. Robotics: Science and Systems. 2017.
 (21) Corah M, O’Meadhra C, Goel K, et al. Communicationefficient planning and mapping for multirobot exploration in large environments. IEEE Robotics and Automation Letters, 2019, 4(2): 17151721.
 (22) Kretzschmar H, Spies M, Sprunk C, et al. Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research, 2016, 35(11): 12891307.
 (23) Kahn G, Villaflor A, Ding B, et al. Selfsupervised deep reinforcement learning with generalized computation graphs for robot navigation. 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018: 18.
 (24) Geng M, Zhou X, Ding B, et al. Learning to cooperate in decentralized multirobot exploration of dynamic environments. International Conference on Neural Information Processing. Springer, Cham, 2018: 4051.

(25)
Sukhbaatar S, Fergus R. Learning multiagent communication with backpropagation. Advances in Neural Information Processing Systems. 2016: 22442252.

(26)
Geng M, Xu K, Zhou X, et al. Learning to cooperate via an attentionbased communication neural network in decentralized multirobot exploration. Entropy, 2019, 21(3): 294.
 (27) Liu S, Geng M, Xu K. Learning to Communicate Efficiently with Group Division in Decentralized Multiagent Cooperation. 2019 IEEE International Conference on ServiceOriented System Engineering (SOSE). IEEE, 2019: 3313316.
 (28) Lowe R, Wu Y, Tamar A, et al. Multiagent actorcritic for mixed cooperativecompetitive environments. Advances in Neural Information Processing Systems. 2017: 63796390.
 (29) Foerster J N, Farquhar G, Afouras T, et al. Counterfactual multiagent policy gradients. ThirtySecond AAAI Conference on Artificial Intelligence. 2018.
 (30) Lowe R, Wu Y, Tamar A, et al. Multiagent actorcritic for mixed cooperativecompetitive environments. Advances in Neural Information Processing Systems. 2017: 63796390.
Appendix A
Considering there is no known way to find optimal solutions for the problem of fully cover a square with minimum amount of radius circles, we have proposed a theorem to achieve higher coverage ratio.
Theorem 1
Arrangement of circles tangent to each other does not necessarily lead to the maximum coverage ratio.
Proof A.1
We prove this theorem by an example, considering the coverage problem of 4 circles, the coverage ratio is calculated as the ratio of circles’ union area to the area of its circumscribed square . The circles are tangent to the edge of circumscribed square. The following example illustrates 4 same circles with radius arranged in two patterns.
As for the circumstance of 4 circles tangent to each other shown in Fig. A.9(a), due to the tangent and symmetric relation,
(15) 
So, the coverage ratio .
As for the other circumstance of 4 circles intersect in a certain pattern in Fig. A.9(b), there are
(16) 
So, the coverage ratio . is nearly higher than the coverage ratio , which means the arrangement of circles tangent to each other does not necessarily lead to the maximum coverage ratio. Therefore, the theorem has been proved by this example.
Comments
There are no comments yet.