Inner Attention Supported Adaptive Cooperation for Heterogeneous Multi Robots Teaming based on Multi-agent Reinforcement Learning

02/12/2020 ∙ by Chao Huang, et al. ∙ 0

Humans can selectively focus on different information based on different tasks requirements, other people's abilities and availability. Therefore, they can adapt quickly to a completely different and complex environments. If, like people, robot could obtain the same abilities, then it would greatly increase their adaptability to new and unexpected situations. Recent efforts in Heterogeneous Multi Robots Teaming have try to achieve this ability, such as the methods based on communication and multi-modal information fusion strategies. However, these methods will not only suffer from the exponential explosion problem with the increase of robots number but also need huge computational resources. To that end, we introduce an inner attention actor-critic method that replicates aspects of human flexibly cooperation. By bringing attention mechanism on computer vision, natural language process into the realm of multi-robot cooperation, our attention method is able to dynamically select which robots to attend to. In order to test the effectiveness of our proposed method, several simulation experiments have been designed. And the results show that inner attention mechanism can enable flexible cooperation and lower resources consuming in rescuing tasks.



There are no comments yet.


page 2

page 3

page 8

page 10

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The heterogeneous multi robots team have gained great interests in the last decades because of its several foreseen benefits compared to single robot team such as the increased ability to increase reliability, resolve tasks complexity, increase performance and simplicity robot’s design [1, 2, 3]. It could be widely used in household, industrial and society applications or situations that are too dangerous for human beings. Such as, in natural disaster search-and-rescue tasks[4, 5, 6], larger area can be searched and more victims can be rescued efficiently by using a heterogeneous multi robots team; In city traffic condition control field [7, 8, 9], it will be much more easy and cheap to deployment a huge number of robots team to monitor the whole city’s traffic conditions by using small and simple robots; What’s more, in complex tasks situations[10, 11, 12], the heterogeneous multi robots team has also been used to resolve task complexity by allocating different kind robots with different part tasks.

Figure 1: An illustration of the human inspired inner attention using the developed IAAC model. By using the inner attention mechanism, the UAV1 can selectively based on other robots to make decisions and flexibly participate to different tasks.

Eventhough the applications of heterogeneous multi robots team has witnessed significant progress, effective deployments and cooperation of heterogeneous robots team are still challenging. The followings are some factors that will make tasks more challenging. First, Given the discrepancy between simulation and the real environment in terms of dynamics and perception, such simulation-to-reality transfer is hindered[13, 14, 15]. For example, real-world constrains such as obstacle distribution, terrain conditions, and system reliability, can also cause unexpected issues to undermine robot team’s performance[19, 20, 21]. Second, robots’ different capability, availability and dynamically changing status, making the application challenging in balancing the efficiency of tasks executions and the effectiveness of utilizing the robot capabilities[16, 17, 18]. In order to fulfil the flexible cooperation in complex and challenging environments, the robots must have the ability to deal with huge amount of unexpected situations and flexibly cooperate with each other based on the selectively chosen information. For example, in order to rescue the trapped victims in flood disaster, the robot should learn how to choose which robots to cooperate with by taking its own situation, other robots’ availability and capability into consideration. What’s more, real-world faults, such as motor degradation and sensor failure, can make some robots unreliable by sharing incorrect information and degrade the robots team’s performances.

To solve these challenges, an inner attention supported adaptive cooperation (IAAC) method, as shown in Figure 1, has been developed in this paper, by imitating human’s attention mechanism. The assumption behind IAAC is that if the robots are trained able to dynamically select which robots to attend to and cooperate with, they can be flexible enough to deal with the unknown dynamical situations. In this research, we mainly have three contributions:

  1. a novel attention supported method for formulizing a general concern-involved cooperation mechanism for general heterogeneous robots teaming, to guide the flexible cooperation with the considerations of robot capability, availability limitations.

  2. a theoretical analysis for the robustness of multi-robot flexible cooperation has been developed, to prove that our proposed method is robust enough to deal with unexpected situations like sensor failure.

  3. a deep reinforcement learning based simulation framework has been introduced for evaluating robot teaming effectiveness, efficiency and robustness.

Figure 2: The architecture of IAAC. The inner attention mechanism determines the attention weight between each agent based on the inputs of agents’ observations, and actions. Calculating Q(o,a) with attention for robot i. Each agent encodes its observations and actions, sends it to the inner attention mechanism, and receives a weighted sum of other agents encodings.

2 Related Work

Heterogeneous multi robots team has been investigated in the robotics field to successfully accomplish complex tasks coming from our dynamic and unpredictable world. A number of researches exist which aim to show the utility of using human-designed heuristic strategies in multi robots teaming to increase their performance

[22, 23, 24]. However, these “pre-designed” cooperation methods should take all the details and situations into account and will perform poorly especially for partial observation task. There are also many works still required human intervention to achieve complex tasks[25, 26, 27]. When a huge amount of robots need to be deployed, it will need lots of well-trained human operators which is expensive and time consuming. In last decades, some researchers have attempted to combine the strengths of deep reinforcement learning mechanism with the control policies for robotics applications[28, 29, 30]. Multi Agent Reinforcement Learning has been proved to be effective for enabling sophisticated and hard-to-design behaviors of robot individuals[31, 32, 33, 34, 35]. Even though it has been used in robotics research, multi-agent reinforcement learning are still far from being generically applicable to dynamic environments with complex tasks. For example, [36, 37, 38, 39, 40] formulated swarm systems as a special case of the decentralized control and used an actor-critic deep reinforcement learning approach to control a group of cooperative agents, but it can only be used in homogeneous robots situation; In [41, 42, 43, 44, 45], distributed task allocation where agents can request help from cooperating neighbors have been proposed, but it also may not be able to deal with heterogeneous agents. [46, 47, 48] are the approach to address heterogeneous multi-agent learning in urban traffic control. However, the proposed deep reinforcement learning approach learns ineffectively in complex traffic conditions. As for the attention mechanism which has been widely used in computer vision, natural language process and text classification fields[68, 69, 70]. In [49, 50, 51, 52]

attention mechanism has been used to cooperate effectively by precisely obtaining necessary communications from other agents. However, in order to calculate whether the communication is necessary for each pair of agents, agent’s local observation and the trajectory will all be needed, which is inefficient. The IAAC presented in this paper can dynamically select which agents to pay attention to by using an inner attention mechanism. The intuition behind our idea comes from the fact that, in many real-world environments, it is beneficial for agents to know which other agents it should pay attention to. Therefore, IAAC has the ability to deal with heterogeneous multi robots teaming in dynamic situations. Besides, by utilizing attention to select relevant information for estimating critics, our method do not need huge communications between agent pairs, which is more efficient.

There have been many efforts in dealing with the possibility that a few robots will broken during multi robots cooperation. In [53, 54, 55, 56, 57] passive healing strategies has been used to increase the resilience of multi robots system. Although this method is able to increase the tolerance of faulty robots, the passive healing strategies usually require relatively high swarm connectivity and specification of tolerable speed values which may be difficult to specify in advance. In [58, 59, 60] a multi-sensors fusion approach has been proposed to maintain the integrity and the continuity of the robots’ localization. Although this approach can increase the fault tolerance ability of multi robots system, the multi-sensors fusion method also need a fault detect and execute unit which is computational expensive. Besides, to the best of our knowledge, there have been relatively few attempts to develop methods that are both effective for flexible cooperation and robust to the possible unexpected real-world factors. With the inner attention mechanism, our proposed method IAAC can train each robot to be flexible enough to adapt its behavior based on the dynamic environment and robust enough to handle the real-world failures such as robot broken and sensor failure.

At last, in the multi robots cooperation system, most of works have focused on how to develop a cooperation strategy that can accomplish complex tasks efficiently. However, few of them have pay attention to the other factors such as resources consume. Eventhough, multi robots teams with different cooperation strategies can rescue same amount of victims in the same period of time, they will consume different resources such as fuel or electrical power. [61, 62, 63, 64, 65] proposed an exploration strategy taking information gain and distance cost into account. [63] proposed a novel method controlling each robot based on a utility function that takes information gain and distance cost into consideration. Eventhough these methods can improve the multi-robot exploration in terms of map quality, exploration time and information gain, they only evaluated on homogeneous robots. In this paper, our proposed method IAAC will be applied to heterogeneous multi-robot team under the resources consume constrains.

3 Inner Attention Supported Adaptive Cooperation

The cooperation flexibility, robustness to sensor failures and resources consuming efficiency of existing multi-robot teaming methods are hard to satisfy the requirements of real-world applications. Therefore, it is meaningful to improve multi-robot team’s cooperation flexibility, robustness to sensor failures and resources consuming efficiency. In this section, we introduce our IAAC method from the following aspects: prerequisite notation and formulation, inner attention mechanism and theoretical analysis.

3.1 Dynamic Teaming Modeling Using Multi-Agent Actor Critics

To model complex and dynamic relations in heterogeneous robot teaming, multi-agent actor-critic deep reinforcement learning algorithm extended from Markov Decision Process has been used, which has been proved to be effective in guiding multi-agent cooperation

[66, 67, 35]. Multi-agent deep reinforcement learning is defined by the number of agents, ; state space, ; a set of actions for all agents,

; transition probability function over next possible states,

; a set of observations for all agents, ; and reward function for each agent . We will consider the application scenario of multi-agent cooperation as a fully observable environment in which we assume that each agent can receive observation, , which contains accurate positions and statues of other agents. A robot’s action selections include exchanging location information, sharing and allocating task goal, maintaining connectivity, etc. By using reinforcement learning for guiding the cooperation, each robot learns an individual policy function,

which is a probability distribution on potential cooperation actions. The goal of multi-agent reinforcement learning is aim to learn an optimal cooperation strategy for each agent which can maximize their expected discounted returns,


where is the discount factor that determines how much the policy favors immediate reward over long-term gain.

Actor Critic Policy gradient algorithm is a learning process to solve reinforcement learning problems, which targets at modeling and optimizing the policy directly. To maximally improve team performance given current status of all robots, a robot’s policy is updated by encouraging updating along the gradient:


Where denotes the parameters for the policy, is an approximation function of the expected discounted returns


it can ameliorate policy gradient methods’ high variance issue by replacing the original return term in the policy gradient estimator. For each cooperation step, the action value

for the robot needs to observe its neighbors’ status and actions and learned by off-policy temporal-difference learning by minimizing the regression loss:


where , is the target -value function, which is simply an exponential moving average of the past -functions. is the experience replay buffer, which stores the previous robot cooperation experience to further reduce the loss. and denote the parameters for the target critics and critics, respectively.

In our multi-agent cooperation scenarios, naive application of actor critic reinforcement learning methods naturally encounter some limitations, such as non-stationary of the environment from the perspective of individual agents. Since each agent’s policy is changing as training progresses, and the environment becomes non-stationary from the perspective of any individual agent. This presents learning stability challenges and prevents the straightforward use of past experience replay, which is crucial for stabilizing policy gradient methods. Therefore, we will use an extended actor-critic framework consisting of centralized training with decentralized execution, allowing the policies to use extra information to ease training, so long as this information is not used at test time. More concretely, in the gradient of the expected return for agent , the is calculated centrally with a global objective of improving the whole team’s performance on executing the task that takes the actions of all agents as input, in addition to the agents’ statues , and outputs the -value for agent . In the simplest case, could consist of the observations of all agents, , however we could also include additional state information if available. Given that each robot may have different cooperation requirements, different teammates available, and limited perceiving capability, each robot distributively implements cooperation policy. This centrally-learn and distributively-use manner will support a flexible teaming for heterogeneous robots, such that the cooperators and cooperation actions will be adjusted dynamically.

3.2 Robot Inner Attention for Team Adaptability Modeling

In the extended actor-critic framework consisting of centralized training with decentralized execution, to calculate the -value function for the agent , the critic receives the observations, , and actions, , for all agents, which will take lots of redundancy information into account. In addition, the action space will also increase exponentially with the number of agents. Given that in reality applications, each robot will has different cooperation requirements, different teammates available, and limited perceiving capability. Therefore, in order to improve agents’ cooperation flexibility, we should train the critic for each agent with the ability of selectively paying attention to other agents. That is, each agent is aware of which agents it should pay attention to rather than simply take all agents into consideration at every step.

The inner attention mechanism functions in a manner similar to the differentiable key-value memory methods. Intuitively, in agent’s decision-making process, the goal of our method is to selectively paying attention to other agents. The shared critic receives the observations, actions and statues from all agents and generates Q-values for each agent. The contribution of other agents’ statues is evaluated by multiple attention heads by generating different attention weights for different agents. This paradigm was chosen, in contrast to other attention-based approaches, as it doesn’t make any assumptions about the temporal or spatial locality of the inputs, as opposed to approaches taken in the natural language processing and computer vision fields. With the inner attention mechanism the agents can have flexible cooperation abilities, then they can be more robust to the unexpected real-world failures and will consume lesser resources when rescuing the same amount victims.

In details, the embedding function is a two layer multiple layer perception (MLP) which takes agents’ observations and actions as input. Then the embedded information are feeded into the inner attention mechanism to get the -value function for the agent , which is a function of agent ’s embeddings, as well as other agents’ contributions:



is rectified linear units (ReLU),

and are the parameters of critics. Similar to the query-key system, the inner attention mechanism also have shared query (), key () and value () matrix. Each agent’s embedding

can be linearly transformed into

and separately. The contribution from other agents, , is a weighted sum of other agents’ value:


The attention weight compares the similarity between and , and then passes the similarity value into a softmax function:


The matching is then scaled by the dimensionality of these two matrices to prevent vanishing gradients. In our experiments, we will use set of key-value parameters , which gives rise to an aggregated contribution from all other agents to the agent

and we simply concatenate the contributions from all set parameters as a single vector. Note that the matrix for extracting queries, keys, and values are shared across all agents, which encourages a common embedding space. The sharing of critic parameters between agents is possible, because multi-agent value-function approximation is, essentially, a multi-task regression problem.

Now, we describe our reward functions which encourage the agents to cooperate in dynamic environments. In the learning process, we will give the corresponding reward based on their behaviour. At the time step , the agent obtains its own observation and the contribution from other agents . The agent is likely to execute the action with highest reward. To describe the reward function accurately, we first illustrate our expectations for the agents in the cooperation tasks. Each agent is expected to avoid collisions with other agents and obstacles in the environment, cooperate with other agents to rescue victims. In other words, the tasks we encourage agents to do are rewarded positively, while behavior we wish the agents to avoid is rewarded negatively . So at the time step t, each agent seeks a policy that could reach the expected goals. Reward function for each robot is as follows:


Here, is the combination of three aspects: rewards from interacting with the environment, collision with other robots or walls and steps cost for rescuing per victim.


which represent that the agent’s expected action is to rescue the closest victim, cooperate with proper candidates and is a set of victims and robots that need to rescue and cooperate separately according to expected cooperation.


implies that the agent should avoid collision with obstacles and is the indication function indicating whether is collide with wall, other agents or not. is used to take resources consume into account, which means the average steps needed to rescue one victim. Given that the expected goals are represented by reward functions, the inner attention mechanism can facilitate the training process by maintain the property on flatness of minima and the exist of good sharp of minima. Besides, when the inner attention mechanism is used, the number of linear regions reduces leading to a simpler loss landscape, yet the approximation error remains small. This leads to lower sample complexity for achieving a desired prediction error. Therefore, the inner attention mechanism can facilitate the learning process and make the agents to learn expected goals more easier.

3.3 Theoretical analysis

In order to simply explain whether inner attention mechanism works, we simply consider a two-layer ReLU neural network with inner attention mechanism, which is the same as the latter part of our critic neural network. The weights of the first layer can be denoted as

, the weights of the second layer as , and the ReLU function is represented by . Then the output of the two layer neural network, when the input is , can be written as:


Here, we simply proof that when inner attention mechanism is used, the number of linear regions reduces leading to lower sample complexity for achieving a desired prediction error.

According to [71], first, assume the sparsity of the attention weights is , we know all the inputs of the neural network with corresponding attention weight , will be inactive. Therefore, we can omit all these inactive inputs. And assume the size of hidden layer . Then we can split units into groups, with number of units in each group. That means, different groups can represent as active inputs. In every group, for example in group, we can assign and choose the neural network layer’s parameters for as:


here we denote to be a vector with entry’s value equal to 1 and all other entries are assigned to 0. And in the second layer, we assign , where is a well designed vector, corresponding to to in each group. Then the designed network has linear regions inside each group, giving by the intervals:


Each of these intervals has a subset that is mapped by onto the interval . Therefore the total number of linear regions is lower bounded by . Therefore the number of linear regions has been bounded and lead to a simpler neural network landscape. Then we can improve data efficiency by adapting the inner attention mechanism. That means, we can achieve a desired prediction error with fewer training samples.

Second, in order to analysis the flatness properties of minima, we consider a minimum satisfying that for . For any has an infinite volume, and for any

, we can find a stationary point such that the largest eigenvalue of

is larger than . In order to analysis that, here we define an scale transformation such that:


And all the value, query and value matrices remain the same. Then we know the jacobian determinant for . Since , as we assign , such that the jacobian determinant goes to infinity, and the volume of goes to infinity. For the Hessian matrix, we still assume a positive diagonal element in . Similarly we have the Frobenius norm:


is lower bounded by . When we choose sufficient small , we have the biggest eigenvalue of is larger than any constant . Therefore there exists a stationary point such that the operator norm for Hessian is arbitrary large[72].

What’s more, by using the inner attention mechanism, the agents can also be more robust to other agents’ failure or sensor broken[73]. Consider that a small perturbation is added to a particular agent ’s embedding, such that is changed to while all the other agent’s embedding remain unchanged. We then study how much this perturbation will affect the attention weights . For a particular , the


is only changed by one term since:


where we use to denote the value after the perturbation. Therefore, with the perturbed input, each set of will only have one term being changed. Furthermore, the changed term in equation is the inner product between and a fixed vector ; although this could be large for some particular in the similar direction of , if the embeddings are scattered enough over the space, the inner products cannot be large for all . Therefore, the change to the next layer will be sparse. For instance, we can prove the sparsity under some distributional assumptions on :

For the perturbation part, the expected value , where is a fixed vector. Assume and are

-dimensional vectors uniformly distributed on the unit sphere. Then it is easy to derive:


To bound this expectation, we first try to bound . where . Due to the rotation invariance we can obtain:


given that :


This implies . Using Markov inequality, we can then find the probability results .

Therefore, as the norm of are not too large (usually regularized by during training) and the dimension is large enough, there will be a significant amount of such that is perturbed negligibly. In contrast, embeddings from RNN-based models are relatively more sensitive to perturbation of one robot’s embedding, as shown below. Similar to the previous case, we assume an embedding sequence , and an embedding is perturbed by . For the vanilla RNN model, the embeddings are sequentially computed as . If is perturbed, then all the will be altered. Therefore, the sensor failure and robot broken can more easily influence all the embeddings.

4 Evaluation

In this section, we will first introduce our experimental settings. Second, we will show the training performance compared with the baseline method, and the corresponding analyze. Then, we will analysis robots’ dynamic cooperation and robustness. Finally, we also analysis the efficiency of our proposed method in the resources consume.

4.1 Experiment Settings

We construct an environment that test various capabilities of our approach IAAC. We investigate in three main directions. First, we study our proposed method’s ability of flexibly adapting to the dynamic environments. We hypothesize that our method with inner attention mechanism can cooperate dynamically according to different cooperation requirements and different teammates available. To this end, we implement a cooperative environment, with two different kind tasks which should be accomplished by different kind robots’ cooperation. As such, we can evaluate our approach’s ability to dynamically cooperation based on dynamic environments.

Figure 3: Simulated environment illustration. In the flood disaster, there are trapped victims with different injury levels. For the victims with low injury level (Task1), they need rescuing robots providing useful information to guide them to safer places; while the victims with high injury level (Task2) will another kind robots providing them with emergency medicine immediately.The main robots team should figure out how to split into different sub-teams that can rescue these victims effectively and efficiently.
1.0 1.0
1.5 0.5
1.5 0.5
Table 1: The configurations of robots

The environment in Figure 3 is implemented in the MPE (multi-agent particle environment) framework [66]. A simple multi-agent particle world with a continuous observation and discrete action space, along with some basic simulated physics. We found this MPE framework useful for creating experiment environments consisting of multi-agent, complex environments and diverse interactions among agents, while keeping the control and perception problems very simple. In our designed environment, we use discrete action spaces and basic physics engine to control agent’s movements, which makes the environment very similar to our real world since agents’ momentum incorporates have been taken into account. The size of artificial environment is 2 X 2, which can satisfy the amount of agents needed to test our method, but not too large to cause inadequate exploration problem. The experimental environment has continuous action space, so the agent can move to anywhere on the map determined by its velocity and acceleration parameters. Each agent can sense the environment information and has a communication range covering the whole environment. And the goal for the whole system is to search and rescue as many victims as possible during one episode.

To be concrete, the environment involves 6 total agents, 2 of which are victims with different injury level and 4 of which are robots. Each robot should cooperate with each other to rescue different injury level victims. For the robots, 2 of them provide living supplies such as food and water, 1 of which can provide victims with useful information for example the locations of the safer places, etc. and the rest robot is mainly used to provide heavily injured victims with medicine treatments. As for the victims, one is heavily injured and needs living supplies and medicine treatments by different kind robots’ cooperation; while for the other victims who is in a good health condition will need living supplies and useful information guiding him to a safer place. All robots are able to see each others’ positions with respect to their own. In addition to the rewards received when one victim is successfully rescued, robots are additionally penalized for colliding with each other. As such, the task contains a mixture of individual rewards and requires different modes of attention which depend on tasks, environment conditions, and robot abilities. In order to analysis the properties of our inner attention mechanism, we will compare it with the baseline mode.In this model, we use uniform attention by fixing the attention weight to be . This restriction prevents the model from focusing its attention on specific agents. Given that we only change the attention weights to fixed, all the models are implemented with approximate equal total number of parameters.

As for our training procedure, we use an off-policy, extended actor-critic method for maximum entropy reinforcement learning in the training progress of 25,000 episodes. There are 12 threads to process training data in parallel and a replay buffer to store experience tuples of for each time step. The environment gets reset every episode of 100 steps. The policy network and the attention critic network get updated 4 times after the first episode. In detail, we sample 1024 tuples from the replay buffer and update the parameters of the -function loss and the policy objective through policy gradients. Adam optimizer is used and the learning rate is set as 0.001. We use a discount factor of 0.99. The embedded information uses a hidden dimension of 128, and 4 attention heads are used in our attention critics. The performance of each approach is assessed by the average rewards per episode and the average steps consumed for rescuing one victim.

As shown in Figure 4, in the Figure 4(left), our proposed method with inner attention mechanism is competitive when compared to uniform attention weights method, that means these two methods can rescue the same number victims during the same time. However, in Figure 4(right), our proposed method with inner attention mechanism is more efficiency when compared to uniform attention weights method, that means our proposed method will take less steps when rescue the same number victims as the uniform attention weights method. In the next subsections, by analyzing in details, we investigate our proposed method in three main directions: First, we study the ability of adapting to task varieties. Secondly, we also want to evaluate agents’ robustness, especially when there are robot failure in the team. This scenario is analogous to real-life tasks such as the sensor failure or broken. To this end, we adjust this task environment by randomly fix broken agent’s status, position and actions to zero. And finally, in the real-life multi robots cooperation system, we need pay attention not only to develop a cooperation strategy that can accomplish complex tasks efficiently but also to the other factors such as resources consuming, distance cost, etc. Therefore, we also analysis the efficiency of our proposed method in the resources consume (moving steps needed) for rescuing per victim.

Figure 4: (Left) Average Rewards on multi robots cooperation. (Right) Average Rewards taking resources consume (moving steps) into consideration on multi robots cooperation. Our model IAAC can rescue more victims by consuming the same amount resources.

4.2 Adapting to Task Varieties

Figure 5: If the robots have been trained with flexible cooperation ability, in the idea cases, after infinite times of experiments, the UAV1 and UAV2 should have equal probability to participate in task1 or task2.
Inner Attention Without Attention
0.526 0.474 0.32 ¡ 3.84 0.903 0.097 80.64 ¿ 3.84
0.441 0.559 1.77 ¡ 3.84 0.085 0.915 81.39 ¿ 3.84
Table 2: UAVs participate rate compare
Figure 6: Attention entropy for each head over the course of training for the four agents in the multi robot cooperation environment. A lower entropy value indicates that the head is basing on specific agents to make decisions.

In order to analysis our proposed method’s adapting ability to task varieties, in our simulated environment we have designed two different kind of tasks: in the first task, besides living supplies, victim is heavily injured and needs medicine treatments; while in the second task the victim who is in a good health condition will need useful information to guide him to a safer place. That means the robots providing living supplies should learn to dynamically cooperate with other robots providing medicine treatment or useful information based on different tasks (victim’s injury level) and other robot’s availability. For example, in order to rescue a victim who are heavily injured, the robot which provide medicine treatment should cooperate with the closest robot that provides living supplies rather than the robot far away from it only if the closest robot isn’t occupied by other rescuing tasks.

As shown in Figure 5, if the robots can flexibly cooperate with other agents based on dynamic environment, in the ideal case after infinite times of simulation, the number of robot providing living supplies cooperates with robot providing medicine treatment should be equal to that of cooperating with robots providing useful information. In order to quantitatively measure robots’ flexibility, we will calculate the rate of robots’ cooperation with each other in 80 episodes by the following formulation:


Where, is the total number of victims rescued by robot i; is the total number of victims rescued by the cooperation of robot i and robot j. In Table 2 we compare the cooperation rates collected from the model trained by our approach and the baseline method. We show that the robots trained by our method with inner attention mechanism are more flexible than those trained by the baseline model. As suspected, the baseline model’s critics use all information non-selectively, while our approach with inner attention mechanism can learn which robots to pay more attention through the inner attention mechanism and compress that information into a constant-sized vector. Thus, our approach is more flexible and sensitive to the dynamically changing environments. Besides that, in Figure 6, we also demonstrate the effect of the attention head on the robot during the training process, we test the entropy of the attention weights for each robot for each of the four attention heads that we use in the rescue task. A lower entropy value indicates that the head is focusing on specific agents, with an entropy of 0 indicating attention focused on one agent. In the rescue task for agents 0, 1, 2 and 3, we plot the attention entropy for each agent. Since victims appear randomly in the training process and the position of each agent is different, each agent faces various situations and needs to cooperate with other agents reasonably based on the dynamic environments. In addition, each of the four attention heads uses a separate set of parameters to determine an aggregated contribution from all other agents, which means each agent tends to be influenced differently by other agents.

Even though the cooperation rate of different robots can represent the flexibility of multi-robot cooperation. However, it can not explain the quality of these flexible corporations, that is these metric can not evaluate whether these corporations are reasonable or not. Therefore, we need other metrics to evaluate the quality of agents cooperation. In order to do that, we need to figure out what is the ideal cooperation for multi-agent cooperation tasks. Just as mentioned above, the ideal cooperation should take complex conditions into account, such as teammate availability, cooperating with proper kind agents and dynamic situations. In order to simplify the complex ideal cooperation in multi-agent search and rescue tasks, in this work we have defined the ideal cooperation by two simple rules:

  1. one robot can only cooperate with proper kind robots.

  2. one robot should rescue its closest victim only if it isn’t occupied by other tasks.

Figure 7: For different situations, the awkward trajectory (red color) and expected trajectory (blue color) are shown in this figure.

In Figure 7, we have figure out three situations for awkward cooperation, which aren’t consist with human’s intuition and all the awkward trajectory path has been illustrated by red dashed arrow lines. In situation 1, even though and are all closer to , should choose to cooperate with and rescue . Because is closer to than . However, even though is also closer to than , since that is already cooperating with , can only cooperate with and rescue . The awkward trajectory paths shown in Figure 7 aren’t satisfy with the above description. In situation 2, that is a very easy case, in which is much closer to and is much closer to . Therefore, should choose to cooperate with and will cooperate with . In situation 3, this situation is a little similar to situation 1, except that both and are both closer to . therefore the agents’ optimal path is similar to situation 1. In all these situations, even though the robots can cooperate with each other and rescue victims, their cooperation strategy is not optimal which will result in consuming more resources. In order to quantitatively measure robots’ cooperation quality, we have count the number of agent’s cooperation cases that are consist with the human’s intuition and the number of agent’s cooperation cases that are not consist with human’s intuition. As shown in Table 3, For the method with inner attention mechanism, after 20 episodes the rates of awkward cooperation in and are 0.85(17/20) and 0.85(17/20). awkward cooperation rates in and are 0.9/(18/20) and 0.85(17/20) separately. As for the results for the baseline model without attention mechanism, after 20 episodes the rates of awkward cooperation in and are 0.55(11/20) and 0.55(11/20). awkward cooperation rates in and are 0.65/(13/20) and 0.5(10/20) separately. Therefore, our approach can have more meaning corporations based on the dynamically changing environments compared with the baseline method.

Inner Attention Without Attention
0.15 0.10 0.45 0.35
0.15 0.15 0.45 0.5
Table 3: Awkward cooperation rate
Figure 8: The robots will choose the trajectories which are more efficient in resources consume (black color, fewer moving steps) rather than trajectories that are not efficient (red color, more moving steps).
Figure 9: Average moving steps per episode. With training process, all methods can become more efficient in resources consuming. And after finishing the training process, IAAC method is more efficient than the baseline method.

Besides, in order to quantitatively prove that our proposed method with inner attention mechanism is much more efficient than the baseline model (shown in Figure 8), we have calculated the average trajectory distance needed to rescue per victim by the following formulation:


where is the average distance costed to rescuing one victim during time T, is the total distance calculated by summing all agents trajectory length in a period of time , and is the total number of rescued victims during time T. As Figure 9 shows, after 25,000 episodes’ training, the model trained by inner attention mechanism is much lower-cost consuming than the model trained without attention mechanism. When

episodes, the average distance for attention model is less than the non-attention model. For example, the average distance cost for the inner attention model is 0.09 lower when compared with the model without inner attention mechanism. However, after 5,000 episodes’ training, the model without attention mechanism is much efficient that the model trained with attention. That’s because the attention mechanism has increase the complexity of the Deep Neural Network’s framework. So the method without attention mechanism can learning faster than the inner attention model and can rescue more victims during the same time. But with the training process, the efficiency of rescuing victims is both increase for all method. Since the average distance needed for rescuing one victim is much less from 5,000 episode to 25,000 episode. In conclusion, we can say that our approach can is more efficient in resources consumes compared with the baseline method.

4.3 Adapting to Robot Availability

Figure 10: Robustness: the faulty robot UAV4 will have no influence on other robots’ flexible teaming.

As shown in Figure 10, in addition to robots’ flexible awkward cooperation, if the robots can learn to pay different attention to robots selectively, the robots teaming will be more robust to sensor failure or robot broken. Since the other normal robots can dynamically choose which robot to cooperate with based on different situations. Besides that, in the method section, we also proved mathematically that the perturbation of one robot’s failure can in fact only have sparse affect to the attention scores when the input embedding are scattered enough in the space. When some robots are broken in a team, the faults caused by sensor failures may have undesirable and uncontrollable effects on other teammates. In addition, broken robots may share incorrect information with other members of the team leading to incorrect behaviors of cooperation. In order to measure the robustness of our proposed method, we assume is broken in our simulated environment and all its status are fixed to zeros. Then we will calculate different robots’ cooperation rate with each other. In the idea cases, if the robots is robust enough, and will have equal chance to participate into and . That means each robot can flexibly choose which robot to cooperate with based on its own local situations and won’t be influenced by other faulty factors. As Table 4 shows, considering when is broken, the robots trained with inner attention mechanism have the equal chance to cooperate with and provide living supplies for the victims. However, the robots trained without attention mechanism will be affected by the broken . Similarly, as for when is broken, the same results are observed. Therefore, we conclude that our approach can is more robust to sensors failure or robot broken compared with the baseline method.

Inner Attention 0.536 0.464 0.505 ¡ 3.84 Inner Attention 0.495 0.505 0.009 ¡ 3.84
Without Attention 0.909 0.091 74.6 ¿ 3.84 Without Attention 0.068 0.932 76.9 ¿ 3.84
Table 4: UAVs participate rate in Tasks

5 Conclusion and Future Work

We presented a inner attention mechanism method (IAAC) to help a multi-robot team to flexible teaming. This allows the multi-robot team to cooperate flexibility, improve its robustness over sensor failure and lower the resources consumed to rescue one victim. The key idea is to utilize inner attention mechanism to select meaningful information between related agents for estimating critics. Three types of evaluations – adapting to task varieties, adapting to robot availability and lower cost for rescuing victims– are simulated and clearly show the inner attention mechanism model is better than the model without attention. With inner attention mechanism, the robots can cooperate with each other more flexible, maintain stable under some real-world faculties and rescue more victims by consuming less resources. The robots are encouraged to cooperate with proper robots selectively and discouraged from depending on broken or faulty robots. In doing so the negative influence caused by misleading information is largely reduced and robots can cooperate properly. The simulation results for the three evaluations demonstrate the effectiveness of inner attention mechanism in increasing the flexibility of robots’ cooperation. In the future research, we will focus on increasing the number of agents and further highlight the advantage of cooperation ability in multi-agent reinforcement learning systems.


  • [1] Saribatur, Zeynep G., Volkan Patoglu, and Esra Erdem. ”Finding optimal feasible global plans for multiple teams of heterogeneous robots using hybrid reasoning: an application to cognitive factories.” Autonomous Robots 43.1 (2019): 213-238.
  • [2] Hartman, Benjamin T., Richard D. Tatum, and Matthew J. Bays. ”Heterogeneous sensor-robot team positioning and mixed strategy scheduling.” 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018.
  • [3] Rizk, Yara, Mariette Awad, and Edward W. Tunstel. ”Cooperative Heterogeneous Multi-Robot Systems: A Survey.” ACM Computing Surveys (CSUR) 52.2 (2019): 1-31.
  • [4] Matos, Aníbal, et al. ”Multiple robot operations for maritime search and rescue in euRathlon 2015 competition.” OCEANS 2016-Shanghai. IEEE, 2016.
  • [5] Mouradian, Carla, Sami Yangui, Roch H. Glitho. ”Robots as-a-service in cloud computing: search and rescue in large-scale disasters case study.” 2018 15th IEEE Annual Consumer Communications & Networking Conference (CCNC). IEEE, 2018.
  • [6] Beck, Zoltán, et al. ”Online planning for collaborative search and rescue by heterogeneous robot teams.” (2016).
  • [7] Alotaibi, Ebtehal Turki Saho, and Hisham Al-Rawi. ”Multi-robot path-planning problem for a heavy traffic control application: A survey.” International Journal of Advanced Computer Science and Applications 7.6 (2016): 10.
  • [8] Dai, Xuefeng, Qi Fan, and Dahui Li. ”Research status of operational environment partitioning and path planning for multi-robot systems.” Journal of Physics: Conference Series. Vol. 887. No. 1. IOP Publishing, 2017.
  • [9] Alotaibi, Ebtehal Turki Saho, and Hisham Al-Rawi. ”A complete multi-robot path-planning algorithm.” Autonomous Agents and Multi-Agent Systems 32.5 (2018): 693-740.
  • [10] Carpentier, Justin, et al. ”Multi-contact locomotion of legged robots in complex environments–the loco3d project.” RSS Workshop on Challenges in Dynamic Legged Locomotion. 2017.
  • [11] Chen, Xi, et al. ”Deep reinforcement learning to acquire navigation skills for wheel-legged robots in complex environments.” 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018.
  • [12] Zhang, Xuebo, et al. ”Multilevel humanlike motion planning for mobile robots in complex indoor environments.” IEEE Transactions on Automation Science and Engineering 16.3 (2018): 1244-1258.
  • [13] Chalaki, Behdad, et al. ”Zero-shot autonomous vehicle policy transfer: From simulation to real-world via adversarial learning.” arXiv preprint arXiv:1903.05252 (2019).
  • [14] Hwangbo, Jemin, et al. ”Learning agile and dynamic motor skills for legged robots.” Science Robotics 4.26 (2019): eaau5872.
  • [15] James, Stephen, Andrew J. Davison, and Edward Johns. ”Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task.” arXiv preprint arXiv:1707.02267 (2017).
  • [16]

    Kim, Jaeseok, et al. ”Cleaning tasks knowledge transfer between heterogeneous robots: a deep learning approach.” Journal of Intelligent & Robotic Systems (2019): 1-15.

  • [17] Prorok, Amanda, M. Ani Hsieh, and Vijay Kumar. ”Fast redistribution of a swarm of heterogeneous robots.” Proceedings of the 9th EAI International Conference on Bio-inspired Information and Communications Technologies (formerly BIONETICS). 2016.
  • [18] Saribatur, Zeynep G., Volkan Patoglu, and Esra Erdem. ”Finding optimal feasible global plans for multiple teams of heterogeneous robots using hybrid reasoning: an application to cognitive factories.” Autonomous Robots 43.1 (2019): 213-238.
  • [19] Han, Mingming, et al. ”Research on Arbitrary Formation Control of Multiple Robots in Obstacle Environment.” Journal of Physics: Conference Series. Vol. 1069. No. 1. IOP Publishing, 2018.
  • [20] Liu, Bingqi, et al. ”Design and realize a snake-like robot in complex environment.” Journal of Robotics 2019 (2019).
  • [21] Janis, Atikah, and Abdullah Bade. ”Path planning algorithm in complex environment: a survey.” Transactions on Science and Technology 3.1 (2016): 31-40.
  • [22] Smith, Phillip, et al. ”Data transfer via uav swarm behaviours: Rule generation, evolution and learning.” Journal of Telecommunications and the Digital Economy 6.2 (2018): 35.
  • [23] Kelly, Stephen, and Malcolm I. Heywood. ”Discovering agent behaviors through code reuse: Examples from half-field offense and ms. pac-man.” IEEE Transactions on Games 10.2 (2017): 195-208.
  • [24] Long, Nathan K., et al. ”A Comprehensive Review of Shepherding as a Bio-inspired Swarm-Robotics Guidance Approach.” arXiv preprint arXiv:1912.07796 (2019).
  • [25] Schaefer, Kristin E., et al. ”Assessing multi-agent human-autonomy teams: US Army Robotic Wingman gunnery operations.” Micro-and Nanotechnology Sensors, Systems, and Applications XI. Vol. 10982. International Society for Optics and Photonics, 2019.
  • [26] Gregory, Jason M., et al. ”Enabling Intuitive Human-Robot Teaming Using Augmented Reality and Gesture Control.” arXiv preprint arXiv:1909.06415 (2019).
  • [27] Iqbal, Tariq, and Laurel D. Riek. ”Human-robot teaming: Approaches from joint action and dynamical systems.” Humanoid robotics: A reference (2019): 2293-2312.
  • [28] Gu, Shixiang, et al. ”Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates.” 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017.
  • [29] Kalashnikov, Dmitry, et al. ”Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation.” arXiv preprint arXiv:1806.10293 (2018).
  • [30] Andersena, Rasmus E., et al. ”Self-learning Processes in Smart Factories: Deep Reinforcement Learning for Process Control of Robot Brine Injection.” 29th International Conference on Flexible Automation and Intelligent Manufacturing: Faim 2019. Elsevier, 2019.
  • [31] Foerster, Jakob, et al. ”Learning to communicate with deep multi-agent reinforcement learning.” Advances in neural information processing systems. 2016.
  • [32] Leibo, Joel Z., et al. ”Multi-agent reinforcement learning in sequential social dilemmas.” arXiv preprint arXiv:1702.03037 (2017).
  • [33]

    Foerster, Jakob, et al. ”Stabilising experience replay for deep multi-agent reinforcement learning.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

  • [34] Okdinawati, Liane, Togar M. Simatupang, and Yos Sunitiyoso. ”Multi-agent reinforcement learning for collaborative transportation management (ctm).” Agent-Based Approaches in Economics and Social Complex Systems IX. Springer, Singapore, 2017. 123-136.
  • [35] Gupta, Jayesh K., Maxim Egorov, and Mykel Kochenderfer. ”Cooperative multi-agent control using deep reinforcement learning.” International Conference on Autonomous Agents and Multiagent Systems. Springer, Cham, 2017.
  • [36] Shinde, Chinmay, et al. ”Distributed Reinforcement Learning Based Optimal Controller For Mobile Robot Formation.” 2018 European Control Conference (ECC). IEEE, 2018.
  • [37] Khan, Arbaaz, et al. ”Graph Policy Gradients for Large Scale Robot Control.” arXiv preprint arXiv:1907.03822 (2019).
  • [38] Raeissi, Masoume M., Nathan Brooks, and Alessandro Farinelli. ”A Balking Queue Approach for Modeling Human-Multi-Robot Interaction for Water Monitoring.” International Conference on Principles and Practice of Multi-Agent Systems. Springer, Cham, 2017.
  • [39] Khan, Arbaaz, et al. ”Collaborative multiagent reinforcement learning in homogeneous swarms.” (2018).
  • [40] Long, Pinxin, et al. ”Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning.” 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018.
  • [41] Gigliotta, Onofrio. ”Equal but different: Task allocation in homogeneous communicating robots.” Neurocomputing 272 (2018): 3-9.
  • [42] Nikitenko, Agris, et al. ”Task Allocation Methods for Homogeneous Multi-Robot Systems: Feed Pushing Case Study.” Automatic Control and Computer Sciences 52.5 (2018): 371-381.
  • [43] Gigliotta, Onofrio. ”Task Allocation in Evolved Communicating Homogeneous Robots: The Importance of Being Different.” International Conference on Practical Applications of Agents and Multi-Agent Systems. Springer, Cham, 2016.
  • [44] Jha, Devesh K. ”Algorithms for Task Allocation in Homogeneous Swarm of Robots.” 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018.
  • [45] Khaluf, Yara, Seppe Vanhee, and Pieter Simoens. ”Local ant system for allocating robot swarms to time-constrained tasks.” Journal of Computational Science 31 (2019): 33-44.
  • [46] Chen, Yang, et al. ”Air-ground heterogeneous robot system path planning method based on neighborhood constraint.” U.S. Patent No. 10,459,437. 29 Oct. 2019.
  • [47] Artuñedo, Antonio, Raúl M. Del Toro, and Rodolfo E. Haber. ”Consensus-based cooperative control based on pollution sensing and traffic information for urban traffic networks.” Sensors 17.5 (2017): 953.
  • [48] Ma, Yuexin, Dinesh Manocha, and Wenping Wang. ”Autorvo: Local navigation with dynamic constraints in dense heterogeneous traffic.” arXiv preprint arXiv:1804.02915 (2018).
  • [49] Jiang, Jiechuan, and Zongqing Lu. ”Learning attentional communication for multi-agent cooperation.” Advances in neural information processing systems. 2018.
  • [50] Geng, Mingyang, et al. ”Learning to cooperate via an attention-based communication neural network in decentralized multi-robot exploration.” Entropy 21.3 (2019): 294.
  • [51] Capitan, Jesus, et al. ”Decentralized multi-robot cooperation with auctioned POMDPs.” The International Journal of Robotics Research 32.6 (2013): 650-671.
  • [52] Jiang, Jiechuan, and Zongqing Lu. ”Learning attentional communication for multi-agent cooperation.” Advances in neural information processing systems. 2018.
  • [53] Mathews, Nithin, et al. ”Mergeable nervous systems for robots.” Nature communications 8.1 (2017): 1-7.
  • [54] Mathews, Nithin, et al. ”Supervised morphogenesis: Exploiting morphological flexibility of self-assembling multirobot systems through cooperation with aerial robots.” Robotics and autonomous systems 112 (2019): 154-167.
  • [55] Pelc, Andrzej, and David Peleg. ”Broadcasting with locally bounded byzantine faults.” Information Processing Letters 93.3 (2005): 109-115.
  • [56] Saulnier, Kelsey, et al. ”Resilient flocking for mobile robot teams.” IEEE Robotics and Automation letters 2.2 (2017): 1039-1046.
  • [57] Zhang, Haotian, and Shreyas Sundaram. ”Robustness of information diffusion algorithms to locally bounded adversaries.” 2012 American Control Conference (ACC). IEEE, 2012.
  • [58] Wang, Hongling, et al. ”Information-fusion based robot simultaneous localization and mapping adapted to search and rescue cluttered environment.” 2017 18th International Conference on Advanced Robotics (ICAR). IEEE, 2017.
  • [59]

    Rostami, Vahid, et al. ”Localization and Navigation Omni-directional Robots based on Sensors Fusion and Particle Filter.” 2018 9th Conference on Artificial Intelligence and Robotics and 2nd Asia-Pacific International Symposium. IEEE, 2018.

  • [60] Abci, Boussad, et al. ”Multi-Robot Autonomous Navigation System Using Informational Fault Tolerant Multi-Sensor Fusion with Robust Closed Loop Sliding Mode Control.” 2018 21st International Conference on Information Fusion (FUSION). IEEE, 2018.
  • [61] Zlot, Robert, et al. ”Multi-robot exploration controlled by a market economy.” Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292). Vol. 3. IEEE, 2002.
  • [62] Visser, Arnoud, and Bayu A. Slamet. ”Balancing the information gain against the movement cost for multi-robot frontier exploration.” European Robotics Symposium 2008. Springer, Berlin, Heidelberg, 2008.
  • [63] Colares, Rafael Gonçalves, and Luiz Chaimowicz. ”The next frontier: combining information gain and distance cost for decentralized multi-robot exploration.” Proceedings of the 31st Annual ACM Symposium on Applied Computing. 2016.
  • [64] Visser, Arnoud, and Bayu A. Slamet. ”Including communication success in the estimation of information gain for multi-robot exploration.” 2008 6th International symposium on modeling and optimization in mobile, ad hoc, and wireless networks and workshops. IEEE, 2008.
  • [65] Gautam, Avinash, et al. ”Cluster, allocate, cover: An efficient approach for multi-robot coverage.” 2015 IEEE International Conference on Systems, Man, and Cybernetics. IEEE, 2015.
  • [66] Lowe, Ryan, et al. ”Multi-agent actor-critic for mixed cooperative-competitive environments.” Advances in neural information processing systems. 2017.
  • [67] Patkin, M. L., and G. N. Rogachev. ”Construction of multi-agent mobile robots control system in the problem of persecution with using a modified reinforcement learning method based on neural networks.” IOP Conference Series: Materials Science and Engineering. Vol. 312. No. 1. IOP Publishing, 2018.
  • [68] Vaswani, Ashish, et al. ”Attention is all you need.” Advances in neural information processing systems. 2017.
  • [69] Yang, Zichao, et al. ”Hierarchical attention networks for document classification.” Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 2016.
  • [70]

    Wang, Fei, et al. ”Residual attention network for image classification.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

  • [71] Montufar, Guido F., et al. ”On the number of linear regions of deep neural networks.” Advances in neural information processing systems. 2014.
  • [72] Dinh, Laurent, et al. ”Sharp minima can generalize for deep nets.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.
  • [73] Hsieh, Yu-Lun, et al. ”On the robustness of self-attentive models.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.