Learning to cooperate is crucially important in multi-agent reinforcement learning. The key is to take the influence of other agents into consideration when performing distributed decision making. However, multi-agent environment is highly dynamic, which makes it hard to learn abstract representations of influences between agents by only low-order features that existing methods exploit. In this paper, we propose a graph convolutional model for multi-agent cooperation. The graph convolution architecture adapts to the dynamics of the underlying graph of the multi-agent environment, where the influence among agents is captured by their abstract relation representations. High-order features extracted by relation kernels of convolutional layers from gradually increased receptive fields are exploited to learn cooperative strategies. The gradient of an agent not only backpropagates to itself but also to other agents in its receptive fields to reinforce the learned cooperative strategies. Moreover, the relation representations are temporally regularized to make the cooperation more consistent. Empirically, we show that our model enables agents to develop more cooperative and sophisticated strategies than existing methods in jungle and battle games and routing in packet switching networks.READ FULL TEXT VIEW PDF
Cooperation is a widespread phenomenon in nature from viruses, bacteria, and social amoebae to insect societies, social animals, and humans [Melis and Semmann2010]
. Human exceeds all other species in terms of the range and scale of cooperation. Therefore, for general artificial intelligence, it is crucially important to learn how to cooperate in multi-agent environments.
Deep reinforcement learning (RL) has shown human level performance in games such as Atari games [Mnih et al.2015] and GO [Silver et al.2017]. However, motivated by self-interest only, individual agents usually ignore the benefit of other agents, causing the damage to the common goal in multi-agent environments. Morever, independent RL treats other agents as part of environment and thus the environment becomes unstable from the perspective of individual agents as the strategies of other agents change during training. In addition, the learned policy can easily overfit to the policies of other agents [Lanctot et al.2017], failing to sufficiently generalize during execution.
The key to multi-agent reinforcement learning (MARL) is taking the influence of other agents into consideration when performing distributed decision making. Currently, most MARL algorithms are designed based on this intuition. MADDPG [Lowe et al.2017] directly trains a centralized critic that receives the observations and actions of all agents. Algorithms based on communication, such as CommNet [Sukhbaatar, Fergus, and others2016], BiCNet [Peng et al.2017], and ATOC [Jiang and Lu2018], convey encoded observations and action intentions by information sharing among agents. Mean field [Yang et al.2018] represents the influence by the mean action of neighboring agents. However, due to the highly dynamic multi-agent environments, existing methods that only explore low-order features fail to learn abstract representations of influences between agents, restraining the range and scale of cooperation. That is to say, existing methods are insufficient to make agents understand their mutual interplay and form a relatively broad view of the environment.
In this paper, we capture the influence between agents by their relation. The intuition is that individuals in games or social networks are related to each other, and agents and their relations can be represented by a graph. Unlike low-dimensional regular grids such as images, the agent graph is irregular and dynamic, even lying on non-Euclidean domains, which makes it challenging to extract features. Inspired by the convolution in the domains, such as social networks [Kipf and Welling2017], protein structure [Duvenaud et al.2015] and 3D point cloud [Charles et al.2017], we apply convolution operations to the graph of agents for cooperative tasks, where each agent is a node, each node connects to its neighbors, and the local observation of agent is the attributes of node. By using multi-head attention [Vaswani et al.2017]
as the convolution kernel, graph convolution is able to extract relation representations, and features from neighboring nodes can be integrated just like the receptive field of a neuron in a normal convolutional neural network (CNN). High-order features extracted from gradually increased receptive fields are exploited to learn cooperative strategies. The gradient of an agent not only backpropagates to itself but also to other agents in its receptive fields to reinforce the learned cooperative strategies. Moreover, the relation representations are temporally regularized to make cooperation more consistent.
Our graph convolutional model, called DGN, is instantiated as an extension of deep network and trained end-to-end, adopting the paradigm of centralized training and distributed execution. DGN abstracts the influence between agents by relation kernels, extracts latent features by convolution, and induces consistent cooperation by temporal relation regularization. Moreover, as DGN shares weights among all agents, it is easy to scale, better suited in large-scale MARL. We empirically show the learning effectiveness of DGN in jungle and battle games and routing in packet switching networks. It is demonstrated DGN agents are able to develop more cooperative and sophisticated strategies than existing methods. To the best of our knowledge, this is the first time that graph convolution is successfully applied to MARL.
MADDPG [Lowe et al.2017] and COMA [Foerster et al.2018] are the extension of actor-critic model for multi-agent environments, where MADDPG is designed for mixed cooperative-competitive environments and COMA is proposed to solve multi-agent credit assignment in cooperative settings. A centralized critic that takes as input the observations and actions of all agents are used in MADDPG and COMA. zhang2018fully consider networked critics that are updated via communication. However, all these three models have to train an independent policy network for each agent, which tends to learn a policy specializing specific tasks and easily overfits to the number of agents.
There are several models that have been proposed to learn multi-agent cooperation by communication. These models are end-to-end trainable by backpropagation. CommNet [Sukhbaatar, Fergus, and others2016] uses continuous communication for full cooperation tasks. At a single communication step, each agent sends its hidden state as the message to the communication channel and then the averaged message from other agents is fed into the next layer. BiCNet [Peng et al.2017] uses a reccurent neural network (RNN) as the communication channel to connect each individual agent’s policy and value networks. ATOC [Jiang and Lu2018] enables agents to learn dynamic communication with nearby agents using attention mechanism. The bidirectional-LSTM integrates the hidden thoughts of agents participating in a communication group and produces new thoughts for distributed decision making. These communication models prove that the information sharing does helps, and as we will show later, they can be considered as special instances of our model.
Most existing models are limited to the scale of dozens of agents, while some consider large-scale MARL. When the number of agents increases, learning becomes hard due to the curse of the dimensionality and the exponential growth of agent interactions. Instead of considering the different effects of other individuals on each agent, Mean Field [Yang et al.2018] approximates the effect of other individuals by their mean action. However, the mean action eliminates the difference among these agents in terms of observation and action and thus incurs the loss of important information that helps cooperative decision making.
Many important real-world applications come in the form of graphs, such as social networks, protein-interaction networks, and 3D point cloud. In the last couple of years, several frameworks [Henaff, Bruna, and LeCun2015, Niepert, Ahmed, and Kutzkov2016, Kipf and Welling2017, Velickovic et al.2017] have been architected to extract locally connected features from arbitrary graphs. Typically, the goal is to learn a function of features on graphs. A graph convolutional network (GCN) takes as input the feature matrix that summarizes the attributes of each node and adjacency matrix outputs a node-level feature matrix. The function is similar to the convolution operation in CNNs, where the kernels are convolved across local regions of the input and produce the feature maps.
Learning common sense knowledge is one of the keys to artificial intelligence. However, it has proven difficult for neural networks. Interaction networks aim to reason the objects, relations and physics in complex systems. Interaction networks predict the future states and underlying properties, which is similar to the way of human thinking. There are several frameworks have been proposed to model the interactions. IN [Battaglia et al.2016] focuses on the binary relations between entities. The model computes the effect of interaction and predicts the next state by taking the interaction into consideration. VIN [Watters et al.2017] predicts the future states from raw visual observations. VAIN [Hoshen2017] models multi-agent relations and predicts the future states with attention mechanism.
The core idea of relational reinforcement learning (RRL) is to combine RL with relational learning by representating states and policies based on relations. Neural networks can operate on structured representations of a set of entities, non-locally compute interactions on a set of feature vectors, and perform relational reasoning via iterated message passing[Zambaldi et al.2018]. The relation block, multi-head dot-product attention [Vaswani et al.2017], is embedded into neural networks to learn the pairwise interaction representation. High-order interactions between entities can be captured by stacking multiple blocks recurrently or deeply.
DGN is built on the concept of graph convolution. We construct the multi-agent environment as a graph, where agents in the environment are represented by the nodes of the graph, and for each node, there are edges connected to its nearest neighbors (e.g., in terms of distance or other metrics, depending on the environment). The intuition behind this is nearer neighbors are more likely to interact with and affect each other. Moreover, in large-scale multi-agent environments, it is costly and less helpful to take all agents’ influence into consideration, because receiving a large amount of information requires high bandwidth and incurs high computational complexity, and agents cannot differentiate valuable information from globally shared information [Jiang and Lu2018]. In addition, as convolution can gradually increase the receptive field of an agent, the scope of cooperation is not restricted. Therefore, it is efficient and effective to consider only nearest neighbors. Unlike the static graph considered in GCNs, the graph of multi-agent environment is continuously changing over time as agents move or enter/leave the environment. Therefore, DGN should be capable to adapt the dynamics of the graph and learn as the multi-agent environment evolves.
We consider the partially observable environment, where at each timestep each agent receives a local observation , which is the property of node in the graph, takes an action , and gets a reward . DGN consists of three types of modules: observation encoder, convolutional layer and network, as illustrated in Figure 1. The local observation is encoded into a feature vector by MLP for low-dimensional input or CNN for visual input. The convolutional layer integrates the feature vectors in the local region (including node and its neighbors) and generates the latent feature vector . By stacking more convolutional layers, the receptive field of an agent gradually grows, where more information is gathered, and thus the scope of cooperation can also increase. That is, by stacking one convolutional layer, node can directly acquire feature vectors from encoders of the nodes in one-hop ( neighbors). By stacking two layers, node can get the output of the first convolutional layer of the nodes in one hop, which contains the information from nodes in two hops. However, more convolutional layers will not increase the local region of node , i.e., node still only directly receives information from its neighbors. This saliency is very important as we consider decentralized execution. Details of the convolution kernel will be explained in next section.
As the number and position of agents vary over time, the underlying graph continuously changes, which brings difficulties to graph convolution. To address the issue, we merge all agents’ feature vectors at time into a feature matrix with size in the order of index, where is the number of agents and is the length of feature vector. Then, we construct an adjacency matrix with size for agent , where the first row is the one-hot representation of node ’s index, and the th row, , is the one-hot representation of the index of the th nearest neighbor. Then, we can obtain the feature vectors in the local region of node by .
Inspired by DenseNet [Huang et al.2017], for each agent, the features of all the preceding layers are merged and fed into the network, so as to assemble and reuse the observation representation and features from different receptive fields, which respectively have distinctive contributions to the strategy that takes the cooperation at different scopes into consideration. The network selects the action that maximizes the
-value with a probability ofor acts randomly with a probability of . The gradient of -loss of each agent will backpropagate not only to itself and neighbors but also to other agents in its receptive fields. That is to say, the agent not only focuses on maximizing its own expected reward but also considers how its policy affects other agents, and hence agents are enabled to learn cooperation. Moreover, each agent receives the encoding of observations and intentions of influential agents, which makes the environment more stable from the perspective of individual agent.
In DGN, all agents share weights, which significantly reduces the number of parameters. However, this does not prevent the emergence of complex cooperative strategies, as we will show in the experiments. We adopt the paradigm of centralized training and distributed execution. During training, at each timestep, we store the tuple in the replay buffer , where , , , , and . Note that we drop time in the notations for simplicity. Then, we sample a random minibatch of samples from and minimize the loss
denotes the set of observations of all the agents in ’s receptive fields, is the discount factor, and the model is parameterized by . To make the learning process more stable, we keep unchanged in two successive timesteps when computing the -loss in training. The gradients of -loss of all agents are accumulated to update the parameters. Each agent not only minimizes its own -loss but also -loss of other agents who the agent collaborates with. Then, we softly update the target network as
During execution, all agents share the parameters and each agent only requires the information from its neighbors (e.g., via communication), regardless of the number of agents. Therefore, our model can easily scale and thus is suitable for large-scale MARL.
Convolution kernels integrate the information in the receptive field to extract the latent feature among agents. One of the most important properties is that the kernel should be independent from the order of the input feature vectors. Mean operation meets this requirement. If we take mean operation as the kernel, the model is the CommNet with graph convolution architecture. However, without learnable parameters, mean kernel leads to only slight performance gain. BiCNet uses the learnable kernel, RNN. However, the input order feature vectors of RNN severely impacts the performance, though the affect is alleviated by bi-direction mechanism.
Adopting the idea from RRL, we use multi-head dot-product attention as the kernel to compute interactions between entities. Unlike RRL, we take each agent rather than pixel as an entity. For each agent , there are a set of entities ( neighbors and itself) in the local region. Then, we can compute the interaction between and for each independent attention head,
where is a scaling factor. can be considered as the relation representation between entities. The features of -head attention are averaged and fed into function
(one-layer MLP with ReLU non-linearities) to produce the output of the convolutional layer,
Figure 2 illustrates the computation of the convolutional layer with relation kernel.
Multi-head attention extracts multiple representations of the relation between individual agents, which makes the kernel independent from the order of input feature vectors, and allows the model to jointly attend to information from different representation subspaces for different agents. Moreover, with multiple convolutional layers, high-order relation representations can be extracted, which effectively capture the influence of other agents and greatly help to make cooperative decision.
As we train our model using deep
learning, we use future value estimate as target for the current estimate. We follow this insight and apply it to the relation kernel in our model. Intuitively, if the relation representation produced by the relation kernel of upper layer truly captures the abstract relation between surrounding agents and itself, such relation representation should be stable/consistent for at least a short period of time, even when the state/feature of surrounding agents changes. Since in our relation kernel, the relation is represented as the attention weight distribution to the state of surrounding agents, we use the attention weight distribution in the next state as the target for the current attention weight distribution to encourage the agent to form the consistent relation representation. As the relation in different states should not be the same but similar, we usedivergence to compute the distance between the attention weight distributions in the two states.
It should be noted that we do not use the target network to produce the target relation representation as in normal deep learning. This is because relation representation is highly correlated with weights of feature extraction. But update of such weights in target network always lags behind that of the current network. Since we only focus on the self-consistent of the relation representation based on the current feature extraction network, we apply current network to the next state to produce the new relation representation instead of the target network as in deep learning.
Let denotes the attention weight distribution of relation representations at convolutional layer for agent , where . Therefore, with temporal relation regularization, the loss is modified as below
where and is the coefficient for the regularization loss.
Temporal relation regularization of upper layer in DGN helps the agent to form long-term and consistent action policy in the highly dynamical environment with a lot of moving agents. This will further help agents to form cooperative behavior since many cooperation tasks need long-term consistent actions of the collaborated agents to get the final reward. We will further analyze this in the experiments.
For the experiments, we adopt a large-scale gridworld platform MAgent [Zheng et al.2017]. In the environment, each agent corresponds to one grid and has a local observation that contains a square view with grids centered at the agent and its own coordinates. The discrete actions are moving or attacking. Two scenarios, jungle and battle, are considered to investigate the cooperation among agents. Also, we build an environment, routing, that simulates the routing in packet switching networks to verify the applicability of our model in read-world applications. These three scenarios are illustrated in Figure 3
. The hyperparameters of DGN in the three scenarios are summarized in Table1. In the experiments, we compare DGN with DQN, CommNet, and Mean Field Q-learning (MFQ). For fair comparison, all the models have similar parameter scale. MADDPG is not considered as a baseline, because training an independent policy network for each agent makes MADDPG infeasible in large-scale scenarios. More importantly, most real-world applications are open systems, i.e., agents come and go as in battle, and thus it is impossible to train a model for every new agent. Please refer to the video at https://goo.gl/AFV9qi for more details about the experiments, and the code of DGN is available at https://github.com/PKU-AI-Edge/GraphConv4MARL.git/.
|# neighbors ()|
|# convolutional layers|
|# attention heads|
|# encoder MLP layers|
|# encoder MLP units|
This scenario is a moral dilemma. There are agents and foods in the field, where foods are fixed. An agent gets positive reward by eating food, but gets higher reward by attacking other agent. At each timestep, each agent can move to or attack one of four neighboring grids. The reward is for moving, for attacking (eating) the food, for attacking other agent, for being attacked, and for attacking a blank grid (inhibiting excessive attacks). This experiment is to examine whether agents can learn the strategy of collaboratively sharing resources rather than attacking each other.
We trained all the models with the setting of and for episodes. Figure 3(a) shows their learning curves, where DGN-M is graph convolution with mean kernel, and each model is with three training runs. Table 2 shows the mean reward (averaged over all agents and timesteps) and number of attacks between agents (averaged over all agents) over test runs, each game unrolled with timesteps.
DGN outperforms all the baselines during training and test in terms of mean reward and number of attacks between agents. It is observed that DGN agents can properly select the close food and seldom hurt each other, and the food can be allocated rationally by the surrounding agents, as shown in Figure 4(a). Moreover, attacks between DGN agents are much less than others, i.e., and less than DGN-M and MFQ, respectively. Sneak attack, fierce conflict, and hesitation are the characteristics of CommNet and DQN agents, as illustrated in Figure 4(b), verifying their failure of learning cooperation. Although DGN-M and CommNet both use mean operation, DGN-M greatly outperforms CommNet. This is attributed to the graph convolution that can effectively extract latent features from surrounding agents. Moreover, comparing DGN with DGN-M, we can conclude that the relation kernel that abstracts the relation representation between agents does help to learn cooperative strategy.
We directly apply the trained model with and to the scenario of and . Higher agent density and food shortage make the moral dilemma more complicated. The slight drop of mean reward of all the models is because food is not enough to supply each agent. DGN maintains the number of attacks, which means agents who cannot obtain the food will stay in a waiting state without attacking other. However, agents of MFQ, CommNet, and DQN attack each other more frequently when there are more agents sharing food.
This scenario is a fully cooperative task, where agents learn to fight against enemies who have superior abilities than the agents. The agent’s moving or attacking range is the four neighbor grids, however, the enemy can move to one of twelve nearest grids or attack one of eight neighbor grids. Each agent/enemy has six hit points (i.e., being killed by six attacks). After the death of an agent/enemy, a new agent/enemy will be added at a random location. The reward is for attacking the enemy, for being killed, and for attacking a blank grid. The pretrained DQN model built-in MAgent takes the role of enemy. As individual enemy is much powerful than individual agent, an agent has to collaborate with others to develop coordinated tactics to fight enemies. Moreover, as the hit point of enemy is six, agents have to continuously cooperate to kill the enemy. Therefore, the task is much more challenging than jungle in terms of learning to cooperate.
We trained all the models with the setting of and for episodes. Figure 3(b) shows the learning curves of all the models in terms of mean reward. DGN converges to much higher mean reward than other baselines, and its learning curve is more stable. For CommNet and DQN, they first get relative high reward, but they eventually converge to much lower reward than others. As observed in the experiment, at the beginning of training, DQN and CommNet learn sub-optimum policies such as gathering as a group in a corner to avoid being attacked, since such behaviors generate relatively high reward. However, since the distribution of reward is uneven, i.e., agents at the exterior of the group are easily attacked, learning from the “low reward experiences” produced by the sub-optimum policy, DQN and CommNet converge to more passive policies, which lead to much lower reward. We evaluate DGN and the baselines by running test games, each game unrolled with timesteps. Table 3 shows the mean reward, kills, deaths, and kill-death ratio.
DGN agents learn a series of tactical maneuvers, such as encircling and envelopment of a single flank. For single enemy, DGN agents learn to encircle and attack it together. For a group of enemies, DGN agents learn to move against and attack one of the enemy’s open flanks, as depicted in Figure 4(c). CommNet agents adopt an active defense strategy. They seldom launch an attack but rather run away or gather together to avoid being attacked. DQN agents driven by self-interest fail to learn a rational policy. They are usually forced into a corner and passively react to the enemy’s attack, as shown in Figure 4(d). MFQ agents do not effectively cooperate with each other since there is no gradient backpropagated among agents to reinforce the cooperation during MFQ training.
In DGN, relation kernels can extract the interaction and influence among agents from gradually increased receptive fields, which can be easily exploited to yield cooperation. Moreover, the gradient backpropagation from an agent to other agents in the receptive field enforces the cooperation. Therefore, DGN outperforms other baselines.
DGN with temporal relation regularization, i.e., DGN+R, achieves consistently better performance compared to DGN as shown in Figure 3(b) and Table 3. In the experiment, it is observed that DGN+R agents indeed behave more consistently and synchronously with each other, while DGN agents are more likely to be distracted by the new appearance of enemy or friend nearby and abandon its original intended trajectory. This results in fewer appearances of successful formation of encircling of a moving enemy, which might need consistent cooperation of agents to move across the field. DGN+R agents often overcome such distraction and show more long-term strategy and aim by moving more synchronously to chase the enemy until encircle and destroy it. From this experiment, we can see that temporal relation regularization indeed helps agents to form more consistent cooperation.
This scenario is an abstraction of routing in packet switching networks, where the routing protocol tries to optimize the mean delay of data packets by making distributed decision at each router (i.e., by determining only the next hop of a packet at a router). The network consists of routers. Each router is randomly connected to a constant number of routers (three in the experiment), and the network topology is stationary. The bandwidth of each link is the same and set to . There are data packets with a random size between and , and each packet is randomly assigned a source and destination router. If there are multiple packets with the sum size larger than , they cannot go through a link simultaneously.
In the experiment, data packets are agents, and they aim to quickly reach the destination while avoiding congestion. At each timestep, the observation of a packet is its own attributes (i.e., current location, destination, and data size), the attributes of cables connected to its current location (i.e., load, length), and neighboring data packets (on the connected cable or routers). It takes some timesteps for a data packet to go through a cable, a linear function of the cable length. The action space of a packet is the choices of next hop. If the link to the selected next hop is overloaded, the data packet will stay at the current router and be punished with a reward . Once the data packet arrives at the destination, it leaves the system and gets a reward and another data packet enters the system with random initialization.
|Floyd||Floyd with BL||DGN+R||DGN||DGN-M||MFQ||CommNet||DQN|
|# delivered packets|
|# delivered packets|
We trained all the models with the setting of and for episodes. Figure 3(c) shows the learning curves in terms of mean reward. DGN and DGN+R converge to much higher mean reward and more quickly than the baselines. DGN-M and MFQ have similar mean reward at the end, though MFQ converges faster than DGN-M. As expected, DQN performs the worst, which is much lower than others.
We evaluate all the models by running test games, each game unrolled with timesteps. Table 4 shows the mean reward, mean delay of data packets, and number of delivered packets, where the delay of a packet is measured by the timesteps taken by the packet from source to destination. To better interpret the performance of the models, we calculate the shortest path for every pair of nodes in the network using Floyd algorithm. Then, during test, we directly calculate the mean delay based on the shortest path of each packet, which is (Floyd in Table 4). Note that this delay is without considering the bandwidth limitation (i.e., data packets can go through any link simultaneously). Thus, this is the ideal case for the routing problem. When considering the bandwidth limit, we let each packet follow its shortest path, and if a link is congested, the packet will wait at the router until the link is unblocked. The resulted delay is (Floyd with BL in Table 4), which can be considered as the practical solution.
As shown in Table 4, the performance of DGN-M, MFQ, CommNet, and DQN are worse than Floyd with BL. However, the delay and number of delivered packets of DGN are much better than other models and also better than Floyd with BL. In the experiment, it is observed that DGN agents tend to select the shortest path to the destination, and more interestingly, learn to select different paths when congestion is about to occur. DQN agents cannot learn the shortest path due to myopia and easily cause congestion at some links without considering the influence of other agents. Information sharing indeed helps as DQN-M, MFQ, and CommNet all outperform DQN. However, they are unable to develop the sophisticated routing protocol as DGN does. DGN+R has slightly better performance than DGN. This is because data packets with different destinations seldom cooperate continuously (sharing many links) along their paths.
To investigate how the traffic pattern affects the performance of the models, we perform the experiments with and , i.e., heavier data traffic, where all the models are retrained. From Table 4, we can see that DGN+R and DGN outperform other models and Floyd with BL. Under heavier traffic, DGN+R and DGN are much better than Floyd with BL, and DGN-M and MFQ are also better than Floyd with BL. The reason is that the strategy of Floyd with BL (i.e., simply following the shortest path) is favorable when traffic is light and congestion is rare, while this does not work well when traffic is heavy and congestion easily occurs. The performance of DGN+R and DGN under various traffic patterns demonstrates the learning effectiveness of our model.
We have proposed a graph convolutional model for multi-agent cooperation. DGN adapts the dynamics of the underlying graph of multi-agent environment and exploits convolution with relation kernels to extract relation representations from gradually increased receptive fields. Different orders of abstract relations are exploited to learn cooperative strategies. The gradient of an agent not only backpropagates to itself but also to other agents in its receptive fields to reinforce the learned cooperative strategies. Moreover, the relation representations are temporally regularized to make the cooperation more consistent. Empirically, DGN outperforms existing methods in a variety of cooperative multi-agent environments.
This work was supported in part by Peng Cheng Laboratory and NSFC under grant 61872009. We thank Zhehan Fu for sharing the GPUs for our experiments.
Pointnet: Deep learning on point sets for 3d classification and segmentation.In CVPR’17, 77–85.