1 Introduction
A common difficulty of reinforcement learning in a multiagent environment is that in order to achieve successful coordination, agents require information about the relevance of environment objects to themselves and other agents. For example, in the game of Pommerman it is important to know how relevant bombs placed in the environment are for teammates, e.g. whether or not the bombs can threaten them. While such information can be handcrafted into the state representation for wellunderstood environments, in lesserknown environments it is preferable to derive it as part of the learning process.
In this paper, we propose a novel method, named MAGNet, to learn such relevance information in form of a relevance graph and incorporate this into the reinforcement learning process. Furthermore, we propose the use of message generation techniques from this graph, inspired by the NerveNet architecture wang2018nervenet . NerveNet has been introduced in the context of robot locomotion, where it has been applied to a graph of connected robot limbs. MAGNet uses a similar approach, but basing the message generation on the learned relevance graph.
We applied MAGNet to the popular Pommerman matiisen2018pommerman
multiagent environment, and achieved significantly better performance than a baseline heuristic method and stateoftheart RL techniques including DQN
mnih2013playing , MADDPG lowe2017MADDPG and MCTS guez2018learning . Additionally, we empirically demonstrate the effectiveness of selfattention, graph sharing and message generating modules with an ablation study.2 Deep MultiAgent Reinforcement Learning
In this section we describe the stateoftheart (deep) reinforcement learning techniques that were applied to multiagent domains. The algorithms introduced below (DQN, MCTSNet, and MADDPG) were also used as evaluation baselines in our experiments.
The majority of work in the area of reinforcement learning applies a Markov Decision Process (MDP) as a mathematical model
puterman2014markov . An MDP is a tuple , where is the state space, is the action space,is the probability that action
in state will lead to state , and is the immediate reward received when action taken in state results in a transition to state . The problem of solving an MDP is to find a policy (i.e., mapping from states to actions) which maximises the accumulated reward. When the environment dynamics (transition probabilities and reward function) are available, this task can be solved using policy iteration bertsekas2005dynamic .The problem of solving an multiagent MDP is to find policies that maximize the expected reward for every agent , where is the distribution of states visited with policy .
2.1 Deep QNetworks
Qlearning is a value iteration method that tries to predict future rewards from current state and an action. This algorithm apply so called temporaldifference updates to propagate information about values of stateaction pairs, . After each transition, , in the environment, it updates stateaction values by the formula:
(1) 
where is the rate of learning and is the discount factor. It modifies the value of taking action in state , when after executing this action the environment returned reward , and moved to a new state .
Deep Qlearning utilizes a neural network to predict Qvalues of stateaction pairs
mnih2013playing . This socalled deep Qnetwork is trained to minimize the difference between predicted and actual Qvalues as follows:(2) 
(3) 
where is best action according to the previous deep Qnetwork,
is the parameter vector of the current Qfunction and
denotes all actions that are permitted in state .The simplest way to apply this approach in multiagent systems is to use an independent network for every agent tan1993multi . However this approach has been shown to not perform well with more complex environments matignon2012independent . One of the shortcomings of DQN learning in multiagent settings is that past experience replay is less informative, because unlike a singleagent setting, the same action in the same state may produce a different result based on the actions of other agents. A way to alleviate this problem is passing parameters of other agents as additional environmental information tesauro2004extending .
2.2 MonteCarlo Tree Search nets (MCTSNet)
An alternative approach to reinforcement learning is to directly find an optimal policy without the intermediate step of computing a value function. Policy gradient methods (e.g. sutton2000policy ) have been developed to do just this.
Policy gradient methods have been shown to be successful in combination with MonteCarlo tree search (MCTS) chaslot2008monte , which is a general, powerful, and widely used decision making algorithm, most commonly applied to games. In MCTS a sample tree of simulated future states is created, and evaluations of those states are backedup to the root of this socalled search tree to compute the best action.
A recent study guez2018learning incorporates a neural network inside the treesearch by expanding, evaluating and backingup a vector embedding of the states. The key idea is to assign a feature or a “memory” vector to an internal state (search tree node) that is then propagated up the tree and used to calculate the value or action in the root node. This MCTSNet approach has been shown to outperform other MCTS methods.
2.3 Multiagent Deep Deterministic Policy Gradient
When dealing with continuous action spaces, the methods described above can not be applied. To overcome this limitation, the actorcritic approach to reinforcement learning was proposed sutton2000policy . In this approach an actor algorithm tries to output the best action vector and a critic tries to predict the value function for this action.
Specifically, in the Deep Deterministic Policy Gradient (DDPG lillicrap2015DDPG ) algorithm two neural networks are used: is the actor network that returns the action vector. is the critic network, that returns the value, i.e.
the value estimate of the action of
in state .The gradient for the critic network can be calculated in the same way as the gradient for Deep QNetworks described above (Equation 3). Knowing the critic gradient we can then compute the gradient for the actor as follows:
(4) 
where and are parameters of critic and actor neural networks respectively, and is the probability of reaching state with policy .
The authors of lowe2017MADDPG proposed an extension of this method by creating multiple actors, each with it’s own critic with each critic taking in the respective agent’s observations and actions of all agents.
3 MAGnet approach and architecture
The overall network architecture of our MAGNet approach is shown in Figure 1. The whole process can be divided into a relevance graph generation stage (shown in the left part) and a decision making stages (shown in the right part). We see them as a regression and classification problem respectively. In this architecture, the concatenation of the current state and previous action forms the input of the models, and the output is the next action. The details of the two processes are described below.
3.1 Relevance graph generation stage
In the first part of our MAGNet approach, a neural network is trained that produces a relevance graph. The relevance graph represents the relationship between agents and between agents and environment objects. The higher the weight of an edge between an agent and another agent or object is, the more important or are for the achievement of agent ’s task. The graph is generated by MAGNet from the current and previous state together with the respective actions.
Figure 4B shows an example of such a graph for two agents. The displayed graph only shows those edges which have a nonzero weight (thus there are objects to which agent 1 is not connected in the graph).
In MAGNet, a neural network is trained via backpropagation to output a relevance graph represented as an matrix, where is the number of agents and is the maximum number of environment objects. The input to the network are the current and the two previous states (denoted by , , and in Figure 1), the two previous actions (denoted by and ), and the relevance graph produced at the previous time step (denoted by ). For the first learning step (i.e.
), the input consists out of three copies of the initial state, no actions, and a random relevance graph. The inputs are passed into a convolution and pooling layer, followed by a padding layer, and then concatenated and passed into fully connected layer and finally into the graph generation network (GGN). The GGN can be either a multilayer perceptron (MLP) or a selfattention network, which uses an attention mechanism to catch long and short term timedependencies, and is an analogue to a recurrent network such as LSTM, but takes much less time to compute
vaswani2017attenion . The result of the GGN is fed into a twolayer fully connected network with dropout, which produces the relevance graph matrix, as described above.The loss function for the backpropagation training is composed of two parts:
(5) 
The first component is based on the difference between the current graph and the one generated in the previous state . It is important to note that graph on each step is the same only weights are changing. The second component comes into play when a special predefined event occurs, and is based on the difference between a selected edge weight updated according to heuristic rules and the weight of the same edge in the current graph . For example, a heuristic rule would specify that if a bomb explodes and kills the agent, the edge weight between the agent and the bomb is set to a high value (i.e. the bomb is clearly of high relevance of the agent).
The training of the neural network can be performed in two stages: first with a default rulebased AI agent, and then with a learning agent.
3.2 Decision making stage
The agent AI responsible for decision making is also represented as a neural network whose inputs are accumulated messages (generated by a method inspired by NerveNet wang2018nervenet and described below) and the current state. The output of the network is an action to be executed.
The graph generated at the last step is where edges represent relevance between agents and objects. Every vertex has a type: that in our case corresponds to: "ally", "enemy", "placed bomb" (about to explode), "increase kick ability", "increase blast power", "extra bomb" (can be picked up). Every edge has a type as well: , that corresponds to “edge between the agents” and “edge between the agent and the object in the environment”.
The final (action) vector is computed in 4 stages through message passing system, similar to a system used for distributed computing and described in attiya2004distributed . Stages 2 and 3 are repeated for a specified number of message propagation steps.

Initialization of information vector. Each vertex has an initialization network associated with it according to it’s type that takes as input the current individual observation and outputs initial information vector for each vertex.
(6) 
Message generation. At message propagation step message networks compute output messages for every edge based on type of the edge .
(7) 
Message processing. Information vector at message propagation step is updated by update network associated with it according to it’s type , that takes as input a sum of all message vectors from connected to edges multiplied by the edge relevance and information at previous step .
(8) 
Choice of action. All vertices that are associated with agents have a decision network which takes as an input its final information vector and compute the mean of the action of the Gaussian policy.
(9)
All networks are trained using backpropagation following the DDPG actorcritic approach lillicrap2015DDPG .
4 Experiments
4.1 Environment
In this paper, we use popular Pommerman game environment which can be played by up to 4 players matiisen2018pommerman . This game has been used in many empirical evaluations of multiagent algorithms, and therefore is especially suitable for a comparison to stateoftheart techniques. In Pommerman, the environment is a gridworld where each agent can move in one of four directions, lay a bomb, or do nothing. A grid square is either clear (which means that an agent can enter it), wooden, or rigid. Wooden grid squares can not be entered, but can be destroyed by a bomb (i.e. turned into clear squares). Rigid squares are indestructible and impassable. When a wooden square is destroyed, there is a probability of items appearing, e.g., an extra bomb, a bomb range increase, or a kick ability. Once a bomb has been placed in a grid square it explodes after 10 time steps. The explosion destroys any wooden square within range 1 and kills any agent within range 4. The last surviving agent wins the gameast
The map of the environment is randomly generated for every episode. The game has two different modes: free for all and team match. Our experiments were carried out in the team match mode in order to evaluate the ability of MAGnet to exploit the discovered relationships between agents (e.g. being on the same team).
4.2 Network training
We first trained the graph generating network on 50,000 episodes with the default Pommerman AI as the decision making agent. After this initial training, the default AI was replaced with the learning decision making AI described in section 3. All learning graphs show the training episodes starting with this replacement (except the ones which explicitly show the relevance graph learning).
Table 1 shows results for different MAGNet variants in terms of achieved win percentage against a default agent after 50,000 episodes. The MAGNet variants are differing in the complexity of the approach, starting from the simplest version which takes the relevance graph as a direct input, to the version incorporating message generation, graph sharing, and selfattention. The table clearly shows the benefit of each extension.
MAGnet modules  Win %  

Selfattention  Graph Sharing  Message Generation  
+  +  +  
+  +    
+    +  
+      
  +  +  
  +    
    +  
     
Each of the three extensions with their hyperparameters are described below:
Graph Generating Network (GGN
): we used a MLP (number of layers and neurons was varied, and a network with 3 layers 512128128 neurons achieved the best result) and a selfattention (
SA) layer vaswani2017attenion with default parameters.Graph Sharing (GS): relevance graphs were trained individually for both agents, or in form of a shared graph for both agents.
Message Generation (MG): the message generation module was implemented as either a MLP or a message generation (MG) architecture, as described in Section 3. We tested the MLP and message generation network with a range of hyperparameters. For the MLP with 3 fully connected layers 102425664 neurons achieved the best result, while for the message generation network 2 layers with 12832 neurons and 5 message passing iterations showed the best result.
Dropout layers were individually optimized by grid search in [0, 0.2, 0.4] space.
We tested two convolution sized: [3x3] and [5x5]. [5x5] convolutions showed the best result.
Rectified Linear Unit (ReLU) transformation was used for all connections.
4.3 Evaluation Baselines
In our experiments, we compare the proposed method with stateoftheart reinforcement learning algorithms simulated in team match mode. Figure 1(a) shows a comparison with DQN mnih2013playing , MCTSNets guez2018learning , MADDPGlowe2017MADDPG , and a default heuristic AI. The latter algorithm is provided as part of the Pommerman environment matiisen2018pommerman . Each of the reinforcment learning algoritms played a number of games (i.e. episodes) against the heuristic AI, and the respective win rates are shown.
All graphs display a 95% confidence interval to illustrate the statistical significance of our results.
The parameters chosen for the baselines were set as follows.
For DQN we implemented multiagent deep Qlearning approach which has been shown to be successful in past work egorov2016multi . In this method training is performed in two repeated steps: first, one agent is training at a time, while policies of other agents are kept fixed; second, the agent that was trained in the previous step distributes its policy to all of its allies as an additional environmental variable.
The network consists of five convolutional layers with 64 3x3 filters in each layer followed by three fully connected layers with 128 neurons each with residual connections
he2016deep ioffe2015batchthat takes an input an 11x11x4 environment tensor and onehot encoded action vector (a padded 1x6 vector) that are provided by the Pommerman environment and outputs a Qfunction for that state. This network showed the best result at the parameter exploration stage.
Parameter exploration on MCTSNet led to the following settings: The backup network is a multilayer perceptron (MLP) with 5 fully connected layer with 64 neurons in each layer that takes in current "memory" vectors of the node and updated "memory" vector of the child and updated the node’s "memory" vector. The embedding network , is consists of 7 convolutional layers with 64 3x3 filters followed by 3 fully connected layers with 128 neurons each with residual connections he2016deep and batch normalization ioffe2015batch that takes an input an 11x11x4 environment tensor and onehot encoded action vector (a padded 1x6 vector) that are provided by Pommerman and outputs a "memory" vector. The policy network has the same architecture, but with 5 convolutional layers with 32 3x3 filters each and it outputs an action for simulation. The readout network, is a multilayer perceptron with 2 fully connected layer with 128 neurons in each layer that inputs root "memory" vector and outputs an action.
For our implementation of MADDPG we used a multilayer perceptron (MLP) with 5 fully connected layer with 128 neurons in each layer and for the critic we used a 3 layer network with 128 neurons in each layer.
4.4 Selfattention and graph sharing in training a relevance graph
Figure 3 shows the shared graph loss value (Equation 5) with and without selfattention module and with or without graph sharing. As we can see from this figure, both selfattention and graph sharing significantly improve graph generation in terms of speed of convergence and final loss value. Furthermore, their actions are somewhat independent which is seen in that using them together gives additional improvement.
To provide further evidence for the usefulness of the shared graph approach, we let a MAGNetAttNerveNet team play against a MAGNetAttNerveNetGS team. As the graph in 2(a) shows, even though both have the same base architectures, the graph sharing method yields a higher winrate after 10,000 episodes.
4.5 Relevance graph visualization
Figure 4 shows examples of relevance graphs with the corresponding environment states. Red nodes denote friendly team agents, the purple nodes denote the agents on the opposing team, and the other nodes denote environment objects such as walls (green) and bombs (black). The lengths of edges represent their weights (shorter edge equals higher weight, i.e. higher relevance). The graphs in Figure 4B are shared, while the graphs in Figure 4C are agentindividual.
As can be seen when comparing the individual and shared graphs, in the shared case agent 1 and agent 2 have different strategies related to the opponent agents (agents 3 and 4). Agent 4 is of relevance to agent 1 but not to agent 2. Similarly, agent 3 is of relevance to agent 2, but not to agent 1. In contrast, when considering the individual graphs, both agents 3 and 4 have the same relevance to agents 1 and 2. Furthermore, it can be seen from all graphs that different environment objects are relevant to different agents.
5 Conclusion
In this paper we presented a novel method, MAGNet, for deep multiagent reinforcement learning incorporating information on the relevance of other agents and environment objects to the RL agent. We also extended this basic approach with various optimizations, namely selfattention, shared relevance graphs, and message generation inspired by Nervenet. The MAGNet variants were evaluated on the popular Pommerman game environment, and compared to stateoftheart MARL techniques. Our results show that MAGNet significantly outperforms all competitors.
6 Acknowledgments
This was supported by Deep Learning Camp Jeju 2018 which was organized by TensorFlow Korea User Group. This work was partially supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under Industrial Technology Innovation Program No.10077659, ‘Development of artificial intelligence based mobile manipulator for automation of logistics in manufacturing line and logistics center’.
References
 [1] H. Attiya and J. Welch. Distributed computing: fundamentals, simulations, and advanced topics, volume 19. John Wiley & Sons, 2004.
 [2] D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 2005.
 [3] G. Chaslot, S. Bakkes, I. Szita, and P. Spronck. Montecarlo tree search: A new framework for game ai. In AIIDE, 2008.
 [4] M. Egorov. Multiagent deep reinforcement learning, 2016.
 [5] A. Guez, T. Weber, I. Antonoglou, K. Simonyan, O. Vinyals, D. Wierstra, R. Munos, and D. Silver. Learning to search with mctsnets. arXiv preprint arXiv:1802.04697, 2018.

[6]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [9] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.

[10]
L. Matignon, G. J. Laurent, and N. Le FortPiat.
Independent reinforcement learners in cooperative markov games: a
survey regarding coordination problems.
The Knowledge Engineering Review
, 27(1):1–31, 2012.  [11] T. Matiisen. Pommerman baselines. https://github.com/tambetm/pommermanbaselines, 2018.
 [12] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 [13] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
 [14] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.

[15]
M. Tan.
Multiagent reinforcement learning: Independent vs. cooperative
agents.
In
Proceedings of the tenth international conference on machine learning
, pages 330–337, 1993.  [16] G. Tesauro. Extending qlearning to general adaptive multiagent systems. In Advances in neural information processing systems, pages 871–878, 2004.
 [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
 [18] T. Wang, R. Liao, J. Ba, and S. Fidler. Nervenet: Learning structured policy with graph neural networks. Proceedings of the International Conference on Learning Representations, 2018.