MAGNet: Multi-agent Graph Network for Deep Multi-agent Reinforcement Learning

12/17/2020 ∙ by Aleksandra Malysheva, et al. ∙ University of York 0

Over recent years, deep reinforcement learning has shown strong successes in complex single-agent tasks, and more recently this approach has also been applied to multi-agent domains. In this paper, we propose a novel approach, called MAGNet, to multi-agent reinforcement learning that utilizes a relevance graph representation of the environment obtained by a self-attention mechanism, and a message-generation technique. We applied our MAGnet approach to the synthetic predator-prey multi-agent environment and the Pommerman game and the results show that it significantly outperforms state-of-the-art MARL solutions, including Multi-agent Deep Q-Networks (MADQN), Multi-agent Deep Deterministic Policy Gradient (MADDPG), and QMIX

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A common difficulty of reinforcement learning in a multi-agent environment (MARL) is that in order to achieve successful coordination, agents require information about the relevance of environment objects to themselves and other agents. For example, in the game of Pommerman [1] it is important to know how relevant bombs placed in the environment are for teammates, e.g. whether or not the bombs can threaten them. While such information can be hand-crafted into the state representation for well-understood environments, in lesser-known environments it is preferable to derive it as part of the learning process.

In this paper, we propose a novel method, named MAGNet (Multi-Agent Graph Network), to learn such relevance information in form of a relevance graph and incorporate this into the reinforcement learning process. The method works in two stages. In the first stage, a relevance graph is learned. In the second stage, this graph together with state information is fed to an actor-critic reinforcement learning network that is responsible for the decision making of the agent and incorporates message passing techniques along the relevance graph.

The contribution of this work is a novel technique to learn object and agent relevance information in a multi-agent environment, and incorporate this information in deep multi-agent reinforcement learning.

We applied MAGNet to the synthetic predator-prey game, commonly used to evaluate multi-agent systems [2], and the popular Pommerman [1] multi-agent environment. We achieved significantly better performance than state-of-the-art MARL techniques including MADQN [3], MADDPG [4] and QMIX [5]. Additionally, we empirically demonstrate the effectiveness of utilized self-attention [6], graph sharing and message passing system.

Ii Deep Multi-Agent Reinforcement Learning

In this section we describe the state-of-the-art deep reinforcement learning techniques that were applied to multi-agent domains. The algorithms introduced below (MADQN, MADDPG, and QMIX) were also used as evaluation baselines in our experiments.

Ii-a Multi-agent Deep Q-Networks

Deep Q-learning utilizes a neural network to predict Q-values of state-action pairs

[7]

. This so-called deep Q-network is trained to minimize the following loss function:

(1)
(2)

where is the state we transition into by taking action in state and is the reward of that action,

is the parameter vector of the current Q-function approximation.

denotes all actions that are permitted in state .

The Multi-agent Deep Q-Networks (MADQN, [3]) approach modifies this process for multi-agent systems by performing training in two repeated steps. First, they train agents one at a time, while keeping the policies of other agents fixed. When the agent is finished training, it distributes its policy to all of its allies as an additional environmental variable.

Ii-B Multi-agent Deep Deterministic Policy Gradient

When dealing with continuous action spaces, the MADQN method described above can not be applied. To overcome this limitation, the actor-critic approach to reinforcement learning was proposed [8]. In this approach an actor algorithm tries to output the best action vector and a critic tries to predict the value function for this action.

Specifically, in the Deep Deterministic Policy Gradient (DDPG [9]) algorithm two neural networks are used: is the actor network that returns the action vector. is the critic network, that returns the

value, i.e. the value estimate of the action of

in state .

The gradient for the critic network can be calculated in the same way as the gradient for Deep Q-Networks described above (Equation 1). Knowing the critic gradient we can then compute the gradient for the actor as follows:

(3)

where and are parameters of critic and actor neural networks respectively, and

is the probability of reaching state

with policy .

The authors of [10] proposed an extension of this method by creating multiple actors, each with its own critic, where each critic takes in the respective agent’s observations and actions of all agents. This then constitutes the following value function for actor :

(4)

This Multi-agent Deep Deterministic Policy Gradient method showed the best results among widely used deep reinforcement learning techniques in continuous state and action space.

Ii-C Qmix

Another recent promising approach to deep multi-agent reinforcement learning is the QMIX [5] method. It utilizes individual Q-functions for every agent and joint Q-function for a team of agents. The QMIX architecture consists of three types of neural networks. Agent networks evaluate individual Q-functions for agents taking in the current observation and the previous action. Mixing network takes as input individual Q-functions from agent networks and a current state and then calculates a joint Q-function for all agents. Hyper networks add an additional layer of complexity to the mixing network. Instead of passing the current state to the mixing network directly, hyper networks use it as input and calculate weight multipliers at each level of the mixing network. We refer the reader to the original paper for a more complete explanation [5].

The authors empirically demonstrated on a number of RL domains that this approach outperforms both MADQN and MADDPG methods.

Iii MAGnet approach and architecture

Fig. 1: The overall network architecture of MAGNet. Left section shows the graph generation stage. Right part shows the decision making stage. denotes the state of the environment at step . denotes the action taken by the agent at step . GGN refers to Graph Generation Network.

Figure 1 shows the overall network architecture of our MAGNet approach. The whole process can be divided into a relevance graph generation stage (shown in the left part) and a decision making stages (shown in the right part). In this architecture, the concatenation of the current state and the previous action forms the input of the model, and the output is the next action. The details of the two processes are described below.

Iii-a Relevance graph generation stage

In the first part of our MAGNet approach, a neural network is trained to produce a relevance graph, which is represented as a numerical matrix , where is the number of agents and is a given maximum number of environment objects, e.g., bombs and walls in Pommerman. Weights for objects that are not present at the current time are set to 0. The relevance graph represents the relationship between agents and between agents and environment objects. The higher the absolute weight of an edge between an agent and another agent or object is, the more important or are for the achievement of agent ’s task. Every vertex of the graph has a type , defined by the user. Example types are ”wall”, ”bomb”, and ”agent”. Types are user-defined and are used in the message generation step (see below). The graph is generated by MAGNet from the current and previous state together with the respective actions.

Iv Relevance graph visualization

To generate this relevance graph, we train a neural network via back-propagation to output a graph representation matrix. The input to the network are the current and the two previous states (denoted by , , and in Figure 1), the two previous actions (denoted by and ), and the relevance graph produced at the previous time step (denoted by ). For the first learning step (i.e.

), the input consists out of three copies of the initial state, no actions, and a random relevance graph. The inputs are passed into a convolution and pooling layer, followed by a padding layer, and then concatenated and passed into fully connected layer and finally into the graph generation network (GGN). In this work we implement GGN as either a multilayer perceptron (MLP) or a self-attention network, which uses an attention mechanism to catch long and short term time-dependencies. We present the results of both implementations in Table 

I. The self-attention network is an analogue to a recurrent network, such as an LSTM, but takes much less time to compute [6]. The result of the GGN is fed into a two-layer fully connected network with dropout, which produces the relevance graph matrix.

The loss function for the back-propagation training is defines as follows:

(5)

The loss function is based on the squared difference between weights of edges in the current graph and the one generated in the previous state

. We could train the graph without this loss function, and instead with just the loss function of the decision making stage backpropagated to the graph generation stage. However, we found that this lowers the performance (see Figure 

3). Both Pommerman and predator-prey environments have these default agents. However, we found out that the better way to train MAGNet is to first pre-train the graph generation and then add the agent networks (see also Section V-E). There are two alternatives for training relevance graphs: (1) train individual relevance graphs for every agent, or (2) train one shared graph (GS) that is the same for all agents on the team. We performed experiments to determine which way is better (see Table I).

Iv-a Decision making stage

The agent AI responsible for decision making is represented as a neural network whose inputs are accumulated messages and the current state of the environment. The output of the network is an action to be executed. This action is computed in 4 steps through a message passing system.

In the first step, individual (i.e. location-specific) observations of agents and objects are pre-processed by a neural network into an information vector (represented as a numerical vector). This neural network is initialized randomly and trained during the overall learning process using the same global loss function.

In the second step, a neural network (also trained) is taking the information vector of an agent, and maps it into a message (also a numerical vector), one for each connected vertex type in the relevance graph. This message is multiplied by the weight of the corresponding edge and passed to the respective vertices.

In the third step, each agent or object in the relevance graph updates its information vector, also using a trained network, based on the incoming messages and the previous information vector. Steps 2 and 3 are repeated a given number of times, in our experiments five times. Finally, in the last step, the final message received by the agent, together with the current state information is mapped by a trained decision making network into an action.

Since the message passing system outputs an action, we view it as an actor in the DDPG actor-critic approach [9], and train it accordingly. A more formal description of this decision making stage is as follows.

  1. Initialization of information vector. Each vertex has an initialization network associated with it according to its type that takes as input the current individual observation and outputs initial information vector for each vertex.

    (6)
  2. Message generation. Message generation performs in iterative steps. At message generation step (not to be confused with environmental time ) the message networks compute output messages for every edge based on the type of the edge , which are then multiplied by the weight of the corresponding edge in the relevant graph.

    (7)
  3. Message processing. The information vector at the message propagation step is updated by an associated update network , according to its type . The network takes as input a sum of all incoming message vectors and the information vector at the previous step.

    (8)
  4. Choice of action. All vertices that are associated with agents have a decision network which takes as an input its final information vector and computes the mean of the action of the Gaussian policy.

    (9)

V Experiments

V-a Environments

In this paper, we use two popular multi-agent benchmark environments for testing, the synthetic multi-agent predator-prey game [2], and the Pommerman game [1].

In the predator-prey environment, the aim of the predators is to kill faster moving prey in 500 iterations. The predator agents must learn to cooperate in order to surround and kill the prey. Every prey has a health of 10. A predator moving within a given range of the prey lowers the prey’s health by 1 point per time step. Lowering the prey health to 0 kills the prey. If even one prey survives after 500 iterations, the prey team wins. Random obstacles are placed in the environment at the start of the game.

The Pommerman game is a popular environment which can be played by up to 4 players. The multi-agent variant has 2 teams of 2 players each. This game has been used in recent competitions for multi-agent algorithms, and therefore is especially suitable for a comparison to state-of-the-art techniques.

In Pommerman, the environment is a grid-world where each agent can move in one of four directions, place a bomb, or do nothing. A grid square is either empty (which means that an agent can enter it), wooden, or rigid. Wooden grid squares can not be entered, but can be destroyed by a bomb (i.e. turned into clear squares). Rigid squares are indestructible and impassable. When a wooden square is destroyed, there is a probability of items appearing, e.g., an extra bomb, a bomb range increase, or a kick ability. Once a bomb has been placed in a grid square it explodes after 10 time steps. The explosion destroys any wooden square within range 1 and kills any agent within range 4. If both agents of one team die, the team loses the game and the opposing team wins. The map of the environment is randomly generated for every episode.

The game has two different modes: free for all and team match. Our experiments were carried out in the team match mode in order to evaluate the ability of MAGnet to exploit the discovered relationships between agents (e.g. being on the same team).

We represent states in both environments as a tensor , where are the dimensions of the field and is the maximum possible number of objects. if object is present in the space and is 0 otherwise. The predator-prey state is a tensor, and the Pommerman state is .

Fig. 2: The synthetic predator-prey (left) and the Pommerman game environment (right).

Figure 2 shows both test environments.

V-B Evaluation Baselines

Fig. 3:

MAGNet variants compared to state-of-the-art MARL techniques in the predator-prey (left) and Pommerman (right) environments. MAGNet-DSH refers to MAGNet with domain specific heuristics (Section 

V-F). MAGNet-NoGL refers to MAGNet trained without the graph generation loss function (Equation 5), but only with the final loss function of the decision making stage. MAGNet-NO-PT refers to MAGNet with no pre-training for the graph generating network (Section V-E

). Every algorithm trained by playing against a default environment agent for a number of games (episodes) and a respective win percentage is shown. Default agents are provided by the environments. Shaded areas show the 95% confidence interval from 20 runs.

In our experiments, we compare the proposed method with state-of-the-art reinforcement learning algorithms in the two environments described above. Figure 3 shows a comparison with the MADQN [3], MADDPG [4] and QMIX [5] algorithms. Each of these algorithms were trained by playing a number of games (i.e. episodes) against the default AI provided with the games, and the respective win rates are shown. All graphs display a 95% confidence interval over 20 runs to show statistical significance.

The parameters for the MADQN

the baselines were set as follows through parameter exploration. The network for the predator-prey environment consists of seven convolutional layers with 64 5x5 filters in each layer followed by five fully connected layers with 512 neurons each with residual connections

[11]

and batch normalization

[12]

that takes an input an 128x128x6 environment state tensor and one-hot encoded action vector (a padded 1x5 vector) and outputs a Q-function for that state-action pair. Since the output of a DQN is discrete, but the predator-prey environment requires a continuous action, the agents employ only two speeds and 10 directions. The network for Pommerman consists of five convolutional layers with 64 3x3 filters in each layer followed by three fully connected layers with 128 neurons each with residual connections and batch normalization that takes an input an 11x11x4 environment state tensor and one-hot encoded action vector (a padded 1x6 vector) that are provided by the Pommerman environment and outputs a Q-function for that state-action pair.

For our implementation of MADDPG

we used a multilayer perceptron (MLP) with 3 fully connected layers with 512-128-64 neurons for both the actor and critic for the predator-prey game, and 5 fully connected layer with 128 neurons in each layer for the critic and a 3 layer network with 128 neurons in each layer for the actor in Pommerman.

Parameter exploration for QMIX

led to the following settings for both environments. All agent networks are DQNs with a recurrent layer of a Gated Recurrent Unit (GRU

[13]) with a 64-dimensional hidden state. The mixing network consists of a single hidden layer of 32 neurons. Since the output of MADDPG and QMIX is continuous, but Pommerman expects a discrete action, we discretized it.

As in the original QMIX paper [5], we decrease learning rate linearly from 1.0 to 0.05 over the first 50k time steps and than keep it constant. As can be seen from Figure 3, our MAGnet approach significantly outperforms current state-of-the-art algorithms.

V-C MagNet network training

In both environments we first trained the graph generating network on 50,000 episodes with the same parameters and with the default AI as the decision making agents. Both predator-prey and Pommerman environments provide these default agents. After this initial training, the default AI was replaced with the learning decision making AI described in section III. All learning graphs show the training episodes starting after this replacement.

Table  I shows results for different MAGNet variants in terms of achieved win percentage against a default agent after 600,000 episodes in the predator-prey game and a 1,000,000 episodes in the game of Pommerman. The MAGNet variants are differing in the complexity of the approach, starting from the simplest version which takes the learned relevance graph as a direct addition to the input, to the version incorporating message generation, graph sharing, and self-attention. The table clearly shows the benefit of each extension.

MAGnet modules Win % PP Win % PM
SA GS MG
+ + +
+ + -
+ - +
+ - -
- + +
- + -
- - +
- - -
TABLE I: Influence of different modules on the performance of the MAGnet model.

Each of the three extensions with their hyper-parameters are described below:

  • Self-attention (SA

    ). We can train the Graph Generating Network (GGN) as a simple multi-layer perceptron (number of layers and neurons was varied, and a network with 3 fully connected layers 512-128-128 neurons achieved the best result). Alternatively, we can train it as a self-attention encoder part of the Transformer network (

    SA) layer [6] with default parameters.

  • Graph Sharing (GS): relevance graphs were trained individually for agents, or in form of a shared graph for all agents on one team.

  • Message Generation (MG): the message generation module was implemented as either an MLP or a message generation (MG) architecture, as described in Section IV-A.

V-D MAGNet parameters

We define vertex types and edge types in relevance graph as follows:

in case of predator-prey game that corresponds to: ”predator on team 1 (1, 2, 3)”, ”predator on team 2 (4, 5, 6)”, ”prey”, ”wall”. Every edge has a type as well: , that corresponds to “edge between predators within one team”, “edge between predators from different teams” and “edge between the predator and the object in the environment or prey”.

in case of Pommerman game that corresponds to: ”ally”, ”enemy”, ”placed bomb” (about to explode), ”increase kick ability”, ”increase blast power”, ”extra bomb” (can be picked up). Every edge has a type as well: , that corresponds to “edge between the agents” and “edge between the agent and the object in the environment”.

We tested the MLP and message generation network with a range of hyper-parameters, choosing the best one. In the predator-prey game, the MLP has 3 fully connected layers with 512-512-128 neurons, while the message generation network has 5 layers with 512-512-128-128-32 neurons. For the Pommerman environment, the MLP has 3 fully connected layers 1024-256-64 neurons, while the message generation network has 2 layers with 128-32 neurons. In both domains 5 message passing iterations showed the best result.

Dropout layers were individually optimized by grid search in the [0, 0.2, 0.4] space. We tested two convolution sizes: [3x3] and [5x5]. [5x5] convolutions showed the best result. A Rectified Linear Unit (ReLU) transformation was used for all connections.

V-E No pre-training

With regards to pre-training of the graph generating network we need to answer the following questions. First, we need to determine whether or not it is feasible to train the network without an external agent for pre-training. In other words, can we simultaneously train both the graph generating network and the decision making networks from the start. Second, we need to demonstrate whether pre-training of a graph network improves the result.

To answer this question, we performed experiments without the pre-training of the graph network. Figure 3 shows the results of those experiments (MAGNet-NO-PT). As can be seen, the network indeed can learn without pre-training, but pre-training significantly improves the results. This may be due to decision making error influencing the graph generator network in a negative way.

V-F Domain specific heuristics

We also performed experiments to see whether or not additional knowledge about the environment can improve the results of our method. To incorporate this knowledge, we change the loss function for graph generation in the following manner.

(10)

The first component is the same: it is based on the squared difference between weights of edges in the current graph and the one generated in the previous state . The second iterates through events at time and calculates the square difference between the weight of edge that is involved in event and the event weight .

For example, in the Pommerman environment we set corresponding to our team agent killing an agent from the opposite team to 100, and the corresponding to an agent picking up a bomb to 25. In the predator-prey environment, if a predator kills a prey, we set the event’s weight to 100. If a predator only wounds the prey, the weight for that event is set to 50.

As can be seen in Figure 3 (line MAGNet-DSH), the model that uses this domain knowledge about the environment trains faster and performs better. It is however important to note that the MAGNet network without any heuristics still outperforms current state-of-the-art methods. For future research we consider creating a method for automatic assignment of the event weights.

Vi Conclusion

In this paper we presented a novel method, MAGNet, for deep multi-agent reinforcement learning incorporating information on the relevance of other agents and environment objects to the RL agent. We also extended this basic approach with various optimizations, namely self-attention, shared relevance graphs, and message generation. The MAGNet variants were evaluated on the popular predator-prey and Pommerman game environments, and compared to state-of-the-art MARL techniques. Our results show that MAGNet significantly outperforms all competitors.

References