Deep Coordination Graphs

09/27/2019 ∙ by Wendelin Böhmer, et al. ∙ University of Oxford 54

This paper introduces the deep coordination graph (DCG) for collaborative multi-agent reinforcement learning. DCG strikes a flexible trade-off between representational capacity and generalization by factorizing the joint value function of all agents according to a coordination graph into payoffs between pairs of agents. The value can be maximized by local message passing along the graph, which allows training of the value function end-to-end with Q-learning. Payoff functions are approximated with deep neural networks and parameter sharing improves generalization over the state-action space. We show that DCG can solve challenging predator-prey tasks that are vulnerable to the relative overgeneralization pathology and in which all other known value factorization approaches fail.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the central challenges in cooperative multi-agent reinforcement learning (MARL, Oliehoek & Amato, 2016)

is coping with the size of the joint action space, which grows exponentially in the number of agents. For example, this paper evaluates tasks where eight agents each have six actions to choose from, yielding a joint action space with more than a million actions. Efficient MARL methods must thus be able to generalize over large joint action spaces, in the same way that convolutional neural networks allow deep RL to generalize over large visual state spaces.

Even though few benchmark tasks actually require agent policies to be independently executable, one common approach to coping with large action spaces is to decentralize the decision policy and/or value function. For example, Figure 1a shows how the joint value function can be factorized into utility functions that each depend only on the actions of one agent (Sunehag et al., 2018; Rashid et al., 2018). Consequently, the joint value function can be efficiently maximized if each agent simply selects the action that maximizes its corresponding utility function. This factorization can represent any deterministic policy and thus can represent at least one optimal policy. However, that policy may not be learnable due to a game-theoretic pathology called relative overgeneralization (Panait et al., 2006): during exploration other agents act randomly and punishment caused by uncooperative agents may outweigh rewards that would be achievable with coordinated actions. If the employed value function does not have the representational capacity to distinguish the values of coordinated and uncoordinated actions, an optimal policy cannot be learned.

However, Castellini et al. (2019) show that higher-order factorization of the value function works surprisingly well in one-shot games that are vulnerable to relative overgeneralization, even if each factor depends on the actions of only a small subset of agents. Such a higher-order factorization can be expressed as an undirected coordination graph (CG, Guestrin et al., 2002a), where each vertex represents one agent and each (hyper-)edge one payoff function over the joint action space of the connected agents. Figure 1b shows a CG with pairwise edges and the corresponding value factorization. Depending on the CG topology, the value can thus depend nontrivially on the actions of all agents, yielding a richer representation. Although the value can no longer be maximized by each agent individually, the greedy action can be found using message passing along the edges (also known as belief propagation, Pearl, 1988). Sparse cooperative -learning (Kok & Vlassis, 2006) applies CGs to MARL but does not scale to modern benchmarks, as each payoff function ( and in Figure 1b) is represented as a table over the state and joint action space of the connected agents. Castellini et al. (2019) use neural networks to approximate payoff functions, but only in one-shot games, and still require a unique function for each edge in the CG. Consequently, each agent group, represented by an edge, must still experience all corresponding action combinations, which can require executing a significant subset of the joint action space.

Figure 1: Examples of value factorization for 3 agents: (a) sum of independent utilities (as in VDN, Sunehag et al., 2018) corresponds to an unconnected CG. QMIX uses a monotonic mixture of utilities instead of a sum (Rashid et al., 2018); (b) sum of pairwise payoffs (Castellini et al., 2019), which correspond to pairwise edges; (c) no factorization (as in QTRAN, Son et al., 2019) corresponds to one hyper-edge connecting all agents. Factorization allows parameter sharing between factors, shown next to the CG, which can dramatically improve the algorithm’s sample complexity.

To address these issues, this paper proposes the deep coordination graph (DCG), a deep RL algorithm that scales to modern benchmark tasks. DCG represents the value function as a CG with pairwise payoffs111 The method can be generalized to CG with hyper-edges, that is, to payoff functions for more than 2 agents. (Figure 1b) and individual utilities (Figure 1a). This improves the representational capacity beyond state-of-the-art value factorization approaches like VDN (Sunehag et al., 2018) and QMIX (Rashid et al., 2018). To achieve scalability, DCG employs parameter sharing between payoffs and utilities. Parameter sharing has long been a staple of factorized MARL. Methods like VDN and QMIX condition an agent’s utility on its history, that is, its past observations and actions, and share the parameters of all utility functions. Experiences of one agent are thus used to train all. This can dramatically improve the sample efficiency compared to unfactored methods (Foerster et al., 2016, 2018; Lowe et al., 2017; Schröder de Witt et al., 2019; Son et al., 2019), which correspond to a CG with one hyper-edge connecting all agents (Figure 1c). DCG takes parameter sharing one step further by approximating all payoff functions with the same neural network. To allow unique outputs for each payoff, the network is conditioned on a learned embedding of the participating agents’ histories. This requires only one linear layer more than VDN and has thus less parameters as QMIX.

DCG is trained end-to-end with deep -learning (DQN, Mnih et al., 2015), but uses message passing to coordinate greedy action selection between all agents in the graph. For message passes over agents with actions each, the time complexity of maximization is only , where is the number of (pairwise) edges, compared to for DQN without factorization.

We compare DCG’s performance with that of other MARL -learning algorithms in a challenging family of predator-prey tasks that require coordinated actions. Here DCG is the only algorithm that solves the harder tasks. We also investigate the influence of graph topologies on the performance.

2 Related work

A general overview over cooperative deep MARL can be found in OroojlooyJadid & Hajinezhad (2019). Independent -learning (IQL Tan, 1993) decentralizes the agents’ policy by modeling each agent as an independent -learner. However, the task from the perspective of a single agent becomes nonstationary as other agents change their policies. To address this, Foerster et al. (2017) show how to stabilize IQL when using experience replay buffers. Another approach to decentralized agents is centralized training and decentralized execution (Foerster et al., 2016) with a factorized value function. Value decomposition networks (VDN, Sunehag et al., 2018) performs central -learning with a value function that is the sum of independent utility functions for each agent (Figure 1a). The greedy policy can be executed by maximizing each utility independently. QMIX (Rashid et al., 2018) improves upon this approach by combining the agents’ utilities with a mixing network, which is monotonic in the utilities and depends on the global state. This allows different mixtures in different states and the central value can be maximized independently due to monotonicity. All of these approaches are derived in Appendix A.1 and can use parameter sharing between the value/utility functions. However, they represent the joint value with independent values/utilities and are therefore susceptible to the relative overgeneralization pathology. We demonstrate this by comparing DCG with all the above algorithms.

Another straightforward way to decentralize in MARL is to define the joint policy as a product of independent agent policies. This lends itself to the actor-critic framework, where the critic is discarded during execution and can therefore condition on the global state and all agents’ actions during training. Examples are MADDPG (Lowe et al., 2017) for continuous actions and COMA (Foerster et al., 2018) for discrete actions. Wei et al. (2018) specifically investigate the relative overgeneralization pathology in continuous multi-agent tasks and show improvement over MADDPG by introducing policy entropy regularization. MACKRL (Schröder de Witt et al., 2019) follows the approach in Foerster et al. (2018), but uses common knowledge to coordinate agents during centralized training. Son et al. (2019) define QTRAN, which also has a centralized critic but uses a greedy actor w.r.t. a VDN factorized function. The corresponding utility functions are distilled from the critic under constraints that ensure proper decentralization. Böhmer et al. (2019) present another approach to decentralize a centralized value function, which is locally maximized by coordinate ascent and decentralized by training IQL agents from the same replay buffer. Centralized joint -value functions do not allow to share parameters to the same extent as value factorization, and we compare DCG to QTRAN to demonstrate the advantage in sample efficiency. That being said, DCG value factorization can in principle be applied to any of the above centralized critics to equally improve sample efficiency at the same cost of representational capacity. We leave this to future work.

Other work deals with gigantic numbers of agents, which requires additional assumptions to reduce the sample complexity. For example, Yang et al. (2018) introduce mean-field multi-agent learning (MF-MARL), which factorizes a tabular value function for hundreds of agents into pairwise payoff functions between neighbors in a uniform grid of agents. These payoffs share parameters similar to DCG. Chen et al. (2018) introduce a value factorization for a similar setup based on a low-rank approximation of the joint value. This approach is restricted by uniformity assumptions between agents, but uses otherwise parameter sharing similar to DCG. The value function cannot be maximized globally and must be locally maximized with coordinate ascent. These techniques are designed for much larger sets of agents and do not perform well in the usual MARL settings considered in this paper. While they use similar parameter sharing techniques as DCG, we do therefore not compare against them.

Coordination graphs (CG) have been extensively studied in multi-agent robotics with given payoffs (e.g. Rogers et al., 2011; Yedidsion et al., 2018). Sparse cooperative -learning (SCQL, Kok & Vlassis, 2006) uses CG in discrete state and action spaces by representing all utility and payoff functions as tables. However, the tabular approach restricts practical application of SCQL to tasks with few agents and small state and action spaces. Castellini et al. (2019) use neural networks to approximate payoff functions, but only in one-shot games, and still require a unique function for each edge in the CG. DCG expands greatly upon these works by introducing parameter sharing between all payoffs (as in VDN/QMIX), conditioning on local information (as in MF-MARL) and evaluating in more complex tasks that are vulnerable to relative overgeneralization.

3 Background

In this paper we assume a Dec-POMDP for agents (Oliehoek & Amato, 2016). denotes a finite or continuous set of environmental states and the discrete set of actions available to agent . At discrete time , the next state is drawn from transition kernel , conditioned on the current state and joint action of all agents. A transition yields collaborative reward , and denotes the discount factor. Each agent observes the state only partially by drawing observations from its observation kernel . The history of agent ’s observations and actions is in the following denoted as . Without loss of generality, this paper restricts itself to episodic tasks, which yield episodes of varying (but finite) length .

3.1 Deep -learning

The goal of collaborative multi-agent reinforcement learning (MARL) is to find an optimal policy , that chooses joint actions

such that the expected discounted sum of future reward is maximized. This can be achieved by estimating the optimal

-value function222 We overload the notation to also indicate the inputs and outputs of multivariate functions . :


The optimal policy chooses greedily the action that maximizes the corresponding optimal -value . In fully observable discrete state and action spaces, can be learned in the limit from interactions with the environment (Watkins & Dayan, 1992). For large or continuous state spaces, can only be approximated, e.g., with a deep neural network (DQN, Mnih et al., 2015), parameterized by , by minimizing the mean-squared Bellman loss with gradient descent:


The expectation is estimated with samples from an experience replay buffer holding previously observed episodes (Lin, 1992), and denotes the parameter of a separate target network, which is periodically replaced with a copy of to improve stability. Double -learning further stabilizes training by choosing the next action greedily w.r.t. the current network , i.e., instead of the target network (van Hasselt et al., 2016).

In partially observable environments, the learned policy cannot condition on the state . Instead, Hausknecht & Stone (2015) approximate a -function that conditions on the agent’s history , i.e.,

, by conditioning a recurrent neural network

(e.g., a GRU, Chung et al., 2014) on the agent’s observations and last actions , that is, conditions on the recurrent network’s hidden state , where is initialized with zeros.

Applying DQN to multi-agent tasks quickly becomes infeasible, due to the combinatorial explosion of state and action spaces. Moreover, DQN value functions cannot be maximized without evaluating all possible actions. To allow MARL -learning with efficient maximization, various algorithms based on value factorization have been developed. We derive IQL (Tan, 1993), VDN (Sunehag et al., 2018), QMIX (Rashid et al., 2018) and QTRAN (Son et al., 2019) in Appendix A.1.

3.2 Coordination graphs

An undirected coordination graph (CG, Guestrin et al., 2002a) contains a vertex for each agent and a set of undirected edges between vertices and . The graph is usually specified before training, but Guestrin et al. (2002b) suggest that the graph could also depend on the state, that is, each state can have its own unique CG. A CG induces a factorization333 The normalizations and are not strictly necessary, but allow to potentially generalize to other CGs. of the -function into utility functions and payoff functions (Fig. 1a and 1b):


The special case yields VDN, but each additional edge enables the representation of the value of the actions of a pair of agents and can thus help to avoid relative overgeneralization. Prior work also considered higher order coordination where the payoff functions depend on arbitrary sets of actions (Guestrin et al., 2002a; Kok & Vlassis, 2006; Castellini et al., 2019), corresponding to graphs with hyper-edges (Figure 1c). For the sake of simplicity we restrict ourselves here to pairwise edges, which yield at most edges, in comparison to up to hyper-edges of degree . The induced -function can be maximized locally using max-plus, also known as belief propagation (Pearl, 1988). At time each node sends messages over all adjacent edges , which can be computed locally:


This process repeats for a number of iterations, after which each agent can locally find the action that maximizes the estimated -value:


Convergence of messages is guaranteed for acyclic CGs (Pearl, 1988; Wainwright et al., 2004), but messages can diverge in cyclic graphs, for example fully connected CGs. Subtracting a normalization constant from each message before it is sent often leads to convergence in practice (Murphy et al., 1999; Crick & Pfeffer, 2002; Yedidia et al., 2003). See Algorithm 1 in the appendix for details.

4 Method

We now introduce the deep coordination graph (DCG), which learns the utility and payoff functions of a coordination graph with deep neural networks. A direct implementation as in Castellini et al. (2019) would learn a separate network for each function and . However, properly approximating these -values requires observing the joint actions of each agent pair in the edge set , which for dense graphs can be a significant subset of the joint action space of all agents . We address this issue by focusing on an architecture that shares parameters across functions and restricts them to locally available information, i.e., to the histories of the participating agents.

Sunehag et al. (2018) introduces parameter sharing between the agents’ utility functions to dramatically improve the sample efficiency of VDN. Agents can have different action spaces but the choice of unavailable actions during maximization can be prevented by setting the utilities of unavailable actions to . Specialized roles for individual agents can be achieved by conditioning on the agent’s role, or more generally on the agent’s ID (Foerster et al., 2018; Rashid et al., 2018). We extend this approach with payoff functions specified by pairwise edges in a given CG (Guestrin et al., 2002a). We take inspiration from highly scalable methods (Yang et al., 2018; Chen et al., 2018) and improve over SCQL (Kok & Vlassis, 2006) and the approach of Castellini et al. (2019) by incorporating the following design principles:

  1. [label=., itemsep=-.25mm, topsep=-.5mm, partopsep=0pt]

  2. restricting the payoffs to local information of agents and only;

  3. sharing parameters between all payoff and utility functions through a common RNN;

  4. allowing transfer/generalization to different CG (as suggested in Guestrin et al., 2002b).

Restricting the payoff’s input (i) and sharing parameters (ii) improves sample efficiency significantly. As in Sunehag et al. (2018), all utilities are computed with the same neural network , but unlike Castellini et al. (2019), all payoffs are computed with the same neural network , too. Both share parameters though a common RNN , which is initialized with .

Generalization (or zero-shot transfer) of the learned functions onto new CGs in (iii) poses some practical design challenges. To be applicable to different graphs/topologies, DCG must be invariant to reshuffling of agent indices. This requires the payoff matrix , of dimensionality , to be the same as with swapped inputs. We enforce invariance by computing the function for both combinations and use the average between the two. Note that this retains the ability to learn asymmetric payoff matrices . However, this paper does not evaluate (iii) and we leave the transfer of a learned DCG onto different graphs to future work. The DCG -value function is:


We train DCG end-to-end with the DQN loss in (2) and Double -learning (van Hasselt et al., 2016)

. Given the tensors

and , , where all unavailable actions are set to , the -value can be maximized by message passing as defined in (4) and (5). The detailed procedure is shown in Algorithm 1 in the appendix. No gradients flow through the message passing loop, as DQN maximizes only the bootstrapped future value.

The key benefit of DCG lies in its ability to prevent relative overgeneralization during the exploration of agents: take the example of two hunters who have cornered their prey. The prey is dangerous and attempting to catch it alone can lead to serious injuries. From the perspective of each hunter, the expected reward for an attack depends on the actions of the other agent, who will initially behave randomly. If the punishment for attacking alone outweighs the reward for catching the prey, agents that cannot represent the value for joint actions (QMIX, VDN, IQL) cannot learn the optimal policy. However, estimating a value function over the joint action space (as in QTRAN) can be equally prohibitive, as it requires many more samples for the same prediction quality. DCG provides a flexible function class between these extremes that can be tailored to the task at hand.

5 Validation

Table 1: Tested graph topologies for DCG.

In this section we compare the performance of DCG with various topologies (see Table 1) to the state-of-the-art algorithms QTRAN (Son et al., 2019), QMIX (Rashid et al., 2018), VDN (Sunehag et al., 2018) and IQL (Tan, 1993). All algorithms are implemented in the multi-agent framework pymarl (Samvelyan et al., 2019).

We evaluate these methods in two complex grid-world tasks: the first formulates the relative overgeneralization problem as a family of predator-prey tasks and the second investigates how artificial decentralization can hurt tasks that demand non-local coordination between agents. In the latter case, decentralized value functions (QMIX, VDN, IQL) cannot learn coordinated action selection between agents that cannot see each other directly and thus converge to a sub-optimal policy.

5.1 Relative overgeneralization

To model the challenge of relative overgeneralization, we consider a partially observable grid-world predator-prey task: 8 agents have to hunt prey in a grid. Each agent can either move in one of the 4 compass directions, remain still, or try to catch any adjacent prey. Impossible actions, that is, moves into an occupied target position or catching when there is no adjacent prey, are treated as unavailable. The prey moves by randomly selecting one available movement or remains motionless if all surrounding positions are occupied. If two adjacent agents execute the catch action, a prey is caught and both the prey and the catching agents are removed from the grid. An agent’s observation is a sub-grid centered around it, with one channel showing agents and another indicating prey. Removed agents and prey are no longer visible and removed agents receive a special observation of all zeros. An episode ends if all agents have been removed or after time steps. Capturing a prey is rewarded , but unsuccessful attempts by single agents are punished by a negative reward . The task is similar to one proposed by Son et al. (2019), but significantly more complex, both in terms of the optimal policy and in the number of agents.

Figure 2: Influence of punishment

for attempts to catch prey alone on greedy test episode return (mean and shaded standard error, [number of seeds]) in a coordination task where 8 agents hunt 8 prey (dotted line denotes best possible return). Note that fully connected DCG (

DCG, solid) are able to represent the value of joint actions and coordinate maximization, which leads to a better performance for larger , where DCG without edges (VDN, dashed) has to fail eventually ().
Figure 3: Greedy test episode return for the coordination task of Figure 2 with punishment : (a) comparison to baseline algorithms; (b) comparison between DCG topologies. Note that QMIX, IQL and VDN (dashed) do not solve the task (return 0) due to relative overgeneralization and that QTRAN learns very slowly due to the large action space. The reliability of DCG depends on the CG-topology: all seeds with fully connected DCG solved the task, but the high standard error for CYCLE, LINE and STAR topologies is caused by some seeds succeeding while others fail completely.

To demonstrate the effect of relative overgeneralization, Figure 2 shows the average return of greedy test episodes for varying punishment as mean and standard error over 8 independent runs. Without punishment ( in Figure 2a), fully connected DCG (DCG, solid) performs as well as DCG without edges (VDN, dashed). However, for stronger punishment VDN becomes more and more unreliable, which is visible in the large standard errors in Figures 2b and 2c, until it fails completely for in Figure 2d. This is due to relative overgeneralization, as VDN cannot represent the values of joint actions during exploration. DCG, on the other hand, learns only slightly slower with punishment and converges otherwise reliably to the optimal solution (dotted line).

Figure 3a shows how well DCG performs in comparison with the baseline algorithms in Appendix A.1 for a strong punishment of . Note that QMIX, IQL and VDN completely fail to learn the task (return ) due to their restrictive value factorization. QTRAN estimates the values with a centralized function, which conditions on all agents’ actions, and can therefore learn the task. However, QTRAN requires many samples before a useful policy can be learned, due to the size of the joint action space. This is in line with the findings of Son et al. (2019), which required much more samples to learn a task with four agents than with two. In this light, fully connected DCG (DCG) learn near-optimal policies remarkably fast and reliable.

We also investigated the performance of various DCG topologies defined in Table 1. Figure 3b shows that in particular the reliability of the achieved test episode return depends strongly on the graph topology. While all seeds of fully connected DCG succeed (DCG), DCG with CYCLE, LINE and STAR topologies have varying means while exhibiting large standard errors. The high deviations are caused by some runs finding near-optimal policies, while others fail completely (return 0). One possible explanation is that for the failed seeds the rewarded experiences, observed in the initial exploration, are only amongst agents that do not share a payoff function. Due to the relative overgeneralization pathology, the learned greedy policy no longer explores ‘catch’ actions and existing payoff functions cannot experience the reward for coordinated actions anymore. It is therefore not surprising that fully connected graphs perform best, as they represent the largest function class and require the fewest assumptions. The topology had also little influence on the runtime of DCG, due to efficient batching on the GPU. The tested fully connected DCG only considers pairwise edges. Hyper-edges between more than two agents (Figure 1c) would yield even richer value representations, but would also require more samples to sufficiently approximate the payoff functions. This effect can be seen in the much slower learning QTRAN results in Figure 3a.

5.2 Artificial decentralization

Figure 4: Greedy test episode return (mean and shaded standard error, [number of seeds]) in a non-decentralizable task where 8 agents hunt 8 prey: (a) comparison to baseline algorithms; (b) comparison between DCG topologies. The prey turns randomly into punishing ghosts, which are indistinguishable from normal prey. The prey status is only visible at an indicator that is placed randomly at each episode in one of the grid’s corners. QTRAN, QMIX, IQL and VDN learn decentralized policies, which are at best suboptimal in this task (around lower dotted line). Fully connected DCG can learn a near-optimal policy (upper dotted line denotes best possible return).

The choice of decentralized value functions is in most cased motivated by the huge joint action spaces and not because the task actually requires decentralized execution: it is an artificial decentralization. While this often works surprisingly well, we want to investigate how existing algorithms deal with tasks that cannot be fully decentralized. One obvious case in which decentralization must fail is when the optimal policy cannot be represented by utility functions alone. For example, decentralized policies behave suboptimally in tasks where the optimal policy must condition on multiple agents’ observations in order to achieve the best return. Payoff functions in DCG, on the other hand, condition on pairs of agents and can thus represent a richer class of policies. Note that dependencies on more agents can be modeled as hyper-edges in the DCG (Figure 1c), but this hurts the sample efficiency as discussed above.

We evaluate the advantage of a richer policy class with a variation of the above predator-prey task. Inspired by the video game pacman, at each turn a fair coin flip decides randomly whether all prey are turned into dangerous ghosts. To disentangle the effects of relative overgeneralization, prey can be caught by only one agent (without punishment), yielding a reward of . However, if the agent captures a ghost, the team is punished with . Ghosts are indistinguishable from normal prey, except for a special indicator that is placed in a random corner at the beginning of each episode. The indicator signals on an additional channel of the agents’ observations whether the prey are currently ghosts. Due to the short visibility range of the agents, the indicator is only visible in one of the positions closest to its corner.

Figure 4a shows the performance of QTRAN, QMIX, IQL and VDN, all of which have decentralized policies, in comparison to fully connected DCG (DCG). The baseline algorithms have to learn a policy that first identifies the location of the indicator and then herds prey into that corner, where the agent is finally able to catch it without risk. By contrast, DCG can learn a policy where one agent finds the indicator, allowing all agents that share an edge to condition their payoffs on that agent’s current observation. As a result, this policy can catch prey much more reliably, as seen in the high performance of DCG compared to all baseline algorithms. We also investigate the influence of DCG topologies shown in Table 1 on the performance, shown in Figure 4b. Note that while other topologies do not reach the same performance as fully connected DCG, they still reach a policy (around middle dotted line) that significantly outperforms all baseline algorithms.

6 Conclusions & Future Work

This paper introduced the deep coordination graph (DCG), an architecture for value factorization that is specified by a coordination graph and can be maximized by message passing. We evaluated deep -learning with DCG and show that the architecture enables learning of tasks where relative overgeneralization causes all decentralized baselines to fail, whereas centralized critics are much less sample efficient than DCG. We also demonstrated that artificial decentralization can lead to suboptimal behavior in all compared methods except DCG. Fully connected DCG performed best in all experiments and should be preferred in the absence of prior knowledge about the task. Although not evaluated in this paper, DCG should be able to transfer/generalize to different graphs/topologies and can also be defined for higher-order dependencies. This would in principle allow the training of DCG on dynamically generated graphs, including hyper-edges with varying degrees, which we plan to investigate in future work.


Appendix A Appendix

a.1 Baseline algorithms


Independent Q-learning (Tan, 1993) is a straightforward approach of value decentralization that allows efficient maximization by modeling each agent as an independent DQN . The value functions can be trained without any knowledge of other agents, which are assumed to be part of the environment. This violates the stationarity assumption of and can become therefore instable (see e.g. Foerster et al., 2017). IQL is nonetheless widely used in practice, as parameter sharing between agents can make it very sample efficient.

Note that parameter sharing requires access to privileged information during training, called centralized training and decentralized execution (Foerster et al., 2016). This is particularly useful for actor-critic methods like MADDPG (Lowe et al., 2017), Multi-agent soft Q-learning (Wei et al., 2018), COMA (Foerster et al., 2018) and MACKRL (Schröder de Witt et al., 2019), where the centralized critic can condition on the underlying state and the joint action .


Another way to exploit centralized training is value function factorization. For example, value decomposition networks (VDN, Sunehag et al., 2018) perform centralized deep -learning on a joint -value function that factors as the sum of independent utility functions , for each agent :


This value function can be maximized by maximizing each agent’s utility independently.


(Rashid et al., 2018) improves upon this concept by factoring the value function as


Here is a monotonic mixing hypernetwork with non-negative weights that retains monotonicity in the inputs . Maximizing each utility therefore also maximizes the joint value , as in VDN. The mixing parameters are generated by a neural network, parameterized by , that condition on the state , allowing different mixing of utilities in different states. QMIX improves performance over VDN, in particular in StarCraft II micromanagement tasks (SMAC, Samvelyan et al., 2019).


Recently Son et al. (2019) introduced QTRAN, which learns the centralized critic of a greedy policy w.r.t. a VDN factorized function, which in turn is distilled from the critic by regression under constraints. The algorithm defines three value functions , and , where is the centralized Q-value function, as in Section 3.1, and


They prove that the greedy policies w.r.t.  and are identical under the constraints:


with strict equality if and only if   . QTRAN minimizes the parameters of the centralized asymmetric value , , for each agent (which is similar to Foerster et al., 2018) with the combined loss :


where denotes what a greedy decentralized agent would have chosen. The decentralized value and the greedy difference , with parameters and respectively, are distilled by regression of the each in the constraints. First the equality constraint:


where the ‘detach’ operator stops the gradient flow through . The inequality constraints are more complicated. In principle one would have to compute a loss for every action which has a negative error in (12). Son et al. (2019) suggest to use only the action which minimizes :


We use this loss, which is called QTRAN-alt, as it is reported to perform significantly better. The losses are combined to  , with .

a.2 Hyper-parameters

All algorithms are implemented in the pymarl framework (Samvelyan et al., 2019). We aimed to keep the hyper-parameters close to those given in the framework and consistent for all algorithms.

All tasks used discount factor and -greedy exploration, which was linearly decayed from to within the first time steps. Every 2000 time steps we evaluated 20 greedy test trajectories with . Results are plotted by first applying histogram-smoothing (100 bins) to each seed, and then computing the mean and standard error between seeds.

All methods are based on agents’ histories, which were individually summarized with

by conditioning a linear layer of 64 neurons on the current observation and previous action, followed by a ReLU activation and a GRU

(Chung et al., 2014)

of the same dimensionality. Both layers’ parameters are shared amongst agents, which can be identified by a one-hot encoded ID in the input. Independent value functions

(for IQL), utility functions (for VDN/QMIX/QTRAN/DCG) and payoff functions (for DCG) are linear layers from the GRU output to the corresponding number of actions. The hyper-network

of QMIX produces a mixing network with two layers connected with an ELU activation function, where the weights of each mixing-layer are generated by a linear hyper-layer with 32 neurons conditioned on the global state, that is, the full grid-world. For QTRAN, the critic

computes the -value for an agent by taking all agents’ GRU outputs, all other agents’ one-hot encoded actions, and the one-hot encoded agent ID as input. The critic contains four successive linear layers with 64 neurons each and ReLU activations between them. The greedy difference also conditions on all agents’ GRU outputs and uses three successive linear layers with 64 neurons each and ReLU activations between them. We took the loss parameters from (Son et al., 2019) without any hyper-parameter exploration.

All algorithms were trained with one RMSprop gradient step after each observed episode based on a batch of 32 episodes, which always contains the newest, from a replay buffer holding the last 500 episodes. The optimizer uses learning rate 0.0005,

and . Gradients with a norm were clipped. The target network parameters were replaced by a copy of the current parameters every 200 episodes.

function greedy()
      messages forward () and backward ()
      initialize “Q-value” without messages
     for   do loop with message passes
         for   do update forward and backward messages
               forward: maximize sender
               backward: maximizes receiver
              if  message_normalization  then to ensure converging messages
                   normalize forward message
                   normalize backward message                        
         for  do update “Q-value” with messages
               utility plus incoming messages
               select greedy action of agent                return return actions that maximize the joint Q-value
Algorithm 1 Greedy action selection with message passes in a coordination graph.