The Representational Capacity of Action-Value Networks for Multi-Agent Reinforcement Learning

02/20/2019 ∙ by Jacopo Castellini, et al. ∙ University of Oxford University of Liverpool Delft University of Technology 0

Recent years have seen the application of deep reinforcement learning techniques to cooperative multi-agent systems, with great empirical success. However, given the lack of theoretical insight, it remains unclear what the employed neural networks are learning, or how we should enhance their representational power to address the problems on which they fail. In this work, we empirically investigate the representational power of various network architectures on a series of one-shot games. Despite their simplicity, these games capture many of the crucial problems that arise in the multi-agent setting, such as an exponential number of joint actions or the lack of an explicit coordination mechanism. Our results quantify how well various approaches can represent the requisite value functions, and help us identify issues that can impede good performance.



There are no comments yet.


page 6

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In future applications, intelligent agents will cooperate and/or compete as part of multi-agent systems (MASs) [30, 15, 33, 7]. Multi-agent reinforcement learning (MARL) uses RL to solve such problems and can lead to flexible and robust solutions [2], and recently, a variety of work [28, 24, 4, 17] has successfully applied deep MARL techniques. These approaches have shown good results, but given the lack of theoretical insight, it remains unclear what these neural networks are learning, or how we should enhance their representational power to address the problems on which they fail.

In this paper, we focus on value-based MARL approaches for cooperative MASs. Value-based single-agent RL methods use (deep) neural networks to represent the action-value function to select actions directly [19] or as a ‘critic’ in an actor-critic scheme [20, 16]. A straightforward way to extend such methods to the multi-agent setting is by simply replacing the action by the joint action of all agents . However, this approach heavily relies on the function approximation abilities of the neural network, since it will need to generalize across a discrete action space whose size is exponential in the number of agents. Moreover, selecting a joint action that maximizes the -function will require that, as in deep -networks [19], the (now joint) actions need to be output nodes. As a result, the computational and sample costs scale poorly in the number of agents.

Another approach to extending single-agent RL methods to MASs is to apply them to each agent independently. This improves scalability at the expense of quality, e.g., individual deep -learners may not be able to accurately represent the value of coordination. Furthermore, the environment becomes non-stationary from the perspective of a single agent and thus, unsurprisingly, their learning process may not converge [3, 29, 32].

A middle ground is to learn factored -value functions [9, 11], which represent the joint value but decompose it as the sum of a number of local components, each involving only a subset of the agents. Compared to independent learning, a factored approach can better represent the value of coordination and does not introduce non-stationarity. Compared to a naive joint approach, it has better scalability in the number of agents. Recently, factored approaches have shown success in deep MARL [27, 24].

This paper examines the representational capacity of these various approaches by studying the accuracy of the learned -function approximations . We consider the optimality of the greedy joint action, which is important when using to select actions. We also consider distance to optimal value , as verifying the optimality of the greedy action requires bounding . Furthermore, minimising

is important for deriving good policy gradients in actor-critic architectures and for sequential value estimation in any approach (such as

-learning) that relies on bootstrapping.

However, to minimise confounding factors, we focus on one-shot (i.e., non-sequential) problems. Specifically, we investigate the representational power of various network architectures on a series of one-shot games that require a high level of coordination. Despite their simplicity, these games capture many of the crucial problems that arise in the multi-agent setting, such as an exponential number of joint actions. While good performance in such one-shot settings does not necessarily imply good performance in the sequential setting, the converse is certainly true: any limitations we find in one shot settings would imply even greater limitations in corresponding sequential settings. Thus, assessing the accuracy of various representations in one-shot problems is a key step towards understanding and improving deep MARL techniques.

2 Background

2.1 One-Shot Games

In this work, we focus on one-shot games, which do not have a notion of environment state. The model consists of the tuple , where is the set of agents, is the set of individual actions for agent ( is the joint action set) and the set of reward functions 111We write for the reward function in the one-shot problem to make the link with sequential MARL more apparent., depending only on the joint action performed by the team of agents, express how much reward agent gets from the overall team decision.

A cooperative one-shot game is a game in which all agents share the same reward function , so that the goal of the team is to maximize this shared reward by finding the optimal joint action to perform. In this work, we focus on cooperative games. Our work aims at investigating the representations of the action-value function obtained with various neural network approaches and how close these are to the original one. We do not investigate the learning of an equilibrium strategy for the agent to exploit, as is typically considered in works on repeated games.

2.2 Coordination Graphs

In many problems, the decision of an agent is influenced by those of only a small subset of other agents [9]. This locality of interaction means the joint action-value function can be represented as the sum of smaller reward functions, one for each factor :


where is the number of factors and is the local joint action of the agents that participate in factor .

The structure of the interactions between the agents can be represented with a (hyper-) graph called a coordination graph [14]. A coordination graph has a node for each agent in the team and (hyper) edges connecting agents in the same factor. Figure LABEL:fig:graphs shows some examples coordination graphs. Coordination graphs are a useful instrument to represent interactions between agents and there are many algorithms that exploit such structures and require good approximations of the action-value function in order to efficiently select a maximizing joint action, e.g., variable elimination [9] or max-sum [25, 14].

There are many cases in which the problem itself is not perfectly factored according to a coordination graph that can be exploited. In these cases, however, it can still be useful to resort to an approximate factorization [11]:


obtained by considering a decomposition of the original function in a desired number of local approximate terms , thus forming an approximation of the original action-value function .

3 Action-Value Functions for MARL

Current deep MARL approaches are either based on the assumption that the joint-action value function can be represented efficiently by neural networks (when, in fact, the exponential number of joint actions usually makes a good approximation hard to learn), or that it suffices to represent (approximated) individual action values [18]. Our aim is to investigate to what degree these assumptions are valid by exploring them in the one-shot case, as well as assessing if higher order factorizations result in improved representations of such functions, while making learning easier (as there are only small factors to be learned).

We use neural networks as function approximators to represent the various components of these factorizations. Our work explores a series of directions, both in terms of agent models (how to use these networks to represent the various agents/factors) and in terms of learning algorithms, combining them to independently assess the influence of these various aspects on the represented action-value functions. We couple each of the factorizations introduced later with the following two learning approaches.

  • Mixture of experts (MoE) [1]: each factor network optimizes its own output to predict the global reward, thus becoming an “expert” on its own field of action. The loss for the network representing factor at training sample is defined as:


    where is the common reward signal received after selecting joint action and is the output of the network for local joint action . As we aim to assess how good the reconstructed action-value function is, after training, we compute it from the factors as the mean over the appropriate local -values:

  • Factored -function (FQ) [10]: we jointly optimize the factor networks to predict the global reward as a sum of their local -values . The loss for the sample at time is identical for all networks:


    After training, the joint action-value function is reconstructed as the sum of the appropriate local -values:


We investigate four different factorizations:

  • Single agent decomposition: each agent is represented by an individual neural network and computes its own individual action-values , one for each output unit, based on its local action . Under the mixture of experts, this corresponds to the standard independent learning approach to MARL, in which we learn local agent-wise components, while under the factored -function approach this corresponds to the value decomposition networks from [27].

  • Random partition: agents are randomly partitioned to form factors of size , with each agent involved in only one factor. Each of the factors is represented by a different neural network that represents local action-values , one for each of its output units, for a certain factor , where is the local joint action of agents in factor . We consider factors of size .

  • Overlapping factors: a fixed number of factors is picked at random from the set of all possible factors of size . We require the sampled set to not include duplicate factors (we use distinct ones) and that every agent appears in at least one factor. Every factor is represented by a different neural network learning local action-values for the local joint actions , one for each output unit. In our experiments we choose , again with factors of size .

  • Complete factorization: each agent is grouped with every possible combination of the other agents in the team to form factors of size , resulting in factors, each represented by a network. Each of these networks learns local action values , one for each output unit of the network, conditioned on local joint actions of component . As for the other factorizations, we consider factors of size .

This results in the following combinations:

Mix. of Experts Factored
Single agent M1(=IL [29]) F1(=VDN [27])
Random p. () M2R, M3R F2R, F3R
Overlapping () M2O, M3O F2O, F3O
Complete () M2C, M3C F2C, F3C
Table 1: Combinations of factorizations and learning rules.

4 Experiments

We investigate the representations obtained with the proposed combinations of factorization and learning approach on a series of challenging one-shot coordination games that do not present an explicit decomposition of the reward function (non-factored games), and then on two factored games. The following Table summarizes the investigated games and associated parameters.

Game Factored
Dispersion/Platonia 6 2 64 No
Climb/Penalty 6 3 729 No
Generalized FF 6 2 (per type) 64 (8192 total) Yes
Aloha 6 2 64 Yes
Table 2: Details of the investigated games.

4.1 Experimental Setup

Every neural network has a single hidden layer with

hidden units using leaky ReLU activation functions, while all output units are linear and output local action-values

222We also investigated deeper networks with 2 and 3 hidden layers, but did not find improvements for the considered problems.

. Given the absence of an environment state to feed to the networks as an input, at every time step they just receive a constant scalar value. We used the mean square error as the loss function and the RMSprop training algorithm with a learning rate of

. For every game, we trained the networks with examples by sampling a joint action uniformly at random333We do not use -greedy because we are interested in representing the whole value function and not just the best performing action at every training step and collecting the reward .. Then, we propagate the gradient update through each network from the output unit . The loss function minimizes the squared difference between the collected reward at each training step and the approximation computed by the networks. After training, the learned action-value function is compared to the original one. We also consider a baseline joint learner (a single neural network with an exponential number of output units). Every experiment was repeated times with random initialization of weights, each time sampling different factors for the random partitions and the overlapping factors; we report the averages of these.

4.2 Non-Factored Games

4.2.1 Dispersion Games

In the Dispersion Game, also known as Anti-Coordination Game, the team of agents must divide as evenly as possible between the two local actions that each agent can perform [8]. This game requires explicit coordination, as none of the local actions is good per se, but the obtained reward depends on the decision of the whole team. We investigate two versions of this game: in the first one the agents obtain reward proportional to their dispersion coefficient (i.e., how split the agents are in performing one of their two local actions). The reward function for this game with agents, each with a local action set is:


In the second version, which we call Sparse Dispersion Game, the agents receive a reward (which we set to the maximum dispersion coefficient with agents: ) only if they are perfectly split:


Figure 1 shows the -function reconstructed by the proposed factorizations and learning approaches for these two games. In these plots (and those that follow), the -axis enumerates the joint actions and the -axis shows the corresponding values for the reconstructed function, with the color of the bars as an encoding of the action-value. We analyse the accuracy of the computed reconstructions considering two aspects: the total reconstruction error of with respect to the original reward function , and whether a reconstruction is able to produce a correct ranking of the joint actions. For a good reconstruction, the bars have the same relative heights, indicating that the factorization correctly ranks the joint actions with respect to their value, and to be of a similar value to those in the original one (the factorization can reconstruct a correct value for that reward component). However, reconstruction error alone is not a good accuracy measure because lower reconstruction error does not imply better decision-making, as a model could lower the total error by over- or underestimating the value of certain joint actions.

Figure 1: Reconstructed for LABEL:sub@sub:dispersion the Dispersion Game, and LABEL:sub@sub:sparse its sparse variant.

Figure 1(a) shows that the proposed complete factorizations are able to almost perfectly reconstruct the relative ranking between the joint actions, meaning that these architectures can be reliably used for decision making. Moreover, the ones using the factored

-function (F2C and F3C in the plot) are also able to produce a generally good approximation of the various reward components (expressed by the value of the bars), while those based on the mixture of experts produce a less precise reconstruction: the joint optimization of the former gives an advantage in this kind of extremely coordinated problems. Smaller factorizations, like the random pairings, are not sufficient to correctly represent this function, probably because a higher degree of connection is required to represent coordination. Figure

1(b) is similar but in this case the reconstruction is less accurate and the values of the bars are quite different from those of the original one. This is possibly due to the sparsity of the reward function, requiring the networks to correctly approximate quite different values with the same output components. In this case, the sparseness of the reward function fools the representations into being similar to those of the non-sparse version.

4.2.2 Platonia Dilemma

In the Platonia Dilemma, an eccentric trillionaire gathers people together and tells them that if one and only one of them sends him a telegram by noon the next day, that person will receive a billion dollars. In our cooperative version the reward is set to the number of agents and is received by the whole team, not just a single agent. Thus, the reward function for agents with local action sets is:

Figure 2: Reconstructed for the Platonia Dilemma.

Figure 2 shows the reconstructed action-value functions for the Platonia Dilemma. For this problem, none of the proposed factorizations can correctly represent the action-value function. In fact, while they are perfectly able to correctly rank all the optimal actions (the ones in which only a single agent sends the telegram) at the same level, they all fail to correctly rank and reconstruct the same joint action (the one in which none of the agents sends the telegram). In fact, the unique symmetric equilibrium for the team in this game is that each of them sends the telegram with probability , so the agents usually gather more reward by not sending it themselves, but relying on someone else to do so. This results in an ‘imbalanced’ reward function in which the high reward is more often obtained, from an agent perspective, by choosing a certain action instead of the other, thus resulting in overestimating one of the actions (the one in which all the agents perform the same action, i.e., not sending the telegram).

This imbalance in the reward given by the two actions is probably the cause of the poor reconstruction. Thus, for this kind of tightly coupled coordination problem, none of the techniques to approximate action-values currently employed in deep MARL suffice to guarantee a good action is taken, even if the coordination problem is conceptually simple.

4.2.3 Climb Game

In the Climb Game [31], each agent has three local actions . Action yields a high reward if all the agents choose it, but no reward if only some do. The other two are suboptimal actions that give lower reward but do not require precise coordination. This game enforces a phenomenon called relative overgeneralization, that pushes the agents to underestimate a certain action (in our example, ) because of the low rewards they receive, while they could get a higher reward by perfectly coordinating on it. The reward function is:

Figure 3: Reconstructed for the Climb Game LABEL:sub@sub:climb_factored factored function learning approach, and LABEL:sub@sub:climb_moe the mixture of experts learning approach.

Figure 3 shows the results obtained on the proposed Climb Game. The joint network is not able to learn the correct action-value function in the given training time, due to the large number of joint actions. This highlights again how joint learners are not suited for this kind of even moderately large multi-agent system. By contrast, all the other architectures are able to correctly rank the suboptimal actions. The ones using the factored -function and a complete factorization are also able to correctly reconstruct the values for those actions, as can be seen from the bars. However, only F2C can correctly rank and reconstruct the optimal action (the coordinated one), while even F3C fails to do so and gives it a large negative value. A likely cause for this effect is that, when optimizing the loss function, assigning negative values to the components forming that joint action reduces the overall error, even if one of the reconstructed reward value is totally wrong. We can also observe how the mixture of experts plot looks somewhat comparable to the one for factored -functions, but more ’compressed’ and noisy.

4.2.4 Penalty Game

Similarly to the Climb Game, in the Penalty Game [31] each agent has three local actions . In this game, two local actions (for example, action and ) give a high reward if the agents perfectly coordinate on one of them, but also give a negative penalty if they mix them together. The third action is suboptimal and gives a lower reward when the team coordinates on it, but also no penalty if at least one of the agents uses it. This game could also lead to relative overgeneralization, as the suboptimal action gives a higher reward on average. We use the following reward function:

Figure 4: Reconstructed for the Penalty Game LABEL:sub@sub:penalty_factored factored function learning approach, and LABEL:sub@sub:penalty_moe the mixture of experts learning approach.

Figure 4 presents the representations obtained by the investigated architectures. Given the high level of coordination required, all of the architectures using the mixture of experts learn a totally incorrect approximation, biased by the larger number of joint actions that yield a penalty rather than a positive reward. For this game, none of the architectures can correctly reconstruct the whole structure of the action-value function, but they all fail at the two optimal joint actions (at the two sides of the bar plots). This is probably due to the large gap in the reward values that the agents can receive when choosing one of their local optimal actions: they can get a high reward if all the agents perfectly coordinate, but it is more common for them to miscoordinate and receive a negative penalty, resulting in an approximation that ranks those two joint actions as bad in order to correctly reconstruct the other cases. Furthermore, the suboptimal action is hard to correctly approximate because, similarly to the optimal ones, it also usually results in a smaller reward than the one it gives when all the agents coordinate on it. Only F1 and F3C rank it as better than the other, but surprisingly only F1 is also able to reconstruct the correct value.

4.3 Factored Games

4.3.1 Generalized Firefighting

The Generalized Firefighting problem [22] is an extension of the standard two-agent firefighting problem with agents. This is a cooperative graphical Bayesian game, so each agent has some private information, called its local type , on which it can condition its decision. The combination of the various agents types determines the values of the reward function . We have a team of firefighters that have to fight possible fires at different houses. Each house can be burning, , or not, . Each agent has a limited observation and action field: it can observe only houses (so its local type is ) and can fight the fire only at houses (the sets of the observed and reachable houses are fixed beforehand and are part of the problem specification, with and being their cardinality respectively). Each house yields a reward component: if an agent fights the fire at a burning house, that house gives a positive reward ; if the house is not burning (or if it is burning but no-one is fighting the fire at it) it does not provide any reward. The reward function is sub-additive: if two agents fight the fire at the same burning house, this gives a reward . The overall value of the reward function experienced by agents for a given joint type and joint action is the sum of the rewards given by each house.

In our experiments, a team of agents have to fight fire at houses. Each agent can observe houses and can fight fires at the same set of locations (). Figure 5 shows the representations learned for the joint type .

Figure 5: Reconstructed for a single joint type of the Generalized Firefighting game.

This game requires less coordination than those studied earlier (agents only have to coordinate with other agents that can fight fire at the same locations), and every investigated architecture correctly ranks all the joint actions, even the single agent factorizations F1 and M1 (this is true also for any other joint type, for which we are not reporting the plots, with just the single agent factorizations reporting some isolated error). However, while those using the factored -function can also correctly reconstruct the reward value of each action, those using the mixture of experts are less precise in their reconstruction. Overall, this experiment demonstrates that there exist non-trivial coordination problems that can effectively be tackled using small factors, including even individual learning approaches.

4.3.2 Aloha

Aloha [21] is a partially observable game in which there is a set of islands, each provided with a radio station, trying to send a message to their inhabitants. We present a slightly altered one-shot version in which the ruler of each island wants to send a radio message to its inhabitants, but, given that some of the islands are near one to another, if they all send the message the radio frequencies interfere and the messages are not correctly received by their populations. Given that all the rulers are living in peace and they want to maximize the number of received messages by their populations, the reward signal is shared and thus the game is cooperative. It is a graphical game, as the result of each island transmission is affected only by the transmissions of nearby islands. Every ruler has two possible actions: send a message or not. If they do not send a message, they do not contribute to the total reward. If they send one and the message is correctly received by the population (no interference occurs) they get a reward , but if they are interfering with someone else, they get a penalty of . The common reward that all the rulers receive at the end is the sum of their local contributions.

Our experiment uses a set of islands disposed in a grid, with each island being affected only by the transmissions of the islands on their sides and in front of them (islands on the corner of the grid miss one of their side neighbours). Representations learned for this game are reported in Figure 6.

Figure 6: Reconstructed for Aloha.

The plot shows clearly how this game is challenging for the proposed factorizations to learn, with only three of them (plus the joint learner) able to correctly represent the reward function. The structure of the game is similar to that of Generalized Firefighting, with an agent depending directly only on a small subset of the others, but the different properties of the reward function make it more challenging to correctly represent. This is possibly due to the large difference between the two rewards an agent can get when transmitting the radio message, depending on an eventual interference. Observing only the total reward, this action looks neutral per se, similarly to what happens for the two actions in the dispersion game, its outcome depending on the action of the neighbouring agents, thus possibly fooling many of the proposed factorizations, especially those using the mixture of experts approach.

4.4 Summary of Results

Table LABEL:tab:measures presents the accuracy using various measures of the investigated representations, both in terms of reconstruction error and action ranking, as well as evaluating the action selection that these representations result in. To evaluate the reconstruction error, we compute the mean square error over all the joint actions and the same measure restricted only to those actions that are optimal in the original reward function. We also assess how many of the optimal actions are considered optimal also by the reconstructions, and compute the value loss (regret) obtained by following the represented value functions. We also provide a different version, that we call Boltzmann value loss, which expresses the value loss obtained by the expected reward accrued by defining a softmax distribution over all the joint actions (this gives an indication of value loss amongst all good actions). Finally, we compute the number of correctly ranked actions (accounting for ties were needed) and the corresponding Kendall -b coefficient [13]

between the computed ranking and the original one. For every method, mean values and standard errors across

runs are reported.

While many aspects can influence the learning outcome, our results have four main takeaways:

  • There are pathological examples, like the Platonia Dilemma, where all types of factorization result in selecting the worst possible joint action. Given that only joint learners seem to be able to address such problems, currently no scalable deep RL methods for for dealing with such problems seem to exist.

  • Beyond those pathological examples, ‘complete factorizations’ of modest factor size coupled with the factored -function learning approach yield near-perfect reconstructions and rankings of the actions, also for non-factored action-value functions. Moreover, these methods scale much better than joint learners: for a given training time, we say that these complete factorizations already outperform fully joint learners on modestly sized problems (like the Climb Game or the Generalized Firefighting with 6 agent), as can be seen from the training curves in Figure 7 for two of the proposed games.

    Figure 7: Training curves for the investigated models for LABEL:sub@sub:training_dispersion the Dispersion Game, and LABEL:sub@sub:training_generalized_ff the Generalized Firefighting game.
  • For these more benign problems, random overlapping factors also achieve excellent performance, especially in terms of value loss, comparable to those of more computationally complex methods like joint learners and complete factorizations. This suggests that such approaches are a promising direction forward for scalable deep MARL in many problem settings.

  • Factorizations with the mixture of experts approach usually perform somewhat worse than the corresponding factored -function approaches. However, in some cases they perform better (e.g. in terms of value loss and MSE on optimal actions in the Penalty Game) or comparable (Dispersion, Generalized Firefighting), in which M2R and M3R still outperform F1 (i.e., VDNs) in terms of value loss. This is promising, because the mixture of experts learning approach does not require any exchange of information between the neural networks, thus potentially facilitating learning in settings with communication constraints, and making it easier to parallelize across on multiple CPUs/GPUs.

These observations shed light on the performance of independent learners in MARL: while they can outperform joint learners on large problems, the degree of independence and the final outcome is hard to predict and is affected by different factors. Designing algorithms that are able to overcome these difficulties should be a primary focus of MARL research.

5 Related Work

Recently, many works have applied deep RL techniques to MASs, achieving great performance. Gupta et al. [12] compare the performance of many standard deep reinforcement learning algorithms (like DQN, DDPG and TRPO) using a variety of learning schemes (joint learners, fully independent learners, etc.) on both discrete and continuous tasks, assessing and comparing their performance. Tampuu et al. [28] present a variation on DQN capable of dealing with both competitive and cooperative settings in Pong. Applications of techniques to enhance the learning process have also been investigated: Palmer et al. [23] apply leniency to independent double DQN learners in a coordinated grid world problem, while Foerster et al. [5] propose a novel approach to stabilize the experience replay buffer in DQNs by conditioning on when the samples were collected, thereby easing the non-stationarity of independent learning. Communication between agents has also been explored: Sukhbaatar et al. [26]

investigate the emergence of a communication mechanism that can be directly learned through backpropagation. However, none of these works compare alternate representations of

-values for such networks.

Many works address the problem of the exponentially large joint action space by exploiting centralized learning. Foerster et al. [6] present an architecture based on the actor-critic framework with multiple independent actors but a single centralized critic used to efficiently estimate both -values and a counterfactual baseline to tackle the credit assignment problem and guide the agents through the learning process under partial observability. However, this work still represents monolithically, and thus can experience scalability issues. On the other hand, Lowe et al. [17] maintain a different critic network for every actor, together with an inferred policy for the other agents in the environment, and apply the approach both on cooperative and competitive task. Sunehag et al. [27] address the problem by training the agents independently by using a value decomposition method that represents the original -function as the sum of local terms depending only on agent-wise information, while Rashid et al. extend this idea by representing the -function using a monotonic nonlinear combination with a mixing network on top of such individual terms, so that the maximization step can still be performed in an efficient way. While such mixing networks may lead to more accurate -values, our investigation shows that for many coordination problems, individual -components may not suffice.

6 Conclusions

In this work, we investigated how well neural networks can represent action-value functions arising in multi-agent systems. This is an important question since accurate representations can enable taking (near-) optimal actions in value-based approaches, and computing good gradient estimates in actor-critic methods. In this paper, we focused on one-shot games as the simplest setting that captures the exponentially large joint action space of MASs. We compared a number of existing and new action-value network factorizations and learning approaches.

Our results highlight the difficulty of compactly representing action values in problems that require tight coordination, but indicate that using higher-order factorizations with multiple agents in each factor can improve the accuracy of these representations substantially. We also demonstrate that there are non-trivial coordination problems - some without a factored structure - that can be tackled quite well with simpler factorizations. Intriguingly, incomplete, overlapping factors perform very well in several settings. There are also settings where the mixtures of experts approach, with its low communication requirements and amenability to parallelization, is competitive in terms of the reconstructions.

While our results emphasize the dependence of appropriate architectural choices on the problem at hand, our analysis also shows general trends that can help in the design of novel algorithms and improve general performance, highlighting how the use of factored action-value function can be a viable way to obtain good representations without incurring in excessive costs.


This research made use of a GPU donated by NVIDIA. F.A.O. is funded by EPSRC First Grant EP/R001227/1. This project had received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 758824 —INFLUENCE).


  • Amato and Oliehoek [2015] Christopher Amato and Frans A. Oliehoek. Scalable planning and learning for multiagent pomdps. In Proceedings of the 29th AAAI Conference on Artificial Intelligence

    , AAAI’15, pages 1995–2002. American Association for Artificial Intelligence, 2015.

  • Busoniu et al. [2008] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38:156–172, 2008.
  • Claus and Boutilier [1998] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th/10th AAAI Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, AAAI’98/IAAI’98, pages 746–752. American Association for Artificial Intelligence, 1998.
  • Foerster et al. [2016] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems 29, NIPS’16, pages 2137–2145. Curran Associates, Inc., 2016.
  • Foerster et al. [2017] Jakob N. Foerster, Nantas Nardelli, Gregory Farquhar, Philip H. S. Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. In

    Proceedings of the 34th International Conference on Machine Learning

    , ICML’17, pages 1146–1155. PMLR, 2017.
  • Foerster et al. [2018] Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the 32th AAAI Conference on Artificial Intelligence, AAAI’18, pages 2974–2982. American Association for Artificial Intelligence, 2018.
  • Ghavamzadeh et al. [2006] Mohammad Ghavamzadeh, Sridhar Mahadevan, and Rajbala Makar. Hierarchical multi-agent reinforcement learning. Autonomous Agents and Multi-Agent Systems, 13(2):197–229, 2006.
  • Grenager et al. [2002] Trond Grenager, Rob A. Powers, and Yoav Shoham. Dispersion games: General definitions and some specific learning results. In Proceedings of the 18th AAAI Conference on Artificial Intelligence, AAAI’02, pages 398–403. American Association for Artificial Intelligence, 2002.
  • Guestrin et al. [2002a] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored mdps. In Advances in Neural Information Processing Systems 14, NIPS’02, pages 1523–1530. Morgan Kaufmann Publishers Inc., 2002a.
  • Guestrin et al. [2002b] Carlos Guestrin, Michail G. Lagoudakis, and Ronald Parr. Coordinated reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning, ICML’02, pages 227–234. Morgan Kaufmann Publishers Inc., 2002b.
  • Guestrin et al. [2003] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient solution algorithms for factored mdps. Journal of Artificial Intelligence Research, 19(1):399–468, 2003.
  • Gupta et al. [2017] Jayesh K. Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In Autonomous Agents and Multi-Agent Systems, pages 66–83. Springer International Publishing, 2017.
  • Kendall and Gibbons [1990] Maurice Kendall and Jean D. Gibbons. Rank Correlation Methods. A Charles Griffin Title, 5 edition, 1990.
  • Kok and Vlassis [2006] Jelle R. Kok and Nikos Vlassis. Collaborative multiagent reinforcement learning by payoff propagation. Journal of Machine Learning Research, 7:1789–1828, 2006.
  • Leibo et al. [2017] Joel Z. Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS’17, pages 464–473. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
  • Lillicrap et al. [2015] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
  • Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems 30, NIPS’17, pages 6379–6390. Curran Associates, Inc., 2017.
  • Matignon et al. [2012] Laëtitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems. Knowledge Engineering Review, 27(1):1–31, 2012.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Mnih et al. [2016] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, volume 48 of ICML’16, pages 1928–1937. PMLR, 2016.
  • Oliehoek [2010] Frans A. Oliehoek. Value-Based Planning for Teams of Agents in Stochastic Partially Observable Environments. PhD thesis, Informatics Institute, University of Amsterdam, 2010.
  • Oliehoek et al. [2011] Frans A. Oliehoek, Shimon Whiteson, and Matthijs T. J. Spaan. Exploiting agent and type independence in collaborative graphical bayesian games, 2011.
  • Palmer et al. [2018] Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani. Lenient multi-agent deep reinforcement learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS’18, pages 443–451. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
  • Rashid et al. [2018] Tabish Rashid, Mikayel Samvelyan, Christian Schröder de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, ICML’18, pages 4292–4301., 2018.
  • Rogers et al. [2011] A. Rogers, A. Farinelli, R. Stranders, and N. R. Jennings. Bounded approximate decentralised coordination via the max-sum algorithm. Artificial Intelligence, 175(2):730–759, 2011.
  • Sukhbaatar et al. [2016] Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communication with backpropagation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 2252–2260. Curran Associates, Inc., 2016.
  • Sunehag et al. [2018] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS’18, pages 2085–2087. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
  • Tampuu et al. [2017] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE, 12(4):1–15, 2017.
  • Tan [1993] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the 10th International Conference on Machine Learning, ICML’93, pages 330–337. Morgan Kaufmann Publishers Inc., 1993.
  • Van der Pol and Oliehoek [2016] Elise Van der Pol and Frans A. Oliehoek. Coordinated deep reinforcement learners for traffic light control. In NIPS’16 Workshop on Learning, Inference and Control of Multi-Agent Systems, 2016.
  • Wei and Luke [2016] Ermo Wei and Sean Luke. Lenient learning in independent-learner stochastic cooperative games. Journal of Machine Learning Research, 17(84):1–42, 2016.
  • Wunder et al. [2010] Michael Wunder, Michael L. Littman, and Monica Babes. Classes of multiagent q-learning dynamics with epsilon-greedy exploration. In Proceedings of the 27th International Conference on Machine Learning, ICML’10, pages 1167–1174. Omnipress, 2010.
  • Ye et al. [2015] Dayong Ye, Minjie Zhang, and Yun Yang. A multi-agent framework for packet routing in wireless sensor networks. Sensors, 15(5):10026–10047, 2015.