1 Introduction
Reinforcement Learning (RL) is the problem of finding an action policy that maximizes reward for an agent embedded in an environment SuttonB18 . It has recently has seen an explosion in popularity due to its many achievements in various fields such as, robotics levine2016end , industrial applications datacentercooling , gameplaying mnih2015human ; silver2017mastering ; silver2016mastering , and the list continues. However, most of these achievements have taken place in the singleagent realm, where one does not have to consider the dynamic environment provided by interacting agents that learn and affect one another.
This is the problem of Multiagent Reinforcement Learning (MARL) where we seek to find the best action policy for each agent in order to maximize their reward. The settings may be cooperative, and thus they might have a shared reward, or the setting may be competitive, where one agent’s gain is another’s loss. Some examples of a multiagent reinforcement learning problem are: decentralized coordination of vehicles to their respective destinations while avoiding collision, or the game of pursuit and evasion where the pursuer seeks to minimize the distance between itself and the evader while the evader seeks the opposite. Other examples of multiagent tasks can be found in Panait2005 and maddpg .
The key difference between MARL and singleagent RL (SARL) is that of interacting agents, which is why the achievements of SARL cannot be absentmindedly transferred to find success in MARL. Specifically, the state transition probabilities in a MARL setting are inherently nonstationary from the perspective of any individual agent. This is due to the fact that the other agents in the environment are also updating their policies, and so the Markov assumptions typically needed for SARL convergence are violated. This aspect of MARL gives rise to instability during training, where each agent is essentially trying to learn a moving target.
In this work, we present a novel method for MARL in the cooperative setting (with shared reward). Our method first trains a centralized expert with full observability, and then uses this expert as a supervisor for independently learning agents. There are a myriad of imitation/supervised learning algorithms, and in this work we focus on adapting DAgger (Dataset Aggregation) DAgger to the multiagent setting. After the imitation learning stage, the agents are able to successfully act in a decentralized manner. We call this algorithm Centralized Expert Supervises MultiAgents (CESMA). CESMA adopts the framework of centralized training, but decentralized execution KRAEMER201682 , the end goal of which is to obtain multiagents that can act in a decentralized manner.
2 Related works
The most straightforward way of adapting singleagent RL algorithms to the multiagent setting is by having agents be independent learners. This was applied in Tan:1997:MRL:284860.284934 , but this training method gives stability issues, as the environment is nonstationary from the perspective of each agent Matignon:2012:RIR:2349641.2349642 ; Busoniu2010MultiagentRL ; claus1998dynamics . This nonstationarity was examined in OmidshafieiPAHV17 , and stabilizing experience replay was studied in foerster2017stabilising
Another common approach to stabilizing the environment is to allow the multiagents to communicate. In SukhbaatarSF16
, they examine this using continuous communications so one may backpropagate to learn to communicate. And in
foerster2016learning , they give an indepth study of communicating multiagents, and also provide training methods for discrete communication. In jpaulos_schen , they decentralize a policy by examining what to communicate and by utilizing supervised learning, although they mathematically solve for a centralized policy and their assumptions require homogeneous communicating agents.Others approach the nonstationarity issue by having the agents take turns updating their weights while freezing others for a time, although nonstationarity is still present egorov2016multi . Other attempts adapt learning to the multiagent setting: Distributed QLearning Lauer00analgorithm updates Qvalues only when they increase,and updates the policy only for actions that are not greedy with respect to the values, and Hysteretic Learning hysteretic provides a modification. Other approaches examine the use of parameter sharing gupta2017cooperative between agents, but this requires a degree of homogeneity of the agents. And in tesauro2004extending , their approach to nonstationarity was to input other agents’ parameters into the Q function. Other approaches to stabilize the training of multiagents are in sukhbaatar2016learning , where the agents share information before selecting their actions.
From a more centralized view point, oliehoek2008optimal ; qmix ; sunehag2017value derived a centralized value function for MARL, and in DBLP:journals/corr/UsunierSLC16 , they train a centralized controller and then sequentially select actions for each agent. The issue of an exploding action space was examined in tavakoli2018action .
A few works that follow the framework of centralized training, but decentralized execution are: RLar (Reinforcement Learning as Rehearsal) KRAEMER201682 , COMA (Counterfactual MultiAgent), and also ijcai2018774 ; DBLP:journals/corr/abs181011702 – where the idea of knowledgereuse is examined. In NIPS2017_6887 , they examine decentralization of policies from an informationtheoretic perspective. There is also MADDPG maddpg where they train in a centralizedcritics decentralizedactors framework; after training completes, the agents are seperated from the critics and can execute in a fully distributed manner.
For surveys of MARL, see articles in bu2008comprehensive ; panait2005cooperative .
3 Background
In this section we briefly review the requisite material needed to define MARL problems. Additionally we summarize some of the standard approaches in general reinforcement learning and discuss their use in MARL.
DecPOMDP:
A formal framework for multiagent systems is called a decentralized partiallyobservable Markov decision process (DecPOMDP)
bernstein2005bounded . A DecPOMDP is a tuple where is the finite set of agents indexed to , is the set of states, is the set of actions for agent , and thus is the joint action space, is the observation space of agent , and thus is the joint observation space, (where and similarly for the others) is the statetransition probability for the whole system, and is the reward. In the case when the joint observations o equals the world state of the system, then we call the system a decentralized Markov decision process (DecMDP).DAgger: The Dataset Aggregation (DAgger) algorithm DAgger is an iterative imitation learning algorithm that seeks to learn a policy from expert demonstration. The main idea is to allow the learning policy to navigate its way through the environment, and have it query the expert on states that the it sees. It does this by starting with a policy which learns from the dataset of expert trajectories through supervised learning. Using , a new dataset is generated by rolling out the policy and having the expert provide supervision on the decisions that the policy made. This new dataset is aggregated with the existing set into . This process is iterated, i.e. a new is trained, another new dataset is obtained and aggregated into and so on. Learning in this way has been shown to be more stable and have nicer convergence properties as learning utilizes trajectories seen from the learner’s state distribution, as opposed to only the expert’s state distribution.
Policy Gradients (PG): One approach to RL problems are policy gradient methods NIPS1999_1713 : instead of directly learning stateaction values, the parameters of the policy are instead adjusted to maximize the objective,
where is the state distribution from following policy . The gradient of the above expression can written as NIPS1999_1713 ; SuttonB18 :
Many policy gradient methods seek to reduce the variance of the above gradient estimate, and thus many study how one estimates
above SchulmanMLJA15 . For example, if we let be the sample return , then we get the REINFORCE algorithm REINFORCE . Or one can choose to learn using temporaldifference learning Sutton1988LearningTP ; SuttonB18 , and would obtain the ActorCritic algorithms SuttonB18 . Other policy gradients algorithms are: DPG SilverDPG , DDPG LillicrapHPHETS15 , A2C and A3C MnihBMGLHSK16 , to name a few.Policy Gradients have been applied to multiagent problems; in particular the MultiAgent Deep Deterministic Policy Gradient (MADDPG) maddpg uses an actorcritic approach to MARL, and this is the main baseline we test our method against. Another policy gradient method is by FoersterFANW17 called Counterfactual MultiAgent (COMA), who also uses an actorcritic approach.
4 Methods
In this section, we explain the motivation and method of our approach: Centralized Expert Supervises MultiAgents (CESMA).
4.1 Treating a multiagent problem as a single agent problem
Intuitively, an optimal strategy of a multiagent problem could be found by a centralized expert with full observability. This is because the centralized controller has the most information available about the environment, and therefore would not pay a high of cost of partialobservability that independent learners might. This is discussed more in 5.
To find this centralized expert, we treat a multiagent problem as a single agent problem in the joint observation and action space of all agents. This is done by concatenating the observations of all agents into one observation vector for the centralized expert, and the expert learns outputs that represent the joint actions of the agents.
Our framework does not impose any other particular constraints on the expert. Any expert architecture that outputs an action that represents the jointactions of all of the agents may be used. Due to that, we are free to use any standard RL algorithm for the expert such as DDPG, DQN, or potentially even analytically derived experts.
4.2 Curse of dimensionality and some reliefs
When training a centralized expert, both the observation space and action space can grow exponentially. For example, if we use a DQN for our centralized expert then the number of output nodes will typically grow exponentially. This is due to each output needing to correspond to an element in the joint action space .
This problem has been studied by QMIX qmix and VDNs (ValueDecomposition Networks) sunehag2017value , where exponential scaling of the output space is solved by having separate values for each agent and then using the sum as a system . Other techniques such as action branching tavakoli2018action have also been considered.
In our experiments, we use DDPG (with GumbelSoftmax action selection if the environment is discrete, as MADDPG does also) to avoid the exploding input nodes of the observation space, as well as exploding output nodes of the action space. Under this paradigm, the input and output nodes only grow linearly with the number of agents.
4.3 CESMA for supervised learning on the multiagents
In order to perform imitation learning to decentralize the centralized expert policy, we adapt DAgger to the multiagent setting. There are many ways DAgger can be applied to multiagents, but we implement a method that best allows the theoretical analysis from DAgger
to apply: Namely after training the expert, we do supervised learning on a single neural network with disconnected components, each of which corresponds to a multiagent.
In more detail, after training a centralized expert , we initialize the agents , and initialize the dataset . The agents then step through the environment, storing each observation the multiagents encounter along with its action label: , where . After has reached a sufficient size, at every th time step (chosen by the practitioner; we used in our experiments), we sample a batch from this dataset , and then distribute observation and actionlabel to agent . The observation and action label is then used to perform supervised learning on the agents. Having a shared dataset of trajectories in this way allows us to view as a single neuralnetwork with disconnected components, and thus the error bounds from DAgger directly apply, as discussed in Section 5.
See Figure 1 for a diagram. The appendix contains pseudocode for our method.
The aformentioned procedure is sufficient when the agents do not have to communicate in order to solve the environment (e.g. they all have full observability, or their observations are isomorphic to full observability, or the partial observability is not too detrimental). When communication is involved, then we have to modify the above method.
4.4 CESMA for multiagents that communicate
In order to train multiagents that communicate, especially in an environment where communication is necessary (see Section 6.3 for two such cases), we have to slightly modify the above procedure. Training the centralized expert in a communication scenario follows the original procedure, but since the expert has full observability, it has no incentive to communicate with itself. Indeed, most of the time the reward function is independent of the communication (although if one desired to act covertly, then this may not be the case). Thus any observations and actions regarding communication are not utilized when training the expert. In the next phase, the agents learn to communicate amongst themselves.
We leave the full details in the appendix, and opt here to provide an overview: In the environments we test, the communication action by an agent at time step will appear to other agents at the next time step . Then in our dataset of observations, we store together the observations from time step and .
The idea is to separate the losses of the agent actions into an action loss and a communication loss. In order to obtain the action loss, we do supervised learning on the observation and actions at time where the input is the observation but with updated communications from the other agents, and the label is the centralized expert action label. The communication loss for agent is computed by first obtaining agent ’s output communication action at . This is then combined with the observations and actions and of all other agents (at time ). We query the expert on the correct action for the other agents, and we use this supervised learning loss to backpropagate the errors through to agent
weights. Indeed, our main motivation was the chain rule.
By using this procedure we allow the decentralized agents to learn communications that help reduce the penalty associated with operating in a partiallyobservable environment. Section 5.1 discusses this in more detail.
In this way, we have alleviated a bit the issue of communication being a sparse reward, which is because even when an agent communicates to another agent, the receiving agent is still learning, so it is unclear if the broadcasting agent was correct, or the receiving agent was correct; we have a double ambiguity. With an expert supervisor, the correct action by the acting agent is clear.
5 Theoretical analysis
Although the proposed framework could handle a myriad of imitation learning algorithms, such as Forward Training ross2010efficient , SMILe ross2010efficient , SEARN daume2009search , and more, we use DAgger in our experiments, and thus we follow its theoretical analysis, and provide some multiagent extensions. And since our method can be viewed as using a singleagent learner with disconnected components so that we can trivially decompose it into multiagents, then this gives us direct theoretical insights DAgger .
In our setting, the centralized expert observes the joint observations of all agents, and thus it is a function , and we can decompose into,
where . In order to decentralize our policy, our goal is to find policies such that,
Note that is able to observe the joint observations while is only able to observe its own local observation . This means we may encounter issues where
so we want
Thus the agent policy can act suboptimally in certain situations, being unaware of the global observation. This unfortunate situation not only afflicts our algorithm, but any multiagent training algorithm (and in general, any algorithm attempting to solve a POMDP).
We call this the partial observability problem of decentralization (note partial observability afflicts any algorithm trying to solve a POMDP (e.g. multiagent systems), but here we examine from the viewpoint of decentralization).
More concretely, we can say there is is a partial observability problem of decentralization if there exists observations and such that .
As it pertains to our algorithm, if we can bound the cost of this problem, then we can obtain an error bound. Then suppose is the  loss of matching the expert policy , and suppose the partial observability problem of decentralization is such that
for all iterations , observations o and policies . Then we have the following theorem (whose proof is in the appendix),
Theorem 1.
If the number of iterations is , and is the cost (negative of the reward) of executing policy , then there exists a joint multiagent policy such that
where,
Under these assumptions, this implies that decentralizing the policy always requires paying a cost of partial observability, independent of the number of iterations . Although in our experiments, we sometimes find this cost is negligible.
5.1 The need for communication
The most effective setting for our methods is when the multiagents are all able to observe the full joint observation. From the perspective of each agent the only nonstationarity when learning comes from other agents’ policies, as opposed to from the nonagent portion of the environment.
In the situation when each agent only has local observations, then to avoid the partial observability problem of decentralization, there is an incentive to communicate. Namely, we want for the multiagent policy ,
where is the communication from either all or only some of the other agents, to agent . Note that we can view as a function (where is some communication action space). Then under the following conditions, we have the error bound:
Theorem 2.
If the multiagent communication satisfies the following condition:
for all , then there is no partial observability cost of decentralization, and thus if the number of iterations is of , then there exists a policy such that . Then this implies,
The proof is in the appendix. We remark that a naive communication protocol satisfying the above conditions is when is the identity operator.
6 Experiments
Our experiments are conducted in the MultiAgent Particle Environment mordatch2017emergence ; maddpg provided by OpenAI, which has basic simulated physics (e.g. Newton’s law) and multiple multiagent scenarios.
6.1 Preliminaries
In order to conduct comparisons to MADDPG, we also use the DDPG algorithm with the GumbelSoftmax jang2016categorical ; maddison2016concrete action selection for discrete environments, as they do. For the singleagent centralized expert neureal network, we always make sure the number of parameters matches (or is lower) than that of MADDPG’s. For the decentralized agents, we use the same number of parameters as the decentralized agents in MADDPG (i.e. the actor part). We always use the discount factor
, as that seemed to work best both for our centralized expert, and also MADDPG. Following their experimental procedure, we average our experiments over three runs. And for the decentralization, we trained three separate centralized experts, and used each of them to obtain three decentralized policies. Full details of our hyperparameters is in the appendix. Rewards are also averaged over 1,000 episodes. And we always use twohidden layer neural networks. Brief descriptions of each environment are provided, and fuller descriptions and some example pictures are placed in the appendix.
6.2 Cooperative Navigation – Homogeneous and Nonhomogenous Agents
Here we examine the situation of agents occupying landmarks in a 2D plane, and the agents are either homogeneous or heterogeneous. The (continuous) observations of each agent are the relative positions of other agents, the relative positions of each landmark, and its own velocity. The agents do not have access to others’ velocities so we have partial observability. The reward is based on how close each landmark has an agent near it, and the actions of each agent are discrete: up, down, left, right, and do nothing.
In Figure 2, in all cases the fully centralized expert is able to achieve a lower reward than MADDPG and DDPG. We are also able to decentralize the expert policy (which was chosen to be the one with lowest reward) so as to reach this same reward. And we remark that our method seems to work better with more agents.
The six nonhomogeneous case works as a good experiment to see what happens when we stop the centralized expert before it truly converges. In this case, decentralization to achieve the same reward as the expert is quickest and occurs within the first 5,000 episodes. Intuitively, it makes sense that a suboptimal solution allows faster convergence.
We observe our method improves sample efficiency over MADDPG: we picked centralized experts that achieved a reward of 350 (averaged over its past 1,000 episodes) which took the centralized expert around 50,000 episodes. Decentralizing these polices so that we obtain a reward higher than 355 took around 7,000 episodes, so in total it took 57,000 episodes to obtain decentralized policies achieving a reward higher than 355. But it takes MADDPG around 80,000 to sometimes more than 100,000 episodes to obtain rewards higher than 355.
We also examined the use of DQNs, one with an exponential number of output nodes, and a QMIX/Centralized VDN that sums the individual agents values so we get linear growth in action. And we also test decentralization performance when each agent has its own dataset of trajectories, in the 3 nonhomogeneous agents scenario. We hypothesized that the nonhomogeneity of the agents may have an effect on convergence, but this turned out not to be so. See Figure 2 for the reward curves.
6.3 Cooperative Communication
Here we we adapt CESMA to a task that involves communication. In this scenario, the communication action taken by each agent at time step will appear to the other agents at time step . Although we require continuous communication to backprop, in practice we can use the softmax operator to provide the bridge between the discrete and continuous. And during decentralized execution, our agents are able to act with discrete communication inputs.
We examine three scenarios for CESMA that involve communication, and use the training scenario described in section 4.4. The first scenario called the “speaker and listener" environment has a speaker who broadcasts the correct goal landmark (in a “language" it must learn) out of a possible 3 choices, and the listener, who is blind to the correct goal landmark, must use this information to move there. Communication is a necessity in this environment. The second scenario is cooperative navigation with communication and here we have two/three agents whose observation space includes the goal landmark of the other agent, and not their own, and there are three/five possible goal landmarks.
We see in figure 2 that we achieve a lower reward and in a more sample efficient manner. For the speaker and listener environment, the centralized expert nearimmediately converges, and same for the decentralization process. And MADDPG has a much higher variance in its convergence. In the cooperative navigation with communication scenarios, the story is similar, that the centralized expert quickly converges, and the decentralization process is near immediate.
6.4 Reward vs. loss, and slow and fast learners
In our experiments with cooperative navigation, when using the cross entropy loss, we did not find an illuminating correlation between the reward and the loss. We reran the experiments in a truer DDPG fashion by solving a continuous version of the environment, and used the meansquared error for the supervised learning. We examined the loss in the cooperative navigation task with 3 agents, both homogeneous and nonhomogeneous agents. We plot the figures in 3. We found that in these cases, the reward and loss were negatively correlated as expected, namely that we achieved a higher reward as the loss decreased. In the nonhomogeneous case, we plot each individual agents’ reward vs its loss and found that the big and slow agent had the biggest loss, followed by the medium agent, and the small and fast agent being the quickest learner. This example demonstrates that in nonhomogeneous settings, some agents may be slower to imitate the expert than others.
7 Conclusion
We propose a MARL algorithm, called Centralized Expert Supervises Multiagents (CESMA), which takes the training paradigm of centralized training, but decentralized execution. The algorithm first trains a centralized expert policy, and then adapts DAgger to obtain decentralized policies that execute in a decentralized fashion.
Appendix A Experiment where each agent has its own dataset of trajectories
Here we describe in a little bit more detail our procedure for experimenting with individual dataset trajectories for each agent, vs a shared dataset. Namely, we plot the learning curves for decentralizing a policy in the two cases: (1) When each agent has its own dataset of trajectories, or (2) when there is a shared dataset of trajectories (which is the one we use in the experiments). We tested on the cooperative navigation environment with 3 nonhomogeneous agents. We hypothesized that the nonhomogeneity of the agents would have an effect on the shared reward, but this turned out not to be so. But it is interesting to note that in the main text, we found that the some agents had a bigger loss when doing supervised learning from the expert.
Appendix B Experiment with DQNs
Here we examined decentralizing DQNs. We briefly described it in the main text, and give a bit more info here. We used the cross entropy loss for the supervised learning portion, and used the cooperative navigation environment with 3 nonhomogenous agents. The DQNs we used are: the exponential actions DQN, which is just a naive implementation of DQNs for the multiagents, and a Centralized VDN where the system value is the sum of the individual agent values. We used a neural network with 200 hidden units, batch size 64, and for the exponential DQN, we used a learning rate and of , and for the QMIX/Centralized VDN DQN we used a learning rate and of . We also used a noisy action selection for exploration.
Appendix C Pseudoalgorithm of CESMA (without communication)
Appendix D Detailed description of CESMA with communicating agents
The main intuition for our idea is very simple: the chain rule. In order to minimize notational burden, we first leave out some details, and then provide them after: If the supervised learning loss function is
, the expert is , and if we have agents: , then in order to update the communication of agent to agent , then denoting the communications from all other agents to agent excluding agent , then we want to minimize:and so we should backpropagate through other agents’ actions (namely the physical actions which have the expert label, and not the communication action which does not) to update agent ’s communication. We note both and are communication actions that are obtained from the current policy of the agents, i.e. we requery the agents for their communication actions.
And to update the action loss of agent , we want to minimize
(note bold characters are vectors, e.g. ). We note that c is an uptodate communication action, which we obtain by requerying the other agents for their communication actions.
The above leaves out some notation for ease of understanding and to get the main idea. The technically and notationally correct version is:
If

the supervised learning loss function is ,

the expert is ,

the agents at the current iteration are: ,

is the observation of agent ,

is the observation of agent at the next timestep (i.e. after observing and communications from other agents, and then taking a step in the environment, to get a new observation),

is the communication action of other agents towards agent that accompanies observation ,

is the updated (i.e. using the most uptodate policy) communication action of agents towards agent , i.e.
then we want to minimize:
(communication loss) 
where we note contains the communication action from agent to agent , and thus we can backprop, and the subscript “action" means the physical action (and not the communication action).
And for the action loss, we want to minimize:
(action loss) 
where is the updated communications from all other agents, to agent (i.e. we use the current mostuptodate policy to form . The need to use an updated version of communication is because the language of the agents tends to change more drastically as training progresses. (One thing we did try was to set a low size limit on the trajectory dataset , but preliminary results showed it did not help much). But we do note that in order to construct , we need to use the observation/communications from time step , so reliance on an older language is still there. One can imagine going all the way back to in order to update the communications of the whole trajectory, but for our experiments, going back one timestep sufficed.
We give a pseudocode in algorithm 2.
A diagram of the action loss and communication loss is given in figure 4 (action loss) and 5 (communication loss).
We remark that we also considered the case of a hybrid objective, where the actions are learned by supervised learning from the expert, and the communication is learned similar to a standard RL algorithm (e.g. the values are communication actions). Preliminary results showed this did not work well.
Appendix E More detailed descriptions of the environments used in the experiments
e.1 Cooperative navigation
The goal of this scenario is to have agents occupy landmarks in a 2D plane, and the agents are either homogeneous or heterogeneous. The environment consists of:

Observations: The (continuous) observations of each agent are the relative positions of other agents, the relative positions of each landmark, and its own velocity. Agents do not have access to other’s velocities, and thus each agent only partially observes the environment (aside from not knowing other agents’ policies).

Reward: At each timestep, if is the th agent, and the th landmark, then the reward at time is,
This is a sum over each landmark of the minimum agent distance to the landmark. Agents also receive a reward of at each timestep that there is a collision.

Actions: Each agents’ actions are discrete and consist of: up, down, left, right, and do nothing. These actions are acceleration vectors (except do nothing), which the environment will take and simulate the agents’ movements using basic physics (i.e. Newton’s law).
e.2 Speaker listener
In this scenario, the goal is for the listener agent to reach a goal landmark, but it does not know which is the goal landmark. Thus it is reliant on the speaker agent to provide the correct goal landmark. The observation of the speaker is just the color of the goal landmark, while the observation of the listener is the relative positions of the landmark. The reward is the distance from the landmark.

Observations: The observation of the speaker is the goal landmark. The observation of the listener is the communication from the speaker, as well as the relative positions of each goal landmark.

Reward: The reward is merely the negative (squared) distance from the listener to the goal landmark.

Actions: The actions of the speaker is just a communication, a 3dimensional vector. The actions of the listener are the five actions: up, down, left, right, and do nothing.
e.3 Cooperative navigation with communication
In this particular scenario, we have one version with 2 agents and 3 landmarks, and another version with 3 agents and 5 landmarks. Each agent has a goal landmark that is only known by the other agents. Thus the each agent must communicate to the other agents its goal. The environment consists of:

Observations: The observations of each agent consist of the agent’s personal velocity, the relative position of each landmark, the goal landmark for the other agent (an 3dimensional RGB color value), and a communication observation from the other agent.

Reward: At each timestep, the reward is the sum of the distances between and agent and its goal landmark.

Actions: This time, agents have a movement action and a communication action. The movement action consists of either not doing anything, or outputting an acceleration vector of magnitude one in the direction of up, down, left, or right; so do nothing, up, down, left right. The communication action is a onehot vector; here we choose the communication action to be a 10dimensional vector.
Appendix F Hyperparameters

For all environments, we chose the discount factor to be for all experiments, as that seemed to benefit both the centralized expert as well as MADDPG (and as well as independently trained DDPG). And we always used a twohiddenlayer neural network for all of MADDPG’s actors and critics, as well as the centralized expert, and the decentralized agents. The training of MADDPG used the hyperparameters from the MADDPG paper maddpg , with the exception of having (instead of 0.95), as that improved MADDPG’s performance.

For the cooperative navigation environments with 3 agents, for both homogeneous and nonhomogeneous: Our centralized expert neural network was a twohiddenlayer neural network with 225 units each (as that matched the number of parameters for MADDPG when choosing 128 as their number of hidden units for each of their 3 agents), and we used a batch size of 64. The learning rate was 0.001, and . We also clipped the gradient norms to . When decentralizing, each agent was a twohiddenlayer neural network with 128 units (as in MADDPG), where we trained with a batch size of 32 and a learning rate of 0.001. In our experiment comparing with MADDPG, we use the cross entropy loss. The MADDPG and DDPG parameters were 128 hidden units, and we clipped gradients norms at 0.5, with a learning rate of 0.01.

For the cooperative navigation with 6 agents, for both homogeneous and nonhomogeneous: Our centralized expert neural network was a twohiddenlayer neural network with 240 units each (as that matched the number of parameters for MADDPG when choosing 128 as their number of hidden units for each of their 3 agents’ actor and critic), and we used a batch size of 32. The learning rate was 0.0001, and . We also clipped the gradient norms to . When decentralizing, each agent was a twohiddenlayer neural network with 128 units (as in MADDPG), where we trained with a batch size of 32 and a learning rate of 0.001. In our experiment comparing with MADDPG, we use the cross entropy loss. The MADDPG and DDPG parameters were 128 hidden units, and we clipped gradients norms at 0.5, with a learning rate of 0.01.

For the speaker and listener environment: Our centralized expert neural network was a twohiddenlayer neural network with 64 units each (which gave a lower number of parameters than MADDPG when choosing 64 as their number of hidden units for each of their 2 agents’ actor and critic), and we used a batch size of 32. The learning rate was 0.0001, and . When decentralizing, each agent was a twohiddenlayer neural network with 64 units (as in MADDPG), where we trained with a batch size of 32 and a learning rate of 0.001. In our experiment comparing with MADDPG, we use the cross entropy loss. The MADDPG and DDPG parameters were 64 hidden units, and we clipped gradients norms at 0.5, with a learning rate of 0.01.

For the cooperative navigation with communication environment: Our centralized expert neural network was a twohiddenlayer neural network with 95 units each (which matched the number of parameters as MADDPG when choosing 64 as their number of hidden units for each of their 2 agents’ actor and critic), and we used a batch size of 32. The learning rate was 0.0001, and . When decentralizing, each agent was a twohiddenlayer neural network with 64 units (as in MADDPG), where we trained with a batch size of 32 and a learning rate of 0.001. In our experiment comparing with MADDPG, we use the cross entropy loss. The MADDPG and DDPG parameters were 64 hidden units, and we clipped gradients norms at 0.5, with a learning rate of 0.01.

We also run all the environments for 25 time steps.
Appendix G Proofs of theorems
g.1 Mathmetical notation and preliminaries
For completeness, we provide here proofs of the theorems from the main text. The proofs heavily borrow from DAgger .
In order to reduce notational burden, we denote,
so that
where is the joint multiagent observation. And we notate similarly.
We also denote the distribution of states seen at time , and the average distribution of states encountered if we follow for timesteps.
And letting be the cost of executing action a in state o, and letting , then we let
We use the following lemma which gives a total variation bound:
Lemma 3.
If is the average distribution of states encountered by (the policy that best mimics the expert on past trajectories at iteration ), and if is the average distribution of states encountered by the policy (which picks expert actions with probability and otherwise), then we have .
Proof.
Suppose is the distribution of states encountered over steps for a policy that picks expert actions at least once. Then since picks actions from over steps with probability , then we have
Then,
∎
Now we can prove:
Theorem 4.
If the number of iterations is , then there exists a policy such that where
Proof.
Let , and let be a nonincreasing sequence, and suppose is the largest where .
Now, the lemma above implies,
Then let be a constant bounding (the noregret bound),
Then we have the following set of inequalities,
∎
So we have,
Corollary 5.
If the number of iterations is , then there exists a joint multiagent policy such that
where,
And now we can prove the Theorem 1 from the main text:
Theorem 1.
Let be the  loss of matching the expert policy. And suppose we have the cost of partial observability,
If the number of iterations is , then there exists a joint multiagent policy such that
where,
Proof.
The condition on gives us that,
And thus we have,
∎
And now we prove Theorem 2 from the main text:
Theorem 2.
If the multiagent communication satisfies the following condition:
for all , then there is no partial observability problem of decentralization, and thus if the number of iterations is of , then there exists a policy such that . Then this implies,
Proof.
We need only show that the there is no partial observability problem of decentralization when the above conditions hold. As a note: we now have that the observation of agent is not just , but now .
Suppose for contradiction that there is a partial observability problem of decentralization, and thus there exists an agent , observations and such that
Denote as the observation without and similarly for . We then see we must have,
for all possible policies . But of course by the above conditions, , and thus we can certainly construct a policy where the above does not hold true, a contradiction. And thus we must have that there is no partial observability problem of decentralization. ∎
References

(1)
Daniel S Bernstein, Eric A Hansen, and Shlomo Zilberstein.
Bounded policy iteration for decentralized pomdps.
In
Proceedings of the nineteenth international joint conference on artificial intelligence (IJCAI)
, pages 52–57, 2005.  (2) Lucian Bu, Robert Babu, Bart De Schutter, et al. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172, 2008.
 (3) Lucian Busoniu, Robert Babuska, and Bart De Schutter. Multiagent reinforcement learning : An overview. 2010.
 (4) Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998:746–752, 1998.
 (5) Hal Daumé, John Langford, and Daniel Marcu. Searchbased structured prediction. Machine learning, 75(3):297–325, 2009.
 (6) Roel Dobbe, David FridovichKeil, and Claire Tomlin. Fully decentralized policies for multiagent systems: An information theoretic approach. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2941–2950. Curran Associates, Inc., 2017.
 (7) Maxim Egorov. Multiagent deep reinforcement learning, 2016.
 (8) Richard Evans and Jim Gao. Deepmind ai reduces google data centre cooling bill by 40, 2016.
 (9) Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
 (10) Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep multiagent reinforcement learning. arXiv preprint arXiv:1702.08887, 2017.
 (11) Jakob N. Foerster, Christian A. Schröder de Witt, Gregory Farquhar, Philip H. S. Torr, Wendelin Boehmer, and Shimon Whiteson. Multiagent common knowledge reinforcement learning. CoRR, abs/1810.11702, 2018.
 (12) Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multiagent policy gradients. CoRR, abs/1705.08926, 2017.
 (13) Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multiagent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pages 66–83. Springer, 2017.
 (14) Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8, 09 1998.
 (15) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 (16) Landon Kraemer and Bikramjit Banerjee. Multiagent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190:82 – 94, 2016.
 (17) Martin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement learning in cooperative multiagent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning, pages 535–542. Morgan Kaufmann, 2000.
 (18) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 (19) Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
 (20) Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. CoRR, abs/1706.02275, 2017.
 (21) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 (22) L. Matignon, G. J. Laurent, and N. L. FortPiat. Hysteretic qlearning :an algorithm for decentralized reinforcement learning in cooperative multiagent teams. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 64–69, Oct 2007.
 (23) Laetitia Matignon, Guillaume j. Laurent, and Nadine Le fort piat. Review: Independent reinforcement learners in cooperative markov games: A survey regarding coordination problems. Knowl. Eng. Rev., 27(1):1–31, feb 2012.
 (24) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016.
 (25) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 (26) Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multiagent populations. arXiv preprint arXiv:1703.04908, 2017.
 (27) Frans A Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate qvalue functions for decentralized pomdps. Journal of Artificial Intelligence Research, 32:289–353, 2008.
 (28) Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P. How, and John Vian. Deep decentralized multitask multiagent reinforcement learning under partial observability. CoRR, abs/1703.06182, 2017.
 (29) Liviu Panait and Sean Luke. Cooperative multiagent learning: The state of the art. Autonomous Agents and MultiAgent Systems, 11(3):387–434, Nov 2005.
 (30) Liviu Panait and Sean Luke. Cooperative multiagent learning: The state of the art. Autonomous agents and multiagent systems, 11(3):387–434, 2005.
 (31) James Paulos, Steven W. Chen, Daigo Shishika, and Vijay Kumar. Decentralization of multiagent policies by learning what to communicate. 2018.
 (32) Tabish Rashid, Mikayel Samvelyan, Christian Schröder de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. QMIX: monotonic value function factorisation for deep multiagent reinforcement learning. CoRR, abs/1803.11485, 2018.
 (33) Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668, 2010.
 (34) Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. Noregret reductions for imitation learning and structured prediction. CoRR, abs/1011.0686, 2010.
 (35) John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage esti