1 Introduction
A valuable characteristic of multirobot systems is their robustness to failure. Multirobot systems can adapt to agent attrition and accomplish tasks even when the performance of individual agents degrade. This characteristic is important in applications that are inherently hazardous due to environmental factors or interaction with adversarial agents. Decisionmaking in multiagent systems can be modeled as a decentralized partially observable Markov decision process (DecPOMDP)
[1]. In general, computing an optimal policy is NEXPcomplete [2]. While various solution techniques based on heuristic search
[3, 4, 5] and dynamic programming [6, 7] exist, they tend to quickly become intractable.One approach to approximate solutions to such problems is to use deep reinforcement learning (deepRL) [8, 9, 10, 11, 12]. However, a fundamental challenge is the multiagent credit assignment problem. When there are multiple agents acting simultaneously toward a shared objective, it is often unclear how to determine which actions of which agents are responsible for the overall joint reward. There may be strong interdependencies between the actions of different agents and long delays between joint actions and their eventual rewards [13].
This work makes three contributions to multiagent decision making. First, we outline a definition and properties for the concept of system health in the context of Markov decision processes. These definitions provide the framework for a subset of DecPOMDPs that can be used to analyze multiagent systems operating in hazardous and adversarial environments. Second, we use the definition of health to formulate a multiagent credit assignment algorithm to be used to accelerate and improve multiagent reinforcement learning techniques. Third, we apply the healthinformed crediting algorithm within a multiagent variant of the proximal policy optimization (PPO) [14] algorithm and demonstrate significant learning improvements compared to existing algorithms in a simple twodimensional particle environment.
2 Related Work
Applying deepRL to multiagent decisionmaking is an active area of research. Gupta et al. made early contributions to this field by demonstrating how algorithms such as TRPO, DQN, DDPG, and A3C can be extended to a range of cooperative multiagent problems [10]. Lowe et al. made further contributions to the field of multiagent deepRL with the development of multiagent deep deterministic policy gradients (MADDPG) that was capable of training in cooperative and competitive environments [11].
Multiagent reinforcement learning is challenging due to the problems of nonstationarity and multiagent credit assignment. The nonstationarity problem arises when a learning agent assumes all other learning agents as part of the environment dynamics. Since, the individual agents are continuously changing their policies, the environment dynamics from the perspective of an agent is continuously changing [15]. While Lowe et al. attempt to address the nonstationary problem, MADDPG is still shown to become ineffective at learning for systems with more than three or four agents [11].
Gupta et al. partially address the nonstationary problem by employing parameter sharing whereby groups of homogeneous agents use identical copies of parameters for their local policies [10]. Parameter sharing techniques can give rise to the second fundamental challenge of multiagent learning: multiagent credit assignment—i.e. the challenge of identifying which actions from which agent at which time were most responsible for the overall performance (returns) of the system. Gupta et al. avoid explicit treatment of this problem by focusing on environments where the joint rewards can be decomposed into local rewards. However, in general, such local reward structures are not guaranteed to optimize joint returns [13].
Wolpert and Tumer made important contributions to addressing the credit assignment problem with their development of Wonderful Life Utility (WLU) and Aristocrat Utility (AU) which are forms of “difference rewards” [13]. Both WLU and AU attempt to assign credit to individual agents’ actions by taking the difference of utility received versus utility that would have been received had a different action been taken by the agent. The comparison between actual returns and hypothetical returns is sometimes referred to as counterfactual returns or learning [12]. Predating most of the advancements in deep reinforcement learning, Wolpert and Tumer’s work was restricted to small decision problems that could be handled in a tabular fashion [13, 16].
Foerster et al. formulated an aristocratlike crediting method that was able to leverage a deep neural network stateaction value function within a policy gradient algorithm; referred to as counterfactual multiagent (COMA) policy gradients
[12]. By employing deep neural networks, Foerster et al. enabled crediting in large or continuous state spaces; however their counterfactual baseline required enumeration over all actions and was thus restricted to problems with small, discrete action spaces.3 Problem Statement
The problems presented in this work can be modeled as decentralized partially observable Markov decision process (DecPOMDPs), which are defined by the tuple . represents a finite set of agents.
is the joint state space of all agents (finite or infinite). Assuming states are described in a vector form, let
be a specific state of the system. is the action space of the th agent in joint state . The vector represents a joint action at time where . is the set of observations for the th agent in joint state . The vector represents a joint observation at time where .is the joint transition function that represents a probability density of arriving in state
given the joint action was taken in state . is the joint reward function that specifies an immediate reward for taking the joint action while in state .3.1 System Health and Action Prognosis
Here, we introduce a concept we refer to simply as system health, though there are alternative definitions used in the field of prognostic decision making (PDM) [17, 18]. If we represent the current state of the system with vector , then the system health, constitutes a subvector of . Without loss of generality we can define the state vector as , where is the nonhealth components of the state. Each element of the the health vector corresponds to the health of an individual agent and is in the interval , where 1 represents full health and 0 represents a fully degraded agent. In this paper, we choose to define health as a monotonically decreasing quantity that holds the following properties associated with reduction of system health:
Property 1
Monotonically decreasing health Let represent any state vector where the health of agent equals ; thus we can define the nonincreasing nature of the health of the system as follows
(1) 
Property 2
Constriction of reachable set in state space Define the reachable set of joint state and constriction of reachable set as follows
(2) 
Property 3
Constriction of available actions in action space Let represent the available action set for agent when the system is in state . Therefore the constriction of the available action set can be described as
(3) 
Property 4
Constriction of observable set in observation space Let represent the set of possible observations for agent when the system is in state . Therefore the constriction of observable set can be described as:
(4) 
Note that the reduction in health may also affect the reward function such as increased variance or a change in the expected value, but we leave these reward effects as a perproblem basis and make no general assertions about the reward signal as a function of system health.
Given a joint action at joint state , we can define the expected reduction in health of the system as the action’s health risk; also referred to as the action prognosis. Let the action prognosis, , be a vector containing the health risk of a joint stateaction pair:
(5) 
Note that an action prognosis is a nonnegative quantity since is guaranteed to be nonpositive due to creftype 1. Furthermore, for each joint state , there exists some individual action that represents the maximum health risk to agent
(6) 
4 Approach
This section uses the properties of health and action prognosis to derive a multiagent reinforcement learning algorithm to train systems operating in hazardous and adversarial environments. We choose to restrict our scope to policy gradient methods because of their scalability to large and continuous state and action spaces [19]. We develop a healthinformed policy gradient in Section 4.1 and use it to propose a new multiagent PPO variant in Section 4.2.
The goal of a cooperative multiagent learning problem is to develop a joint policy , parameterized by joint parameters , that maximizes the discounted joint returns, , where is the discount factor, is the empirical joint reward, and is the final time step in an episode or receding horizon. In the context of DecPOMDPs with no centralized control, the joint policy is composed of a set of local policies that are parameterized by and map an agent’s local observations to its actions at each time step.
In general each agent may be learning its own individual policy , however this can lead to a nonstationary environment from the perspective any one agent which can confound the learning process [15]. Instead, for this paper, we assume that agents are executing identical copies of the same policy in a decentralized fashion, referred to as parameter sharing [10, 20]. We adopt a centralized learning, decentralized execution architecture whereby training data can be centralized between training episodes even if no such centralization of information is possible during execution—a common approach in the multiagent RL literature [10, 11, 12].
Although parameter sharing mitigates some of the problems of nonstationarity, it does not resolve the problem of multiagent credit assignment. When the reward function is a function of the joint state and actions of the system, i.e. , then all agents receive the same reward signal at each time step. This obfuscates which agents are most responsible for the overall performance of the group and can significantly inhibit multiagent learning [13, 12]. We propose a novel counterfactual learning method that leverages health information in order to assign credit in multiagent systems and improve learning performance.
4.1 HealthInformed MultiAgent Policy Gradients
To develop a policy gradient approach for healthbased multiagent systems, we begin with the policy gradient theorem [21, Chapter 13]
(7) 
Equation 7 gives an analytical expression for the gradient of the objective function with respect to policy parameters , where is the true stateaction value and refers to the initial state distribution. To develop a practical algorithm, we need a method for sampling such an analytical expression that has an expected value equal or approximately equal to Equation 7. A common form of the policy gradient expression used in samplingbased algorithms is given as [21, Chapter 13]:
(8) 
where can take on a range of forms that affect the bias and variance of the learning process [22]. Typically takes the form of the return or discounted return (i.e. REINFORCE algorithm [21, Chapter 13]); the returns baselined on the state value function (i.e. REINFORCE with baseline [21, Chapter 13]); the stateaction value function ; the advantage function ; or the generalized advantage function [22, 14]. The joint state value function and joint stateaction value function of policy are defined as
(9) 
(10) 
Equation 8 is that it is derived under the assumption of a singleagent or centrally controlled multiagent system. Let us recast Equation 8 in a decentralized context where each agent selects individual actions, , based on local observations, , and produces a separate stateaction trajectory that can be used to compute policy gradients
(11) 
If all agents have the same policy parameters, receive identical rewards throughout an episode, and employ the same function, then Equation 11 renders the same gradient at each time step for all agents’ actions. This uniformity in policy gradients is problematic because it results in all actions at a given time step being promoted equally during the next training cycle—i.e. the multiagent credit assignment problem.
To overcome this problem, we use the term to distinguish gradients between different agents’ actions at the same time step. One option is to set , which defines a observationaction function referred to as a local critic. Since all agents are expected to receive distinct observations at each time step, is expected to be distinct for each agent at a given time step. However, this approach relies on making value approximations based on limited information, which can significantly impact learning, as we demonstrate in Section 5. If we instead leverage our prior assumption of centralized learning, then we can make direct use of the central value functions in Equations 9 and 10; however, this still does not resolve the multiagent credit assignment problem.
We can use the concepts of health and action prognosis in an attempt to address the credit assignment problem. Our technique stems from the concepts of counterfactual baselines [12] or difference rewards [23]. The idea is that credit is assigned to an agent at a given time step by comparing the joint return from having agent at its present state and chosen action, with the expected return had agent been at its minimum health or chosen an action of maximum health risk (Equation 6).
In order to replace the summation over states and actions in Equation 7 with an expectation in Equation 8, we assume that policy is followed and that policy produces states and actions in proportion to the summation terms. However, this assumption is potentially invalid for our purposes due to Properties 2 and 3. If reduction in system health constricts the available actions and reachable states at a given state, and this constriction is not encoded within the policy, then the implied assumption in Equation 8 is invalid. This would occur if the policy selects an action from a health state that the physical agent is incapable of executing.
To overcome this inconsistency, we propose a simple heuristic: the policy gradient for agent at time is multiplied by the true health of agent at time ; i.e. . The justification for this heuristic is that, as an agent’s health degrades and the available action set is constricted, it becomes less likely that the action selected by the policy aligns with the action executed by the agent. By attenuating the policy gradient by the health of the agent, policy learning occurs more slowly on data generated in low health states and thus reduces the effect of mismatching chosen and executed actions. Section 5 investigates the case where the health state is binary for each agent and a zerohealth state shrinks the action set until only a zerovector action is executable by the agent.
We propose the following two minimumhealth and worstcaseprognosis baselines:
(12) 
(13) 
where represents the true joint state of the system at time , except that the health of agent is replaced with minimum health (typically 0). Likewise, represents the true joint action of the system at time except that the action of agent is replaced with its maximum risk action; i.e. the action with the largest expected decrease in health.
The healthinformed baselines are motivated by the concept of Wonderful Life Utility (WLU) [13] but adapted for deepRL policy gradients. Furthermore, Equation 12 provides a counterfactual baseline that is completely agnostic to the action space, a property not seen in prior work [13, 23, 12]. In contrast, other modernizations of WLU still require maintaining explicit counts of discrete actions and state visitations and are not well posed for continuous domains and deepRL [23]. Modern implementations of Aristocrat Utility (AU) such as COMA [12] require enumeration over all possible actions or computationally expensive Monte Carlo analysis at each time step. The healthinformed baselines suffer no such limitations.
For the remainder of the paper, we focus on the minimumhealth baseline, leaving more detailed analysis and testing of the worstcaseprognosis baseline for future work—though we hypothesize that they would produce similar performance.
4.2 HealthInformed MultiAgent Proximal Policy Optimization
While the healthinformed credit assignment technique described in Section 4.1 is applicable to any reinforcement learning algorithm that uses value functions, we choose to demonstrate the application of these baselines within a multiagent variant of proximal policy optimization (PPO) [14]. PPO is chosen because it has been shown to work well with continuous action spaces [19] as well as multiagent environments [9].
We apply healthinformed counterfactual baselines in Equations 12 and 13 to PPO’s surrogate objective function and formulate the clipped surrogate objective function
(14) 
As in the original PPO paper, we also need to train the value function network and augment with an entropy bonus to encourage exploration [14]. We train a centralized critic , using TD(). In particular, we compute the returns for the rollouts from time step and train the parameters using gradient descent on the following loss:
(15) 
5 Experiments
This section presents experiments to demonstrate healthinformed credit assignment in multiagent learning. The experiments compare multiagent deep deterministic policy gradients (MADDPG [11]) with three multiagent variants of proximal policy optimization (MAPPO) referred to as local critic, central critic and minhealth crediting.^{2}^{2}2See Appendix A for a comparison between MADDPG and our proposed variants of MAPPO in an environment taken directly from the original MADDPG paper [11].
The local critic MAPPO uses an advantage function based on local observation value estimates
. The central critic MAPPO uses an advantage function based on joint state value estimates enabled by the centralized learning assumption: . The minimumhealth crediting MAPPO uses the healthinformed counterfactual baseline in Equation 12.The experiments are built within a forked version of OpenAI’s MultiAgent Particle Environment library that is extended to incorporate the concepts of system health and risktaking [11, 24]. Since the scenarios used by Lowe et al. were dedicated to small groups of agents and did not incorporate the concept of health, new scenarios were developed. Section 5.1 describes a task whereby multiple agents must link two fixed terminals in the presence of a hazard that can cause individual agents to be terminated; thus emulating the formation of an ad hoc communication network.
5.1 Ad Hoc Communication Networks in Hazardous Environments
The hazardous communication scenario is a variant of a classic multiagent coordination problem that is modified to incorporate the concepts of health and risktaking. The problem consists of two fixed terminals in a 2D environment and a group of robots that can act as communication relays over short distances and are capable of moving through the 2D space. The objective is for the robotic agents to arrange themselves to form an uninterrupted link between the terminals as quickly as possible and maintain connectivity for as long as possible during an episode. The agents observe the relative state of other agents within some finite radius, i.e. “neighborhood”. There is no central commander directing the motion of agents; instead, each agent uses its own learned local policy to map their local observations of the environment and neighboring agents to velocity commands. For each time step in which an uninterrupted link is formed between the two fixed terminals, all agents receive a reward of 1, regardless of whether or not they form one of the links between terminals. If no continuous link exists between terminals, than all agents receive a reward of 0.
The healthbased variant of this classic problem comes in the form of environmental hazards, particularly hazards that cannot be directly observed. For each episode an environmental hazard is randomly placed between the terminals. The location of the hazard is not known or directly observed by the agents, it is hidden. Any agent in proximity to the hazard is at risk of being terminated with probability . Terminated agents, i.e. agents with zero health, cannot form communication links with other agents and thus cannot directly contribute to the objective of connecting the fixed terminals. While surviving agents cannot directly observe the location of the hazard, they are able to observe the location of terminated agents; thus terminated agents can act as a warning beacon to other agents. Figure 1 shows a snapshot of the hazardous communication scenario where agents have successfully formed a complete link between terminals while using the observation of terminated agents in order to avoid the unobserved hazard.
From a single agent’s perspective, the number of observed agents and environmental features varies over time, thus creating a variablesized input for the agent’s policy. To encode the variablesized observation as a fixedsized input to a neural network policy, we use histograms to bin the relative states of neighboring agents [20]. In the experiments presented here, we assume that an agent can observe the relative bearing and distance to all other agents within some neighborhood radius. These observations are then stored as counts in a matrix that represents radial and angular bins of a 2D histogram. The matrix is then flattened and concatenated with other observation features (e.g. environment information of nonvarying size) to form the vector .
5.2 Results
Figure 2 illustrates the improvement in learning for a 10agent hazardous ad hoc communication problem when multiagent PPO is used with a centralized critic and healthinformed crediting. The greentriangle line represents the learning curve for multiagent PPO using a local critic, i.e. value estimates are drawn from individual agents’ observations. The orangex curve represents multiagent PPO learning with a centralized critic^{3}^{3}3Note that the centralized critic only has access to joint state information during offline, interepisode training phases. Such global information is not available during execution of the policy. but with no credit assignment. The bluedot curve represents learning produced by multiagent PPO with a centralized critic applying a minimumhealth counterfactual baseline (Equation 12). To serve as a comparison with an existing algorithm, the redcross curve represents learning for the MADDPG [11].
Figure 2 shows that, although a local critic may initially learn more quickly than a central critic, the centralized critic overtakes it in the long run and provides superior final performance at the end of training. Most importantly we see that a centralized critic with healthinformed credit assignment outperforms all other algorithms in terms of both initial learning rate and final performance. We also see that MADDPG fails to learn in this environment, most likely due to the relatively large number of agents.
All experiments represented in Figure 2
were run for 50,000 episodes with each episode consisting of 50 time steps. Training occurred in batches composed of 256 episodes. Training batches were broken into 8 minibatches and run over 8 epochs. For the multiagent PPO experiments, we used an entropy coefficient of 0.01 to ensure sufficient exploration
[14]. Since the actor and critic networks are separate and do not share parameters, learning is not affected by the value function coefficient [14] and is thus ignored.For all experiments represented in Figure 2
the policy network was composed of a multilayer perceptron (MLP) with 2 fully connected hidden layers, each of which being 64 units wide, and a hyberbolic tangent activation function. For experiments that utilized a local critic the value function network matched the architecture of the policy network. For experiments that utilized a centralized critic the value function network had a distinct architecture that was developed empirically. Such centralized critic networks consisted of a 8layer by 64unit, fully connected MLP that used an exponential linear unit (ELU) activation function
[25]. We observe that the ELU activation function tended to outperform the rectified linear unit (ReLU) and hyperbolic tangent activation functions for central critic learning.
In order to ensure that actorcritic model converges to a local optimum, we must ensure that the update timescales of actor and critic are sufficiently slow and that the actor is updated sufficiently slower than the critic [12]. We used an actor learning rate of and a central critic learning rate of with the Adam optimizer [26].
Figure 3 attempts to show the effect of applying minimumhealth credit assignment as a function of group size. With a group size of only two agents, the counterfactual credit assignment actually underperforms a purely factual, noncrediting approach. For groups of 4 and 8 agents the minimumhealth crediting technique shows slight but notable improvement in learning performance over noncredit learning. For a group size of 16 agents, the minimumhealth credit assignment significantly outperforms a noncrediting approach. These result align with those shown in Figure 2 which represents a similar test with 10 agents. The curves in Figure 2 are averaged over multiple training experiments, whereas Figure 3 shows singleexperiment curves.
6 Conclusions and Future Work
In this paper we have proposed a definition for system health in DecPOMDPs and shown how it can be used in policy gradient methods to improve multiagent reinforcement learning. The experiments presented in this paper serve as proof of concept, but are highly simplified when compared to any realworld problems. We believe that the techniques presented here are well suited for reinforcement learning in multiagent strategy game environments such as StarCraft II and DOTA 2 [8, 9]
We currently restrict our analysis to systems of homogeneous agents with shared, local polices operating in cooperative environments. Using a centralized critic based on the joint state, we can extend this to heterogeneous agents with distinct local policies by forming policy groups of homogeneous agents within the heterogeneous team. To extend to partially observable stochastic games, we could apply this technique to each cooperative group of agents within the competitive game.
References
 Kochenderfer [2015] M. J. Kochenderfer. Decision making under uncertainty: Theory and application. MIT Press, 2015.
 Bernstein et al. [2002] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.

Szer and Charpillet [2005]
D. Szer and F. Charpillet.
An optimal bestfirst search algorithm for solving infinite horizon
DecPOMDPs.
In
European Conference on Machine Learning (ECML)
, pages 389–399. Springer, 2005. 
Spaan et al. [2011]
M. T. Spaan, F. A. Oliehoek, and C. Amato.
Scaling up optimal heuristic search in DecPOMDPs via incremental
expansion.
In
International Joint Conference on Artificial Intelligence (IJCAI)
, 2011.  Oliehoek et al. [2013] F. A. Oliehoek, S. Whiteson, and M. T. Spaan. Approximate solutions for factored DecPOMDPs with many agents. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 563–570, 2013.
 Hansen et al. [2004] E. A. Hansen, D. S. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. In AAAI Conference on Artificial Intelligence (AAAI), volume 4, pages 709–715, 2004.
 Boularias and Chaibdraa [2008] A. Boularias and B. Chaibdraa. Exact dynamic programming for decentralized POMDPs with lossless policy compression. In International Conference on Automated Planning and Scheduling (ICAPS), pages 20–27. AAAI Press, 2008.
 [8] DeepMind. AlphaStar: Mastering the realtime strategy game StarCraft II. URL https://deepmind.com/blog/alphastarmasteringrealtimestrategygamestarcraftii/.
 [9] OpenAI. Openai Five. URL https://openai.com/five/#howopenaifiveworks.
 Gupta et al. [2017] J. K. Gupta, M. Egorov, and M. Kochenderfer. Cooperative multiagent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 66–83. Springer, 2017.
 Lowe et al. [2017] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems (NIPS), pages 6379–6390, 2017.
 Foerster et al. [2018] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multiagent policy gradients. In AAAI Conference on Artificial Intelligence (AAAI), 2018.
 Wolpert and Tumer [2002] D. H. Wolpert and K. Tumer. Optimal payoff functions for members of collectives. In Modeling Complexity in Economic and Social Systems, pages 355–369. World Scientific, 2002.
 Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 HernandezLeal et al. [2017] P. HernandezLeal, M. Kaisers, T. Baarslag, and E. M. de Cote. A survey of learning in multiagent environments: Dealing with nonstationarity. arXiv preprint arXiv:1707.09183, 2017.
 Tumer et al. [2002] K. Tumer, A. K. Agogino, and D. H. Wolpert. Learning sequences of actions in collectives of autonomous agents. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 378–385, 2002.
 Saxena et al. [2008] A. Saxena, J. Celaya, E. Balaban, K. Goebel, B. Saha, S. Saha, and M. Schwabacher. Metrics for evaluating performance of prognostic techniques. In IEEE International Conference on Prognostics and Health Management, pages 1–17, 2008.
 Balaban et al. [2013] E. Balaban, S. Narasimhan, M. Daigle, I. Roychoudhury, A. Sweet, C. Bond, and G. Gorospe. Development of a mobile robot test platform and methods for validation of prognosticsenabled decision making algorithms. International Journal of Prognostics and Health Management, 4(1):87, 2013.
 Duan et al. [2016] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (ICML), pages 1329–1338, 2016.
 Hüttenrauch et al. [2018] M. Hüttenrauch, A. Šošić, and G. Neumann. Local communication protocols for learning complex swarm behaviors with deep reinforcement learning. In International Conference on Swarm Intelligence, pages 71–83. Springer, 2018.
 Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 2 edition, 2018.
 Schulman et al. [2016] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. Highdimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016.
 Nguyen et al. [2018] D. T. Nguyen, A. Kumar, and H. C. Lau. Credit assignment for collective multiagent rl with global rewards. In Advances in Neural Information Processing Systems (NIPS), pages 8102–8113, 2018.
 [24] R. Lowe. Multiagent particle environment. URL https://github.com/openai/multiagentparticleenvs.
 Clevert et al. [2015] D.A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289, 2015.
 Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Appendix A MADDPG Benchmarking
To provide a more direct comparison with the MADDPG algorithm, we ran our three variants of multiagent PPO on the cooperative navigation environment that appears in the original MADDPG paper [11].
Figure 4 shows MADDPG successfully learning an effective policy over a 50,000 episode training experiment. This learning curve is what we expect to see given that this environment was originally developed in conjunction with MADDPG. It is shown, however, that MADDPG underpeforms all of the MAPPO implementations except the localcritic MAPPO. The centralcritic, noncrediting variant of MAPPO produces the best performance, outperforming the minimumhealth baseline crediting variant. This is contrast to the results for the hazardous communication scenario discussed in Section 5. However, this is to be expected due to the fact that the cooperative navigation environment does not encapsulate the concepts of health or risk. We see that this causes the minimumhealth crediting technique to display high variance between training experiments. The blue shaded region indicates that worstperforming training run on minimumhealth MAPPO underperforms all other algorithms, while the bestperforming training run out performs all other algorithms.
Comments
There are no comments yet.