Health-Informed Policy Gradients for Multi-Agent Reinforcement Learning

08/02/2019 ∙ by Ross E. Allen, et al. ∙ MIT Stanford University 0

This paper proposes a definition of system health in the context of multiple agents optimizing a joint reward function. We use this definition as a credit assignment term in a policy gradient algorithm to distinguish the contributions of individual agents to the global reward. The health-informed credit assignment is then extended to a multi-agent variant of the proximal policy optimization algorithm and demonstrated on simple particle environments that have elements of system health, risk-taking, semi-expendable agents, and partial observability. We show significant improvement in learning performance compared to policy gradient methods that do not perform multi-agent credit assignment.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A valuable characteristic of multi-robot systems is their robustness to failure. Multi-robot systems can adapt to agent attrition and accomplish tasks even when the performance of individual agents degrade. This characteristic is important in applications that are inherently hazardous due to environmental factors or interaction with adversarial agents. Decision-making in multi-agent systems can be modeled as a decentralized partially observable Markov decision process (Dec-POMDP) 

[1]. In general, computing an optimal policy is NEXP-complete [2]

. While various solution techniques based on heuristic search 

[3, 4, 5] and dynamic programming [6, 7] exist, they tend to quickly become intractable.

One approach to approximate solutions to such problems is to use deep reinforcement learning (deep-RL) [8, 9, 10, 11, 12]. However, a fundamental challenge is the multi-agent credit assignment problem. When there are multiple agents acting simultaneously toward a shared objective, it is often unclear how to determine which actions of which agents are responsible for the overall joint reward. There may be strong inter-dependencies between the actions of different agents and long delays between joint actions and their eventual rewards [13].

This work makes three contributions to multi-agent decision making. First, we outline a definition and properties for the concept of system health in the context of Markov decision processes. These definitions provide the framework for a subset of Dec-POMDPs that can be used to analyze multi-agent systems operating in hazardous and adversarial environments. Second, we use the definition of health to formulate a multi-agent credit assignment algorithm to be used to accelerate and improve multi-agent reinforcement learning techniques. Third, we apply the health-informed crediting algorithm within a multi-agent variant of the proximal policy optimization (PPO) [14] algorithm and demonstrate significant learning improvements compared to existing algorithms in a simple two-dimensional particle environment.

2 Related Work

Applying deep-RL to multi-agent decision-making is an active area of research. Gupta et al. made early contributions to this field by demonstrating how algorithms such as TRPO, DQN, DDPG, and A3C can be extended to a range of cooperative multi-agent problems [10]. Lowe et al. made further contributions to the field of multi-agent deep-RL with the development of multi-agent deep deterministic policy gradients (MADDPG) that was capable of training in cooperative and competitive environments [11].

Multi-agent reinforcement learning is challenging due to the problems of non-stationarity and multi-agent credit assignment. The non-stationarity problem arises when a learning agent assumes all other learning agents as part of the environment dynamics. Since, the individual agents are continuously changing their policies, the environment dynamics from the perspective of an agent is continuously changing [15]. While Lowe et al. attempt to address the non-stationary problem, MADDPG is still shown to become ineffective at learning for systems with more than three or four agents [11].

Gupta et al. partially address the non-stationary problem by employing parameter sharing whereby groups of homogeneous agents use identical copies of parameters for their local policies [10]. Parameter sharing techniques can give rise to the second fundamental challenge of multi-agent learning: multi-agent credit assignment—i.e. the challenge of identifying which actions from which agent at which time were most responsible for the overall performance (returns) of the system. Gupta et al. avoid explicit treatment of this problem by focusing on environments where the joint rewards can be decomposed into local rewards. However, in general, such local reward structures are not guaranteed to optimize joint returns [13].

Wolpert and Tumer made important contributions to addressing the credit assignment problem with their development of Wonderful Life Utility (WLU) and Aristocrat Utility (AU) which are forms of “difference rewards” [13]. Both WLU and AU attempt to assign credit to individual agents’ actions by taking the difference of utility received versus utility that would have been received had a different action been taken by the agent. The comparison between actual returns and hypothetical returns is sometimes referred to as counterfactual returns or learning [12]. Predating most of the advancements in deep reinforcement learning, Wolpert and Tumer’s work was restricted to small decision problems that could be handled in a tabular fashion [13, 16].

Foerster et al. formulated an aristocrat-like crediting method that was able to leverage a deep neural network state-action value function within a policy gradient algorithm; referred to as counterfactual multi-agent (COMA) policy gradients 

[12]. By employing deep neural networks, Foerster et al. enabled crediting in large or continuous state spaces; however their counterfactual baseline required enumeration over all actions and was thus restricted to problems with small, discrete action spaces.

3 Problem Statement

The problems presented in this work can be modeled as decentralized partially observable Markov decision process (Dec-POMDPs), which are defined by the tuple . represents a finite set of agents.

is the joint state space of all agents (finite or infinite). Assuming states are described in a vector form, let

be a specific state of the system. is the action space of the th agent in joint state . The vector represents a joint action at time where . is the set of observations for the th agent in joint state . The vector represents a joint observation at time where .

is the joint transition function that represents a probability density of arriving in state

given the joint action was taken in state . is the joint reward function that specifies an immediate reward for taking the joint action while in state .

3.1 System Health and Action Prognosis

Here, we introduce a concept we refer to simply as system health, though there are alternative definitions used in the field of prognostic decision making (PDM) [17, 18]. If we represent the current state of the system with vector , then the system health, constitutes a subvector of . Without loss of generality we can define the state vector as , where is the non-health components of the state. Each element of the the health vector corresponds to the health of an individual agent and is in the interval , where 1 represents full health and 0 represents a fully degraded agent. In this paper, we choose to define health as a monotonically decreasing quantity that holds the following properties associated with reduction of system health:

Property 1

Monotonically decreasing health Let represent any state vector where the health of agent equals ; thus we can define the non-increasing nature of the health of the system as follows

(1)
Property 2

Constriction of reachable set in state space Define the reachable set of joint state and constriction of reachable set as follows

(2)
Property 3

Constriction of available actions in action space Let represent the available action set for agent when the system is in state . Therefore the constriction of the available action set can be described as

(3)
Property 4

Constriction of observable set in observation space Let represent the set of possible observations for agent when the system is in state . Therefore the constriction of observable set can be described as:

(4)

Note that the reduction in health may also affect the reward function such as increased variance or a change in the expected value, but we leave these reward effects as a per-problem basis and make no general assertions about the reward signal as a function of system health.

Given a joint action at joint state , we can define the expected reduction in health of the system as the action’s health risk; also referred to as the action prognosis. Let the action prognosis, , be a vector containing the health risk of a joint state-action pair:

(5)

Note that an action prognosis is a non-negative quantity since is guaranteed to be non-positive due to creftype 1. Furthermore, for each joint state , there exists some individual action that represents the maximum health risk to agent

(6)

4 Approach

This section uses the properties of health and action prognosis to derive a multi-agent reinforcement learning algorithm to train systems operating in hazardous and adversarial environments. We choose to restrict our scope to policy gradient methods because of their scalability to large and continuous state and action spaces [19]. We develop a health-informed policy gradient in Section 4.1 and use it to propose a new multi-agent PPO variant in Section 4.2.

The goal of a cooperative multi-agent learning problem is to develop a joint policy , parameterized by joint parameters , that maximizes the discounted joint returns, , where is the discount factor, is the empirical joint reward, and is the final time step in an episode or receding horizon. In the context of Dec-POMDPs with no centralized control, the joint policy is composed of a set of local policies that are parameterized by and map an agent’s local observations to its actions at each time step.

In general each agent may be learning its own individual policy , however this can lead to a non-stationary environment from the perspective any one agent which can confound the learning process [15]. Instead, for this paper, we assume that agents are executing identical copies of the same policy in a decentralized fashion, referred to as parameter sharing [10, 20]. We adopt a centralized learning, decentralized execution architecture whereby training data can be centralized between training episodes even if no such centralization of information is possible during execution—a common approach in the multi-agent RL literature [10, 11, 12].

Although parameter sharing mitigates some of the problems of non-stationarity, it does not resolve the problem of multi-agent credit assignment. When the reward function is a function of the joint state and actions of the system, i.e. , then all agents receive the same reward signal at each time step. This obfuscates which agents are most responsible for the overall performance of the group and can significantly inhibit multi-agent learning [13, 12]. We propose a novel counterfactual learning method that leverages health information in order to assign credit in multi-agent systems and improve learning performance.

4.1 Health-Informed Multi-Agent Policy Gradients

To develop a policy gradient approach for health-based multi-agent systems, we begin with the policy gradient theorem [21, Chapter 13]

(7)

Equation 7 gives an analytical expression for the gradient of the objective function with respect to policy parameters , where is the true state-action value and refers to the initial state distribution. To develop a practical algorithm, we need a method for sampling such an analytical expression that has an expected value equal or approximately equal to Equation 7. A common form of the policy gradient expression used in sampling-based algorithms is given as [21, Chapter 13]:

(8)

where can take on a range of forms that affect the bias and variance of the learning process [22]. Typically takes the form of the return or discounted return (i.e. REINFORCE algorithm [21, Chapter 13]); the returns baselined on the state value function (i.e. REINFORCE with baseline [21, Chapter 13]); the state-action value function ; the advantage function ; or the generalized advantage function  [22, 14]. The joint state value function and joint state-action value function of policy are defined as

(9)
(10)

Equation 8 is that it is derived under the assumption of a single-agent or centrally controlled multi-agent system. Let us recast Equation 8 in a decentralized context where each agent selects individual actions, , based on local observations, , and produces a separate state-action trajectory that can be used to compute policy gradients

(11)

If all agents have the same policy parameters, receive identical rewards throughout an episode, and employ the same function, then Equation 11 renders the same gradient at each time step for all agents’ actions. This uniformity in policy gradients is problematic because it results in all actions at a given time step being promoted equally during the next training cycle—i.e. the multi-agent credit assignment problem.

To overcome this problem, we use the term to distinguish gradients between different agents’ actions at the same time step. One option is to set , which defines a observation-action function referred to as a local critic. Since all agents are expected to receive distinct observations at each time step, is expected to be distinct for each agent at a given time step. However, this approach relies on making value approximations based on limited information, which can significantly impact learning, as we demonstrate in Section 5. If we instead leverage our prior assumption of centralized learning, then we can make direct use of the central value functions in Equations 9 and 10; however, this still does not resolve the multi-agent credit assignment problem.

We can use the concepts of health and action prognosis in an attempt to address the credit assignment problem. Our technique stems from the concepts of counterfactual baselines [12] or difference rewards [23]. The idea is that credit is assigned to an agent at a given time step by comparing the joint return from having agent at its present state and chosen action, with the expected return had agent been at its minimum health or chosen an action of maximum health risk (Equation 6).

In order to replace the summation over states and actions in Equation 7 with an expectation in Equation 8, we assume that policy is followed and that policy produces states and actions in proportion to the summation terms. However, this assumption is potentially invalid for our purposes due to Properties 2 and 3. If reduction in system health constricts the available actions and reachable states at a given state, and this constriction is not encoded within the policy, then the implied assumption in Equation 8 is invalid. This would occur if the policy selects an action from a health state that the physical agent is incapable of executing.

To overcome this inconsistency, we propose a simple heuristic: the policy gradient for agent at time is multiplied by the true health of agent at time ; i.e. . The justification for this heuristic is that, as an agent’s health degrades and the available action set is constricted, it becomes less likely that the action selected by the policy aligns with the action executed by the agent. By attenuating the policy gradient by the health of the agent, policy learning occurs more slowly on data generated in low health states and thus reduces the effect of mismatching chosen and executed actions. Section 5 investigates the case where the health state is binary for each agent and a zero-health state shrinks the action set until only a zero-vector action is executable by the agent.

We propose the following two minimum-health and worst-case-prognosis baselines:

(12)
(13)

where represents the true joint state of the system at time , except that the health of agent is replaced with minimum health (typically 0). Likewise, represents the true joint action of the system at time except that the action of agent is replaced with its maximum risk action; i.e. the action with the largest expected decrease in health.

The health-informed baselines are motivated by the concept of Wonderful Life Utility (WLU) [13] but adapted for deep-RL policy gradients. Furthermore, Equation 12 provides a counterfactual baseline that is completely agnostic to the action space, a property not seen in prior work [13, 23, 12]. In contrast, other modernizations of WLU still require maintaining explicit counts of discrete actions and state visitations and are not well posed for continuous domains and deep-RL [23]. Modern implementations of Aristocrat Utility (AU) such as COMA [12] require enumeration over all possible actions or computationally expensive Monte Carlo analysis at each time step. The health-informed baselines suffer no such limitations.

For the remainder of the paper, we focus on the minimum-health baseline, leaving more detailed analysis and testing of the worst-case-prognosis baseline for future work—though we hypothesize that they would produce similar performance.

4.2 Health-Informed Multi-Agent Proximal Policy Optimization

While the health-informed credit assignment technique described in Section 4.1 is applicable to any reinforcement learning algorithm that uses value functions, we choose to demonstrate the application of these baselines within a multi-agent variant of proximal policy optimization (PPO) [14]. PPO is chosen because it has been shown to work well with continuous action spaces [19] as well as multi-agent environments [9].

We apply health-informed counterfactual baselines in Equations 12 and 13 to PPO’s surrogate objective function and formulate the clipped surrogate objective function

(14)

As in the original PPO paper, we also need to train the value function network and augment with an entropy bonus to encourage exploration [14]. We train a centralized critic , using TD(). In particular, we compute the returns for the rollouts from time step and train the parameters using gradient descent on the following loss:

(15)

5 Experiments

This section presents experiments to demonstrate health-informed credit assignment in multi-agent learning. The experiments compare multi-agent deep deterministic policy gradients (MADDPG [11]) with three multi-agent variants of proximal policy optimization (MAPPO) referred to as local critic, central critic and min-health crediting.222See Appendix A for a comparison between MADDPG and our proposed variants of MAPPO in an environment taken directly from the original MADDPG paper [11].

The local critic MAPPO uses an advantage function based on local observation value estimates

. The central critic MAPPO uses an advantage function based on joint state value estimates enabled by the centralized learning assumption: . The minimum-health crediting MAPPO uses the health-informed counterfactual baseline in Equation 12.

The experiments are built within a forked version of OpenAI’s Multi-Agent Particle Environment library that is extended to incorporate the concepts of system health and risk-taking [11, 24]. Since the scenarios used by Lowe et al. were dedicated to small groups of agents and did not incorporate the concept of health, new scenarios were developed. Section 5.1 describes a task whereby multiple agents must link two fixed terminals in the presence of a hazard that can cause individual agents to be terminated; thus emulating the formation of an ad hoc communication network.

5.1 Ad Hoc Communication Networks in Hazardous Environments

Figure 1: The hazardous communication scenario with 16 agents. The larger black dots represent the terminals to be connected. The smaller interconnected dots represent agents. The red dot is the environmental hazard. Red crosses represent agents that have been terminated due to proximity to the hazard.

The hazardous communication scenario is a variant of a classic multi-agent coordination problem that is modified to incorporate the concepts of health and risk-taking. The problem consists of two fixed terminals in a 2D environment and a group of robots that can act as communication relays over short distances and are capable of moving through the 2D space. The objective is for the robotic agents to arrange themselves to form an uninterrupted link between the terminals as quickly as possible and maintain connectivity for as long as possible during an episode. The agents observe the relative state of other agents within some finite radius, i.e. “neighborhood”. There is no central commander directing the motion of agents; instead, each agent uses its own learned local policy to map their local observations of the environment and neighboring agents to velocity commands. For each time step in which an uninterrupted link is formed between the two fixed terminals, all agents receive a reward of 1, regardless of whether or not they form one of the links between terminals. If no continuous link exists between terminals, than all agents receive a reward of 0.

The health-based variant of this classic problem comes in the form of environmental hazards, particularly hazards that cannot be directly observed. For each episode an environmental hazard is randomly placed between the terminals. The location of the hazard is not known or directly observed by the agents, it is hidden. Any agent in proximity to the hazard is at risk of being terminated with probability . Terminated agents, i.e. agents with zero health, cannot form communication links with other agents and thus cannot directly contribute to the objective of connecting the fixed terminals. While surviving agents cannot directly observe the location of the hazard, they are able to observe the location of terminated agents; thus terminated agents can act as a warning beacon to other agents. Figure 1 shows a snapshot of the hazardous communication scenario where agents have successfully formed a complete link between terminals while using the observation of terminated agents in order to avoid the unobserved hazard.

From a single agent’s perspective, the number of observed agents and environmental features varies over time, thus creating a variable-sized input for the agent’s policy. To encode the variable-sized observation as a fixed-sized input to a neural network policy, we use histograms to bin the relative states of neighboring agents [20]. In the experiments presented here, we assume that an agent can observe the relative bearing and distance to all other agents within some neighborhood radius. These observations are then stored as counts in a matrix that represents radial and angular bins of a 2D histogram. The matrix is then flattened and concatenated with other observation features (e.g. environment information of non-varying size) to form the vector .

5.2 Results

Figure 2: Learning curves for the hazardous communication problem with 10 agents. Each of the four learning curve is an average over four independent training experiments with the shaded region representing the minimum and maximum bounds of the four training experiments.

Figure 2 illustrates the improvement in learning for a 10-agent hazardous ad hoc communication problem when multi-agent PPO is used with a centralized critic and health-informed crediting. The green-triangle line represents the learning curve for multi-agent PPO using a local critic, i.e. value estimates are drawn from individual agents’ observations. The orange-x curve represents multi-agent PPO learning with a centralized critic333Note that the centralized critic only has access to joint state information during offline, inter-episode training phases. Such global information is not available during execution of the policy. but with no credit assignment. The blue-dot curve represents learning produced by multi-agent PPO with a centralized critic applying a minimum-health counterfactual baseline (Equation 12). To serve as a comparison with an existing algorithm, the red-cross curve represents learning for the MADDPG [11].

Figure 2 shows that, although a local critic may initially learn more quickly than a central critic, the centralized critic overtakes it in the long run and provides superior final performance at the end of training. Most importantly we see that a centralized critic with health-informed credit assignment outperforms all other algorithms in terms of both initial learning rate and final performance. We also see that MADDPG fails to learn in this environment, most likely due to the relatively large number of agents.

All experiments represented in Figure 2

were run for 50,000 episodes with each episode consisting of 50 time steps. Training occurred in batches composed of 256 episodes. Training batches were broken into 8 minibatches and run over 8 epochs. For the multi-agent PPO experiments, we used an entropy coefficient of 0.01 to ensure sufficient exploration

[14]. Since the actor and critic networks are separate and do not share parameters, learning is not affected by the value function coefficient [14] and is thus ignored.

For all experiments represented in Figure 2

the policy network was composed of a multilayer perceptron (MLP) with 2 fully connected hidden layers, each of which being 64 units wide, and a hyberbolic tangent activation function. For experiments that utilized a local critic the value function network matched the architecture of the policy network. For experiments that utilized a centralized critic the value function network had a distinct architecture that was developed empirically. Such centralized critic networks consisted of a 8-layer by 64-unit, fully connected MLP that used an exponential linear unit (ELU) activation function 

[25]

. We observe that the ELU activation function tended to outperform the rectified linear unit (ReLU) and hyperbolic tangent activation functions for central critic learning.

In order to ensure that actor-critic model converges to a local optimum, we must ensure that the update timescales of actor and critic are sufficiently slow and that the actor is updated sufficiently slower than the critic [12]. We used an actor learning rate of and a central critic learning rate of with the Adam optimizer [26].

(a)
(b)
(c)
(d)
Figure 3: Effect of minimum-health credit assignment on various group sizes in the hazardous communication scenario. The trend lines represent mean rewards per episode and the shaded regions represent the interquartile range of rewards per episode

Figure 3 attempts to show the effect of applying minimum-health credit assignment as a function of group size. With a group size of only two agents, the counterfactual credit assignment actually underperforms a purely factual, non-crediting approach. For groups of 4 and 8 agents the minimum-health crediting technique shows slight but notable improvement in learning performance over non-credit learning. For a group size of 16 agents, the minimum-health credit assignment significantly outperforms a non-crediting approach. These result align with those shown in Figure 2 which represents a similar test with 10 agents. The curves in Figure 2 are averaged over multiple training experiments, whereas Figure 3 shows single-experiment curves.

6 Conclusions and Future Work

In this paper we have proposed a definition for system health in Dec-POMDPs and shown how it can be used in policy gradient methods to improve multi-agent reinforcement learning. The experiments presented in this paper serve as proof of concept, but are highly simplified when compared to any real-world problems. We believe that the techniques presented here are well suited for reinforcement learning in multi-agent strategy game environments such as StarCraft II and DOTA 2 [8, 9]

We currently restrict our analysis to systems of homogeneous agents with shared, local polices operating in cooperative environments. Using a centralized critic based on the joint state, we can extend this to heterogeneous agents with distinct local policies by forming policy groups of homogeneous agents within the heterogeneous team. To extend to partially observable stochastic games, we could apply this technique to each cooperative group of agents within the competitive game.

References

  • Kochenderfer [2015] M. J. Kochenderfer. Decision making under uncertainty: Theory and application. MIT Press, 2015.
  • Bernstein et al. [2002] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.
  • Szer and Charpillet [2005] D. Szer and F. Charpillet. An optimal best-first search algorithm for solving infinite horizon Dec-POMDPs. In

    European Conference on Machine Learning (ECML)

    , pages 389–399. Springer, 2005.
  • Spaan et al. [2011] M. T. Spaan, F. A. Oliehoek, and C. Amato. Scaling up optimal heuristic search in Dec-POMDPs via incremental expansion. In

    International Joint Conference on Artificial Intelligence (IJCAI)

    , 2011.
  • Oliehoek et al. [2013] F. A. Oliehoek, S. Whiteson, and M. T. Spaan. Approximate solutions for factored Dec-POMDPs with many agents. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 563–570, 2013.
  • Hansen et al. [2004] E. A. Hansen, D. S. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. In AAAI Conference on Artificial Intelligence (AAAI), volume 4, pages 709–715, 2004.
  • Boularias and Chaib-draa [2008] A. Boularias and B. Chaib-draa. Exact dynamic programming for decentralized POMDPs with lossless policy compression. In International Conference on Automated Planning and Scheduling (ICAPS), pages 20–27. AAAI Press, 2008.
  • [8] DeepMind. AlphaStar: Mastering the real-time strategy game StarCraft II. URL https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/.
  • [9] OpenAI. Openai Five. URL https://openai.com/five/#how-openai-five-works.
  • Gupta et al. [2017] J. K. Gupta, M. Egorov, and M. Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 66–83. Springer, 2017.
  • Lowe et al. [2017] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems (NIPS), pages 6379–6390, 2017.
  • Foerster et al. [2018] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent policy gradients. In AAAI Conference on Artificial Intelligence (AAAI), 2018.
  • Wolpert and Tumer [2002] D. H. Wolpert and K. Tumer. Optimal payoff functions for members of collectives. In Modeling Complexity in Economic and Social Systems, pages 355–369. World Scientific, 2002.
  • Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Hernandez-Leal et al. [2017] P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote. A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv:1707.09183, 2017.
  • Tumer et al. [2002] K. Tumer, A. K. Agogino, and D. H. Wolpert. Learning sequences of actions in collectives of autonomous agents. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 378–385, 2002.
  • Saxena et al. [2008] A. Saxena, J. Celaya, E. Balaban, K. Goebel, B. Saha, S. Saha, and M. Schwabacher. Metrics for evaluating performance of prognostic techniques. In IEEE International Conference on Prognostics and Health Management, pages 1–17, 2008.
  • Balaban et al. [2013] E. Balaban, S. Narasimhan, M. Daigle, I. Roychoudhury, A. Sweet, C. Bond, and G. Gorospe. Development of a mobile robot test platform and methods for validation of prognostics-enabled decision making algorithms. International Journal of Prognostics and Health Management, 4(1):87, 2013.
  • Duan et al. [2016] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (ICML), pages 1329–1338, 2016.
  • Hüttenrauch et al. [2018] M. Hüttenrauch, A. Šošić, and G. Neumann. Local communication protocols for learning complex swarm behaviors with deep reinforcement learning. In International Conference on Swarm Intelligence, pages 71–83. Springer, 2018.
  • Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 2 edition, 2018.
  • Schulman et al. [2016] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016.
  • Nguyen et al. [2018] D. T. Nguyen, A. Kumar, and H. C. Lau. Credit assignment for collective multiagent rl with global rewards. In Advances in Neural Information Processing Systems (NIPS), pages 8102–8113, 2018.
  • [24] R. Lowe. Multi-agent particle environment. URL https://github.com/openai/multiagent-particle-envs.
  • Clevert et al. [2015] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289, 2015.
  • Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Appendix A MADDPG Benchmarking

To provide a more direct comparison with the MADDPG algorithm, we ran our three variants of multi-agent PPO on the cooperative navigation environment that appears in the original MADDPG paper [11].

Figure 4: Learning curves for the cooperative navigation problem with 3 agents. Each of the four learning curve is an average over four independent training experiments with the shaded region representing the minimum and maximum bounds of the four training experiments.

Figure 4 shows MADDPG successfully learning an effective policy over a 50,000 episode training experiment. This learning curve is what we expect to see given that this environment was originally developed in conjunction with MADDPG. It is shown, however, that MADDPG underpeforms all of the MAPPO implementations except the local-critic MAPPO. The central-critic, non-crediting variant of MAPPO produces the best performance, outperforming the minimum-health baseline crediting variant. This is contrast to the results for the hazardous communication scenario discussed in Section 5. However, this is to be expected due to the fact that the cooperative navigation environment does not encapsulate the concepts of health or risk. We see that this causes the minimum-health crediting technique to display high variance between training experiments. The blue shaded region indicates that worst-performing training run on minimum-health MAPPO under-performs all other algorithms, while the best-performing training run out performs all other algorithms.