Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning

05/28/2019 ∙ by Shariq Iqbal, et al. ∙ Google 7

Sparse rewards are one of the most important challenges in reinforcement learning. In the single-agent setting, these challenges have been addressed by introducing intrinsic rewards that motivate agents to explore unseen regions of their state spaces. Applying these techniques naively to the multi-agent setting results in individual agents exploring independently, without any coordination among themselves. We argue that learning in cooperative multi-agent settings can be accelerated and improved if agents coordinate with respect to what they have explored. In this paper we propose an approach for learning how to dynamically select between different types of intrinsic rewards which consider not just what an individual agent has explored, but all agents, such that the agents can coordinate their exploration and maximize extrinsic returns. Concretely, we formulate the approach as a hierarchical policy where a high-level controller selects among sets of policies trained on different types of intrinsic rewards and the low-level controllers learn the action policies of all agents under these specific rewards. We demonstrate the effectiveness of the proposed approach in a multi-agent learning domain with sparse rewards.



There are no comments yet.


page 5

page 7

page 13

page 14

Code Repositories


Code for "Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent work in deep reinforcement learning effectively tackles challenging problems including the board game Go (Silver et al., 2016), Atari video games (Mnih et al., 2015), and simulated robotic continuous control (Lillicrap et al., 2016); however, these successful approaches often rely on frequent feedback indicating whether the learning agent is performing well, otherwise known as dense rewards. In many tasks, dense rewards can be difficult to specify without inducing locally optimal but globally sub-optimal behavior. As such, it is frequently desirable to specify only a sparse reward that simply signals whether an agent has attained success or failure on a given task. Despite their desirability, sparse rewards introduce their own set of challenges.

When rewards are sparse, determining which of an agent’s actions led to a reward becomes more difficult, a phenomenon known in reinforcement learning as the credit-assignment problem. Furthermore, if rewards cannot be obtained by random actions, an agent will never receive a signal through which it can begin learning. As such, researchers have devised methods which attempt to provide agents with additional reward signals, known as intrinsic rewards, through which they can learn meaningful behavior (Oudeyer and Kaplan, 2009). A large subset of these works focus on learning intrinsic rewards that encourage exploration of the state space (Pathak et al., 2017; Houthooft et al., 2016; Burda et al., 2019; Ostrovski et al., 2017, 2017; Tang et al., 2017).

Exploring the state space provides a useful inductive bias for many sparse reward problems where the challenge lies in "finding" rewards that may only be obtained in parts of the state space that are hard to reach by random exploration. These exploration-focused approaches frequently formulate their intrinsic rewards to measure the "novelty" of a state, such that agents are rewarded for taking actions that lead to novel states. Our work approaches the question of how to apply novelty-based intrinsic motivation in the cooperative multi-agent setting.

Directly applying novelty-based intrinsic motivation to the multi-agent setting results in agents each exploring their shared state space independently from one another. In many cases, independent exploration may not be the most efficient method for exploration in multi-agent tasks. For example, consider a task where multiple agents are placed in a maze and their goal is to collectively reach all of the landmarks that are spread out through the maze. In this case, it would be inefficient for the agents to explore the same areas redundantly. Instead, it would be much more sensible for agents to "divide-and-conquer," or avoid redundant exploration. Thus, an ideal intrinsic reward for this task would encourage such behavior; however, a simple task can be constructed where the same behavior would not be ideal. For example, take the same maze but change the task such that all agents need to reach the same landmark. Divide-and-conquer would no longer be an optimal exploration strategy since agents only need to find one landmark and they all need to reach the same one. Cooperative multi-agent reinforcement learning can benefit from sharing information about exploration across agents; however, the question of what to do with that shared information depends on the task at hand.

In order to improve exploration in cooperative multi-agent reinforcement learning, we must first identify what kinds inductive biases can potentially be useful for multi-agent tasks and then devise intrinsic reward functions that incorporate those biases. Then, we must find a way to allow our agents to adapt their exploration to the given task, rather than committing to one type of intrinsic reward function. In this work, we first introduce a candidate set of intrinsic rewards for multi-agent exploration which hold differing properties with regards to how they explore the state space. Subsequently, we present a hierarchical method for simultaneously learning policies trained on different intrinsic rewards and selecting the policies which maximize extrinsic returns. Importantly, all policies are trained using a shared replay buffer, drastically improving the sample efficiency and effectiveness of learning in cooperative multi-agent tasks with sparse rewards.

2 Related Work

Single-Agent Exploration

In order to solve sparse reward problems, researchers have long worked on improving exploration in reinforcement learning. To achieve these means, prior works commonly propose reward bonuses that encourage agents to reach novel states. In tabular domains, reward bonuses based on the inverse state-action count have been shown to be effective in speeding up learning (Strehl and Littman, 2008). In order to scale count-based approaches to large state spaces, many recent works have focused on devising pseudo state counts to use as reward bonuses (Bellemare et al., 2016; Ostrovski et al., 2017; Tang et al., 2017). Alternatively, some work has focused on defining intrinsic rewards for exploration based on inspiration from psychology (Oudeyer and Kaplan, 2009). These works use various measures of novelty as intrinsic rewards including: transition dynamics prediction error (Pathak et al., 2017), information gain with respect to a learned dynamics model (Houthooft et al., 2016), and random state embedding network distillation error (Burda et al., 2019).

Multi-Agent Reinforcement Learning (MARL)

Multi-agent reinforcement learning introduces several unique challenges that recent work has attempted to address. These challenges include: multi-agent credit assignment in cooperative tasks with shared rewards (Sunehag et al., 2018; Rashid et al., 2018; Foerster et al., 2018), non-stationarity of the environment in the presence of other learning agents (Lowe et al., 2017; Foerster et al., 2018; Iqbal and Sha, 2019), and learning of communication protocols between cooperative agents (Foerster et al., 2016; Sukhbaatar et al., 2016; Jiang and Lu, 2018).

Exploration in MARL

While the fields of exploration in RL and multi-agent RL are popular, relatively little work has been done at the intersection of both.  Carmel and Markovitch (1997) consider exploration with respect to opponent strategies in competitive games, and Verbeeck et al. (2005) consider exploration of a large joint action space in a load balancing problem.  Jaques et al. (2018) define an intrinsic reward function for multi-agent reinforcement learning that encourages agents to take actions which have the biggest effect on other agents’ behavior, otherwise referred to as "social influence." These works, while important, do not address the problem of exploring a large state space, and whether this exploration can be improved in multi-agent systems. A recent approach to collaborative evolutionary reinforcement learning (Khadka et al., 2019) shares some similarities with our approach. As in our work, the authors devise a method for learning a population of diverse policies, training using a shared replay buffer among all learners to increase sample efficiency, and dynamically selecting the best learner; however, their work is focused on single-agent tasks and does not incorporate any notion of intrinsic rewards. As such, this work is not applicable to sparse reward problems in MARL.

3 Background


In this work, we consider the setting of decentralized POMDPs (Oliehoek et al., 2016), which are used to describe cooperative multi-agent tasks. A decentralized POMDP (Dec-POMDP) is defined by a tuple: . In this setting we have total agents. is the set of global states in the environment, while is the set of joint observations for each agent and is the set of possible joint actions for each agent. A specific joint action at one time step is denoted as and a joint observation is .

is the state transition function which defines the probability

, and is the observation function which defines the probability . is the reward function which maps the combination of state and joint actions to a single scalar reward. Importantly, this reward is shared between all agents, so Dec-POMDPs always describe cooperative problems. Finally, is the discount factor which determines how much the agents should favor immediate reward over long-term gain.

Policy Gradients

Policy gradient techniques (Sutton et al., 2000; Williams, 1992)

are used to estimate the gradient of the expected returns

with respect to the parameters of a policy, such that we can optimize the policy to maximize expected returns. This gradient estimate takes the following form:


In the case of the REINFORCE algorithm (Williams, 1992), is measured by empirical rollouts of the policy in the environment: . In the case of actor-critic methods (Konda and Tsitsiklis, 2000),

is represented by a learned function approximation (the critic or state-action value function) in an attempt to reduce the variance of the policy (i.e. actor) gradient estimate.

Soft Actor-Critic

Our approach uses Soft Actor-Critic (SAC) (Haarnoja et al., 2018)

as its underlying algorithm. SAC incorporates an entropy term in the loss functions for both the actor and critic, in order to encourage exploration and prevent premature convergence to a sub-optimal deterministic policy. The policy gradient with an entropy term is computed as follows:


where is a replay buffer that stores past environment transitions, are the parameters of the learned critic, is a state dependent baseline (e.g. the state value function ), and is a reward scale parameter determining the amount of entropy in an optimal policy. The critic is learned with the following loss function:


where are the parameters of the target critic which is an exponential moving average of the past critics, updated as: , where

is a hyperparameter that controls the update rate.

Centralized Training with Decentralized Execution

A number of works in deep multi-agent reinforcement learning have followed the paradigm of centralized training with decentralized execution (Lowe et al., 2017; Foerster et al., 2018; Sunehag et al., 2018; Rashid et al., 2018; Iqbal and Sha, 2019). This paradigm allows for agents to act in their environments without costly communication while maintaining the advantages of sharing information during training. Since most reinforcement learning applications use simulation for training, communication between agents during the training phase has a relatively low cost.

4 Intrinsic Reward Functions for Multi-Agent Exploration

In this section we present a set of intrinsic reward functions for exploration that incorporate information about what other agents have explored. These rewards assume that each agent (indexed by ) has a novelty function that determines how novel an observation is to it, based on its past experience. This function can be an inverse state visit count in discrete domains, or, in large/continuous domains, it can be represented by recent approaches for developing novelty-based intrinsic rewards in complex domains, such as random network distillation (Burda et al., 2019). Note that they assume that all agents share the same observation space so that each agent’s novelty function can operate on all other agents’ observations.

independent minimum mean covering burrowing
Table 1: Multi-agent intrinsic rewards for agent , with

In Table 1 we define the intrinsic rewards that we use in our experiments. Independent rewards are analagous to single-agent approaches to exploration which define the intrinsic reward for an agent as the novelty of their new and own observation that occurs as a result of an action. The remainder of intrinsic reward functions that we consider use the novelty functions of other agents, in addition to their own, to further shape their exploration.

Minimum rewards consider how novel all agents find a specific agent’s observation and rewards that agent based on the minimum of these novelties. This method leads to agents only being rewarded for exploring areas that no other agent has explored, which could be advantageous in scenarios where redundancy in exploration is not useful or even harmful. Mean rewards, on the other hand, take the average of all agents’ novelty functions, which results in agents exploring regions based on how novel they are on average to the whole team of agents, rather than just to themselves. Covering rewards agents for exploring areas that it considers more novel than the average agent. This reward results in agents shifting around the state space, only exploring regions as long as they are more novel to them than their average teammate. Burrowing rewards do the opposite, only rewarding agents for exploring areas that it considers less novel than average. As such, it results in agents continuing to explore the same regions until they exhaust all possible intrinsic rewards from that region, somewhat akin to a depth-first search.

Note that these are not meant to be a comprehensive set of intrinsic reward functions applicable to all cooperative multi-agent tasks but rather a set of examples of how exploration can be centralized in order to take other agents into account. Our approach, described in the following sections, is agnostic to the type of intrinsic rewards used and, as such, can incorporate other reward types not described here, as long as they can be computed off-policy.

5 Learning Policies for Multi-Agent Exploration

For many tasks, it is impossible to know a priori which intrinsic rewards will be the most helpful one. Furthermore, the type of reward that is most helpful could change over the course of training if the task is sufficiently complex. In this section we present our approach for simultaneously learning policies trained with different types of intrinsic rewards and dynamically selecting the best one.

Figure 1: Diagram of our model architecture, showing how parameters for actors and critics are shared.

Simultaneous Policy Learning

In order to learn policies for various types of intrinsic rewards in parallel, we utilize a shared replay buffer and off-policy learning to maximize sample efficiency. In other words, we learn policies and value functions for all intrinsic reward types from all collected data, regardless of which policies it was collected by. This parallel learning is made possible by the fact that we can compute our novelty functions off-policy, given we save the observations for each agent after each environment transition. For each type of reward, we learn a different "head" for our policies and critics. In other words, we learn a single network for each agent’s set of policies that shares early layers and branches out into different heads for each reward type. For critics, we learn a single network across all agents that shares early layers and branches out into separate heads for each agent and reward type. Importantly, we learn separate heads for intrinsic and extrinsic rewards, as in Burda et al. (2019), the reasons for which will become clear in the next section. We provide a diagram of our model architecture in Figure 1.

We index agents by and intrinsic reward types by where is the total number of intrinsic reward types that we are considering. The policy for agent , trained using reward (in addition to extrinsic rewards), is represented by . It takes as input agent ’s observation, , and outputs an action . The parameters of this policy are , where is a shared base/input (for agent

) in a neural network and

is a head/output specific to this reward type.

Additionally, we learn a head selector policy . This high-level policy aims to select, at the beginning of each episode, the action policy head (across all agents) which maximizes extrinsic returns. The selector policy

is parametrized by a vector,

, that contains an entry for every reward type. The probability of sampling head is: . Unlike the action policies, this high-level policy does not take any inputs. We simply want to learn which set of policies trained on the individual intrinsic reward functions has the highest expected extrinsic returns from the beginning of the episode. The procedure for learning this selector policy is detailed in the next section.

The extrinsic critic for policy head is represented by . It takes as input the global state and the actions of all other agents , and it outputs the expected returns under policy for each possible action that agent can take, given all other agents’ actions. The parameters of this critic are where is a shared base across all agents and reward types. A critic with similar structure exists for predicting the intrinsic returns of actions taken by , represented by , which uses the parameters: . Note that the intrinsic critics share the same base parameters .

We remove the symbols representing the parameters of the policies () and the critics () for readability. In our notation we use the absence of a subscript or superscript to refer to a group. For example , refers to all agents’ policies trained on intrinsic reward . We train our critics with the following loss function, adapted from soft actor-critic:


where refers to the target Q-function, an exponential weighted average of the past Q-functions, used for stability, and are similarly updated target policies. The intrinsic rewards laid out in Table 1 are represented as a function of the observations that results from the action taken, where specifies the type of reward. Importantly, we can calculate these loss functions for expected intrinsic and extrinsic returns for all policies given a single environment transition, allowing us to learn multiple policies for each agent in parallel. We train each policy head with the following gradient:


where is a scalar that determines the weight of the intrinsic rewards, relative to extrinsic rewards, and is a multi-agent advantage function (Foerster et al., 2018; Iqbal and Sha, 2019), used for helping with multi-agent credit assignment.

Dynamic Policy Selection

Now that we have established a method for simultaneously learning policies using different intrinsic reward types, we must devise a means of selecting between these policies. In order to select policies to use for environment rollouts, we must consider which policies maximize extrinsic returns, while taking into account the fact that there may still be "unknown unknowns," or regions that the agents have not seen yet where they may be able to further increase their extrinsic returns. As such, we must learn a meta-policy that, at the beginning of each episode, selects between the different sets of policies trained on different intrinsic rewards and maximizes extrinsic returns without collapsing to a single set of policies too early.

The most sensible metric for selecting policies is the expected extrinsic returns given by each policy head. Fortunately, we are learning separate Q-function heads for extrinsic returns, so we can leverage their predictions to learn a policy selector using policy gradients. We use the following gradient to train the policy selector, :


where is the initial state/observation distribution, and is a parameter similar to for the low-level policies, which promotes entropy in the selector policy. Entropy in the policy selector is important in order to prevent it from collapsing onto a single exploration type that does well at first but does not continue to explore as effectively as others. As such, we can learn a diverse set of behaviors based on various multi-agent intrinsic reward functions and select the one that maximizes performance on the task at hand at any point during training, while continuing to consider other policies that may lead to greater rewards.

6 Experiments

We begin by describing our evaluation domain and then present experimental results which demonstrate the effectiveness of our approach. We provide details in the supplementary material and will share code for both the model and environment.

6.1 Gridworld Domain


Figure 2: (Left) Rendering of our evaluation domain. Agents start each episode in the central room and must complete various tasks related to collecting the yellow treasures placed around the map. (Right) Mean number of trasures found per episode on task 1

with 2 agents. Each variant is run with 6 random seeds. For each run, we first calculate the running mean over a 100 episode window, then we plot a shaded 68% confidence interval across runs per variant with a dark line representing the mean. Our approach (

Multi-Exploration) is competitive with the best individual intrinsic reward function, without any prior knowledge provided.

In order to test the effectiveness of our approach, we use a multi-agent gridworld domain (pictured in Fig. 1(a)). This domain allows us to design environments where the primary challenge lies in a combination of exploring the state space efficiently and coordinating behaviors.

We use a maximum of four agents and encode several tasks related to collecting the yellow treasure which each require different types of exploration: task 1 Agents must cooperatively collect all treasure on the map in order to complete the task; task 2 Agents must all collect the same treasure. The first agent to collect a treasure during an episode determines the goal for the rest of the agents. task 3 Agents must all collect the specific treasure that is assigned to them. The two agent version of each task uses agents 1-2 and treasure A-B, while the three agent versions use 1-3, A-C, and the four agent versions use 1-4, A-D. Agents receive a negative time penalty at each step, so they are motivated to complete the task as quickly as possible. The only positive reward comes from any agent collecting a treasure that is allowed by the specific task, and rewards are shared between all agents. The optimal strategy in task 1 is for agents to spread out and explore separate portions of the map, while in task 2 they should explore the same areas, and in task 3 independently.

For added challenge, the environment includes two sources of stochasticity: random transitions and black holes. At each step there is a 10% chance of an agent’s action being replaced by a random one. Furthermore, there are several "black holes" placed around the map which have a probability of opening at each time step. This probability changes at each step using a biased random walk such that it moves toward one, until the hole opens and it resets to zero. If an agent steps into a black hole when it is open, they will be sent back to their starting position. The spaces colored as black are holes that are currently open, while the gray spaces are holes that have the possibility of opening at the next step (the darker they are, the higher the probability).

Agents observe their global position in coordinates (scalars), as well as local information regarding walls in adjacent spaces, the probability of their adjacent spaces opening into a black hole, the relative position of other agents (if they are within 3 spaces), as well as information about which treasures the agent has already collected in the given episode. The global state is represented by the

coordinates of all agents, as one-hot encoded vectors for

and separately, as well as the local information of all agents regarding black holes, walls, and treasures collected. Each agent’s action space consists of the 4 cardinal directions as well as an option to not move, which is helpful in cases where an agent is waiting for a black hole to be safe to cross. The novelty function for each agent , which are used for calculating the intrinsic rewards in Table 1, is defined as , where is the number of times that the agent has visited its current cell and is a decay rate selected as a hyperparameter (we find that works well for our purposes).

6.2 Results and anlysis

Each individual training run uses an Intel Core i7-6800K CPU for environment rollouts, and an NVIDIA Titan Xp GPU for training. Figure 1(b) demonstrates the results of our approach over the course of training on the 2 agent version of task 1, and the final results on each task can be found in Table 2. Training curves for all tasks can be found in the supplement. We train a team of agents using each of the multi-agent intrinsic reward functions defined in Table 1 individually, and then test our dynamic policy selection approach. We find that our approach is competitive with the best performing individual exploration method in all tasks. This performance is exciting since our method receives no prior information about which type of exploration would work best, while each type carries its own inductive bias.

Intrinsic reward type (fixed or adaptive as in our approach multi)
Task Independent Minimum Mean Covering Burrowing Multi
1 2 0.14 0.05 1.62 0.59 0.00 0.00 0.13 0.12 1.98 0.06 2.00 0.00
3 1.16 0.11 1.49 0.76 0.36 0.48 0.00 0.00 2.06 1.05 2.15 1.24
4 0.84 0.29 1.78 0.44 0.48 0.53 0.00 0.00 1.90 0.49 1.71 1.09
2 2 2.00 0.00 0.92 0.10 0.32 0.49 1.11 0.99 0.98 0.05 1.63 0.57
3 2 1.39 0.94 0.67 1.03 0.17 0.41 0.29 0.37 0.67 1.03 1.18 0.96
Table 2:

# of treasures found with standard deviation

In order to better understand how each reward type encourages agents to explore the state space, we visualize their exploration in videos, viewable at the anonymized link below.111 independent rewards, as expected, result in agents exploring the whole state space without taking other agents into consideration. As a result, on task 1, which requires coordination between agents to spread out and explore different areas, independent rewards struggle; however, on task 3, where agents receive individualized goals, independent exploration performs well. task 2 also requires coordination, but the rate of black holes dropping out is lower on that task, making exploration much easier. As a result, independent rewards also perform well on task 2. In future work, we will explore intrinsic reward functions that explicitly reward agents to explore the same regions, which may surpass all present methods on tasks that require agents to concurrently explore similar regions such as task 2.

mimimum rewards prevent agents from exploring the same regions redundantly but can lead to situations where one of the agents is the first to explore all regions that provide sparse extrinsic rewards. In these cases, other agents are not aware of the extrinsic rewards and are also not motivated to explore for them since another agent has already done so. mean rewards also experience failure cases due to the fact that individual agents can continually explore regions where their rewards are high as long as the other agents do not enter those regions, leading to degenerate behavior where agents split up and exploit small regions where they can continually receive high intrinsic rewards. mean rewards may benefit from combining with an intrinsic reward function that encourages all agents to explore similar areas. covering rewards, as expected, lead to behavior where agents are constantly switching up the regions that they explore. While this behavior does not prove to be useful in the tasks we test since the switching slows down overall exploration progress, it may be useful in scenarios where agents are required to spread out. Finally, burrowing rewards cause agents to each explore different subregions and continue to explore those regions until they exhaust their options. This behavior is particularly effective on task 1, where agents need to spread out and explore the whole map in a mutually exclusive fashion.

We find that our method, Multi-Exploration, is effective on all tasks, coming close to or surpassing the top performing exploration method in every scenario. This flexibility is advantageous, as no other individual exploration method performs well on all task settings. Overall, we can see that multi-agent tasks can, in some cases, benefit from intrinsic rewards that take into account what other agents have explored, but there are various ways to incorporate that information with differing properties. We find that our method is able to reliably attain or nearly attain the performance level of the best reward function without prior knowledge, making it an ideal approach for multi-agent reinforcement learning in sparse reward settings. Importantly, our approach can be further improved simply by devising more intrinsic reward functions that take into account all agents.

7 Conclusion

We propose a set of multi-agent intrinsic reward functions with differing properties, and compare them both qualitatively (through videos) and quantitatively on several multi-agent exploration tasks. Furthermore, we propose a method for learning policies for all intrinsic reward types simultaneously while dynamically selecting the most effective ones. We show that our method is capable of matching the performance of the best performing intrinsic reward type on various tasks while using the same number of samples. In future work we hope to introduce methods for directly learning the multi-agent intrinsic reward functions, rather than selecting from a set.


  • Bellemare et al. [2016] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
  • Burda et al. [2019] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019. URL
  • Carmel and Markovitch [1997] David Carmel and Shaul Markovitch. Exploration and adaptation in multiagent systems: A model-based approach. In IJCAI (1), pages 606–611, 1997.
  • Foerster et al. [2016] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
  • Foerster et al. [2018] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In

    AAAI Conference on Artificial Intelligence

    , 2018.
  • Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In

    Proceedings of the 35th International Conference on Machine Learning

    , volume 80 of Proceedings of Machine Learning Research, pages 1861–1870, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018.
  • Houthooft et al. [2016] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
  • Iqbal and Sha [2019] Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Long Beach, CA, USA, 10–15 Jul 2019.
  • Jaques et al. [2018] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro A Ortega, D J Strouse, Joel Z Leibo, and Nando de Freitas. Social influence as intrinsic motivation for Multi-Agent deep reinforcement learning. arXiv preprint arXiv:1810.08647, October 2018.
  • Jiang and Lu [2018] Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent cooperation. arXiv preprint arXiv:1805.07733, 2018.
  • Khadka et al. [2019] Shauharda Khadka, Somdeb Majumdar, Santiago Miret, Evren Tumer, Tarek Nassar, Zach Dwiel, Yinyin Liu, and Kagan Tumer. Collaborative evolutionary reinforcement learning. arXiv preprint arXiv:1905.00976, 2019.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.
  • Konda and Tsitsiklis [2000] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
  • Lillicrap et al. [2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
  • Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6382–6393, 2017.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Oliehoek et al. [2016] Frans A Oliehoek, Christopher Amato, et al. A concise introduction to decentralized POMDPs, volume 1. Springer, 2016.
  • Ostrovski et al. [2017] Georg Ostrovski, Marc G Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2721–2730. JMLR. org, 2017.
  • Oudeyer and Kaplan [2009] Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009.
  • Pathak et al. [2017] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2778–2787, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL
  • Rashid et al. [2018] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4295–4304, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • Strehl and Littman [2008] Alexander L. Strehl and Michael L. Littman.

    An analysis of model-based interval estimation for markov decision processes.

    Journal of Computer and System Sciences, 74(8):1309 – 1331, 2008. ISSN 0022-0000. doi: URL Learning Theory 2005.
  • Sukhbaatar et al. [2016] Sainbayar Sukhbaatar, Rob Fergus, et al.

    Learning multiagent communication with backpropagation.

    In Advances in Neural Information Processing Systems, pages 2244–2252, 2016.
  • Sunehag et al. [2018] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’18, pages 2085–2087, Richland, SC, 2018. International Foundation for Autonomous Agents and Multiagent Systems.
  • Sutton et al. [2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • Tang et al. [2017] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
  • Verbeeck et al. [2005] Katja Verbeeck, Ann Nowé, and Karl Tuyls. Coordinated exploration in multi-agent reinforcement learning: an application to load-balancing. In Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems, pages 1105–1106. ACM, 2005.
  • Williams [1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer, 1992.

8 Appendix

8.1 Environment Details

As described in the main text, we use a multi-agent gridworld domain for our experiments. The black holes which send agents back to their starting positions if they are stepped into are an important aspect of the environment, as they add difficulty to exploration. The probability, , of a black hole opening at each step, , evolves as such: , where for Task 1 and for Tasks 2 and 3.

8.2 Training Details

The training procedure is detailed in Algorithm 1, and all hyperparameters are listed in Table 3. Hyperparameters were selected by tuning one parameter at a time through intuition on task 1 with 2 agents and then applying to the rest of the settings with minimal changes. Where hyperparameters differ between settings, we make a footnote denoting them as such.

1:Initialize environment with agents
2:Initialize replay buffer,
5:for  do
6:     if episode done or  then
7:          ResetEnv
8:          Sample policy head
10:     end if
11:     Select actions for each agent,
12:     Send actions to environment and get , ,
13:     Store transitions for all environments in
16:     if  then
17:         for  do
18:              Sample minibatch,
19:              UpdateCritic() Eqs 5-7 in main text
20:              UpdatePolicies() Eqs 8-10 in main text
21:              UpdateSelector() Eqs 11-13 in main text
22:              Update target parameters:
23:         end for
25:     end if
26:end for
Algorithm 1 Training Procedure for Multi-Explore w/ Soft Actor-Critic [Haarnoja et al., 2018]
Name Description Value
layers input output dimensions for layers in (,)
layers input output dimensions for layers in (, )
layers input output dimensions for layers in (,)
layers input output dimensions for layers in (,
layers input output dimensions for layers in (,
nonlinearity type of nonlinearity used in all networks ReLU
lr learning rate for centralized critic 0.001
optimizer optimizer for centralized critic Adam [Kingma and Ba, 2014]
lr learning rate for decentralized policies 0.001
optimizer optimizer for decentralized policies Adam
lr learning rate for policy selector 0.04
optimizer optimizer for policy selector SGD
target function update rate 0.005
bs batch size 1024
total steps number of total environment steps 1e6
steps per update number of environment steps between updates 100
niters number of iterations per update 50
max ep length maximum length of an episode before resetting 500
penalty coefficient for weight decay on 0.001
parameters of Q-function
penalty coefficient on penalty on pre-softmax 0.001
output of policies
penalty coefficient for weight decay on 0.001
parameters of policy selector
maximum size of replay buffer
action policy reward scale 100
selector policy reward scale 15/2533315 for Task 1 w/ 3 agents, Task 3 w/ 2 agents. 25 for all others.
discount factor 0.99
relative weight of intrisic rewards to extrinsic 0.1
decay rate of count-based rewards 0.7

Table 3: Hyperparameter settings across all runs.

8.3 Training Curves

Figure 3: Results on Task 1 with 2 agents.
Figure 4: Results on Task 1 with 3 agents.
Figure 5: Results on Task 1 with 4 agents.
Figure 6: Results on Task 2 with 2 agents.
Figure 7: Results on Task 3 with 2 agents.