Multi-Agent Reinforcement Learning with Emergent Roles

03/18/2020 ∙ by Tonghan Wang, et al. ∙ 0

The role concept provides a useful tool to design and understand complex multi-agent systems, which allows agents with a similar role to share similar behaviors. However, existing role-based methods use prior domain knowledge and predefine role structures and behaviors. In contrast, multi-agent reinforcement learning (MARL) provides flexibility and adaptability, but less efficiency in complex tasks. In this paper, we synergize these two paradigms and propose a role-oriented MARL framework (ROMA). In this framework, roles are emergent, and agents with similar roles tend to share their learning to be specialized on certain sub-tasks. To this end, we construct a stochastic role embedding space by introducing two novel regularizers and conditioning individual policies on roles. Experiments show that our method can learn dynamic, versatile, identifiable, and specialized roles, which help our method push forward the state of the art on the StarCraft II micromanagement benchmark. Demonstrative videos are available at https://sites.google.com/view/romarl/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 7

page 14

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real-world systems can be modeled as multi-agent systems (MAS), such as autonomous vehicle teams (cao2012overview), intelligent warehouse systems (nowe2012game), and sensor networks (zhang2011coordinated). Cooperative multi-agent reinforcement learning (MARL) provides a promising approach to developing these systems, which allows agents to deal with uncertainty and adapt to the dynamics of an environment. In recent years, cooperative MARL has achieved prominent progresses and many deep methods have been proposed (foerster2018counterfactual; sunehag2018value; rashid2018qmix; son2019qtran; vinyals2019grandmaster; wang2020learning; baker2020emergent).

Figure 1: Visualization of our learned role representations at a timestep. The blue agent has the maximum health, while the red ones are dead. The corresponding policy is that agent 6 moves towards enemies to take on more firepower, so that more seriously injured agents are protected. Roles can change adaptively and will aggregate according to responsibilities that are compatible with individual characteristics, such as location, agent type, health, etc.

In order to achieve scalability, these deep MARL methods adopt a simple mechanism that all agents share and learn a decentralized value or policy network. However, such simple sharing is often not effective for many complex multi-agent tasks. For example, in Adam Smith’s Pin Factory, workers must complete up to eighteen different tasks to create one pin (smith1937wealth). In this case, it is a heavy burden for a single shared policy to represent and learn all required skills. On the other hand, it is also not necessary for each agent to use a distinct policy network, which leads to high learning complexity, because some of agents often perform similar sub-tasks from time to time. The question is how we can give full play to agents’ specialization and dynamic sharing for improving learning efficiency.

A natural concept that comes to mind is role. A role is a comprehensive pattern of behavior, often specialized for some tasks. Agents with similar roles will show similar behaviors, and thus can share their experiences to improve performance. The role theory has been widely studied in economics, sociology and organization theory. Researchers has also introduced the concept of role into MAS (becht1999rope; stone1999task; depke2001roles; ferber2003agents; odell2004metamodel; bonjean2014adelfe; Lhaksmana2018role). In these role-based frameworks, the complexity of agent design is reduced via task decomposition by defining roles which are associated with responsibilities made up of a set of sub-tasks, so that the policy search space is effectively decomposed (zhu2008role). However, these works exploit prior domain knowledge to decompose tasks and predefine the responsibilities of each role, which prevents role-based MAS from being dynamic and adaptive to uncertain environments.

To leverage the benefits of both role-based and learning methods, in this paper, we propose a role-oriented multi-agent reinforcement learning framework (ROMA). This framework implicitly introduces the role concept into MARL, which serves as an intermediary to enable agents with similar responsibilities to share their learning. We achieve this by ensuring that agents with the similar role have both similar policies and responsibilities. To establish the connection between roles and decentralized policies, ROMA conditions agents’ policies on individual roles, which are parameterized in a stochastic embedding space and are determined by agents’ local observations. To associate roles with responsibilities, we introduce two regularizers to enable roles to be identifiable by behaviors and specialized in certain sub-tasks. We show how well-formed role representations can be learned via optimizing tractable variational estimations of the proposed regularizers. In this way, our method synergizes role-based and learning methods while avoiding their individual shortcomings – we provide a flexible and general-purpose mechanism that promotes the

emergence and specialization of role, which in turn provides an adaptive learning sharing mechanism for efficient policy learning.

We test our method on StarCraft II111StarCraft II are trademarks of Blizzard EntertainmentTM. micromanagement environments (vinyals2017starcraft; samvelyan2019starcraft). Results show that our method significantly pushes forward the state of the art of MARL algorithms, by virtue of the adaptive policy sharing among agents with similar roles. Visualization of the role representations in both homogeneous and heterogeneous agent teams demonstrates that the learned roles can adapt automatically in dynamic environments, and that agents with similar responsibility have similar roles. In addition, the emergence and evolution process of roles is shown, highlighting the connection between role-driven sub-task specialization and improvement of team efficiency in our framework. These results provide a new perspective in understanding and promoting the emergence of cooperation among agents.

2 Background

In our work, we consider a fully cooperative multi-agent task that can be modelled by a Dec-POMDP (oliehoek2016concise) , where is the finite action set, is the finite set of agents, is the discount factor, and is the true state of the environment. We consider partially observable settings and agent only has access to an observation drawn according to the observation function . Each agent has a history . At each timestep, each agent selects an action , forming a joint action , leading to next state according to the transition function and a shared reward for each agent. The joint policy induces a joint action-value function: ,= .

Figure 2: Schematics of our approach. The role encoder generates a role embedding distribution, from which a role is sampled and serves as an input to the hyper-net. The hyper-net generates the parameters of the local utility network. Local utilities are fed into a mixing network to get an estimation of the global action value. During the interactions with other agents and the environment, individual trajectories are collected and fed into the trajectory encoder to get posterior estimations of the role distributions. The framework can be trained in an end-to-end fashion.

To effectively learn policies for agents, the paradigm of centralized training with decentralized execution (CTDE) (foerster2016learning; foerster2018counterfactual; wang2020influence) has recently attracted attention from deep MARL to deal with non-stationarity. One promising way to exploit the CTDE paradigm is the value function decomposition method (sunehag2018value; rashid2018qmix; son2019qtran; wang2020learning), which learns a decentralized utility function for each agent and uses a mixing network to combine these local utilities into a global action value. To achieve learning scalability, existing CTDE methods typically learn a shared local value or policy network for agents. However, this simple sharing mechanism is often not sufficient for learning complex tasks, where agents may require diverse responsibilities or skills to achieve goals. In this paper, we aim to develop a novel MARL framework to address this challenge that achieves efficient shared learning while allowing agents to learn sufficiently diverse skills.

3 Method

In this section, we will present a novel role-oriented MARL framework (ROMA) that introduces the role concept into MARL and enables adaptive shared learning among agents. ROMA adopts the CTDE paradigm. As shown in Figure 2, it learns local Q-value functions for agents, which are fed into a mixing network to compute a global TD loss for centralized training. During the execution, the mixing network will be removed and each agent will act based on its local policy derived from its value function. Agents’ value functions or policies depend on their roles, each of which is responsible for performing similar automatically identified sub-tasks. To enable efficient and effective shared learning among agents with similar behaviors, ROMA will automatically learn roles that are:

i) Identifiable: The role of an agent can be identifiable by its behavior patterns;

ii) Specialized: Agents with similar roles are expected to specialize in similar responsibilities;

iii) Dynamic: An agent’s role can automatically adapt to the dynamics of the environment;

iv) Versatile: The emergent roles are diverse enough to efficiently fulfill a task.

Formally, each agent has a local utility function (or an individual policy), whose parameters are conditioned on its role . To learn roles with desired properties, we encode roles in a stochastic embedding space, and the role of agent ,

, is drawn from a multivariate Gaussian distribution

. To enable the dynamic

property, ROMA conditions an agent’s role on its local observations, and uses a trainable neural network

to learn the parameters of the Gaussian distribution of the role:

(1)

where are parameters of . The sampled role is then fed into a hyper-network parameterized by to generate the parameters for the individual policy, . We call the role encoder and the role decoder. In the next two sub-sections, we will describe two regularizers for learning identifiable, versatile, and specialized roles.

3.1 Identifiable and Versatile Roles

Introducing latent role embedding and conditioning individual policies on this embedding does not automatically generate versatile and identifiable roles. To ensure these two properties for roles, we can maximize an entropy objective for role to be diverse based on local observation , and, meanwhile, minimize the conditional entropy of role given an experience trajectory to make a behavior identifiable by a role. Therefore, we propose a regularizer for ROMA to maximize , which mathematically equals to , the conditional mutual information between the individual trajectory and role given the current observation.

However, estimating and maximizing mutual information is often intractable. Drawing inspiration from the literature of variational inference (wainwright2008graphical; alemi2017deep), we introduce a variational posterior estimator to derive a tractable lower bound for the mutual information for each time step :

(2)

where , is the variational estimator parameterised with , and the last inequality holds because of the non-negativity of the KL divergence. The distribution can be arbitrary, and we use a GRU (cho2014learning) to encode an agent’s history of observations and actions, which is called a trajectory encoder. The lower bound in Eq. 2

can be further rewritten as a loss function to be minimized:

(3)

where is a replay buffer, and are the entropy and cross entropy operators, respectively. The detailed derivation can be found in Appendix A.1.

3.2 Specialized Roles

The formulation so far does not promote sub-task specialization, which is a key component to decompose the task and improve efficiency in multi-agent systems. Minimizing loss enables roles to contain enough information about long-term behaviors, but does not explicitly ensure agents with similar behaviors to have similar role embeddings.

Intuitively, for any two agents in a population with clear sub-task specialization, either they have similar roles or they have quite different responsibilities, which can be characterized by their behavior trajectories. However, neither is which agents have similar roles known during the process of role emergence, nor is the similarity between behaviors straightforward to be defined.

To resolve this problem, we define another regularizer. To encourage two agents and to have similar roles and similar behaviors, we can maximize , the mutual information between the role of agent and the trajectory of agent . Directly optimizing these objectives will result in all agents having the same role, and, correspondingly, the same policy, which will limit system performance. To settle this issue, we introduce a dissimilarity model , which is a trainable neural network taking two trajectories as input, and seek to maximize while minimize the number of non-zero elements in the matrix , where is the estimated dissimilarity between trajectories of agent and . Such formulation makes sure that dissimilarity is high only when mutual information is low, so that the set of learned roles is compact but diverse enough for identified sub-tasks in order to efficiently solve the given task. Formally, the following equation constrains the role embedding learning and encourages sub-task specialization:

(4)

where is a hyper-parameter which controls the compactness of the role representation. Relaxing the matrix norm with Frobenius norm, we can get the optimization objective for minimizing:

(5)

However, as estimating and optimizing mutual information term are intractable, we use the variational posterior estimator introduced in Sec. 3.1 to construct an upper bound, serving as the second regularizer of ROMA:

(6)

where is the replay buffer, is the joint trajectory, is the joint observation, and . A detailed derivation can be found in Appendix A.2.

3.3 Overall Optimization Objective

We have introduced optimization objectives for learning roles to be versatile, identifiable, and specialized. Apart from these regularizers, all the parameters in the framework are updated by gradients induced by the standard TD loss of reinforcement learning. As shown in Figure 2, to compute the global TD loss, individual utilities are fed into a mixing network whose output is the estimation of global action-value . In this paper, our ROMA implementation uses the mixing network introduced by QMIX (rashid2018qmix) (see Appendix D) for its monotonic approximation, but it can be easily replaced by other mixing methods. The parameters of the mixing network are conditioned on the global state and are generated by a hyper-net parameterized by . Therefore, the final learning objective of ROMA is:

(7)

where , and are scaling factors, and = -() ( are parameters of a periodically updated target network). In our centralized training with decentralized execution paradigm, only the role encoder, the role decoder, and the individual policies are used when execution.

4 Related Works

The emergence of role has been documented in many natural systems, such as bees (jeanson2005emergence), ants (gordon1996organization), and humans (butler2012condensed). In these systems, the role is closely related to the division of labor and is crucial to the improvement of labor efficiency. Many multi-agent systems are inspired by these natural systems. They decompose the task, make agents with the same role specialize in certain sub-tasks, and thus reduce the design complexity (wooldridge2000gaia; omicini2000soda; padgham2002prometheus; pavon2003agent; cossentino2005passi; zhu2008role; spanoudakis2010using; deloach2010mase; bonjean2014adelfe). These methodologies are designed for tasks with clear structure, such as soft engineering (bresciani2004tropos). Therefore, they tends to use predefined roles and the associated responsibilities (Lhaksmana2018role). In contrast, we focus on how to implicitly introduce the concept of roles into general multi-agent sequential decision making under dynamic and uncertain environments.

(a) =, roles are characterized by agents’ positions.
(b) =, roles are characterized by agents’ remaining hit points.
(c) =, roles are characterized by agents’ remaining hit points and positions.
Figure 3: Dynamic role adaptation during an episode (means of the role distributions, , are shown, without using any dimensionality reduction techniques). The role encoder learns to focus on different parts of observations according to the automatically discovered demands of the task. The role-induced strategy helps (a) quickly form the offensive arc when =; (b) protect injured agents when =; (c) protect dying agents and alternate fire when .

Deep multi-agent reinforcement learning has witnessed vigorous progress in recent years. COMA (foerster2018counterfactual), MADDPG (lowe2017multi), PR2 (wen2019probabilistic), and MAAC (iqbal2019actor) explore multi-agent policy gradients. Another line of research focuses on value-based multi-agent RL, and value-function factorization is the most popular method. VDN (sunehag2018value), QMIX (rashid2018qmix), and QTRAN (son2019qtran) have progressively enlarged the family of functions that can be represented by the mixing network. NDQ (wang2020learning) proposes nearly decomposable value functions to address the miscoordination problem in learning fully decentralized value functions. Emergence is a topic with increasing interest in deep MARL. Works on the emergence of communication (foerster2016learning; lazaridou2017multi; das2017learning; mordatch2018emergence; wang2020learning), the emergence of fairness (jiang2019learning), and the emergence of tool usage (baker2020emergent)

provide a deep learning perspective in understanding both natural and artificial multi-agent systems.

To learn diverse and identifiable roles, we propose to optimize the mutual information between individual roles and trajectories. A recent work studying multi-agent exploration, MAVEN (mahajan2019maven)

, uses a similar objective. Different from ROMA, MAVEN aims at committed exploration. This difference in high-level purpose leads to many technical distinctions. First, MAVEN optimizes the mutual information between the joint trajectory and a latent variable conditioned on a Gaussian or uniform random variable to encourage diverse joint trajectory. Second, apart from the mutual information objective, we propose a novel regularizer to learn specialized roles, while MAVEN adopts a hierarchical structure and encourages the latent variable to help get more environmental rewards. We empirically compare ROMA with MAVEN in Sec. 

5. More related works will be discussed in Appendix D.

5 Experiments

Our experiments aim to answer the following questions: (1) Whether the learned roles can automatically adapt in dynamic environments? (Sec. 5.1.) (2) Can our method promote sub-task specialization? That is, agents with similar responsibilities have similar role embedding representations, while agents with different responsibilities have role embedding representations far from each other. (Sec. 5.15.3.) (3) Can such sub-task specialization improve the performance of multi-agent reinforcement learning algorithms? (Sec. 5.2.) (4) How do roles evolve during training, and how do they influence team performance? (Sec. 5.4.) Videos of our experiments are available online222https://sites.google.com/view/romarl/.

Baselines We compare our methods with various baselines shown in Table 1. In particular, we carry out the following ablation studies: (i) ROMA-RAW has the same framework as ROMA but does not include and in the optimization objective. This ablation is designed to highlight the contribution of the proposed regularizers. (ii) QMIX-NPS. The same as QMIX (rashid2018qmix), but agents do not share parameters. Our method achieves adaptive learning sharing, and comparison against QMIX (parameters are shared among agents) and QMIX-NPS tests whether this flexibility can improve learning efficiency. (iii) QMIX-LAR, QMIX with a similar number of parameters with our framework, which can test whether the superiority of our method comes from the increase in the number of parameters.

Figure 4: Comparison of our method against baseline algorithms. Results for more maps can be found in Appendix C.1.
Figure 5: Comparison of our method against ablations.
Alg. Description
Related
Works
IQL Independent Q-learning
COMA Foerster et al. (foerster2018counterfactual)
QMIX Rashid et al. (rashid2018qmix)
QTRAN Son et al. (son2019qtran)
MAVEN Mahajan et al. (mahajan2019maven)
Abla-
tions
ROMA-RAW Without and
QMIX-NPS
QMIX without parameter
sharing among agents
QMIX-LAR
QMIX with similar number
of parameters with ROMA
Table 1: Baseline algorithms.

We carry out a grid search over the loss coefficients and , and fix them at and

, respectively, across all the experiments. The dimensionality of latent role space is set to 3, so we did not use any dimensionality reduction techniques when visualizing the role embedding representations. Other hyperparameters are also fixed in our experiments, which are listed in Appendix B.1. For ROMA, We use elementary network structures (fully-connected networks or GRU) for the role encoder, role decoder, and trajectory encoder. The details of the architecture of our method and baselines can be found in Appendix B.

5.1 Dynamic Roles

Answering the first and second questions, we show snapshots in an episode played by ROMA agents on the StarCraft II micromanagement benchmark (SMAC) map , where 10 Marines face 11 enemy Marines. As shown in Fig. 3 (the role representations at = are presented in Fig. 1), although observations contain much information, such as positions, health points, shield points, states of ally and enemy units, etc., the role encoder learns to focus on different parts of the observations according to the dynamically changed situations. At the beginning (=), agents need to form a concave arc to maximize the number of agents whose shoot range covers the front line of enemies. ROMA learns to allocate roles according to agents’ relative positions so that agents can form the offensive formation quickly using specialized policies. In the middle of the battle, one important tactic is to protect the injured ranged units. Our method learns this maneuver and roles cluster according to the remaining health points (=, , ). Healthiest agents have role representations far from those of other agents. Such representations result in differentiated strategies: healthiest agents move forward to take on more firepower while other agents move backward, firing from a distance. In the meantime, some roles also cluster according to positions (agents 3 and 8 when =). The corresponding behaviors are agents with different roles fire alternatively to share the firepower. We can also observe that the role representations of dead agents aggregate together, representing a special group with an increasing number of agents during the battle.

These results demonstrate that our method learns dynamic roles and roles cluster clearly corresponding to automatically detected sub-tasks, in line with implicit constraints of the proposed optimization objectives.

(a) Strategy: sacrificing Zealots 9 and 7 to minimize Banelings’ splash damage.
(b) Strategy: forming an offensive concave arc quickly
(c) Strategy: green Zerglings hide away and Banelings kill most enemies by explosion.
Figure 6: Learned roles for , , and (means of the role distributions, , are shown, without using any dimensionality reduction techniques), and the related, automatically discovered responsibilities.

5.2 Performance on StarCraft II

To test whether these roles and the corresponding sub-task specialization can improve learning efficiency, we test our method on the StarCraft II micromanagement (SMAC) benchmark (samvelyan2019starcraft)

. This benchmark consists of various maps which have been classified as

easy, hard, and super hard. We compare ROMA with algorithms shown in Table 1 and present results for three hard maps , & and two super hard maps & . Although SMAC benchmark is challenging, it is not specially designed to test performance in tasks with many agents. We thus introduce three new SMAC maps to test the scalability of our method, which are described in detail in Appendix C.

For evaluation, all experiments in this section are carried out with 5 different random seeds, and results are shown with a confidence interval. Among these maps, four maps, , , , and , feature heterogeneous agents, and the others have homogeneous agents. Fig. 4 shows that our method yields substantially better results than all the alternative approaches on both homogeneous and heterogeneous maps (additional plots can be found in Appendix C.1). MAVEN overcomes the negative effects of QMIX’s monotonicity constraint on exploration. However, it performs less satisfactorily than QMIX on most maps. We believe this is because agents start engaging in the battle immediately after spawning in SMAC maps, and exploration is not the critical factor affecting performance.

Ablations We carry out ablation studies, comparing with the ablations shown in Table 1 and present results on three maps: (heterogeneous), , and (homogeneous) in Fig. 5. The superiority of our method against ROMA-RAW highlights the contribution of proposed regularizers – ROMA-RAW performs even worse than QMIX on two of the three maps. Comparison between QMIX-NPS and QMIX demonstrates that parameter sharing can, as documented (foerster2018counterfactual; rashid2018qmix), speed up training. As discussed in the introduction, both these two paradigms may not get the best possible performance. In contrast, our method provides a dynamic learning sharing mechanism – agents committed to a certain responsibility have similar policies. The comparison of the performance of ROMA, QMIX, and QMIX-NPS proves that such sub-task specialization can indeed improve team performance. What’s more, comparison of ROMA against QMIX-LAR proves that the superiority of our method does not depend on the larger number of parameters.

The performance gap between ROMA and ablations is more significant on maps with more than ten agents. This observation supports discussions in previous sections – the emergence of role is more likely to improve the labor efficiency in larger populations.

Figure 7: Role emergence and evolution on the map (role representations at time step are shown) during training (means of the role distributions, , are shown, without using any dimensionality reduction techniques). The emergence and specialization of roles is closely connected to the improvement of team performance. Agents in are heterogeneous, and we show role evolution process in a homogeneous team in Appendix C.3.

5.3 Role Embedding Representations

To explain the superiority of ROMA, we present the learned role embedding representations for three maps in Fig. 6. Roles are representative of automatically discovered sub-tasks in the learned winning strategy. In the map of , ROMA learns to sacrifice Zealots 9 and 7 to kill all the enemy Banelings. Specifically, Zealots 9 and 7 will move to the frontier one by one to minimize the splash damage, while other agents will stay away and wait until all Banelings explode. Fig. 6(a) shows the role embedding representations while performing the first sub-task where agent 9 is sacrificed. We can see that the role of Zealot 9 is quite different from those of other agents. Correspondingly, the strategy at this time is agent 9 moving rightward while other agents keeping still. Detailed analysis for the other two maps can be found in Appendix C.2.

5.4 Emergence and Evolution of Roles

We have shown the learned role representations and performance of our method, but the relationship between roles and performance remains unclear. To make up for this shortcoming, we visualize the emergence and evolution of roles during the training process on the map (heterogeneous) and (homogeneous). We discuss the results on here and defer analysis for to Appendix C.3.

In , 1 Medivac, 2 Marauders, and 7 Marines are faced with a stronger enemy team consisting of 1 Medivac, 3 Marauders, and 8 Marines. Among the three involved unit types, Medivac is the most special one for that it can heal the injured units. In Fig. 7, we show one of the learning curves of ROMA (red) and the role representations at the first environment step at three different stages. When the training begins (=), roles are random, and the agents are exploring the environment to learn the basic dynamics and structure of the task. By =M, ROMA has learned that the responsibilities of the Medivac are different from those of Marines and Marauders. The role, and correspondingly, the policy of the Medivac becomes quite different (Fig. 7 middle). Such differentiation in behaviors enables agents to start winning the game. Gradually, ROMA learns that Marines and Marauders have dissimilar characteristics and should take different sub-tasks, indicated by the differentiation of their role representations (Fig. 7 right). This further specialization facilitates the performance increase between M and M. After =M, the responsibilities of roles are clear, and, as a result, the win rate gradually converges (Fig. 4 top left). For comparison, ROMA without and can not even win once on this challenging task (ROMA-RAW in Fig. 5-left). These results demonstrate that the gradually specialized roles are indispensable in team performance improvement.

In summary, our experiments demonstrate that ROMA can learn dynamic, identifiable, versatile, and specialized roles that effectively decompose the task. Drawing support from these emergent roles, our method significantly pushes forward the state of the art of multi-agent reinforcement learning algorithms.

6 Closing Remarks

We have introduced the concept of roles into deep multi-agent reinforcement learning by capturing the emergent roles and encouraging them to specialize on a set of automatically detected sub-tasks. Such deep role-oriented multi-agent learning framework provides another perspective to explain and promote cooperation within agent teams, and implicitly draws connection to the division of labor, which has been practiced in many natural systems for long.

To our best knowledge, this paper is making a first attempt at learning roles via deep reinforcement learning. The gargantuan task of understanding the emergence of roles, the division of labor, and interactions between more complex roles in hierarchical organization still lies ahead. We believe that these topics are basic and indispensable in building effective, flexible, and general-purpose multi-agent systems and this paper can help tackle these challenges.

References

Appendix A Mathematical Derivation

a.1 Identifiable and Versatile Roles

For learning identifiable and versatile roles, we propose to maximize the conditional mutual information objective between roles and local observation-action histories given the current observations. In Sec. 3.1 of the paper, we introduce a posterior estimator and derive a tractable lower bound of the mutual information term:

(8)

Then it follows that:

(9)

The role encoder is conditioned on the local observations, so given the observations, the distributions of roles, , are independent from the local histories. Thus, we have

(10)

In practice, we use a replay buffer and minimize

(11)

a.2 Specialized Roles

Conditioning roles on local observations enables roles to be dynamic, and optimizing enables roles to be identifiable and versatile, but these formulations do not explicitly encourage specialized roles. To make up for this shortcoming, we propose a role differentiation objective in Sec. 3.2 of the paper, where a mutual information maximization objective is involved (maximizing ). Here, we derive a variational lower bound of this mutual information objective to render it feasible to be optimized.

(12)

We clip the variances of role distributions at a small value (0.1) to ensure that the entropies of role distributions are always non-negative so that the last inequality holds. Then, it follows that:

(13)

where is the trajectory encoder introduced in Sec. A.1, and the KL divergence term can be left out when deriving the lower bound because it is non-negative. Therefore, we have:

(14)

Recall that, in order to learn specialized roles, we propose to minimize:

(15)

where , and is the estimated dissimilarity between trajectories of agent and . For the term , we have:

(16)

where is the joint trajectory, is the joint observation, and . We denote

(17)

Because

(18)

it follows that:

(19)

So that

(20)

which means that Eq. 15 satisfies:

(21)

We minimize this upper bound to optimize Eq. 15. In practice, we use a replay buffer, and minimize:

(22)

where is the replay buffer, is the joint trajectory, is the joint observation, and .

Appendix B Architecture, Hyperparameters, and Infrastructure

b.1 Roma

In this paper, we base our algorithm on QMIX (rashid2018qmix), whose framework is shown in Fig. 12 and described in Appendix D

. In ROMA, each agent has a neural network to approximate its local utility. The local utility network consists of three layers, a fully-connected layer, followed by a 64 bit GRU, and followed by another fully-connected layer that outputs an estimated value for each action. The local utilities are fed into a mixing network estimating the global action value. The mixing network has a 32-dimensional hidden layer with ReLU activation. Parameters of the mixing network are generated by a hyper-net conditioning on the global state. This hyper-net has a fully-connected hidden layer of 32 dimensions. These settings are the same as QMIX.

Figure 8: Additional results on the SMAC benchmark.

We use very simple network structures for the components related to role embedding learning, i.e., the role encoder, the role decoder, and the trajectory encoder. The multi-variate Gaussian distributions from which the individual roles are drawn have their means and variances generated by the role encoder, which is a fully-connected network with a 12-dimensional hidden layer with ReLU activation. The parameters in the second fully-connected layers of the local utility approximators are generated by the role decoder whose inputs are the individual roles, which are 3-dimensional in all experiments. The role decoder is also a fully-connected network with a 12-dimensional hidden layer and ReLU activation. For the trajectory encoder, we again use a fully-connected network with a 12-dimensional hidden layer and ReLU activation. The inputs of the trajectory encoder are the hidden states of the GRUs in the local utility functions after the last time step.

(a) Strategy: sacrificing Zealots 9 and 7 to minimize Banelings’ splash damage.
(b) Strategy: forming an offensive concave arc quickly
(c) Strategy: green Zerglings hide away and Banelings kill most enemies by explosion.
Figure 9: (Reproduced from Fig. 6 in the paper, for quick reference.) Learned roles for , , and (means of the role distributions, , are shown, without using any dimensionality reduction techniques), and the related, automatically discovered responsibilities.

For all experiments, we set , , and the discount factor

. The optimization is conducted using RMSprop with a learning rate of

, of 0.99, and with no momentum or weight decay. For exploration, we use -greedy with annealed linearly from to over time steps and kept constant for the rest of the training. We run parallel environments to collect samples. Batches of 32 episodes are sampled from the replay buffer, and the whole framework is trained end-to-end on fully unrolled episodes. All experiments on StarCraft II use the default reward and observation settings of the SMAC benchmark.

Experiments are carried out on NVIDIA GTX 2080 Ti GPU.

Figure 10: The process of role emergence and evolution on the map .

b.2 Baselines and Ablations

We compare ROMA with various baselines and ablations, which are listed in Table. 1 of the paper. For COMA (foerster2018counterfactual), QMIX (rashid2018qmix), and MAVEN (mahajan2019maven), we use the codes provided by the authors where the hyper-parameters have been fin-tuned on the SMAC benchmark. QMIX-NPS uses the identical architecture as QMIX, and the only difference lies in that QMIX-NPS does not share parameters among agents. Compared to QMIX, for the local utility function of agents, QMIX-LAR adds two more fully-connected layers of 80 and 25 dimensions after the GRU layer so that it approximately has the same number of parameters as ROMA.

Appendix C Additional Experimental Results

We benchmark our method on the StarCraft II unit micromanagement tasks. To test the scalability of the proposed approach, we introduce three maps. The map features symmetry teams consisting of 4 Banelings and 6 Zerglings. In the map of , 6 Stalkers and 4 Zealots learn to defeat 10 Banelings and 30 Zerglings. And characterizes asymmetry teams consisting of 10 Zerglings & 5 Banelings and 2 Zealots & 3 Stalkers, respectively.

c.1 Performance Comparison against Baselines

Fig. 8 presents performance of ROMA against various baselines on three maps. Performance comparison on the other maps is shown in Fig. 4 of the paper. We can see that the advantage of ROMA is more significant on maps with more agents, such as , , , and .

c.2 Role Embedding Representations

Fig. 9 shows various roles learned by ROMA. Roles are closely related to the sub-tasks in the learned winning strategy.

For the map , the winning strategy is to form an offensive concave arc before engaging in the battle. Fig. 9(b) illustrates the role embedding representations at the first time step when the agents are going to set up the attack formation. We can see the roles aggregate according to the relative positions of the agents. Such role differentiation leads to different moving strategies so that agents can quickly form the arc without collisions.

Similar role-behavior relationships can be seen in all tasks. We present another example on the task of . In the winning strategy learned by ROMA, Zerglings 4 & 5 and Banelings kill most of the enemies, taking advantage of the splash damage of the Banelings, while Zerglings 6-9 hideaway, wait until the explosion is over, and then kill the remaining enemies. Fig. 9(c) shows the role embedding representations before the explosion. We can see clear clusters closely corresponding to the automatically detected sub-tasks at this time step.

Supported by these results, we can conclude that ROMA can automatically decompose the task and learn versatile roles, each of which is specialized in a certain sub-task.

c.3 Additional Results for Role Evolution

Figure 11: Screenshot of , =.

In Fig. 7 of the paper, we show how roles emerge and evolve on the map , where the involved agents are heterogeneous. In this section, we discuss the case of homogeneous agent teams. To this end, we visualize the emergence and evolution process of roles on the map , which features 10 ally Marines facing 11 enemy Marines. In Fig. 10, we show the roles at the first time step of the battle (screenshot can be found in Fig. 11

) at four different stages during the training. At this moment, agents need to form an offensive concave arc quickly. We can see that ROMA gradually learns to allocate roles according to relative positions of agents. Such roles and the corresponding differentiation in the individual policies help agents form the offensive arc more efficiently. Since setting up an attack formation is critical for winning the game, a connection between the specialization of the roles at the first time step and the improvement of the win rate can be observed.

Figure 12: The framework of QMIX, reproduced from the original paper (rashid2018qmix). (a) The architecture of the mixing network (blue), whose weights and biases are generated by a hyper-net (red) conditioned on the global state. (b) The overall QMIX structure. (c) Local utility network structure.

Appendix D Related Works

Multi-agent reinforcement learning holds the promise to solve many real-world problems and has been making vigorous progress recently. To avoid otherwise exponentially large state-action space, factorizing MDPs for multi-agent systems is proposed (guestrin2002multiagent). Coordination graphs (bargiacchi2018learning; yang2018glomo; grover2018evaluating; kipf2018neural) and explicit communication  (sukhbaatar2016learning; hoshen2017vain; jiang2018learning; singh2019learning; das2019tarmac; singh2019learning; kim2019learning) are studied to model the dependence between the decision-making processes of agents. Training decentralized policies is faced with two challenges: the issue of non-stationarity (tan1993multi) and reward assignment (foerster2018counterfactual; nguyen2018credit). To resolve these problems, Sunehag et al. (sunehag2018value) propose a value decomposition method called VDN. VDN learns a global action-value function, which is factored as the sum of each agent’s local Q-value. QMIX (rashid2018qmix) extends VDN by representing the global value function as a learnable, state-condition, and monotonic combination of the local Q-values. In this paper, we use the mixing network of QMIX. The framework of QMIX is shown in Fig. 12.

The StarCraft II unit micromanagement task is considered as one of the most challenging cooperative multi-agent testbeds for its high degree of control complexity and environmental stochasticity. Usunier et al. (usunier2017episodic) and Peng et al. (peng2017multiagent) study this problem from a centralized perspective. In order to facilitate decentralized control, we test our method on the SMAC benchmark (samvelyan2019starcraft), which is the same as in (foerster2017stabilising; foerster2018counterfactual; rashid2018qmix; mahajan2019maven).