Hierarchical Cooperative Multi-Agent Reinforcement Learning with Skill Discovery

by   Jiachen Yang, et al.

Human players in professional team sports achieve high level coordination by dynamically choosing complementary skills and executing primitive actions to perform these skills. As a step toward creating intelligent agents with this capability for fully cooperative multi-agent settings, we propose a two-level hierarchical multi-agent reinforcement learning (MARL) algorithm with unsupervised skill discovery. Agents learn useful and distinct skills at the low level via independent Q-learning, while they learn to select complementary latent skill variables at the high level via centralized multi-agent training with an extrinsic team reward. The set of low-level skills emerges from an intrinsic reward that solely promotes the decodability of latent skill variables from the trajectory of a low-level skill, without the need for hand-crafted rewards for each skill. For scalable decentralized execution, each agent independently chooses latent skill variables and primitive actions based on local observations. Our overall method enables the use of general cooperative MARL algorithms for training high level policies and single-agent RL for training low level skills. Experiments on a stochastic high dimensional team game show the emergence of useful skills and cooperative team play. The interpretability of the learned skills show the promise of the proposed method for achieving human-AI cooperation in team sports games.



There are no comments yet.


page 10


Skill Discovery of Coordination in Multi-agent Reinforcement Learning

Unsupervised skill discovery drives intelligent agents to explore the un...

On Multi-Agent Learning in Team Sports Games

In recent years, reinforcement learning has been successful in solving v...

From Motor Control to Team Play in Simulated Humanoid Football

Intelligent behaviour in the physical world exhibits structure at multip...

Low-level cognitive skill transfer between two individuals' minds via computer game-based framework

The novel technique introduced here aims to accomplish the first stage o...

Cooperative Multi-Agent Fairness and Equivariant Policies

We study fairness through the lens of cooperative multi-agent learning. ...

Learning High-level Representations from Demonstrations

Hierarchical learning (HL) is key to solving complex sequential decision...

Automata Guided Hierarchical Reinforcement Learning for Zero-shot Skill Composition

An obstacle that prevents the wide adoption of (deep) reinforcement lear...

Code Repositories


Hierarchical Cooperative Multi-Agent Reinforcement Learning with Skill Discovery

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We focus on cooperative multi-agent learning specifically in the context of team sports. Strategic team play in real-world sports is often driven by parallel execution of a sequence of subgoals by each player on the team jc: cite. Taking football for example, Player 1 may aim to reach an open spot while Player 2 prepares to make a pass, whereas Player 3 stays in a defensive zone. If the opponent acquired possession, the subgoals of each player will immediately change: Player 3 may switch to immediately blocking the opponent, while Player 1 and 2 may return to their side of the field to help with defense. One step toward attaining agents who exhibit strategic human-like team play in their high-level behavior is for agents to discover and learn a set of subgoals and execute high-level temporally-extended behavior (i.e. options [30]

) in the game. Training agents to accomplish subgoals also constrains the set of possible agent behavior to a smaller space that is easier to identify and interpret, and potentially more similar to strategic human-like play. Moreover, having agents who can reliably execute certain behavior is useful for human-machine cooperation in sports games when human players may expect AI-controlled teammates to cooperate by choosing and executing the correct subgoals (e.g. agents should go on offense when their human teammates have the ball). Furthermore, discovering and mastering subgoals may induce a natural decomposition of the multi-agent learning problem, enabling the team to execute higher-level strategies, leading to faster learning and higher performance than otherwise. While real-world players tend to have fixed roles that limit the set of subgoals they are capable of achieving based on their strengths and weaknesses, we start by only considering the case of a homogeneous agent population in which any agent can choose from all possible subgoals. Hence, our subgoals are not identical with roles. When agents are distinguishable, it is possible to condition subgoal-selection and agent policies on their respective feature vectors.

2 Problem formulation and desired outcome

We work with a homogeneous team of interchangeable agents, in the context of centralized training and decentralized execution. This is a realistic choice in the context of team sports, given that sports teams train with centralized instructions from coaches while all players act in a decentralized manner using their individual observations during real competitions. Agents should discover and master a set of distinct skills that are useful for accomplishing the global task (i.e. winning the game). These skills should correspond to intuitive subgoals such as moving into offensive position and blocking an opponent from scoring. Each agent must learn to choose subgoals dynamically given its current observation. For ease of exposition, we describe our formulation and method using a fully cooperative multi-agent setting, where the team plays against a fixed opponent; it is straightforward to train with self-play.

Real-world agents within a team often execute high-level strategies by dynamically coordinating each agent’s choice of subgoals. These subgoals can be attained when each agent learns a diverse set of skills and learns to choose the correct skill to execute given the state of the system. While it is plausible that training agents using the final game outcome (e.g. winning the game) can lead to the emergence of subgoal-directed behavior, we investigate the benefits of an explicit hierarchical multi-agent architecture involving automatic skill discovery without subgoal-specific rewards.

2.1 Notation

A cooperative Markov game is a tuple with agents labeled by . At time and global state , each agent receives an observation and chooses an action . The environment moves to due to joint action

, according to transition probability

. The team receives a global reward and the learning task is to find stochastic decentralized policies , conditioned only on local observations, to maximize , where and joint policy factorizes as due to decentralization. Let denote all agents’ actions, except that of agent . Let boldface denote the joint action.

2.2 Multi-agent skill discovery

We hypothesize that a skill can be inferred from the trajectory generated by an agent who is executing the skill. For example, an agent executing a defensive skill may move toward opponents in a specific trajectory in state space, and its choice of actions to achieve the subgoal of defense is only weakly dependent on the behavior of physically distant agents on its own team. While some skills may only be identified from coordinated behavior (i.e. involving trajectories of two or more agents), most real-world team sports have distinguishing features that differentiate players’ based on an agent’s trajectory alone jc: cite.

We want external control of the skill executed by the agent, to allow further analysis and interpretation of the learned skill. Hence, we start with a finite set of initial skill vectors, upon which we condition the low-level action policies .

How do we ensure that policies conditioned on vector are good for maximizing the global reward, without using subgoal-specific rewards? Phrased differently, how can the skill vectors attain meaning such that conditioning a policy on it will lead to behavior that is useful for winning the game?

What kind of trajectories are good for unsupervised learning of skills? If all agents are able to act optimally for subgoals, then every agent’s trajectory is informative for learning a latent representation of roles. However, at the start of multi-agent learning, agents’ individual trajectories may not be differentiated enough to allow learning of skills. This is especially problematic because we further wish to train a “coach” that assigns specific subgoals for each agent to accomplish.

2.3 Cooperative multi-agent subgoal selection

We investigate whether cooperative multi-agent learning at the higher level of skill selection can lead to higher team performance and via the emergence of interpretable strategies. Analogous to the way that sports team coaches may train players individually to master certain skills, then train the full team to execute certain combinations of skills in parallel, we consider a hierarchical architecture involving MARL at the level of skill selection while relying on simple independent learning for low level actions.

We train a high level assignment that factorizes into using QMIX, to select among the discrete set of roles.

Note that the key difference from single-agent hierarchical RL is that skills are chosen simultaneously by all agents in the MARL setting, whereas a single option is chosen sequentially in time in methods such as FeUdal Networks and option-critic [3, 33].

At the low level, we train policies where is a skill chosen by . For time steps, all agents sustain their role vectors. jc: It is bad to hardcode the number of steps to sustain a skill. Suppose a skill really requires steps to complete, and suppose that the high-level policy manages to learn to sustain that skill for multiple high-level steps. Even in this optimistic case, it seems hard for the low-level policy to “continue” a skill from where it left off during the previous high-level step. We can choose sensible values for by domain knowledge, e.g. average possession length between each turnover.

3 Proposed method

We aim to develop a synthesis of latent variable models and multi-agent hierarchical RL to learn both a representation of player roles and the optimal assignment of roles to agents on the team. To do so, we can draw inspiration from variational inference [16], trajectory embedding [6], and hierarchical reinforcement learning [29, 33, 3]. See Algorithm 1 for a sketch of the algorithm.

3.1 Multi-agent skill discovery

In contrast to the baseline methods, which require domain knowledge of reward functions associated with each role, this method is designed to learn role-specific policies without reward functions. We build on the general framework of variational option discovery [1], in which the key idea is to train a policy (viewed as an encoder) that maps from a context vector to a trajectory

and a classifier

(viewed as a decoder) that aims to recover from , by maximizing the following objective:


The key intuition is that the policy is trained to encode different into modes of behavior that are sufficiently different such that the decoder can easily reconstruct the original . Examples may be “move toward opponent net” versus “move to defend own net”. This approach is feasible even when there is insufficient domain knowledge to design rewards that give meaning to the set of .

We bring this idea into a hierarchical MARL algorithm as follows. We start with a set of vectors that will acquire meaning as skill vectors during the course of training. This set defines a discrete action space for high-level policies that chooses skills based on observations. Let denote the joint high-level policy. Let denote the low-level agent controller that takes actions in the game, conditioned on a role .

We have the following bilevel optimization problem:


We train low-level agent controllers using single-agent RL using reward function :


where determines the amount of intrinsic versus extrinsic reward. jc: In my current implementation that manages to get non-zero win rate, I used IQL for the low-level policy, which means we don’t have the entropy term because IQL is deterministic. Without the entropy term, the formal analogy with the VAE equation is lost, i.e. we retain the “reconstruction” term as the , but lose the “regularization” term.

Generating useful low-level trajectories Consider the case where is gradually decreased from 1 to 0 according to some schedule. At the beginning of training, means that all low-level policies learn independently to achieve environment rewards by scoring a goal or getting possession of the ball. At this point, has no effect on the reward for . As low-level policies improve and learn to win against the opponent, there will be trajectories that correspond to identifiable and useful behavior such as making shot attempts, intercepting passes, and defending the home net. Now as increases, the low-level policy is rewarded for generating more distinguishable modes of behavior when conditioned on different , while still optimizing for environment reward. In practice, we decrease only when the win rate in evaluation episodes exceeds some threshold, e.g. 50%. This means we initially rely on the environment reward to get the low-level policy to produce useful behavior, such that the decoder can be trained on meaningful data. As decreases, the policy is increasingly rewarded for generating trajectories that are easy for the decoder to classify.

Trajectory segmentation. When the goal is to discover “interesting” or distinguishable behavior, it is feasible to set a fixed trajectory length and use the full trajectory under the policy for training the decoder [1]. However, our final goal in the MARL setting is to learn skills that can be constructed into high level strategies when concatenated in a temporal sequence, which means we must learn from segments of full episodic trajectories. The simple idea of segmenting an agent’s full trajectory into fixed-length segments may produce segments that include both offense and and defense behavior, such as when the agent team loses possession in the middle of the segment, resulting in difficulty for the decoder. We resolve this difficulty by using the time points where the high-level policy chooses a new set of skill assignments as a natural segmentation for creating low-level training data. Intuitively, this improves consistency of the hierarchy: optimizes the choice of skills across longer time intervals, while learns to generate trajectory segments in between the time points and learns to associate them with skills vectors.

Dynamic multi-agent skill coordination via MARL

jc: What if the high-level policy stops assigning some ? Then the set of skills will be reduced. Consider using max entropy for multi-agent joint policy? But perhaps some environments don’t need too many skills.

We run multi-agent RL at the level of skills by , where are individual -functions trained via QMIX from a joint using the extrinsic reward. While the low-level policies are trained to generate useful behavior and that each mode of behavior is associated with a particular , the high-level policy is being trained to produce strategic team play by composing simultaneous skills in sequence at a high-level of temporal abstraction.


jc: We need to explain the change from stochastic policy in the formulation to deterministic policy in implementation.

jc: We should also emphasize that any MARL algorithm with decentralized execution can be used in place of QMIX. We chose QMIX because it is a more expressive function class than VDN [sunehag2017value]. QTRAN is a more recent alternative [28].

Curriculum for increasing number of skills. jc: I do not currently use this curriculum because it creates implementation difficulties for the high-level Q function. It may be difficult to train a large number of skills all at once, because skill vectors have no prior meaning, policies do not get any useful information from , and does not get meaningful pairs of . Achiam et al. [1] addressed this challenge by slowly increasing the number of skills to be learned. Since we need to define the action space for the high level , we can limit the decentralized skill selection to be the limited to the first skills currently allowed. If we do not use this, we can just say that the high-level policy determines the distribution of for training the low-level encoder-decoder pair.

Using pretrained policies. jc: Not used currently. We can initialize using a pretrained policy acquired from either pretraining or expert data, using a function augmentation method such as in [35]. This means that we start with low-level policies that can already generate goal-directed trajectories that are conducive for segmenting into distinguishable sub-trajectories.

1:procedure Algorithm
2:      Initialize , , , high-level replay buffer , low-level replay buffer , and trajectory-skill dataset
3:      for each episode do
4:             env.reset()
5:            Initialize low-level trajectory storage of max length for each agent
6:            for each step in episode do
7:                 if  then
8:                       if  then
9:                             Compute SMDP reward
10:                             Store high-level transition into
11:                             For each agent, store into
12:                             Compute intrinsic reward for each agent using and decoder
13:                       end if
14:                       For each agent, select new skill from -greedy policy using
15:                       If , update using (6)
16:                 end if
17:                 For each agent, compute low-level action from -greedy policy using
18:                 Take step in environment = env.step()
19:                 Compute low-level reward for each agent
20:                 For all agents, store into low-level replay buffer , and append to trajectory
21:                 If # (low-level steps) , update using IQL
22:                 if size of  then
23:                       Downsample and update decoder using all pairs of from , then empty
24:                 end if
25:            end for
26:      end for
27:end procedure
Algorithm 1 Hierarchical multi-agent role discovery

4 Related works

4.1 Marl

Early theoretical work analyzed MARL in discrete state and action spaces [31, 20, 13].

jc: Remove the less relevant ones When a global objective is the sum of agents’ individual objectives, value-decomposition methods optimize a centralized Q-function while preserving scalable decentralized execution [sunehag2017value, 26, 28] In the special case when all agents have a single goal and share a global reward, COMA [9] uses a counterfactual baseline. However, the centralized critic in these methods does not evaluate the specific impact of an agent’s action on another’s success in the general multi-goal setting. While MADDPG [21] and M3DDPG [19] apply to agents with different reward, they do not specifically address the need for cooperation; in fact, they do not distinguish the problems of cooperation and competition, despite the fundamental difference.

4.2 Hierarchical models and option discovery

Early work on hierarchical MARL [22, 11].

Wilson et al. [34]

proposed a multi-agent role discovery method that learns the number of roles, one policy per role, and the assignment of roles to agents, based on a Dirichlet process. They take a Bayesian policy search approach instead of RL, and manually-design conditional probability distributions for role assignment. In their setting, agents have distinct identities and features, which means they can consider similarity between agents and agents’ previous performance in certain roles when deciding the assignment. The key difference from our setting is that our agents are interchangeable.

Self-consistent trajectory autoencoder


Early paper on subgoal discovery [23]. We build on the general framework of variational option discovery algorithms [1], which encompasses variations such as [8]. While VALOR aims to learn only diverse trajectories that are easy for a decoder to reconstruct context vectors, without regard for downstream tasks, our algorithm explicitly encourages the generated trajectories to be useful for maximizing a sparse environment reward. [27] uses the predictability of state transitions as intrinsic reward.

VAE [16].

Multi-agent imitation learning of coordination

[18]. They use an alternating optimization method to train both individual policies and a latent structure model. Because they have unstructured expert data in which indices are not attached to unique expert roles, the role assignment problem they face is different from ours. Their approach involves interleaving three steps: restructure the expert data using a latent variable model; train policies via imitation learning from the restructured data; rollout policies to generate trajectories for learning parameters of the latent variable model.

There may be many prior work on augmenting the reward function. An example is [15].

Hierarchical learning and options papers may contain useful techniques for us to adapt. The key idea in Feudal Networks (FuN) is to train a Manager to produce directional goal vectors using the environment reward, while training a Worker to take low-level actions via an intrinsic reward defined by cosine distance between directional goals and actual state differences [7, 33].

When plausible sub-goal candidates are known, [17] proposed to train a meta-controller to select goals for a low-level controller to reach sequentially in time. They used an intrinsic binary reward for goal attainment by the low-level controller.

Single-agent hierarchical RL has been extended to MARL [2]. In their architecture, a Manager receives environment rewards and gives goal vectors to many Workers. The hierarchy is a directed tree with the Manager at the root and Workers at each lower node. Worker rewards depend on the goals given by the Manager. In their problem formulation, the Manager is associated with a specific agent who takes actions in the environment (e.g. the Manage is a “speaker” who sends messages to “listeners”), their method does not directly apply to the team sports case where all agents are on equal status. Moreover, the training of Worker agents requires hand-designed reward functions, which are not straightforward to specify for player roles in team sports.

Mention [32] in passing. Their description of hierarchical versions of existing MARL algorithms is not clear.

4.3 Team sports

According to [5], each role in a formation is unique.

4.4 Competitive multi-agent training via self-play

[4] found that training against random old versions of opponents instead of against the most recent opponent improved learning stability. They also used dense rewards that annealed toward zero to encourage learning of basic motor functions in their 3D physical tasks.

5 Experimental setup

Environments. Half Field Offense (HFO) is another possible additional experimental domain [12] It is built on the RoboCup 2D simulation platform, with reduced complexity, only containing the core decision-making logic. The multi-agent setting allows usage of automated NPCs in addition to the player-controlled (i.e. learning) agents, up to 10 players per side. It supports both high-level and low-level states and action spaces.

5.1 Baselines

jc: While there are few, if any, hierarchical multi-agent papers out there, we may be expected to compare against some, e.g. treating the problem as centralized control and using single-agent hierarchical RL?

There are very few works on multi-agent role assignment. We compare win rate against baseline QMIX and a method with hand-designed subgoal rewards. While higher win rate is preferable, we focus more on investigating the behavior of the resulting policies. Flat QMIX and hierarchical QMIX with hand-designed subgoal rewards can be viewed as loose upper bounds on the win rate of our unsupervised method. Also compare average number of timesteps required to score a goal.

We consider the following baselines to demonstrate the improvement due to our algorithm:

Baseline 1. QMIX is our baseline MARL algorithm without role assignment

Baseline 2: multi-agent subgoal assignment with fixed reward functions This is a simple extension of QMIX into a hierarchical architecture. The key idea is to achieve temporal abstraction at the level of multi-agent role assignment, rather than just temporal abstraction for any particular agent. First, we rely on role-specific reward functions given by domain knowledge.

  • Let

    be the identity matrix representing a set of

    predefined role vectors.

  • For each 1-hot row vector in , we have an associated domain-specific reward function .

  • We train a high-level QMIX architecture to select role vectors via decentralized individual functions .

  • Each selected role vector is given as input to a low-level , which is trained independently using the reward function with single-agent RL.

Role-specific reward functions are

  • Score goal: the agent is rewarded for scoring a goal.

  • Regain possession: the agent is rewarded for regaining possession of the ball from the opponent team.

  • Get into scoring position: agent stays near opponent net and there is no opponent in the line between agent and the teammate with ball.

  • Get into defensive position: agent stays near its own net and obstructs the line between opponent with ball and the net.

We choose a fixed number of steps, after which the high-level role assigner chooses all roles again. The loss function for training the role assignment policy is


This is the semi-MDP version of 1-step Q-learning. In the single-agent setting, the converges to the optimal value function over options under standard assumptions of Q-learning [29].

Baseline 3. Consider direct extension of another hierarchical RL method to the multi-agent setting. Consider implementing and comparing to Feudal Multiagent Hierarchies [2].

5.2 Evaluation of multi-agent skill discovery

Key things to show: 1. we really discovered interpretable skills, and 2. the skills are used to create high-level strategies

jc: How to visualize the learned skills? Unlike single-agent skill discovery in Mujoco, we cannot simply plot the trajectory of an agent in the absence of other agents.

jc: To show high-level strategy, we must show that 1. high-level policy assigns complementary skills across agents, and 2. high-level policy chooses good sequence of skills across time.

Quantifying role-specific behavior. Suppose the agents have learned to associate each with a role-specific behavior. How do we find out which role corresponds to each ? One way to do this is to define a set of features and aggregate the measurements across all agents whenever was assigned to the agent. Assuming a well-chosen set of features, if each corresponds to distinct behavior, then the measurements will differ. Based on commonly used measurements in professional sports [10, 24, 25], we can measure the following quantities conditioned under each role vector

  • Goals: agent scored a goal

  • Offensive rebound: agent’s team made a shot attempt, which missed, and the agent got possession of the ball

  • Shot attempts: agent attempted to score a goal

  • Made pass: agent made a successful pass

  • Received pass: agent received a pass from a teammate

  • Steals: agent retrieved possession from opponent by direct physical contact

  • Blocked shots and defensive rebounds: opponent with ball made a failed shot attempt and agent got possession

  • Blocked pass: opponent with ball made a failed pass attempt and agent got possession.

For each skill, plot the distribution of events that occurred under the skill. If skills lead to different behavior, then the counts for each event will differ amo‘ng skills.

Distribution of skills. Check whether all available are used by the high-level policy. Also plot distribution separately for the two cases when agent team has possession and when opponent has possession.

Distribution of raw actions for each skill. For each skill, plot the distribution of raw actions taken by low-level policy when conditioned on that skill.

Using a single skill for the whole game. What if we force all agents to use the same skill and never change the during the whole episode?

Plots of (x,y) trajectories. For each skill that mostly determines physical movement, record and plot some representative trajectories in (x,y) space. Plot each skill in a different color.

How well do agents help one another. Suppose we fix only one agent to use a single skill for the whole episode, e.g. force it to be a defensive player. We can investigate whether the other teammates change their behavior to play a complementary role to maintain team success, e.g. they choose the offensive skill.

Decoder. Show that decoder converges?

6 Results

(a) Win rate of QMIX and MARA against stock AI in STS2
(b) Behavioral measurements of MARA policy for offense and defense roles

Main comparisons

Quantifying role-specific behavior. We use the following procedure to confirm whether different role vectors correspond to distinguishable behavior: for each role vector, we accumulate the count of each quantity whenever any agent with the role satisfies the condition for that quantity, and average over 100 test episodes.

Skill evaluation.


7 Discussion and future work

When skill-specific reward functions are known and easy to design, such as in our baseline method, can we insert this prior knowledge into the architecture?

How to optimize the number of subgoals.

When agents have distinct identities and differ according to a set of features (e.g. agent A “runs faster” than agent B), similar agents should be assigned to the same role. It is crucial to have agent features, because agents’ location in the shared environment is not enough to differentiate them: e.g. two basketball players can perform different roles, e.g. a “larger” player performs a screen for a more “agile” player to find an open spot to shoot.

How to learn asynchronous termination of subgoals. Subgoal assignment can be asynchronous across agents. For example, one agent may always play a defensive role, while another agent on the other end of the field may change roles from offense to midfield defender when the opponent team makes an interception.


  • [1] J. Achiam, H. Edwards, D. Amodei, and P. Abbeel (2018) Variational option discovery algorithms. arXiv preprint arXiv:1807.10299. Cited by: §3.1, §3.1, §3.1, §4.2.
  • [2] S. Ahilan and P. Dayan (2019) Feudal multi-agent hierarchies for cooperative reinforcement learning. arXiv preprint arXiv:1901.08492. Cited by: §C.2, §4.2, §5.1.
  • [3] P. Bacon, J. Harb, and D. Precup (2017) The option-critic architecture. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §2.3, §3.
  • [4] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch (2018) Emergent complexity via multi-agent competition. In International Conference on Learning Representations, Cited by: §4.4.
  • [5] A. Bialkowski, P. Lucey, P. Carr, Y. Yue, S. Sridharan, and I. Matthews (2014) Identifying team style in soccer using formations learned from spatiotemporal tracking data. In 2014 IEEE International Conference on Data Mining Workshop, pp. 9–14. Cited by: §4.3.
  • [6] J. Co-Reyes, Y. Liu, A. Gupta, B. Eysenbach, P. Abbeel, and S. Levine (2018) Self-consistent trajectory autoencoder: hierarchical reinforcement learning with trajectory embeddings. In

    International Conference on Machine Learning

    pp. 1008–1017. Cited by: §3, §4.2.
  • [7] P. Dayan and G. E. Hinton (1993) Feudal reinforcement learning. In Advances in neural information processing systems, pp. 271–278. Cited by: §4.2.
  • [8] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine (2019) Diversity is all you need: learning skills without a reward function. In International Conference on Learning Representations, Cited by: §4.2.
  • [9] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018) Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §4.1.
  • [10] A. Franks, A. Miller, L. Bornn, K. Goldsberry, et al. (2015) Characterizing the spatial structure of defensive skill in professional basketball. The Annals of Applied Statistics 9 (1), pp. 94–121. Cited by: §5.2.
  • [11] M. Ghavamzadeh, S. Mahadevan, and R. Makar (2006) Hierarchical multi-agent reinforcement learning. Autonomous Agents and Multi-Agent Systems 13 (2), pp. 197–229. Cited by: §4.2.
  • [12] M. Hausknecht, P. Mupparaju, S. Subramanian, S. Kalyanakrishnan, and P. Stone (2016) Half field offense: an environment for multiagent learning and ad hoc teamwork. In AAMAS Adaptive Learning Agents (ALA) Workshop, Cited by: §5.
  • [13] J. Hu and M. P. Wellman (2003) Nash Q-learning for general-sum stochastic games. Journal of machine learning research 4 (Nov), pp. 1039–1069. Cited by: §4.1.
  • [14] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castañeda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, et al. (2019) Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 364 (6443), pp. 859–865. Cited by: §C.1.
  • [15] N. Jaques, A. Lazaridou, E. Hughes, C. Gulcehre, P. Ortega, D. Strouse, J. Z. Leibo, and N. De Freitas (2019) Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International Conference on Machine Learning, pp. 3040–3049. Cited by: §4.2.
  • [16] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In International Conference on Learning Representations, Cited by: §3, §4.2.
  • [17] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum (2016) Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pp. 3675–3683. Cited by: §4.2.
  • [18] H. M. Le, Y. Yue, P. Carr, and P. Lucey (2017) Coordinated multi-agent imitation learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1995–2003. Cited by: §4.2.
  • [19] S. Li, Y. Wu, X. Cui, H. Dong, F. Fang, and S. Russell (2019) Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §4.1.
  • [20] M. L. Littman (1994) Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994, pp. 157–163. Cited by: §4.1.
  • [21] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6382–6393. Cited by: §4.1.
  • [22] R. Makar, S. Mahadevan, and M. Ghavamzadeh (2001) Hierarchical multi-agent reinforcement learning. In Proceedings of the fifth international conference on Autonomous agents, pp. 246–253. Cited by: §4.2.
  • [23] A. McGovern and A. G. Barto (2001) Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 361–368. Cited by: §4.2.
  • [24] NHL (2019) NHL statistics glossary. Note: http://www.nhl.com/stats/glossaryOnline; accessed 25-June-2019 External Links: Link Cited by: §5.2.
  • [25] M. Oh, S. Keshri, and G. Iyengar (2015) Graphical model for baskeball match simulation. In Proceddings of the 2015 MIT Sloan Sports Analytics Conference, Boston, MA, USA, Vol. 2728. Cited by: §5.2.
  • [26] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 4295–4304. Cited by: §4.1.
  • [27] A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman (2019) Dynamics-aware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657. Cited by: §4.2.
  • [28] K. Son, D. Kim, W. J. Kang, D. Hostallero, and Y. Yi (2019) QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International Conference on Machine Learning, Cited by: §3.1, §4.1.
  • [29] M. Stolle and D. Precup (2002) Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation, pp. 212–223. Cited by: §3, §5.1.
  • [30] R. S. Sutton, D. Precup, and S. Singh (1999) Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: §1.
  • [31] M. Tan (1993) Multi-agent reinforcement learning: independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330–337. Cited by: §4.1.
  • [32] H. Tang, J. Hao, T. Lv, Y. Chen, Z. Zhang, H. Jia, C. Ren, Y. Zheng, C. Fan, and L. Wang (2018) Hierarchical deep multiagent reinforcement learning. arXiv preprint arXiv:1809.09332. Cited by: §4.2.
  • [33] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu (2017) Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3540–3549. Cited by: §2.3, §3, §4.2.
  • [34] A. Wilson, A. Fern, and P. Tadepalli (2010) Bayesian policy search for multi-agent role discovery. In Twenty-Fourth AAAI Conference on Artificial Intelligence, Cited by: §4.2.
  • [35] J. Yang, A. Nakhaei, D. Isele, H. Zha, and K. Fujimura (2018) CM3: cooperative multi-goal multi-stage multi-agent reinforcement learning. arXiv preprint arXiv:1809.05188. Cited by: §3.1.

Appendix A Experimental details

a.1 Sts2

State. We define a state representation that is invariant under 180 degree rotation of the playing field and switch of team perspective. Taking the Home team for example, the state vector has the following components (dimension 34 for 3v3):

  • Puck carrier: position of puck carrier relative to goal, normalized to , and absolute velocity of puck carrier

  • 1-hot vector indicating which player has puck; all entries are zero if team does not have possession

  • For each team player: position normalized to and absolute velocity

  • 1-hot vector indicating which opponent player has puck; all entries are zero if opponent team does not have possession

  • For each opponent player: position normalized to and absolute velocity

Observation. Each agent on the home team has its own egocentric observation vector with the following components (dimension 31 for 3v3):

  • Relative position of puck carrier normalized by dimensions of the field, and relative velocity

  • Binary indicator of whether the agent is the puck carrier

  • Binary indicator of whether agent’s team has possession of puck

  • Agent’s position normalized to and absolute velocity

  • For each teammate: relative position normalized by dimensions of the field, and relative velocity

  • Binary indicator of whether opponent team possession of puck

  • For each opponent agent: relative position normalized by dimensions of field, and relative velocity

Action. At each time step, each agent can choose the following discrete set of actions: do-nothing, shoot, pass-1, … , pass-N, down, up, right, left. If the agent is not the puck carrier and attempts to shoot or pass, or if it has the puck and it chooses to pass to itself, it is forced to do nothing instead.

Reward. The team receives reward for scoring, when the opponent scores, on the single step when it gets possession of the puck from the opponent, and at the step when it loses possession of the puck to the opponent. Also consider a dense term of for having or not having possession of puck, respectively.

a.1.1 Simple hierarchical MARL

The following are the manually designed role-specific rewards.

  • Agent in role 0 gets +1 reward if it has control of ball at previous state and scored a goal at current state

  • Agent in role 1 gets +1 reward if its team does not have possession of ball at previous state and it gained possession from opponent team at current state

  • Agent in role 2 gets a reward value in according to the following rules. Reward is non-zero only if the agent’s team has possession of ball and this particular agent does not have possession. Let denote the agent in question, and let denote the agent with possession, where . Let be the normalized distance between this agent and the opponent net. Let be the smallest orthogonal distance between any opponent and the line between this agent and its teammate with the ball, scaled such that if all opponent projections do not lie on the line and if any opponent is on the line. Then the reward is .

  • Agent in role 3 gets a reward value in according to the following rules. First, reward is non-zero only if the opponent team has possession. Let denote the normalized distance between this agent and its own net. Let denote the orthogonal distance from this agent’s position to the line between the opponent with the puck and the net that the opponent is attacking, such that if the orthogonal distance is zero. Then the reward is .

a.1.2 Fractional reward weights

We conduct an experiment to see if different fractional reward parameters for different roles produce significant differences in performance. As defined in Section C.1, the role-specific reward functions are defined by a set of conditions with one real-valued parameter for each condition.

For the proof-of-concept experiment, experiment 1 uses parameter matrix

Role Goal OR Off. side Def. side Midfield Steal DR Intercept
Role 1 1.0 0.1 0.002 0 0 0 0 0
Role 2 0 0 0 0.002 0 1.0 1.0 1.0
Table 1: Experiment 1
Role Goal OR Off. side Def. side Midfield Steal DR Intercept
Role 1 1.0 0.1 0.002 0 0 0.5 0 0.5
Role 2 0.5 0 0 0.002 0 1.0 1.0 1.0
Table 2: Experiment 2

a.1.3 Quantifying role-specific behavior

a.2 Multi-agent skill discovery

Algorithm design choices

  • Number of skills: fewer skills may be easier to learn, while more skills may be required to produce complex behavior for scoring.

  • Curriculum for gradually increasing the number of skills: does it make sense to do this given that high-level training is off-policy from a replay buffer?

  • Using state differences instead of raw states for trajectory input to decoder. This may help to prevent the agent from doing nothing, since it would be impossible for the decoder to discriminate between trajectories and this would penalize the policy

  • Condition for decreasing

  • Dense reward for possession: this may be a reason for getting stuck in the local optimum

  • Penalty for draw to prevent local optimum of holding possession

Also consider using two phases of training: in the first phase, only train the low-level policy and decoder using environment reward and intrinsic reward, without involving the high-level policy; in the second phase, only train the high-level policy and do not update the low-level policy.

Appendix B Schedule

  • November 15 - AAMAS deadline

  • November 15 - ICLR rebuttal ends

  • September 25 - ICLR deadline

  • August 16 - Last day

  • August 9 - Finish experiments, begin paper polishing

  • August 6 - Present poster

  • August 2 - Finish setting up Half Field Offense.

  • July 26 - Finish policy evaluation on STS2

  • July 19 - Set up Google Football environment jc: was not feasible; instead, we got positive results with new algorithm on STS2

  • July 12 - Preliminary results of proposed method on STS2

  • July 5 - Baseline 2 on STS2

  • June 28 - Experiments to test feasibility of proposed method

  • June 21 - Method proposal, second round

  • June 14 - Method proposal and baseline experiment

  • June 7 - Problem formulation

  • May 31 - Project direction

Appendix C Old material

1:procedure Algorithm
2:       jc: Needs to be updated because now I use IQL as the low-level policy
3:      Initialize , , , high-level replay buffer , and dataset of trajectories and subgoals
4:      for each episode do
5:             env.reset()
6:            Initialize low-level trajectory buffers of max length for each agent
7:            for each step in episode do
8:                 if  then
9:                       if  then
10:                             Compute SMDP reward
11:                             Store high-level transition into replay buffer
12:                             For each agent, store into
13:                       end if
14:                       For all agents, select new
15:                       If , update using (6)
16:                 end if
17:                 For all agents, compute low-level action
18:                 Take step in environment = env.step()
19:                 For all agents, append into , respectively
20:                 if size of  then
21:                       Use (4) as the reward to update with each pair of in
22:                       Downsample and update decoder all pairs of
23:                       Reset
24:                 end if
25:            end for
26:      end for
27:end procedure
Algorithm 2 Hierarchical multi-agent role discovery

c.1 Method 2: Learning reward parameters

jc: Problems with this method are: it is computationally expensive to run the outer-loop optimization; perhaps the space of reward parameters is not very interesting; most of the manual work is in designing the set of notable events, while assigning values to each event is not difficult.

In the absence of expert knowledge about the team sport, we may only know a set of notable events that occur in the game (e.g. scored goal, interception), and we do not know what reward values to associate with each event. This second approach allows usage of domain knowledge to define reward functions for each subgoal. Instead of fully defining the reward functions, we only need to identify a set of notable events, such as scoring, intercepting a pass, and reaching an open position far from opponents. Merely identifying such events is much easier than the ill-defined problem of manually specifying numerical values for terms in a complicated reward function, and can be easily done with the bare minimum of domain knowledge. This is similar to the notion of “game events”, such as in Quake III Arena [14]. For each event , we initially associate it with a reward component of value . Let there be roles. Now associate all subgoals with a matrix , where row is defined as a vector of parameters that define the reward function for subgoal . Then the reward for attaining the event by an agent assigned to subgoal is . We can optimize the reward weights by an outer-loop of population-based training [14], where a population of weights are maintained and underperforming ones are substituted by perturbed versions of better ones. Alternatively, we can treat the optimization of as an RL problem in a continuous action space of dimension . Given a reward weighting matrix, the optimization of high and low level policies follow the procedure in Method 1: the high level subgoal assignment policy learns to choose for each agent to maximize the environment reward , and the low-level controller learns to take actions to maximize reward if corresponds to subgoal .

Notable events are

  1. Scoring

  2. Getting an offensive rebound

  3. Staying on offensive side of field

  4. Staying on defensive side of field

  5. Staying around midfield

  6. Getting ball from opponent by direct physical contact

  7. Getting ball after opponent attempted a shot

  8. Getting ball after opponent attempted a pass

c.2 Method 4

jc: This method does not make sense, ignore it.

  1. Train a multi-agent exploration policy that produces trajectories that are diverse across agents. jc: Do we really need a separate policy or can we just use the agent policies? Can we modify the training objective of the policy so that it is encouraged to produce trajectories that are “useful” for inference? Define “useful”.

  2. Train a latent variable model, involving a trajectory encoder that encodes trajectories into a latent space , which is interpreted as the space of roles.

  3. Cluster the set of latent vectors into a small number of discrete clusters , each interpreted as a role.

  4. Train a role assignment policy that chooses a cluster center given the state and agent ’s observation .

  5. Train the low level policy that conditions on the role vector to act in the environment.

A potential downside of this approach is that it requires three objective functions—for training , , and —and is not end-to-end differentiable.

Difference from Feudal Multi-agent Hierarchies [2] jc: to do.

Role discovery. Initial idea is to cluster the set of trajectories. However, the meaning of cluster centers may change.

Diverse trajectory generation for role discovery. Again, use the VAE performance as a reward? But this becomes reward engineering. Extrinsic notion of similarity of trajectories generated when conditioned by a fixed . Extrinsic notion of differences in trajectory generated by different . Maximize entropy of trajectory distribution across agents. Minimize entropy of trajectory distribution of a single agent. Consider each trajectory as one sample.

Dynamic role assignment. Either a single function assigns roles to all agents, or each agent has a role selection function . What are the differences?