Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning

by   Dong-Ki Kim, et al.

Heterogeneous knowledge naturally arises among different agents in cooperative multiagent reinforcement learning. As such, learning can be greatly improved if agents can effectively pass their knowledge on to other agents. Existing work has demonstrated that peer-to-peer knowledge transfer, a process referred to as action advising, improves team-wide learning. In contrast to previous frameworks that advise at the level of primitive actions, we aim to learn high-level teaching policies that decide when and what high-level action (e.g., sub-goal) to advise a teammate. We introduce a new learning to teach framework, called hierarchical multiagent teaching (HMAT). The proposed framework solves difficulties faced by prior work on multiagent teaching when operating in domains with long horizons, delayed rewards, and continuous states/actions by leveraging temporal abstraction and deep function approximation. Our empirical evaluations show that HMAT accelerates team-wide learning progress in difficult environments that are more complex than those explored in previous work. HMAT also learns teaching policies that can be transferred to different teammates/tasks and can even teach teammates with heterogeneous action spaces.


page 1

page 2

page 3

page 4


Learning to Teach in Cooperative Multiagent Reinforcement Learning

We present a framework and algorithm for peer-to-peer teaching in cooper...

Transfer Heterogeneous Knowledge Among Peer-to-Peer Teammates: A Model Distillation Approach

Peer-to-peer knowledge transfer in distributed environments has emerged ...

Multi-Pass Q-Networks for Deep Reinforcement Learning with Parameterised Action Spaces

Parameterised actions in reinforcement learning are composed of discrete...

Hierarchical Deep Multiagent Reinforcement Learning

Despite deep reinforcement learning has recently achieved great successe...

Two-stage training algorithm for AI robot soccer

In multi-agent reinforcement learning, the cooperative learning behavior...

SCC-rFMQ Learning in Cooperative Markov Games with Continuous Actions

Although many reinforcement learning methods have been proposed for lear...

Reinforcement Learning for Heterogeneous Teams with PALO Bounds

We introduce reinforcement learning for heterogeneous teams in which rew...

1 Introduction

In cooperative multiagent reinforcement learning (MARL), agents commonly develop their own knowledge of a domain since they have unique experiences. The history of human social groups provides evidence that the collective intelligence of multiagent populations may be greatly boosted if agents share their learned behaviors with others rogers2010diffusion . With this motivation in mind, we explore new methodologies for allowing multiagent populations of deep RL agents to effectively share their knowledge and learn from other agents while maximizing collective reward.

Recently proposed frameworks allow for various types of knowledge transfer between agents Taylor:2009:TLR:1577069.1755839 . In this paper, we focus on transfer based on action advising, where an experienced “teacher” agent helps a less experienced “student” agent, by suggesting which action to take next. Action advising allows a student to directly execute suggested actions without incurring much computation overhead. Recent work on action advising includes the Learning to Coordinate and Teach Reinforcement (LeCTR) framework omidshafiei18teach , in which agents learn

when and what actions to advise. While LeCTR improves upon prior advising methods based on heuristics 

amir2016interactive ; clouse1996integrating ; SilvaGC17 ; torrey2013teaching , it faces limitations in scaling to more complicated tasks with high-dimension state-action spaces, long time horizons, and delayed rewards. The key difficulty is teacher credit assignment omidshafiei18teach

: learning teacher policies requires estimates of the impact of each piece of advice on the student agent’s learning progress, but these estimates are difficult to obtain. A simple function approximation, such as tile coding, has been used to simplify the learning of teaching policies in LeCTR, but that approach does not scale well.

This paper proposes a new learning-to-teach framework, hierarchical multiagent teaching (HMAT)

. Specifically, scalability is improved by representing student policies using nonlinear function approximations (e.g., deep neural networks (DNNs)) and hierarchical reinforcement learning (HRL) 

sutton1999 ; Kulkarni16hrl ; nachum18hrl , which allows advising temporally extended sequences of primitive actions. Credit assignment remains a significant issue since DNNs use mini-batches to stabilize learning Goodfellow-et-al-2016 , and mini-batches can be randomly selected from a replay memory mnih15dqn . Hence, the student’s learning progress is affected by a batch of advice suggested at varying times and identifying the extent to which each piece of advice contributes to the student’s learning is very challenging. Additional challenges include handling large state-action spaces, long time horizons, and delayed rewards.

The main contribution of this work is a method to resolve the teacher credit assignment with deep student policies: a sequence of advice (i.e., sub-goal plan), instead of single advice, is used to update the student’s policy and a temporary student policy is used to estimate the amount of teacher’s contribution to the student’s learning progress. The second contribution is learning (via HRL) high-level teacher policies that advise students to take high-level actions (e.g., sub-goals). Empirical evaluations demonstrate multiple advantages of HMAT. First, HMAT accelerates the team-wide learning progress for complex tasks compared to previous techniques such as LeCTR. This benefit results from the improved credit assignment determination that helps the teacher learning process; the representation of student policies using DNNs that can handle large state-action spaces; and the high-level advising that addresses long-horizons and delayed rewards. Agents can also learn high-level teaching policies that are transferable to different types of agents and/or tasks. Finally, agents can have different dynamics/action spaces because the high-level teaching enables knowledge transfer that is agnostic to these details.

2 Background

We consider a cooperative MARL setting in which

agents jointly interact in the environment, then receive feedback via local observations and a shared team reward. This setting can be formalized as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), defined as a tuple

OliehoekAmato16book ; is the set of agents, is the set of states, is the set of joint actions,

is the transition probability function,

is the set of joint observations, is the observation probability function, is the reward function, and is the discount factor. At each timestep , each agent executes an action according to its policy parameterized by , where is the agent ’s observation history at timestep . Joint action yields a transition from current state to next state with probability . Then, a joint observation is obtained and the team receives a shared reward . The agents’ objective is to maximize the expected cumulative reward . The policy parameter will often be omitted and reactive policies are considered (e.g., ).

2.1 Learning to Teach in Cooperative MARL

We review key concepts and notations in the learning-to-teach framework.

Task-Level Learning Problem We consider a cooperative MARL setting with two agents and in a shared environment. At each learning iteration, agents interact in the environment, collect experiences, and update their policies, and , with learning algorithms, and . The resulting policies aim to coordinate and optimize final task performance. This problem of learning task-related policies is referred to as the task-level learning problem .

Advice-Level Learning Problem Throughout task-level learning, even without assuming that any agents are experts, agents may still develop unique skills from their experiences. As such, it is potentially beneficial for agents to advise one another using their specialized knowledge to improve the final performance and accelerate team-wide learning. The problem of learning teacher policies that decide when and what to advise is referred to as the advice-level learning problem, , where refers to a teacher policy property.

Learning task-level policies and advice-level policies are both RL problems, interleaved within the learning-to-teach framework. However, there are important differences between and . One difference lies in the definition of learning episodes as rewards about the success of advice are naturally delayed relative to typical task-level rewards. For , an episode terminates either when agents arrive at a terminal state or a timestep exceeds a pre-specified value . In contrast, for , an episode ends when task-level policies have converged, forming one “episode” for learning teaching policies. Upon completion of the advising-level episode, task-level policies are re-initialized and training proceeds for another advising-level episode. To avoid confusion, we refer to an episode as one task-level problem episode and a session as one advice-level problem episode (see Figure 0(a)). Also, and have different learning objectives. The task-level learning aims to coordinate and maximize cumulative reward per episode, whereas advice-level learning aims to maximize cumulative teacher reward per session, corresponding to accelerating team-wide learning progress (i.e., a maximum area under the learning curve in one session). Lastly, task-level policies are inherently off-policy while teacher policies are not necessarily. This is because task-level policies are updated with experiences affected by teacher policies, instead of experiences generated by following agents’ task-level policies alone.

Figure 1: (a) Illustration of task-level learning progress for each session. After completing a pre-determined episode count, task-level policies are re-initialized (teaching session reset). With each new teaching session, teacher policies are better at advising students, leading to faster learning progress for task-level policies. (b) Agents teach each other according to an advising protocol. For example, knowledgeable agent evaluates the action that agent intends to take and advises if needed.

2.2 Hierarchical Reinforcement Learning (HRL)

HRL is a structured framework with multi-level reasoning and extended temporal abstraction Parr1998 ; sutton1999 ; MAXQ2000 ; Kulkarni16hrl ; bacon16hrl ; vezhnevets17hrl ; abstractoptions ; feudalrl ; fun . HRL efficiently decomposes a complex problem into simpler sub-problems, which offers a benefit over non-HRL in solving difficult tasks with long horizons and delayed reward assignments. The closest HRL framework that we use in this paper is that of nachum18hrl with a two-layer hierarchical structure: the higher-level manager policy () and the lower-level worker policy (). The manager policy obtains an observation and plans a high-level sub-goal for the worker policy. The worker policy attempts to reach this sub-goal from current state by executing a primitive action in the environment. Following this framework, an updated sub-goal is generated by the manager every timesteps and a sequence of primitive actions are executed by the worker. The manager learns to accomplish a task by optimizing the cumulative environment reward and stores an experience every timesteps. By contrast, the worker learns to reach the sub-goal by maximizing the cumulative intrinsic reward and stores an experience at each timestep. Without loss of generality, we also denote the next observation with the prime symbol ().

3 Overview of Hierarchical Multiagent Teaching (HMAT)

3.1 Deep hierarchical Task-Level Policy

HMAT addresses the limited scalability of LeCTR by using deep function approximations and learns high-level teacher policies to decide what high-level actions (e.g., sub-goal) to advise fellow agents and when that advice should be given. Specifically, to extend task-level policies with DNNs and hierarchical representations, we replace and with deep hierarchical policies consisting of manager policies, and , and worker policies, and (see Figure 0(b)). The manager and worker policies in HMAT are trained with different objectives. Managers learn to accomplish a task together (i.e., solving ) by optimizing cumulative reward, while workers are trained to reach sub-goals suggested by their managers (see appendix C for details).

In this paper, we focus on transferring knowledge at the manager-level instead of the worker-level, since manager policies represent abstract knowledge, which is more relevant to fellow agents. Such a consideration also allows managers to transfer knowledge even if workers have dynamic/action/observation space unique to each agent. Therefore, hereafter when we discuss task-level polices it is implied that we are only discussing the manager policy. The manager subscript will often be omitted when discussing these task-level policies (i.e., ) to simplify notation.

3.2 Advice-Level Learning in Hierarchical Settings

We consider the problem of sharing knowledge between managers via teacher policies, and , which learn when and what sub-goal actions to advise. Consider Figure 0(b), where agents are learning to coordinate with hierarchical task-level policies (solving ) while advising each other via teacher policies (solving ). There are two roles in Figure 0(b): that of a student agent (i.e., an agent whose manager policy receives advice) and that of a teacher agent (i.e., an agent whose teacher policy gives advice). Note that agents and can simultaneously teach each other, but, for clarity, Figure 0(b) only shows a one-way interaction. Here, student has decided that it is appropriate to strive for sub-goal by querying its manager policy. Before passes the sub-goal to its worker, ’s teaching policy checks ’s intended sub-goal and decides whether to advise or not. Having decided to advise, transforms its task-level knowledge into desirable sub-goal advice via its teacher policy and suggests it to . After student accepts the advice from the teacher, the updated sub-goal is passed to ’s worker policy, which then generates a primitive action .

4 Details of HMAT

4.1 Algorithm of Hierarchical Multiagent Teaching

HMAT iterates over the following three phases to learn how to coordinate with deep hierarchical task-level policies (solving ) and how to provide advice using the teacher policies (solving ). These phases are designed to address the teacher credit assignment issue with deep task-level policies. As discussed, identifying which portions of the advice led to successful student learning is difficult (see Section 1). That issue is addressed by adapting ideas developed for learning an exploration policy for a single agent xu18meta , which includes an extended view of actions (see below) and the use of a temporary policy for measuring a reward for the exploration policy. We adapt and extend these ideas from learning-to-explore in a single-agent setting into our learning-to-teach in a multiagent setting. Pseudocode of HMAT is presented in appendix A.

Phase I (Advising Phase)   Agents advise each another using their teacher policies according to the advising protocol during one episode. This process generates a batch of task-level experiences influenced by the teaching policy’s behavior. Specifically, we extend the concept of teacher policies by providing advice in the form of a sequence of multiple sub-goals, (where denotes multiple sub-goals in one episode) instead of just providing one piece of advice before updating task-level policies. One teacher action in this extended view corresponds to providing multiple pieces of advice during one episode, which contrasts with previous approaches to teaching in MARL that were updated based on a single piece of advice amir2016interactive ; clouse1996integrating ; SilvaGC17 ; omidshafiei18teach ; torrey2013teaching . This extension is important, because by following advice in an episode, a batch of task-level experiences for agent and , , are generated to enable stable mini-batch updates of the DNN policies.

Phase II (Advice Evaluation Phase)   Learning teacher policies requires reward feedback from the advice in phase I. Phase II evaluates and estimates the impact the advice had on improving team-wide learning progress, yielding the teacher policy’s reward. A temporary task-level policy is used to estimate the teacher reward for , so agents copy their current task-level policies to temporary policies (). To determine the teacher reward for , are updated for a small number of iterations using only (i.e., ). The updated temporary policies generate a batch of self-practice experiences by rolling out a total of timesteps without involving teacher policies. These self-practice experiences, which are based on reflect how agents (on their own) would perform after the advising phase and can be used to estimate the impact of on the team-wide learning. The function (Section 4.2) uses the self-practice experiences to compute the teacher reward for (i.e., ). A key point is that the temporary policies used to compute are only updated based on (i.e., experiences from the past iterations are not utilized), which resolves the teacher credit assignment issue.

Phase III (Policy Update Phase)   Task-level policies are updated to solve by using the learning algorithms and , and advice-level policies are updated to solve by using the learning algorithms and . In particular, task-level policies, and , are updated for the next iteration by randomly sampling experiences from task-level experience memories, and . As in xu18meta , both and are added to the task-level memories. Similarly, after adding the teacher experience collected from the advising (Phase I) and advice evaluation phases (Phase II) to teacher experience memories, and , teacher policies are updated by randomly selecting samples from their replay memories, but at a slower update frequency than those of task-level policies.

4.2 Details of Teacher Policy

We explain important components of the teacher policy (focusing on teacher for clarity). See Appendix B for additional details.

Teacher Observation and Action   Teacher-level observations compactly provide information about the nature of the heterogeneous knowledge between the two agents. Given , teacher decides when and what to advise, with one action for deciding whether or not it should provide advice and another action for selecting the sub-goal to give as advice. If no advice is provided, student executes its originally intended sub-goal.

Teacher Reward Function   Teacher policies aim to maximize cumulative teacher reward in a session that should result in faster team-wide learning progress. Recall in Phase II that the batch of self-practice experiences reflect how agents by themselves perform after one advising phase. The open question is how to identify an appropriate teacher reward function that can map self-practice experiences into better learning performance. Intuitively, maximizing the reward returned by a teacher reward function means that teachers should advise so that the learning performance is maximized after one advising phase. In this work, we consider a new reward function, called current rollout (CR), which returns the sum of rewards in the self-practice experiences . We also evaluate different choices of teacher reward functions, including the ones in  omidshafiei18teach and xu18meta .

Teacher Experiences   One teacher experience corresponds to ; where is a teacher observation; is a teacher action; is an estimated teacher reward with ; and is a next teacher observation, obtained by updating with the updated temporary policy (i.e., representing the change in student ’s knowledge due to advice ).

4.3 Training Protocol

Task-Level Training   To accommodate inherent off-policy task-level policies (Section 2.1), we use the off-policy TD3 fujimoto18td3 algorithm to train the worker and manager policies. TD3 is an actor-critic algorithm which introduces two critics, and , to reduce overestimation of Q-value estimate in DDPG lillicrap15ddpg and yields more robust learning performance. Originally, TD3 is a single-agent deep RL algorithm accommodating continuous spaces/actions. Here, we extend TD3 to multiagent settings with a resulting algorithm termed MATD3, and non-stationarity in MARL is addressed by applying centralized critics/decentralized actors foerster2017counterfactual ; lowe17maddpg . Another algorithm termed HMATD3 further extends MATD3 with HRL. In HMATD3, an agent ’s task policy critics, and , minimize the following critic loss:


where ; ; ; ; the subscript “target” denotes the target network; and . The agent ’s actor policy with parameter is updated by:


Advice-Level Training   TD3 is also used for updating teacher policies. We modify eq. 1 and eq. 2 to account for the teacher’s extended view. Considering agent for clarity, agent ’s teacher policy critics, and , minimize the following critic loss:


where ; ; ; . The agent ’s actor policy with parameter is updated by:


where .

5 Related Work

Imitation learning studies how to learn a policy from expert demonstrations daswani15imitation ; ross11imitation ; ross14imitation . Recent work applied imitation learning for multiagent coordination le2017coordinated and explored effective combinations of imitation learning and hierarchical RL le2018hierarchical . Curriculum learning bengio2009curriculum ; graves17curriculum ; tsvetkov16 , which progressively increases task difficulty, is also relevant. While most work in this topic focuses on learning knowledge for solving single agent problems, we study peer-to-peer knowledge transfer in cooperative MARL. Related work by Xu et al. xu18meta learns an exploration policy, which relates to our approach to learning teaching policies. Their approach led to unstable learning in our setting, motivating our policy update rule (section 4.3). We also consider various teacher reward functions, including the one in xu18meta , and a new reward function of CR empirically performs better in our domains.

6 Evaluation

We demonstrate HMAT’s performance in increasingly challenging domains that involve continuous states/actions, long horizons, and delayed rewards.

(a) Cooperative one box domain.
(b) Cooperative two box domain.
Figure 2: Two scenarios used for evaluating the learning-to-teach framework.

6.1 Evaluation Domains and Tasks

Our domains are based on the OpenAI’s multiagent particle environment that supports continuous observation/action spaces. We modify the environment and propose new domains, called cooperative one/two box push, that require coordinating agents (see appendix C for additional details).

Cooperative One Box Push (COBP)   The domain consists of one round box and two agents (Figure 1(a)). The objective is to move the box to the target on the left side as soon as possible. The box can be moved if and only if two agents act on it together. This unique property requires that the agents coordinate. The domain has a delayed reward because there is no change in reward until the box is moved by the two agents. An episode in COBP ends when exceeds timesteps.

Cooperative Two Box Push (CTBP)   This domain is similar to the one box domain but with increased complexity. There are two round boxes in the domain (Figure 1(b)). The objective is to move the left box (box) to the left target (target) and the right box (box) to the right target (target). In addition, the boxes have different mass – box is 3x heavier than box. An episode in CTBP ends when exceeds timesteps.

Heterogeneous Knowledge   For each domain, we provide each agent with a different set of priors to ensure heterogeneous knowledge between them and motivate interesting teaching scenarios. For the COBP (Figure 1(a)), agent and are first trained to move the box to the target. Then, agents and are teamed up, and agent has no knowledge about the domain. Agent , which understands how to move the box, should teach agent by giving good advice to improve ’s learning progress. For the CTBP (Figure 1(b)), agents and have received prior training about how to move box to target, and agents and understand how to move box to target. However, these two teams have different skills as the tasks involve moving boxes with different weights (light vs heavy) and also in different directions (left vs right). Then agents and are teamed up, and in this scenario, agent should transfer its knowledge about moving box to agent . Meanwhile, agent should teach agent how to move box, so that there is a two-way transfer of knowledge where each agent is the primary teacher at one point and primary student at another.

Algorithm Hierarchical? Teaching? One Box Push Two Box Push
AI 11.33 2.46
AICI 10.60 0.85
MAT 10.04 0.38
HMATD3 10.24 0.20
HAI 10.23 0.19
HAICI 10.25 0.26
HMAT 10.10 0.19 458 8 27.49 0.96 5032 186
Table 1:

and AUC for different algorithms. Results show a mean and standard deviation computed for

sessions. Best results in bold (computed via a -test with ).

6.2 Baselines

We compare to several baselines in order to provide context for the performance of HMAT (see Appendix D for additional details).

No-Teaching Baselines:   MATD3 and HMATD3 are the baselines for a primitive and hierarchical MARL without teaching, respectively.

LeCTR Baselines:   This includes the LeCTR framework omidshafiei18teach with task-level policies that use tile-coding (LeCTR–Tile). We also consider modifications of that framework in which the task-level policies are learned with deep RL methods: MATD3 (LeCTR–D) and HMATD3 (LeCTR–HD). Lastly, we compare two LeCTR baselines using online MATD3 (LeCTR–OD) and online HMATD3 (LeCTR–OHD), where the online update denotes the policy update with the most recent experience.

Heuristic Teaching Baselines:   Two heuristic-based primitive teaching baselines, Ask Important (AI) and Ask Important–Correct Important (AICI) amir2016interactive , are compared. In AI, each student asks for advice based on the importance of a state using the student’s -values. When asked, the teacher agent always advises with its best primitive action at a given student state. Students in AICI also ask for advice, but each teacher can decide whether to advise with its best action (or not to advise). The teacher decides based on the state importance using the teacher’s -values and the difference between the student’s intended action and teacher’s intended action at a student state. AICI is one of the best performing heuristic algorithms in amir2016interactive . Hierarchical AI (HAI) and hierarchical AICI (HAICI) are similar to AI and AICI, but teaches in the hierarchical settings (i.e., managers teach each other).

HMAT Variants:   A non-hierarchical variant of HMAT, called MAT, is compared under different choices of teacher reward functions. For clarity, we only show results with the CR reward function. Performance comparisons between different reward functions are shown in appendix B.

Figure 3: (a) and (b) Task-level learning progress in the one box push domain and two box push domain, respectively. The oracles in (a) and (b) refer to performance of converged HMATD3. For fair comparisons, HMAT and MAT include both the number of episodes used in the Phase I and II when counting the number of train episodes. (c) Heterogeneous action AUC based on the action rotation. Mean and confidence interval computed for sessions are shown in all figures.

6.3 Results on One Box and Two Box Push

Table 1 compares HMAT and its baselines. The results show both final task-level performance () and area under the task-level learning curve (AUC) – higher values are better for both metrics. The results demonstrate improved task-level learning performance with HMAT compared to HMATD3 and with MAT compared to MATD3, as indicated by the higher final performance () and larger rate of learning (AUC) in Figures 2(b) and 2(a). HMAT also shows better performance than MAT. These results demonstrate the benefits of the high-level advising, which helps address the long time horizons and delayed rewards in these two domains.

HMAT also achieves better performance than the LeCTR baselines. LeCTR–Tile shows the smallest and AUC due to the limitations of the tile-coding representation of the policies in these complex domains. Teacher policies in LeCTR–D and LeCTR–HD have poor estimates of the teacher credit assignment with deep task-level policies, that result in unstable learning of the teaching policies and worse performance than the no-teaching baselines (i.e., LeCTR–D vs MATD3, LeCTR–HD vs HMATD3). In contrast, both LeCTR–OD and LeCTR–OHD have good estimates of the teacher credit assignment as the task-level policies are updated online. However, these two approaches suffer from instability caused by the absence of mini-batch update for the DNN policies. Finally, HMAT attains the best performance in terms of and AUC, compared to the heuristics-based baselines.

These combined results demonstrate the key advantage of HMAT in that it can accelerate the learning progress for complex tasks with continuous states/actions, long horizons, and delayed rewards.

6.4 Results on Transferability and Heterogeneous Action Spaces

HMAT can learn high-level teaching strategies transferable to different types of agents and/or tasks. We evaluate both transferability in the following two perspectives and teaching with heterogeneous actions spaces. Below numerical results are mean and standard deviation computed for sessions.

Transfer across Different Student Types   We first create a small population of students, each having different knowledge. Specifically, we create students that can push one box to distinct areas in the one-box push domain: top-left, top, top-right, right, bottom-right, bottom, and bottom-left. This population is divided into train top-left, top, bottom-right, bottom, and bottom-left, validation top-right, and test right groups. After the teacher policy has converged, we fix the policy and transfer it to a different setting in which the teacher advises a student in the test group. Although the teacher has never interacted with the student in the test group before, it achieves an AUC of , compared to no-teaching baseline (HMATD3) AUC of .

Transfer across Different Tasks   We first train the teacher in the one box push domain that learns to transfer knowledge to agent about how to move the box to the left. Then we fix the converged teacher policy and evaluate on a different task of moving the box to right. While task-level learning without teaching achieves AUC of , task-level learning with teaching achieves AUC of . Thus, learning is faster, even when using pre-trained teacher policies from different tasks.

Teaching with Heterogeneous Action Spaces   We consider heterogeneous action space variants in the one-box push domain, where agent has a remapped primitive action space (e.g., rotation) compared to its teammate . The original heuristic-based algorithms advise at primitive actions and thus assume the action space homogeneity. As Figure 2(c) shows, AICI has almost decreased mean AUC when the action space is remapped. Advising with HRL helps to teach a teammate with a heterogeneous action space, and HMAT achieves the best performance regardless of action rotation.

7 Conclusion

The paper presents HMAT, which utilizes deep function approximations and HRL to transfer knowledge between agents in cooperative MARL problems. We propose a method to overcome the teacher credit assignment issue and show accelerated learning progress in challenging domains. Ongoing work is expanding the framework to more than two agents.


This work was supported by IBM (as part of the MIT-IBM Watson AI Lab initiative) and AWS Machine Learning Research Awards program. Dong-Ki Kim was also supported by Kwanjeong Educational Foundation Fellowship.


Appendix A HMAT Pseudocode

1:Maximum number of episodes in session
2:Teacher update frequency
3:Initialize advice-level policies and memories
4:for teaching session do
5:     Re

-initialize task-level policy parameters
6:     Re-initialize task-level memories
7:     Re-initialize train episode count:
8:     while  do

Teacher’s advice
10:         Update episode count:


Copy temporary task-level policies:
12:         Update to using Eqn eq. 1eq. 2 with
13:          perform self-practice
14:         Update episode count:

Get teacher reward with

Add and to
17:         Add a teacher experience to
18:         Update using Eqn eq. 1eq. 2 with
19:         if  mod  then
20:              Update using Eqn (3)–eq. 4 with
21:         end if

22:     end while
23:end for

Phase I

Phase II

Phase III
Algorithm 1 HMAT Pseudocode

Appendix B Additional Details of Teacher Policy

b.1 Teacher Observation

While agents can simultaneously advise one another, for clarity, we detail the teacher observation when agents and are the teacher and student agent, respectively. For agent ’s teacher policy, its observation consists of:

where ; ; ; ; , are the centralized critics for agent and , respectively; and are average rewards in the last few iterations in Phase I and Phase II, respectively; and is the remaining time in current session.

b.2 Teacher Reward Function

Recall in the advice evaluation phase of HMAT (Phase II) that the batch of self-practice experiences reflects how agents by themselves perform after one advising phase. Then, the next important question is to identify an appropriate teacher reward function that can transform the self-practice experiences into the learning performance. In this section, we detail our choices of teacher reward functions, including the ones in omidshafiei18teach and xu18meta , as described in Table 2.

Teacher Reward Name                                   Description Teacher Reward
VEG: Value Estimation Gain Student’s Q-value above threshold  omidshafiei18teach
DR: Difference Rollout Difference in rollout reward before/after advising xu18meta
CR: Current Rollout Rollout reward after advising phase
Table 2: Summary of teacher reward functions . Rollout reward denotes the sum of rewards in the self-practice experiences

Value Estimation Gain (VEG)   The value estimation gain (VEG) teacher reward function is introduced and performed the best in the Learning to Coordinate and Teach Reinforcement (LeCTR) framework omidshafiei18teach . VEG rewards a teacher when its student’s Q-value exceeds a threshold : . The motivation of VEG is that the student’s Q-value is correlated to its estimated learning progress omidshafiei18teach . Algorithm 2 describes how teachers receive the VEG teacher reward. Note that the Q-functions in the algorithm correspond to the ones of the updated temporary task-level policies . The threshold values of and are used for the one-box and two-box push domain, respectively.

1:Self-practice experiences
3:Initialize agent ’s teacher reward
4:Initialize agent ’s teacher reward
5:for  do
6:     if  then
8:     end if
9:     if  then
11:     end if
12:end for
Algorithm 2 VEG Reward Pseudocode

Difference Rollout (DR)   The difference rollout (DR) teacher reward function is used in the work of Xu et al. xu18meta for learning an exploration policy for a single agent. DR requires another self-practice experiences collected before the advising phase . Then, the teacher reward is calculated by the difference between the sum of rewards in (denoted by ) and the sum of rewards in (denoted by ).

Current Rollout (CR)   The current rollout (CR) is a simpler reward function than DR. CR returns the sum of rewards in the self-practice experiences .

Performance Comparisons   Our empirical experiments in Table 3 show that CR performs the best with HMAT. Consistent with this observation, the learning progress estimated with CR has a very high correlation with the true learning progress (see appendix E).

b.3 Teacher Experience

In this section, we detail how the next teacher observation in the teacher experience is obtained (focusing on teacher for clarity). To get the next teacher observation, the teacher’s observation is updated by replacing student ’s Q-values (see section B.1) with the the student ’s updated temporary policy (i.e., representing the change in student ’s knowledge due to advice ). , , and are also updated in the next teacher observation.

Algorithm Hierarchical? Teaching? One Box Push Two Box Push
HMAT (with VEG) 10.38 0.25 29.73 2.89 4694 366
HMAT (with DR) 10.36 0.32 28.12 1.58 4758 286
HMAT (with CR) 10.10 0.19 458 8 27.49 0.96 5032 186
Table 3: and AUC for different reward functions. Results show a mean and standard deviation computed for sessions. Best results in bold (computed via a -test with ).

Appendix C Additional Experimental/Domain Details

c.1 Implementation Details

Each policy’s actor and critic are two-layer feed-forward neural networks, consisting of the rectified linear unit (ReLU) activations. In terms of the task-level actor policies, a final layer of the tanh activation that outputs between

and is used at the output. Actors for the worker and primitive task-level policies output two actions that correspond to x–y forces to move in the one box and two box domains. Similarly, actors for the manager task-level policies output two actions, but they correspond to sub-goals of (x, y) coordinate.

HMAT’s teacher-level actor policies output four actions, where the first two outputs correspond to what to advise (i.e., continuous sub-goal advice of (x, y) coordinate) and the last two outputs correspond to when

to advise (i.e., discrete actions converted to the one-hot encoding). Two separate final layers of the tanh activation and the linear activation are applied to what to advise actions and when to advise actions, respectively. Note that the Gumbel-Softmax estimator 

jang2016categorical is used to compute gradients for the discrete actions of when to advise.

Because our focus is teaching at the manager-level, not at the worker-level, we pre-train and fix the worker policies by giving randomly generated sub-goals. Per Section 2.2, the intrinsic reward function to pre-train the worker policy is the negative distance between the current position and sub-goal: , where and correspond to the current position and sub-goal, respectively. All hierarchical methods presented in this work use pre-trained workers.

The Amazon Elastic Compute Cloud (EC2) of C5 and P3 instances are used for the computing infrastructures.

c.2 Cooperative One-Box Push Domain

  • Each agent’s observation include its position/speed, the position of the box, the position of the left target, and the other agent’s position.

  • Every episode, the box and the left target initialize at () and (), respectively. Agents initialize at random locations.

  • The domain returns a team reward .

  • The box and the agents have a radius of and , respectively.

  • The width and height of the domain are from to .

  • Maximum timestep per episode is timesteps.

  • All methods use discount factor of .

  • The maximum number of episodes in a session is episodes.

  • Adam optimizer; actor learning rate of and critic learning rate of .

  • As in TD3 fujimoto18td3 , task-level policies for all methods use first timesteps for a purely random exploration followed by an added Gaussian noise. The added Gaussian noise has mean of and a standard deviation of . Empirically, adding the purely random exploration results in more stable results.

  • Managers in all hierarchical methods generate sub-goal every timesteps.

  • Self-practice rollout timestep is timesteps (two episodes).

  • Teacher update frequency of episodes.

  • Task-level batch size ; Teacher-level batch size of .

  • We focus on knowledge transfer from agent to agent in this domain. Recall that agent already understands how to move the box to the left side. Therefore, we fix agent ’s task-level policy’s weight, but train agent task-level policy throughout the experiments.

c.3 Cooperative Two-Box Push Domain

  • Each agent’s observation include its position/speed, the positions of the boxes, the positions of the targets, and the other agent’s position.

  • Every episode, the two boxes initialize at () and (), respectively. The two targets initialize at () and (), respectively. Agents reset at random locations.

  • The boxes have a radius of . The agents have a radius of .

  • The domain returns a team reward

  • The left box has a mass of and the right box has a mass of .

  • The width and height of the domain are from to .

  • Maximum timestep per episode is timesteps.

  • All methods use discount factor of .

  • The maximum number of episodes in a session is episodes.

  • Adam optimizer; actor learning rate of and critic learning rate of .

  • Task-level policies for all methods use first timesteps for a purely random exploration followed by an added Gaussian noise. The added Gaussian noise has mean of and a standard deviation of .

  • Managers in all hierarchical methods generate sub-goal every timesteps.

  • Self-practice rollout timestep is timesteps (two episodes).

  • Teacher update frequency of episodes.

  • Task-level batch size ; Teacher-level batch size of .

  • Both agent and agent ’s task-level policies are trained.

Appendix D Baseline Details

d.1 LeCTR Baseline Details

In terms of the original LeCTR framework (LeCTR–Tile), various parameters related to the tile-coding task-level policies are evaluated. Specifically, we attempt various combinations: the hash size values of ; number of tiling of ; number of tiles per tiling of ; and of .

d.2 Heuristic-Based Teaching Baseline Details

Heuristic-based teaching methods amir2016interactive , clouse1996integrating , SilvaGC17 , torrey2013teaching use various heuristics to decide when to advise. When decided to advise, a teacher agent advises with its best action at a student state/observation. We compare the Ask Important (AI) heuristic amir2016interactive and the Ask Important–Correct Important (AICI) heuristic amir2016interactive in our experiments. We also consider when AI and AICI methods are applied in the hierarchical settings: hierarchical AI (HAI) and hierarchical AICI (HAICI). As heuristic-based teaching approaches are based on tile-coding/discrete action space, we apply minor modifications to these approaches to combine with MATD3 or HMATD3. In this section, we detail our modifications to AI, AICI, HAI, and HAICI.

AI and HAI   In AI, a student agent asks for an advice when the importance of a state is larger than a threshold : . represents an important state when receiving an advice can yield a significantly better performance, as determined by the student agent’s Q-function. We make the following modifications to measure with MATD3. First, we uniformly sample actions to effectively represent all possible actions. Second, is modified to include the centralized critics. Specifically, considering the teacher agent and the student agent for clarity, the student asks for advice when: ; where denotes the set of sampled actions; and . Several values of are evaluated, and AI performs the best with the value of .

HAI is similar to AI, except the hierarchical settings. Specifically, in HAI, a teacher agent gives the sub-goal advice, instead of the primitive action advice, and larger values of for are evaluated to account the manager’s sub-goal generation frequency of every timesteps. HAI performs the best with the value of .

AICI and HAICI   As in AI, students in AICI ask for advice when the is larger than a threshold . When asked, however, teachers can decide whether to give advice or not based on and the difference between the student’s intended action and teacher’s intended action at student state. Specifically, when asked, teacher agent decides to advise student if and . Similar to AI, minor modifications are made for AICI to advise MATD3 task-level policies. Student asks for an advice as in AI, but teacher in AICI decides to advise when: and ; where denotes the set of sampled actions. Combinations of values of and values of are evaluated with of . AICI performs the best with value of and value of .

Similarly for HAICI, combinations of values of and values of are evaluated with of . HAICI performs the best with value of and value of .

Figure 4: Ground-truth learning progress vs estimated learning progress (with CR). The CR teacher reward function well estimates the true learning progress with a high correlation of .
Figure 5: Teacher actor loss between different number of threads. With increasing number of threads, a teacher policy converges faster.

Appendix E Analysis of Teacher Reward Function

Developing ground-truth learning progress of task-level policies often requires an expert policy and could be computationally undesirable graves17curriculum . Thus, our framework uses an estimation of the learning progress as a teacher reward. However, it is important to understand how close the estimation is to the true learning progress. The goal of teacher policies is to maximize the cumulative teacher reward, so a poor estimate of teaching reward would result in learning undesirable teachers. In this section, we aim to measure the differences between the true and estimated learning progress and analyze the CR teacher reward function, which performed the best.

In imitation learning, with an assumption of a given expert, one method to measure the true learning progress is by measuring the distance between an action of a learning agent and an optimal action of an expert ross14imitation , daswani15imitation . Similarly, we pre-train expert policies (HMATD3) and measure the true learning progress by the action differences. The comparison between the true and the estimated learning progress using the CR teacher reward function is shown in Figure 4. The Pearson correlation is , which empirically shows that the CR teacher reward function well estimates the learning progress.

Appendix F Asynchronous Hierarchical Multiagent Teaching

Similar to a deep RL algorithm requiring millions of episodes to learn a useful policy mnih15dqn , our teacher policies would require many sessions to learn. As one session consists of many episodes, much time might be needed until teacher policies converge. We address this potential issue with asynchronous policy update with multi-threading as in asynchronous advantage actor-critic (A3C) mniha16A3C . A3C demonstrated a reduction in training time that is roughly linear in the number of threads. We also show that our HMAT variant, asynchronous HMAT, achieves a roughly linear reduction in training time as a function of the number of threads (see Figure 5).