Learning to Teach in Cooperative Multiagent Reinforcement Learning

by   Shayegan Omidshafiei, et al.
Northeastern University

We present a framework and algorithm for peer-to-peer teaching in cooperative multiagent reinforcement learning. Our algorithm, Learning to Coordinate and Teach Reinforcement (LeCTR), trains advising policies by using students' learning progress as a teaching reward. Agents using LeCTR learn to assume the role of a teacher or student at the appropriate moments, exchanging action advice to accelerate the entire learning process. Our algorithm supports teaching heterogeneous teammates, advising under communication constraints, and learns both what and when to advise. LeCTR is demonstrated to outperform the final performance and rate of learning of prior teaching methods on multiple benchmark domains. To our knowledge, this is the first approach for learning to teach in a multiagent setting.



page 8


Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning

Heterogeneous knowledge naturally arises among different agents in coope...

Reinforcement Teaching

We propose Reinforcement Teaching: a framework for meta-learning in whic...

How does online teamwork change student communication patterns in programming courses?

Online teaching has become a new reality due to the COVID-19 pandemic ra...

Agent-Agnostic Human-in-the-Loop Reinforcement Learning

Providing Reinforcement Learning agents with expert advice can dramatica...

Reinforcement learning with human advice. A survey

In this paper, we provide an overview of the existing methods for integr...

Robust Domain Randomised Reinforcement Learning through Peer-to-Peer Distillation

In reinforcement learning, domain randomisation is an increasingly popul...

Truthful Peer Grading with Limited Effort from Teaching Staff

Massive open online courses pose a massive challenge for grading the ans...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In social settings, innovations by individuals are taught to others in the population through communication channels [Rogers2010]

, which not only improves final performance, but also the effectiveness of the entire learning process (i.e., rate of learning). There exist analogous settings where learning agents interact and adapt behaviors while interacting in a shared environment (e.g., autonomous cars and assistive robots). While any given agent may not be an expert during learning, it may have local knowledge that teammates may be unaware of. Similar to human social groups, these learning agents would likely benefit from communication to share knowledge and teach skills, thereby improving the effectiveness of system-wide learning. It is also desirable for agents in such systems to learn to teach one another, rather than rely on hand-crafted teaching heuristics created by domain experts. The benefit of learned peer-to-peer teaching is that it can accelerate learning even without relying on the existence of “all-knowing” teachers. Despite these potential advantages, no algorithms exist for learning to teach in multiagent systems.

This paper targets the learning to teach problem in the context of cooperative Multiagent Reinforcement Learning (MARL). Cooperative MARL is a standard framework for settings where agents learn to coordinate in a shared environment. Recent works in cooperative MARL have shown final task performance can be improved by introducing inter-agent communication mechanisms [Sukhbaatar, Fergus, and others2016, Foerster et al.2016, Lowe et al.2017]. Agents in these works, however, merely communicate to coordinate in the given task, not to improve overall learning by teaching one another. By contrast, this paper targets a new multiagent paradigm in which agents learn to teach by communicating action advice, thereby improving final performance and accelerating teamwide learning.

The learning to teach in MARL problem has unique inherent complexities that compound the delayed reward, credit assignment, and partial observability issues found in general multiagent problems [Oliehoek and Amato2016]

. As such, there are several key issues that must be addressed. First, agents must learn when to teach, what to teach, and how to learn from what is being taught. Second, despite coordinating in a shared environment, agents may be independent/decentralized learners with privacy constraints (e.g., robots from distinct corporations that cannot share full policies), and so must learn how to teach under these constraints. A third issue is that agents must estimate the impact of each piece of advice on their teammate’s learning progress. Delays in the accumulation of knowledge make this credit assignment problem difficult, even in supervised/unsupervised learning

[Graves et al.2017]. Nonstationarities due to agent interactions and the temporally-extended nature of MARL compound these difficulties in our setting. These issues are unique to our learning to teach setting and remain largely unaddressed in the literature, despite being of practical importance for future decision-making systems. One of the main reasons for the lack of progress addressing these inherent challenges is the significant increase in the computational complexity of this new teaching/learning paradigm compared to multiagent problems that have previously been considered.

Our paper targets the problem of learning to teach in a multiagent team, which has not been considered before. Each agent in our approach learns both when and what to advise, then uses the received advice to improve local learning. Importantly, these roles are not fixed (see Fig. 1); these agents learn to assume the role of student and/or teacher at appropriate moments, requesting and providing advice to improve teamwide performance and learning. In contrast to prior works, our algorithm supports teaching of heterogeneous teammates and applies to settings where advice exchange incurs a communication cost. Comparisons conducted against state-of-the-art teaching methods show that our teaching agents not only learn significantly faster, but also learn to coordinate in tasks where existing methods fail.

Background: Cooperative MARL

Our work targets cooperative MARL, where agents execute actions that jointly affect the environment, then receive feedback via local observations and a shared reward. This setting is formalized as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), defined as

[Oliehoek and Amato2016]; is the set of agents, is the state space, is the joint action space, and is the joint observation space.111Superscript denotes parameters for the -th agent. Refer to the supplementary material for a notation list. Joint action causes state to transition to

with probability

. At each timestep , joint observation is observed with probability . Given its observation history, , agent executes actions dictated by its policy . The joint policy is denoted by and parameterized by

. It may sometimes be desirable to use a recurrent policy representation (e.g., recurrent neural network) to compute an internal state

that compresses the observation history, or to explicitly compute a belief state (probability distribution over states); with abuse of notation, we use

to refer to all such variations of internal states/observation histories. At each timestep, the team receives reward , with the objective being to maximize value, , given discount factor . Let action-value denote agent ’s expected value for executing action given a new local observation and internal state , and using its policy thereafter. We denote by

the vector of action-values (for all actions) given new observation


Teaching in Cooperative MARL

Figure 1: Overview of teaching via action advising in MARL. Each agent learns to execute the task using task-level policy , to request advice using learned student policy , and to respond with action advice using learned teacher policy . Each agent can assume a student and/or teacher role at any time. In this example, agent uses its student policy to request help, agent advises action , which the student executes instead of its originally-intended action . By learning to transform the local knowledge captured in task-level policies into action advice, the agents can help one another learn.

This work explores multiagent teaching in a setting where no agent is necessarily an all-knowing expert. This section provides a high-level overview of the motivating scenario. Consider the cooperative MARL setting in Fig. 1, where agents and learn a joint task (i.e., a Dec-POMDP). In each learning iteration, these agents interact with the environment and collect data used by their respective learning algorithms, and , to update their policy parameters, and . This is the standard cooperative MARL problem, which we hereafter refer to as : the task-level learning problem. For example, task-level policy is the policy agent learns and uses to execute actions in the task. Thus, task-level policies summarize each agent’s learned behavioral knowledge.

During task-level learning, it is unlikely for any agent to be an expert. However, each agent may have unique experiences, skill sets, or local knowledge of how to learn effectively in the task. Throughout the learning process, it would likely be useful for agents to advise one another using this local knowledge, in order to improve final performance and accelerate teamwide learning. Moreover, it would be desirable for agents to learn when and what to advise, rather than rely on hand-crafted and domain-specific advising heuristics. Finally, following advising, agents should ideally have learned effective task-level policies that no longer rely on teammate advice at every timestep. We refer to this new problem, which involves agents learning to advise one another to improve joint task-level learning, as : the advising-level problem.

The advising mechanism used in this paper is action advising, where agents suggest actions to one another. By learning to appropriately transform local knowledge (i.e., task-level policies) into action advice, teachers can affect students’ experiences and their resulting task-level policy updates. Action advising makes few assumptions, in that learners need only use task-level algorithms that support off-policy exploration (enabling execution of action advice for policy updates), and that they receive advising-level observations summarizing teammates’ learning progress (enabling learning of when/what to advise). Action advising has a good empirical track record [Torrey and Taylor2013, Taylor et al.2014, Fachantidis, Taylor, and Vlahavas2017, da Silva, Glatt, and Costa2017]. However, existing frameworks have key limitations: the majority are designed for single-agent RL and do not consider multiagent learning; their teachers always advise optimal actions to students, making decisions about when (not what) to teach; they also use heuristics for advising, rather than training teachers by measuring student learning progress. By contrast, agents in our paper learn to interchangeably assume the role of a student (advice requester) and/or teacher (advice responder), denoted and , respectively. Each agent learns task-level policy used to actually perform the task, student policy used to request advice during task-level learning, and teacher policy used to advise a teammate during task-level learning.222Tilde accents (e.g., ) denote advising-level properties.

Before detailing the algorithm, let us first illustrate the multiagent interactions in this action advising scenario. Consider again Fig. 1, where agents are learning to execute a task (i.e., solving ) while advising one another. While each agent in our framework can assume a student and/or teacher role at any time, Fig. 1 visualizes the case where agent is the student and agent the teacher. At a given task-level learning timestep, agent ’s task-level policy outputs an action (‘original action ’ in Fig. 1). However, as the agents are still learning to solve , agent may prefer to execute an action that maximizes local learning. Thus, agent uses its student policy to decide whether to ask teammate for advice. If this advice request is made, teammate checks its teacher policy and task-level policy to decide whether to respond with action advice. Given a response, agent then executes the advised action ( in Fig. 1) as opposed to its originally-intended action ( in Fig. 1). This results in a local experience that agent uses to update its task-level policy. A reciprocal process occurs when the agents’ roles are reversed. The benefit of advising is that agents can learn to use local knowledge to improve teamwide learning.

Similar to recent works that model the multiagent learning process [Hadfield-Menell et al.2016, Foerster et al.2018], we focus on the pairwise (two agent) case, targeting the issues of when/what to advise, then detail extensions to agents. Even in the pairwise case, there exist issues unique to our learning to teach paradigm. First, note that the objectives of and are distinct. Task-level problem, , has a standard MARL objective of agents learning to coordinate to maximize final performance in the task. Learning to advise (), however, is a higher-level problem, where agents learn to influence teammates’ task-level learning by advising them. However, and are also coupled, as advising influences the task-level policies learned. Agents in our problem must learn to advise despite the nonstationarities due to changing task-level policies, which are also a function of algorithms and policy parameterizations .

Learning to teach is also distinct from prior works that involve agents learning to communicate [Sukhbaatar, Fergus, and others2016, Foerster et al.2016, Lowe et al.2017]. These works focus on agents communicating in order to coordinate in a task. By contrast, our problem focuses on agents learning how advising affects the underlying task-level learning process, then using this knowledge to accelerate learning even when agents are non-experts. Thus, the objectives of communication-based multiagent papers are disparate from ours, and the two approaches may even be combined.

LeCTR: Algorithm for Learning to Coordinate and Teach Reinforcement

This section introduces our learning to teach approach, details how issues specific to our problem setting are resolved, and summarizes overall training protocol. Pseudocode is presented in the supplementary material due to limited space.

Overview   Our algorithm, Learning to Coordinate and Teach Reinforcement (LeCTR), solves advising-level problem . The objective is to learn advising policies that augment agents’ task-level algorithms to accelerate solving of . Our approach involves 2 phases (see Fig. 2):

  • Phase I: agents learn from scratch using blackbox learning algorithms and latest advising policies.

  • Phase II: advising policies are updated using advising-level rewards correlated to teammates’ task-level learning.

No restrictions are placed on agents’ task-level algorithms (i.e., they can be heterogeneous). Iteration of Phases I and II enables training of increasingly capable advising policies.

Figure 2: LeCTR consists of two iterated phases: task-level learning (Phase I), and advising-level learning (Phase II). In Phase II, advising policies are trained using rewards correlated to task-level learning (see Table 1). Task-level, student, and teacher policy colors above follows convention of Fig. 1.

Advising Policy Inputs & Outputs   LeCTR learns student policies and teacher policies for agents and , constituting a jointly-initiated advising approach that learns when to request advice and when/what to advise. It is often infeasible to learn high-level policies that directly map task-level policy parameters (i.e., local knowledge) to advising decisions: the agents may be independent/decentralized learners and the cost of communicating task-level policy parameters may be high; sharing policy parameters may be undesirable due to privacy concerns; and learning advising policies over the task-level policy parameter space may be infeasible (e.g., if the latter policies involve millions of parameters). Instead, each LeCTR agent learns advising policies over advising-level observations . As detailed below, these observations are selected to provide information about agents’ task-level state and knowledge in a more compact manner than full policy parameters .

Each LeCTR agent can be a student, teacher, or both simultaneously (i.e., request advice for its own state, while advising a teammate in a different state). For clarity, we detail advising protocols when agents and are student and teacher, respectively (see Fig. 1). LeCTR uses distinct advising-level observations for student and teacher policies. Student policy for agent decides when to request advice using advising-level observation , where and are the agent’s task-level observation and action-value vectors, respectively. Through , agent observes a measure of its local task-level observation and policy state. Thus, agent ’s student-perspective action is .

Similarly, agent ’s teacher policy uses advising-level observation to decide when/what to advise. provides teacher agent with a measure of student ’s task-level state/knowledge (via and ) and of its own task-level knowledge given the student’s context (via ). Using , teacher decides what to advise: either an action from student ’s action space, , or a special no-advice action . Thus, the teacher-perspective action for agent is .

Given no advice, student executes originally-intended action . However, given advice , student executes action , where is a local behavioral policy not known by . The assumption of local behavioral policies increases the generality of LeCTR, as students may locally transform advised actions before execution.

Following advice execution, agents collect task-level experiences and update their respective task-level policies. A key feature is that LeCTR agents learn what to advise by training , rather than always advising actions they would have taken in students’ states. These agents may learn to advise exploratory actions or even decline to advise if they estimate that such advice will improve teammate learning.

Advising Reward Name Description Reward Value
JVG: Joint Value Gain Task-level value improvement after learning
QTR: Q-Teaching Reward Teacher’s estimate of best vs. intended student action
LG: Loss Gain Student’s task-level loss reduction
LGG: Loss Gradient Gain Student’s task-level policy gradient magnitude
TDG: TD Gain Student’s temporal difference (TD) error reduction
VEG: Value Estimation Gain Student’s value estimate gain above threshold
Table 1: Summary of rewards used to train advising policies. Rewards shown are for the case where agent is student and agent teacher (i.e., flip the indices for the reverse case). Each reward corresponds to a different measure of task-level learning after the student executes an action advice and uses it to update its task-level policy. Refer to the supplementary material for more details.

Rewarding Advising Policies   Recall in Phase II of LeCTR, advising policies are trained to maximize advising-level rewards that should, ideally, reflect the objective of accelerating task-level learning. Without loss of generality, we focus again on the case where agents and assume student and teacher roles, respectively, to detail these rewards. Since student policy and teacher policy must coordinate to help student learn, they receive identical advising-level rewards, . The remaining issue is to identify advising-level rewards that reflect learning progress.

Remark 1.

Earning task-level rewards by executing advised actions may not imply actual learning. Thus, rewarding advising-level policies with the task-level reward, , received after advice execution can lead to poor advising policies.

We evaluate many choices of advising-level rewards, which are summarized and described in Table 1. The unifying intuition is that each reward type corresponds to a different measure of the advised agent’s task-level learning, which occurs after executing an advised action. Readers are referred to the supplementary material for more details.

Note that at any time, task-level action executed by agent may either be selected by its local task-level policy, or by a teammate via advising. In Phase II, pair is rewarded only if advising occurs (with zero advising reward otherwise). Analogous advising-level rewards apply for the reverse student-teacher pairing -, where . During Phase II, we train all advising-level policies using a joint advising-level reward to induce cooperation.

Advising-level rewards are only used during advising-level training, and are computed using either information already available to agents or only require exchange of scalar values (rather than full policy parameters). It is sometimes desirable to consider advising under communication constraints, which can be done by deducting a communication cost from these advising-level rewards for each piece of advice exchanged.

Training Protocol   Recall LeCTR’s two phases are iterated to enable training of increasingly capable advising policies. In Phase I, task-level learning is conducted using agents’ blackbox learning algorithms and latest advising policies. At the task-level, agents may be independent learners with distinct algorithms. Advising policies are executed in a decentralized fashion, but their training in Phase II is centralized. Our advising policies are trained using the multiagent actor-critic approach of lowe2017multi lowe2017multi. Let joint advising-level observations, advising-level actions, and advising-level policies (i.e., ‘actors’) be, respectively, denoted by , , and , with parameterizing . To induce to learn to teach both agents and , we use a centralized action-value function (i.e., ‘critic’) with advising-level reward . Critic is trained by minimizing loss,


where are next advising actions computed using the advising policies, and denotes advising-level replay buffer [Mnih et al.2015]. The policy gradient theorem [Sutton et al.2000] is invoked on objective to update advising policies using gradients,


where is agent ’s policy in role .

During training, the advising feedback nonstationarities mentioned earlier are handled as follows: in Phase I, task-level policies are trained online (i.e., no replay memory is used so impact of advice on task-level policies is immediately observed by agents); in Phase II, centralized advising-level learning reduces nonstationarities due to teammate learning, and reservoir sampling is used to further reduce advising reward nonstationarities (see supplementary material for details). Our overall approach stabilizes advising-level learning.

Advising Agents   In the agent case, students must also decide how to fuse advice from multiple teachers. This is a complex problem requiring full investigation in future work; feasible ideas include using majority voting for advice fusion (as in da2017simultaneously da2017simultaneously), or asking a specific agent for advice by learning a ‘teacher score’ modulated based on teacher knowledge/previous teaching experiences.


We conduct empirical evaluations on a sequence of increasingly challenging domains involving two agents. In the ‘Repeated’ game domain, agents coordinate to maximize the payoffs in Fig. 2(a) over timesteps. In ‘Hallway’ (see Fig. 3(a)), agents only observe their own positions and receive reward if they reach opposite goal states; task-level actions are ‘move left/right’, states are agents’ joint grid positions. The higher-dimensional ‘Room’ game (see Fig. 4(a)) has the same state/observation/reward structure, but actions (‘move up/right/down/left’). Recall student-perspective advising-level actions are to ‘ask’ or ‘not ask’ for advice. Teacher-perspective actions are to advise an action from the teammate’s task-level action space, or to decline advising.

For the Repeated, Hallway, and Room games, respectively, each iteration of LeCTR Phase I consists of , , and task-level learning iterations. Our task-level agents are independent Q-learners with tabular policies for the Repeated game and tile-coded policies [Sutton and Barto1998]

for the other games. Advising policies are neural networks with internal rectified linear unit activations. Refer to the supplementary material for hyperparameters. The advising-level learning nature of our problem makes these domains challenging, despite their visual simplicity; their complexity is comparable to domains tested in recent MARL works that learn over multiagent learning processes

[Foerster et al.2018], which also consider two agent repeated/gridworld games.

Agent Agent

(a) Repeated game payoffs. Each agent has 2 actions ().
(b) Counterexample showing poor advising reward choice . LeCTR Phase I and II iterations are shown as background bands.
Figure 3: Repeated game. (b) shows a counterexample where using yields poor advising, as teachers learn to advise actions that maximize reward (left half, green), but do not actually improve student task-level learning (right half, blue).

Counterexample demonstrating Remark 1   Fig. 2(b) shows results given a poor choice of advising-level reward, , in the Repeated game. The left plot (in green) shows task-level return received due to both local policy actions and advised actions, which increases as teachers learn. However, in the right plot (blue) we evaluate how well task-level policies perform by themselves, after they have been trained using the final advising-level policies. The poor performance of the resulting task-level policies indicates that advising policies learned to maximize their own rewards by always advising optimal actions to students, thereby disregarding whether task-level policies actually learn. No exploratory actions are advised, causing poor task-level performance after advising. This counterexample demonstrates that advising-level rewards that reflect student learning progress, rather than task-level reward , are critical for useful advising.

Algorithm Repeated Game Hallway Game Room Game
Independent Q-learning (No Teaching)
Ask Important [Amir et al.2016]
Ask Uncertain [Clouse1996]
Early Advising [Torrey and Taylor2013]
Import. Advising [Torrey and Taylor2013]
Early Correcting [Amir et al.2016]
Correct Important [Torrey and Taylor2013]
AdHocVisit [da Silva, Glatt, and Costa2017]
AdHocTD [da Silva, Glatt, and Costa2017]
LeCTR (with JVG) 4.16 ±1.17 405 ±114
LeCTR (with QTR) 4.52 ±0.00 443 ±3
LeCTR (with TDG)
LeCTR (with LG) 3.88 ±1.51
LeCTR (with LGG) 4.41 ±0.69 430 ±53
LeCTR 4.52 ±0.00 443 ±3 0.77 ±0.00 71 ±3 0.68 ±0.07 79 ±16
Table 2: and Area under the Curve (AUC) for teaching algorithms. Best results in bold (computed via a -test with ). Independent Q-learning correspond to the no-teaching case. Final version LeCTR uses the VEG advising-level reward.

(a) Hallway domain overview.
(b) Simult. learning & teaching
Figure 4: Hallway game. (a) Agents receive reward by navigating to opposite states in -grid hallway. (b) LeCTR accelerates learning & teaching compared to no-teaching.

(a) Room domain overview.
(b) Teaching heterogeneous agents.
Figure 5: Room game. (a) Agents receive reward by navigating to opposite goals in grid. (b) LeCTR outperforms prior approaches when agents are heterogeneous.

Comparisons to existing teaching approaches   Table 2 shows extensive comparisons of existing heuristics-based teaching approaches, no teaching (independent Q-learning), and LeCTR with all advising rewards introduced in Table 1. We use the VEG advising-level reward in the final version of our LeCTR algorithm, but show all advising reward results for completeness. We report both final task-level performance after teaching, , and also area under the task-level learning curve (AUC) as a measure of rate of learning; higher values are better for both. Single-agent approaches requiring an expert teacher are extended to the MARL setting by using teammates’ policies (pre-trained to expert level) as each agent’s teacher. In the Repeated game, LeCTR attains best performance in terms of final value and rate of learning (AUC). Existing approaches always advise the teaching agent’s optimal action to its teammate, resulting in suboptimal returns. In the Hallway and Room games, approaches that tend to over-advise (e.g., Ask Uncertain, Early Advising, and Early Correcting) perform poorly. AdHocVisit and AdHocTD fare better, as their probabilistic nature permits agents to take exploratory actions and sometimes learn optimal policies. Importance Advising and Correct Important heuristics lead agents to suboptimal (distant) goals in Hallway and Room, yet attain positive value due to domain symmetries.

LeCTR outperforms all approaches when using the VEG advising-level reward (Table 2

). While the JVG advising-level reward seems an intuitive measure of learning progress due to directly measuring task-level performance, its high variance in situations where the task-level value is sensitive to policy initialization sometimes destabilizes training. JVG is also expensive to compute, requiring game rollouts after each advice exchange. LG and TDG perform poorly due to the high variance of task-level losses used to compute them. We hypothesize that VEG performs best as its thresholded binary advising-level reward filters the underlying noisy task-level losses for teachers. A similar result is reported in recent work on teaching of supervised learners, where threshold-based advising-level rewards have good empirical performance

[Fan et al.2018]. Fig. 3(b) shows improvement of LeCTR’s advising policies due to training, measured by the number of task-level episodes needed to converge to the max value reached, . LeCTR outperforms the rate of learning for the no-teaching case, stabilizing after roughly -

training epochs.

(a) Hallway domain.
(b) Room domain.
Figure 6:

LeCTR accelerates multiagent transfer learning.

(a) No communication cost, , agents advise opposite actions.
(b) With , one agent leads & advises actions opposite its own.
Figure 7: Hallway game, impact of communication cost on advising policy behaviors. First and second rows show probabilities of action advice, and , for agents and , respectively, as their advising policies are trained using LeCTR.

Teaching for transfer learning   Learning to teach can also be applied to multiagent transfer learning. We first pre-train task-level policies in the Hallway/Room tasks (denote these ), flip agents’ initial positions, then train agents to use teammates’ task-level policies to accelerate learning in flipped task . Results for Hallway and Room are shown in Figs. 5(b) and 5(a), respectively, where advising accelerates rate of learning using prior task knowledge. Next, we test transferability of advising policies themselves (i.e., use advising policies trained for one task to accelerate learning in a brand new, but related, task). We fix (no longer train) advising policies from the above transfer learning test. We then consider 2 variants of Room: one with the domain (including initial agent positions) flipped vertically (), and one flipped vertically and horizontally (). We evaluate the fixed advising policies (trained to transfer from ) on transfer from . Learning without advising on yields AUC , while using the fixed advising policy for transfer attains AUC . Thus, learning is accelerated even when using pre-trained advising policies. While transfer learning typically involves more significant differences in tasks, these preliminary results motivate future work on applications of advising for MARL transfer learning.

Advising heterogeneous teammates   We consider heterogeneous variants of the Room game where one agent, , uses rotated versions of its teammate ’s action space; e.g., for rotation , agent ’s action indices correspond to (up/right/down/left), while ’s to (left/up/right/down). Comparisons of LeCTR and the best-performing existing methods are shown in Fig. 4(b) for all rotations. Prior approaches (Importance Advising and Correct Important) work well for homogeneous actions ( rotation). However, they attain 0 AUC for heterogeneous cases, as agents always advise action indices corresponding to their local action spaces, leading teammates to no-reward regions. AdHocVisit works reasonably well for all rotations, by sometimes permitting agents to explore. LeCTR attains highest AUC for all rotations.

Effect of communication cost on advice exchange   We evaluate impact of communication cost on advising by deducting cost from advising rewards for each piece of advice exchanged. Fig. 7 shows a comparison of action advice probabilities for communication costs and in the Hallway game. With no cost ( in Fig. 6(a)), agents learn to advise each other opposite actions ( and , respectively) in addition to exploratory actions. As LeCTR’s VEG advising-level rewards are binary ( or ), two-way advising nullifies positive advising-level rewards, penalizing excessive advising. Thus, when (Fig. 6(b)), advising becomes unidirectional: one agent advises opposite exploratory actions of its own, while its teammate tends not to advise.

Related Work

Effective diffusion of knowledge has been studied in many fields, including inverse reinforcement learning [Ng and Russell2000], apprenticeship learning [Abbeel and Ng2004], and learning from demonstration [Argall et al.2009], wherein students discern and emulate key demonstrated behaviors. Works on curriculum learning [Bengio et al.2009] are also related, particularly automated curriculum learning [Graves et al.2017]

. Though graves2017automated focus on single student supervised/unsupervised learning, they highlight interesting measures of learning progress also used here. Several works meta-learn active learning policies for supervised learning

[Bachman, Sordoni, and Trischler2017, Fang, Li, and Cohn2017, Pang, Dong, and Hospedales2018, Fan et al.2018]. Our work also uses advising-level meta-learning, but in the regime of MARL, where agents must learn to advise teammates without destabilizing coordination.

In action advising, a student executes actions suggested by a teacher, who is typically an expert always advising the optimal action [Torrey and Taylor2013]. These works typically use state importance value to decide when to advise, estimating the performance difference between the student’s best action versus intended/worst-case action . In student-initiated approaches such as Ask Uncertain [Clouse1996] and Ask Important [Amir et al.2016], the student decides when to request advice using heuristics based on . In teacher-initiated approaches such as Importance Advising [Torrey and Taylor2013], Early Correcting [Amir et al.2016], and Correct Important [Torrey and Taylor2013], the teacher decides when to advise by comparing student policy to expert policy . Q-Teaching [Fachantidis, Taylor, and Vlahavas2017] learns when to advise by rewarding the teacher when it advises. See the supplementary material for details of these approaches.

While most works on information transfer target single-agent settings, several exist for MARL. These include imitation learning of expert demonstrations

[Le et al.2017], cooperative inverse reinforcement learning with a human and robot [Hadfield-Menell et al.2016], and transfer to parallel learners in tasks with similar value functions [Taylor et al.2013]. To our knowledge, AdHocVisit and AdHocTD [da Silva, Glatt, and Costa2017] are the only action advising methods that do not assume expert teachers; teaching agents always advise the action they would have locally taken in the student’s state, using state visit counts as a heuristic to decide when to exchange advise. wang2018efficient wang2018efficient uses da2017simultaneously’s teaching algorithm with minor changes.


This work introduced a new paradigm for learning to teach in cooperative MARL settings. Our algorithm, LeCTR, uses agents’ task-level learning progress as advising policy feedback, training advisors that improve the rate of learning without harming final performance. Unlike prior works [Torrey and Taylor2013, Taylor et al.2014, Zimmer, Viappiani, and Weng2014], our approach avoids hand-crafted advising policies and does not assume expert teachers. Due to the many complexities involved, we focused on the pairwise problem, targeting the issues of when and what to teach. A natural avenue for future work is to investigate the -agent setting, extending the ideas presented here where appropriate.


Research funded by IBM (as part of the MIT-IBM Watson AI Lab initiative) and a Kwanjeong Educational Foundation Fellowship. The authors thank Dr. Kasra Khosoussi for fruitful discussions early in the paper development process.


  • [Abbeel and Ng2004] Abbeel, P., and Ng, A. Y. 2004. Apprenticeship learning via inverse reinforcement learning. In

    Proceedings of the twenty-first international conference on Machine learning

    ,  1.
  • [Amir et al.2016] Amir, O.; Kamar, E.; Kolobov, A.; and Grosz, B. J. 2016. Interactive teaching strategies for agent training.

    International Joint Conferences on Artificial Intelligence.

  • [Argall et al.2009] Argall, B. D.; Chernova, S.; Veloso, M.; and Browning, B. 2009. A survey of robot learning from demonstration. Robotics and autonomous systems 57(5):469–483.
  • [Bachman, Sordoni, and Trischler2017] Bachman, P.; Sordoni, A.; and Trischler, A. 2017. Learning algorithms for active learning. In International Conference on Machine Learning, 301–310.
  • [Bengio et al.2009] Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, 41–48. ACM.
  • [Clouse1996] Clouse, J. A. 1996. On integrating apprentice learning and reinforcement learning.
  • [da Silva, Glatt, and Costa2017] da Silva, F. L.; Glatt, R.; and Costa, A. H. R. 2017. Simultaneously learning and advising in multiagent reinforcement learning. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, 1100–1108. International Foundation for Autonomous Agents and Multiagent Systems.
  • [Fachantidis, Taylor, and Vlahavas2017] Fachantidis, A.; Taylor, M. E.; and Vlahavas, I. 2017. Learning to teach reinforcement learning agents. Machine Learning and Knowledge Extraction 1(1):2.
  • [Fan et al.2018] Fan, Y.; Tian, F.; Qin, T.; Li, X.-Y.; and Liu, T.-Y. 2018. Learning to teach. In International Conference on Learning Representations.
  • [Fang, Li, and Cohn2017] Fang, M.; Li, Y.; and Cohn, T. 2017. Learning how to active learn: A deep reinforcement learning approach. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , 595–605.
  • [Foerster et al.2016] Foerster, J.; Assael, I. A.; de Freitas, N.; and Whiteson, S. 2016. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, 2137–2145.
  • [Foerster et al.2018] Foerster, J. N.; Chen, R. Y.; Al-Shedivat, M.; Whiteson, S.; Abbeel, P.; and Mordatch, I. 2018. Learning with opponent-learning awareness. In Proceedings of the 17th Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems.
  • [Graves et al.2017] Graves, A.; Bellemare, M. G.; Menick, J.; Munos, R.; and Kavukcuoglu, K. 2017. Automated curriculum learning for neural networks. In International Conference on Machine Learning, 1311–1320.
  • [Hadfield-Menell et al.2016] Hadfield-Menell, D.; Russell, S. J.; Abbeel, P.; and Dragan, A. 2016. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, 3909–3917.
  • [Jang, Gu, and Poole2016] Jang, E.; Gu, S.; and Poole, B. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Le et al.2017] Le, H. M.; Yue, Y.; Carr, P.; and Lucey, P. 2017. Coordinated multi-agent imitation learning. In International Conference on Machine Learning, 1995–2003.
  • [Lowe et al.2017] Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, O. P.; and Mordatch, I. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, 6382–6393.
  • [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529.
  • [Ng and Russell2000] Ng, A. Y., and Russell, S. J. 2000. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, 663–670. Morgan Kaufmann Publishers Inc.
  • [Oliehoek and Amato2016] Oliehoek, F. A., and Amato, C. 2016. A concise introduction to decentralized POMDPs, volume 1. Springer.
  • [Pang, Dong, and Hospedales2018] Pang, K.; Dong, M.; and Hospedales, T. 2018. Meta-learning transferable active learning policies by deep reinforcement learning.
  • [Rogers2010] Rogers, E. M. 2010. Diffusion of innovations. Simon and Schuster.
  • [Sukhbaatar, Fergus, and others2016] Sukhbaatar, S.; Fergus, R.; et al. 2016.

    Learning multiagent communication with backpropagation.

    In Advances in Neural Information Processing Systems, 2244–2252.
  • [Sutton and Barto1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
  • [Sutton et al.2000] Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057–1063.
  • [Taylor et al.2013] Taylor, A.; Dusparic, I.; Galván-López, E.; Clarke, S.; and Cahill, V. 2013. Transfer learning in multi-agent systems through parallel transfer. In Workshop on Theoretically Grounded Transfer Learning at the 30th International Conf. on Machine Learning (Poster), volume 28,  28. Omnipress.
  • [Taylor et al.2014] Taylor, M. E.; Carboni, N.; Fachantidis, A.; Vlahavas, I.; and Torrey, L. 2014. Reinforcement learning agents providing advice in complex video games. Connection Science 26(1):45–63.
  • [Torrey and Taylor2013] Torrey, L., and Taylor, M. 2013. Teaching on a budget: Agents advising agents in reinforcement learning. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, 1053–1060. International Foundation for Autonomous Agents and Multiagent Systems.
  • [Wang et al.2018] Wang, Y.; Lu, W.; Hao, J.; Wei, J.; and Leung, H.-F. 2018. Efficient convention emergence through decoupled reinforcement social learning with teacher-student mechanism. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 795–803. International Foundation for Autonomous Agents and Multiagent Systems.
  • [Zimmer, Viappiani, and Weng2014] Zimmer, M.; Viappiani, P.; and Weng, P. 2014. Teacher-student framework: a reinforcement learning approach. In AAMAS Workshop Autonomous Robots and Multirobot Systems.

Supplementary Material

Details of Advising-level Rewards

Recall in Phase II of LeCTR, advising policies are trained to maximize advising-level rewards that should, ideally, reflect the objective of accelerating task-level learning. Selection of an appropriate advising-level reward is, itself, non-obvious. Due to this, we considered a variety of advising-level rewards, each corresponding to a different measure of task-level learning after the student executes an action advice and uses it to update its task-level policy. Advising-level rewards below are detailed for the case where agent is student and agent teacher (i.e., flip the indices for the reverse case). Recall that the shared reward used to jointly train all advising policies is .

  • Joint Value Gain (JVG): Let and , respectively, denote agents’ joint task-level policy parameters before and after learning from an experience resulting from action advise. The JVG advising-level reward measures improvement in task-level value due to advising, such that,


    This is, perhaps, the most intuitive choice of advising-level reward, as it directly measures the gain in task-level performance due to advising. However, the JVG reward has high variance in situations where the task-level value is sensitive to policy initialization, which sometimes destabilizes training. Moreover, the JVG reward requires a full evaluation of task-level performance after each advising step, which can be expensive due to the game rollouts required.

  • Q-Teaching Reward (QTR): The QTR advising-level reward extends Q-Teaching [Fachantidis, Taylor, and Vlahavas2017] to MARL by using


    each time advising occurs. The motivating intuition for QTR is that teacher should have higher probability of advising when they estimate that the student’s intended action, , can be outperformed by a different action (the action).

  • TD Gain (TDG): For temporal difference (TD) learners, the TDG advising-level reward measures improvement of student ’s task-level TD error due to advising,


    where is ’s TD error at timestep . For example, if agents are independent Q-learners at the task-level, then,


    The motivating intuition for the TDG advising-level reward is that actions that are anticipated to reduce student ’s task-level TD error should be advised by the teacher .

  • Loss Gain (LG):

    The LG advising-level reward applies to many loss-based algorithms, measuring improvement of the task-level loss function used by student learner



    For example, if agents are independent Q-learners using parameterized task-level policies at the task-level, then


    The motivating intuition for the LG reward is similar to the TDG, in that teachers should advise actions they anticipate to decrease student’s task-level loss function.

  • Loss Gradient Gain (LGG): The LGG advising-level reward is an extension of the gradient prediction gain [Graves et al.2017], which measures the magnitude of student parameter updates due to teaching,


    The intuition here is that larger task-level parameter updates may be correlated to learning progress.

  • Value Estimation Gain (VEG): VEG rewards teachers when student’s local value function estimates exceed a threshold ,


    using and indicator function . The motivation here is that the student’s value function approximation is correlated to its estimated performance as a function of its local experiences. A convenient means of choosing is to set it as a fraction of the value estimated when no teaching occurs.

Details of Heuristics-based Advising Approaches

Existing works on action advising typically use the state importance value to decide when to advise, where for student-initiated advising, for teacher-initiated advising, is the corresponding action-value function, and is the student’s intended action if known (or the worst-case action otherwise). estimates the performance difference of best versus intended student action in state . The following is a summary of prior advising approaches:

  • The Ask Important heuristic [Amir et al.2016] requests advice whenever , where is a threshold parameter.

  • Ask Uncertain requests when [Clouse1996], where is a threshold parameter.

  • Early Advising advises until advice budget depletion.

  • Importance Advising advises when [Torrey and Taylor2013], where is a threshold parameter.

  • Early Correcting advises when [Amir et al.2016].

  • Correct Important advises when and [Torrey and Taylor2013], where is a threshold parameter.

  • Q-Teaching [Fachantidis, Taylor, and Vlahavas2017] learns when to advise by rewarding the teacher when advising occurs. Constrained by a finite advice budget, Q-Teaching has advising performance similar to Importance Advising, with the advantage of not requiring a tuned threshold .

Pairwise combinations of student- and teacher-initiated approaches can be used to constitute a jointly-initiated approach [Amir et al.2016], such as ours. As shown in our experiments, application of single-agent teaching approaches yields poor performance in MARL games.

Optimal Action Advising

Note that in the majority of prior approaches, the above heuristics are used to decide when to advise. To address the question of what to advise, these works typically assume that teachers have expert-level knowledge and always advise optimal action to students.

Optimal action advising has a strong empirical track record in single-agent teaching approaches [Torrey and Taylor2013, Zimmer, Viappiani, and Weng2014, Amir et al.2016]. In such settings, the assumed homogeneity of the teacher and student’s optimal policies indeed leads optimal action advice to improve student learning (i.e., when the expert teacher’s optimal policy is equivalent to student’s optimal policy). In the context of multiagent learning, however, this advising strategy has primarily been applied to games where behavioral homogeneity does not substantially degrade team performance [da Silva, Glatt, and Costa2017]. However, there exist scenarios where multiple agents learn best by exhibiting behavioral diversity (e.g., by exploring distinct regions of the state-action space), or where agents have heterogeneous capabilities/action/observation spaces altogether (e.g., coordination of 2-armed and 3-armed robots, robots with different sensors, etc.). Use of optimal action advising in cooperative multiagent tasks can lead to suboptimal joint return, particularly when the optimal policies for agents are heterogeneous. We show this empirically in several of our experiments.

In contrast to earlier optimal advising approaches, our LeCTR algorithm applies to the above settings in addition to the standard homogeneous case; this is due to LeCTR’s ability to learn a policy over not only when to advise, but also what to advise. As shown in our experiments, while existing probabilistic advising strategies (e.g., AdHocTD and AdHocVisit) attain reasonable performance in heterogeneous action settings, they do so passively by permitting students to sometimes explore their local action spaces. By contrast, LeCTR agents attain even better performance by actively learning what to advise within teammates’ action spaces; this constitutes a unique strength of our approach.

Architecture, Training Details, and Hyperparameters

At the teaching level, our advising-level critic is parameterized by a 3-layer multilayer perceptron (MLP), consisting of internal rectified linear unit (ReLU) activations, linear output, and 32 hidden units per layer. Our advising-level actors (advice request/response policies) use a similar parameterization, with the softmax function applied to outputs for discrete advising-level action probabilities. Recurrent neural networks may also be used in settings where use of advising-level observation histories yields better performance, though we did not find this necessary in our domains. As in lowe2017multi lowe2017multi, we use the Gumbel-Softmax estimator

[Jang, Gu, and Poole2016] to compute gradients for the teaching policies over discrete advising-level actions (readers are referred to their paper for additional details).

Policy training is conducted with the Adam optimization algorithm [Kingma and Ba2014], using a learning rate of . We use at the task-level and at the advising-level level to induce long-horizon teaching policies. Similar to graves2017automated graves2017automated, we use reservoir sampling to adaptively rescale advising-level rewards with time-varying and non-normalized magnitudes (all except VEG) to the interval . Refer to graves2017automated for details on how this is conducted.

Experimental Procedures

In Table 2, is computed by running each algorithm until convergence of task-level policies , and computing the mean value obtained by the final joint policy . The area under the learning curve (AUC) is computed by intermittently evaluating the resulting task-level policies throughout learning; while teacher advice is used during learning, the AUC is computed by evaluating the resulting after advising (i.e., in absence of teacher advice actions, such that AUC measures actual student learning progress). All results and uncertainties are reported using at least 10 independent runs, with most results using over 20 independent runs. In Table 2, best results in bold are computed using a Student’s -test with significance level .


The following summarizes the notation used throughout the paper. In general: superscripts denote properties for an agent (e.g., ); bold notation denotes joint properties for the team (e.g., ); tilde accents denote properties at the advising-level (e.g., ); and bold characters with tilde accent denote joint advising-level properties (e.g., ).

Symbol Definition
Task (a Dec-POMDP)
Task-level learning algorithm
Joint task-level policy
Agent ’s task-level policy
Agent ’s task-level policy parameters
Task-level value
Agent ’s action value function
Agent ’s action value vector (i.e., vector of action-values for all actions)
State space
State transition function
Joint action space
Joint action
Agent ’s action space
Agent ’s action
Joint observation space
Joint observation
Agent ’s observation space
Agent ’s observation
Observation function
Reward function
Task-level reward
Discount factor
Task-level experience replay memory
Task-level temporal difference error
Agent’s advising-level role, where for student role, for teacher role
Agent ’s advice request policy
Agent ’s advice response policy
Joint advising policy
Joint advising-level policy parameters
Agent ’s advice request action
Special no-advice action
Agent ’s advice response
Joint advising action
Agent ’s behavioral policy
Agent ’s student-perspective advising obs.
Agent ’s teacher-perspective advising obs.
Joint advising obs.
Advising-level reward
Advising-level discount factor
Advising-level experience replay memory

LeCTR Algorithm

1:function GetAdviseObs()
2:     for agents  do
3:         Let denote ’s teammate.
6:     end for
7:     return
8:end function
Algorithm 1 Get advising-level observations
1:for Phase II episode to  do
2:     Initialize task-level policy parameters
3:     for Phase I episode to  do
4:          initial task-level observation
5:         for task-level timestep to  do
7:              for agents  do
8:                  Exchange advice via advising policies
9:                  if No advising occurred then
10:                       Select action via local policy
11:                  end if
12:              end for
13:              ,
14:               Execute action in task
18:               Compute advising-level rewards
19:              Store in buffer
20:         end for
21:     end for
22:     Update advising-level critic by minimizing loss,
23:     for agents  do
24:         for roles  do
25:              Update advising policy parameters via,
26:         end for
27:     end for
28:end for
Algorithm 2 LeCTR Algorithm