Learning to Interactively Learn and Assist

06/24/2019 ∙ by Mark Woodward, et al. ∙ 0

When deploying autonomous agents in the real world, we need to think about effective ways of communicating our objectives to them. Traditional skill learning has revolved around reinforcement and imitation learning, each with their own constraints on the format and temporal distribution with which information between the human and the agent is exchanged. In contrast, when humans communicate with each other, they make use of a large vocabulary of informative behaviors, including non-verbal communication, which help to disambiguate their message throughout learning. Communicating throughout learning allows them to identify any missing information, whereas the large vocabulary of behaviors helps with selecting appropriate behaviors for communicating the required information. In this paper, we introduce a multi-agent training framework, which emerges physical information-communicating behaviors. The agent is trained, on a variety of tasks, with another agent, who knows the task and serves as a human surrogate. Our approach produces an agent that is capable of learning interactively from a human user, without a set of explicit demonstrations or a reward function. We conduct user experiments on object gathering tasks with pixel observations, and confirm that the trained agent learns from the human and that the joint performance significantly exceeds the performance of the human acting alone. Further, through a series of experiments, we demonstrate the emergence of a variety of learning behaviors, including information-sharing, information-seeking, and question-answering.



There are no comments yet.


page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many tasks that we would like our agents to perform, such as unloading a dishwasher, straightening a room, or restocking shelves are inherently user-specific, requiring information from the user in order to fully learn all the intricacies of the task. The traditional paradigm for agents to learn such tasks is through rewards and demonstrations. However, iterative reward engineering with untrained human users is impractical in real-world settings, while demonstrations are often burdensome to provide. In contrast, humans learn from a variety of interactive communicative behaviors, including nonverbal gestures and partial demonstrations, each with their own information capacity and effort. Can we enable agents to learn tasks from humans through such unstructured interaction, requiring minimal effort from the human user?

The effort required by the human user is affected by many aspects of the learning problem, including restrictions on when the agent is allowed to act and restrictions on the behavior space of either human or agent, such as limiting the user feedback to rewards or demonstrations. We consider a setting where both the human and the agent are allowed to act throughout learning, which we refer to as interactive learning. Unlike collecting a set of demonstrations before training, interactive learning allows the user to selectively act only when it deems the information is necessary and useful, reducing the user’s effort. Examples of such interactions include allowing user interventions, or agent requests, for demonstrations Kelly et al. (2018), rewards Warnell et al. (2018); Arumugam et al. (2019), or preferences Christiano et al. (2017). While these methods allow the user to provide feedback throughout learning, the communication interface is restricted to structured forms of supervision, which may be inefficient for a given situation. For example, in a dishwasher unloading task, given the history of learning, it may be sufficient to point at the correct drawer rather than provide a full demonstration.

To this end, we propose to allow the agent and the user to exchange information through an unstructured interface. To do so, the agent and the user need a common prior understanding of the meaning of different unstructured interactions, along with the context of the space of tasks that the user cares about. Indeed, when humans communicate tasks to each other, they come in with rich prior knowledge and common sense about what the other person may want and how they may communicate that, enabling them to communicate concepts effectively and efficiently Peloquin et al. (2019).

In this paper, we propose to allow the agent to acquire this prior knowledge through joint pre-training with another agent who knows the task and serves as a human surrogate. The agents are jointly trained on a variety of tasks, where actions and observations are restricted to the physical environment. Since the first agent is available to assist, but only the second agent is aware of the task, interactive learning behaviors should emerge to accomplish the task efficiently. We hypothesize that, by restricting the action and observation spaces to the physical environment, the emerged behaviors can transfer to learning from a human user. An added benefit of our framework is that, by training on a variety of tasks from the target task domain, much of the non-user specific task prior knowledge is pre-trained into the agent, further reducing the effort required by the user.

We evaluate various aspects of agents trained with our framework on several simulated object gathering task domains, including a domain with pixel observations, shown in Figure 1. We show that our trained agents exhibit emergent information-gathering behaviors in general and explicit question-asking behavior where appropriate. Further, we conduct a user study with trained agents, where the users score significantly higher with the agent than without the agent, which demonstrates that our approach can produce agents that can learn from and assist human users.

The key contribution of our work is a training framework that allows agents to quickly learn new tasks from humans through unstructured interactions, without an explicitly-provided reward function or demonstrations. Critically, our experiments demonstrate that agents trained with our framework generalize to learning test tasks from human users, demonstrating interactive learning with a human in the loop. In addition, we introduce a novel multi-agent model architecture for cooperative multi-agent training that exhibits improved training characteristics. Finally, our experiments on a series of object-gathering task domains illustrate a variety of emergent interactive learning behaviors and demonstrate that our method can scale to raw pixel observations.

(a) After 100 gradient steps
(b) After 1k gradient steps
(c) After 40k gradient steps
(d) With human principal
Figure 1: Episode traces after 100, 1k, and 40k pre-training steps for the cooperative fruit collection domain of Experiment 4. The principal agent “P” (pink) is told the fruit to be collected, lemons or plums, in its observations. Within an episode, the assistant agent “A” (blue) must infer the fruit to be collected from observations of the principal. Each agent observes an overhead image of itself and its nearby surroundings. By the end of training (fig:exp4_traces_40k) the assistant is inferring the correct fruit and the agents are coordinating. This inference and coordination transfers to human principals (fig:exp4_traces_human).

2 Related Work

The traditional means of passing task information to an agent include specifying a reward function (Barto and Sutton, 1998; Bertsekas and Tsitsiklis, 1996) that can be hand-crafted for the task (Singh et al., 2009; Levine et al., 2016; Chebotar et al., 2017) and providing demonstrations (Schaal, 1999; Abbeel and Ng, 2004) before the agent starts training. More recent works explore the concept of the human supervision being provided throughout training by either providing rewards during training (Warnell et al., 2018; Thomaz et al., 2005; Isbell et al., 2001; Perez-Dattari et al., 2018) or demonstrations during training; either continuously (Kelly et al., 2018; Ross et al., 2011b) or at the agent’s discretion (Xu et al., 2018; Hester et al., 2018; James et al., 2018; Yu et al., 2018b; Borsa et al., 2017; Krening, 2018; Ross et al., 2011a; Brown et al., 2018). In all of these cases, however, the reward and demonstrations are the sole means of interaction.

Another recent line of research involves the human expressing their preference between agent generated trajectories (Christiano et al., 2017; Mindermann et al., 2018; Ibarz et al., 2018). Here again, the interaction is restricted to a single modality.

Our work builds upon the idea of meta-learning, or learning to learn (Schmidhuber, 1987; Thrun and Pratt, 2012; Bengio et al., 1991)

. Meta-learning for control has been considered in the context of reinforcement learning 

(Duan et al., 2016; Wang et al., 2016; Finn et al., 2017) and imitation learning (Duan et al., 2017; Yu et al., 2018a). Our problem setting differs from these, as the agent is learning by observing and interacting with another agent, as opposed to using reinforcement or imitation learning. In particular, our method builds upon recurrence-based meta-learning approaches (Santoro et al., 2016; Duan et al., 2016; Wang et al., 2016) in the context of the multi-agent task setting.

When a broader range of interactive behaviors is desired, prior works have introduced a multi-agent learning component (Potter and Jong, 1994; Palmer et al., 2018; Matignon et al., 2007). These methods are more closely related to ours in that, during training, they also maximize a joint reward function between the agents and emerge cooperative behavior (Gupta et al., 2017; Foerster et al., 2018, 2016; Lazaridou et al., 2016; Andreas et al., 2017). Multiple works Gupta et al. (2017); Foerster et al. (2018) emerge cooperative behavior but in task domains that do not require knowledge transfer between the agents, while others Foerster et al. (2016); Mordatch and Abbeel (2018); Lowe et al. (2017); Lazaridou et al. (2016); Andreas et al. (2017) all emerge communication over a communication channel. Such communication is known to be difficult to interpret (Lazaridou et al., 2016), without post-inspection (Mordatch and Abbeel, 2018) or a method for translation (Andreas et al., 2017). Critically, none of these prior works conduct user experiments to evaluate transfer to humans.

Mordatch and Abbeel (2018) experiment with tasks similar to ours, in which information must be communicated between the agents, and communication is restricted to the physical environment. This work demonstrates the emergence of pointing, demonstrations, and pushing behavior. Unlike this prior approach, however, our algorithm does not require a differentiable environment. We also demonstrate our method with pixel observations and conduct a user experiment to evaluate transfer to humans.

3 Preliminaries

In this section, we review the cooperative partially observable Markov game (Littman, 1994), which serves as the foundation for tasks in Section 4. A cooperative partially observable Markov game is defined by the tuple , {}, , , {}, {}, , , where indexes the agent among agents, , , and are state, action, and observation spaces, is the transition function, is the reward function, are the observation functions, is the discount factor, and is the horizon.

The functions and are not accessible to the agents. At time , the environment accepts actions , samples , and returns reward and observations . The objective of the game is to choose actions to maximize the expected discounted sum of future rewards:


Note that, while the action and observation spaces vary for the agents, they share a common reward which leads to a cooperative task.

4 The LILA Training Framework

We now describe our training framework for producing an assisting agent that can learn a task interactively from a human user. We define a task to be an instance of a cooperative partially observable Markov game as described in Section 3, with . To enable the agent to solve such tasks, we train the agent, whom we call the “assistant” (superscript ), jointly with another agent, whom we call the “principal” (superscript ) on a variety of tasks. Critically, the principal’s observation function informs it of the task.111Our tasks are similar to tasks in Hadfield-Menell et al. (2016), but with partially observable state and without access to the other agent’s actions, which should better generalize to learning from humans in natural environment. The principal agent acts as a human surrogate which allows us to replace it with a human once the training is finished. By informing the principal of the current task and withholding rewards and gradient updates until the end of each task, the agents are encouraged to emerge interactive learning behaviors in order to inform the assistant of the task and allow them to contribute to the joint reward. We limit actions and observations to the physical environment, with the hope of emerging human-compatible behaviors.

In order to train the agents, we consider two different models. We first introduce a simple model that we find works well in tabular environments. Then, in order to scale our approach to pixel observations, we introduce a modification to the first model that we found was important in increasing the stability of learning.

Figure 2: Information flow for the two models used in our experiments; red paths are only needed during training. The MADDRQN model fig:maddrqn uses a centralized value-function with per-agent advantage functions. The centralized value function is only used during training. The MAIDRQN model (fig:maidrqn) is used in experiments 1-3 and the MADDRQN model (fig:maddrqn) is used in experiment 4 where it exhibits superior training characteristics for learning from pixels.

Multi-Agent Independent DRQN (MAIDRQN): The first model uses two deep recurrent -networks (DRQN) (Hausknecht and Stone, 2015) that are each trained with Q-learning (Watkins, 1989). Let be the action-value function for agent , which maps from the current action, observation, and history, , to the expected discounted sum of future rewards. The MAIDRQN method optimizes the following loss:222In our experiments we do not use a lagged “target” -network (Mnih et al., 2013), but we do stop gradients through the -network for the target time step.


The networks are trained simultaneously on the same transitions, but do not share weights and gradient updates are made independently for each agent. The model architecture is a recurrent neural network, depicted in Figure 

1(a); see Section A.1.1 for details. We use this model for experiments 1-3.

Multi-Agent Dueling DRQN (MADDRQN)

: With independent Q-Learning, as in MAIDRQN, the other agent’s changing behavior and unknown actions make it difficult to estimate the Bellman target,

in Equation 2, which leads to instability in training. This model addresses part of the instability that is caused by unknown actions.

If is the optimal action-value function, then the optimal value function is , and the optimal advantage function is defined as  (Wang et al., 2015). The advantage function captures how inferior an action is to the optimal action in terms of the expected sum of discounted future rewards. This allows us to express in a new form, . We note that the value function is not needed when selecting actions: . We leverage this idea by making the following approximation to an optimal, centralized action-value function for multiple agents:


where is an advantage function for agent and is a joint value function.333 The approximation is due to the substitution of for in Equation 5, which implies that the agents’ current actions have independent effects on expected future rewards, and is not true in general. Nevertheless, it is a useful approximation.

The training loss for this model, assuming the use of per-agent recurrent networks, is:




Once trained, each agent selects their actions according to advantage function ,


as opposed to the Q-function in the case of MAIDDRQN.

In the loss for the MAIDRQN model, Equation 2, there is a squared error term for each which depends on the joint reward . This means that, in addition to estimating the immediate reward due their own actions, each must estimate the immediate reward due to the actions of the other agents, without access to their actions or observations. By using a joint action value function and decomposing it into advantage functions and a value function, each can ignore the immediate reward due to the other agent, simplifying the optimization.

We refer to this model as a multi-agent dueling deep recurrent -network (MADDRQN), in reference to the single agent dueling network of Wang et al. (2015). The MADDRQN model, which uses a recurrent network, is depicted in Figure 1(b); See Section A.1.2 for details. The MADDRQN model is used in experiment 4.

Training Procedure: We use a standard episodic training procedure, with the task changing on each episode. Here, we describe the training procedure for the MADDRQN model; the training procedure for the MAIDRQN model is similar, see Algorithm 2 in the appendix for details. We assume access to a subset of tasks, , from a task domain, . First, we initialize the parameters , , and . Then, the following procedure is repeated until convergence. A batch of tasks are uniformely sampled from . For each task in the batch, a trajectory, , is collected by playing out an episode in an environment configured to , with actions chosen -greedy according to and . The hidden states for the recurrent LSTM cells are reset to at the start of each episode. The loss for each trajectory is calculated using Equations 4 and 5. Finally, a gradient step is taken with respect to , , and on the sum of the episode losses. Algorithm 1 in the appendix describes this training procedure in detail.

5 Experimental Results

Exp. Model Principal Motion Penalty Grid Shape Num. Objects Observations Observation Window
1a MAIDRQN 0.0 5x5 10

Binary Vectors

1b MAIDRQN -0.4 5x5 10 Binary Vectors Full
2 MAIDRQN 0.0 5x5 10 Binary Vectors 1-Cell
3 MAIDRQN -0.1 3x1 “L” 1 Binary Vectors 1-Cell
4 MADDRQN 0.0 5x5 10 Pixels 2-Cells
Table 1: Experimental configurations for our 4 experiments. Experiment 1 has two sub experiments, 1A and 1B. In 1B, the agents incur a penalty whenever the principals moves. The Observation Window column lists the radius of cells visible to each agent.

We design a series of experiments in order to study how different interactive learning behaviors may emerge, to test whether our method can scale to pixel observations, and to evaluate the ability for the agents to transfer to a setting with a human user.

We conduct four experiments on grid-world environments, where the goal is to collect all objects from one of two object classes. Two agents, the prime and the assistant, act simultaneously and may move in one of the four cardinal directions or may choose not to move, giving five possible actions per agent.

Within an experiment, tasks vary by the placement of objects, and by the class of objects to collect, which we call the “target class”. The target class is supplied to the principal as a two dimensional, one-hot vector. Each episode consists of a single task and lasts for 10 time-steps. Table 1 gives the setup for each experiment; see Section A.2 for details.

We collected 10 training runs per experiment, and report the aggregated performance of the 10 trained agent pairs on 100 test tasks not seen during training. The training batch size was 100 episodes and the models were trained for 150,000 gradient steps (Experiments 1-3) or 40,000 gradient steps (Experiment 4). Videos for all experiments are available on the supplementary website444https://interactive-learning.github.io.

Figure 3:

Training curve for Experiment 1B. Error bars are 1 standard deviation. At the end of training, nearly all of the joint reward is due to the assistant’s actions, indicating that the trained assistant can learn the task and complete it independently.

(a) The assistant learns from a single principal movement. (b) The assistant learns from a lack of principal movement.
Figure 4: Episode traces of trained agents on test tasks from Experiment 1B. The assistant rapidly learns the target shape from the principal and collects all instances.
Experiment 1A Experiment 1B
Method Name Joint Reward Reward due to P Reward due to A Joint Reward Reward due to P Reward due to A
Oracle-A 4.9 0.2 2.5 0.1 2.4 0.1 4.0 0.1 0.0 0.0 4.0 0.1
MAIDRQN 4.6 0.2 3.3 0.2 1.3 0.3 3.6 0.1 0.4 0.1 3.2 0.1
FeedFwd-A 4.1 0.1 4.1 0.1 0.0 0.0555The FeedForward assistant moves 80% of the time, but it never collects an object. 2.0 0.4 0.7 0.3 1.3 0.6
Solo-P 4.0 0.1 4.0 0.1 N/A 1.2 0.1 1.2 0.1 N/A
Table 2: Results for Experiments 1A and 1B. Experiment 1B includes a motion penalty for the principal’s actions. In both experiments, MAIDRQN outperforms the principal acting alone. All performance increases are significant (), except for FeedFwd-A and Solo-P in Experiment 1A, which are statistically equivalent.

Experiment 1 A&B – Learning and Assisting: In this experiment we explore if the assistant can be trained to learn and assist the principal. Tables 2 shows the experimental results without and with a penalty for motion of the principal (Experiments 1A and 1B respectively). Figures 4 and 4 show the learning curve and trajectory traces for trained agents in Experiment 1B.

The joint reward of our approach (MAIDRQN) exceeds that of a principal trained to act alone (Solo-P), and approaches the optimal setting where the assistant also observes the target class (Oracle-A). Further, we see that the reward due to the assistant is positive, and even exceeds the reward due to the principal when the motion penalty is present (Experiment 1B). This demonstrates that the assistant learns the task from the principal and assists the principal. Our approach also outperforms an ablation in which the assistant’s LSTM is replaced with a feed forward network (FeedForward-A), highlighting the importance of recurrence and memory.

Experiment 2 – Active Information Gathering: In this experiment we explore if, in the presence of additional partial observability, the assistant will take actions to actively seek out information. This experiment restricts the view of each agent to a 1-cell window and only places objects around the exterior of the grid, requiring the assistant to move with the principal and observe its behavior, see Figure 6. Figure 6 shows trajectory traces for two test tasks. The average joint reward, reward due to the principal, and reward due to the assistant are , , and respectively. This shows that our training framework can produce information seeking behaviors.

(a) Principal View (b) Assistant View
Figure 5: Visualization of the 1-cell observation window used in experiments 2 and 3. Cell contents outside of an agent’s window are hidden from that agent.
(a) 2 step info. seek (b) 3 step info. seek
Figure 6: Episode traces of trained agents on test tasks from Experiment 2. The assistant moves with the principal until it observes a disambiguating action.
(a) The square should be collected (green), but the assistant does not observe this (grey under green).
(b) The circle should not be collected (red), but the assistant does not observe this (grey under red)
Figure 7: Episode roll-outs for trained agents from Experiment 3. When the assistant is uncertain of an object it requests information from the principal by moving into its visual field and observing the response.

Experiment 3 – Interactive Questioning and Answering: In this experiment we explore if there is a setting where explicit questioning and answering can emerge. On 50% of the tasks, the assistant is allowed to observe the target class. This adds uncertainty for the principal, and discourages it from proactively informing the assistant. Figure 7 shows the first several states of tasks in which the assistant does not observe the target class.666The test and training sets are the same in Experiment 3, since there are only 8 possible tasks

The emerged behavior is for the assistant to move into the visual field of the principal, effectively asking the question, then the principal moves until it sees the object, and finally answers the question by moving one step closer only if the object should be collected. The average joint reward, reward due to the principal, and reward due to the assistant are , , and respectively. This demonstrates that our framework can emerge question-answering, interactive behaviors.

Experiment 4 – Learning from and Assisting a Human Principal with Pixel Observations: In this final experiment we explore if our training framework can extend to pixel observations and whether the trained assistant can learn from a human principal. Figure 8 shows examples of the pixel observations. Ten participants, who were not familiar with this research, were paired with the 10 trained assistants, and “played” 20 games with the assistant and 20 games without the assistant. Participants were randomly assigned which setting to play first. Figure 1 shows trajectory traces on test tasks at several points during training and with a human principal after training. Table 3 shows the experimental results.

The participants scored significantly higher with the assistant than without (). This demonstrates that our framework can produce agents that can learn from humans.

Unlike the previous experiments, stability was a challenge in this problem setting; most training runs of MAIDRQN became unstable and dropped below 0.1 joint reward before the end of training. Hence, we chose to use the MADDRQN model because we found it to be more stable than MAIDRQN. The failure rate was 64% vs 75% for each method respectively, and the mean failure time was 5.6 hours vs 9.7 hours (), which saved training time and was a practical benefit.

(a) Principal
(b) Assistant
Figure 8: Example observations for experiment 4. The principal’s observation also includes a 2 dimensional one-hot vector indicating the fruit to collect, plums in this case. These are the 7th observations from the human-agent trajectory in Figure 0(d).

Players Joint Reward Reward due to Principal Reward due to Assistant
Agent&Agent 4.6 0.2 2.6 0.2 2.0 0.2
Human&Agent 4.2 0.4 2.9 0.3 1.3 0.5
Agent 3.9 0.1 3.9 0.1 N/A
Human 3.8 0.3 3.8 0.3 N/A
Table 3: Results for Experiment 4. The trained assistant is able to learn from the human and significantly increase their score (Human&Agent) over the human acting alone (Human).888

Significance is based on a t-test of the participants’ change in score, which is more significant than the table’s standard deviations would suggest (


6 Summary and Future Work

We introduced the LILA training framework, which trains an assistant to learn interactively from a knowledgeable principal through only physical actions and observations in the environment. LILA produces the assistant by jointly training it with a principal, who is made aware of the task through its observations, on a variety of tasks, and restricting the observation and action spaces to the physical environment. We further introduced the MADDRQN algorithm, in which the agents have individual advantage functions but share a value function during training. MADDRQN fails less frequently than MAIDRQN, which is a practical benefit when training. The experiments demonstrate that, depending on the environment, LILA emerges behaviors such as demonstrations, partial demonstrations, information seeking, and question answering. Experiment 4 demonstrated that LILA scales to environments with pixel observations, and, crucially, that LILA is able to produce agents that can learn from and assist humans.

A possible future extension involves training with populations of agents. In our experiments, the agents sometimes emerged overly co-adapted behaviors. For example, in Experiment 2, the agents tend to always move in the same direction in the first time step, but the direction varies by the training run. This makes agents paired across runs less compatible, and less likely to generalize to human principals. We believe that training an assistant across populations of agents will reduce such co-adapted behaviors. Finally, LILA’s emergence of behaviors, means that the trained assistant can only learn from behaviors that emerged during training. Further research should seek to minimize these limitations, perhaps through advances in online meta-learning Finn et al. (2019).


  • Abbeel and Ng [2004] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In

    Proceedings of the twenty-first international conference on Machine learning

    , page 1. ACM, 2004.
  • Andreas et al. [2017] Jacob Andreas, Anca Dragan, and Dan Klein. Translating neuralese. arXiv preprint arXiv:1704.06960, 2017.
  • Arumugam et al. [2019] Dilip Arumugam, Jun Ki Lee, Sophie Saskin, and Michael L Littman. Deep reinforcement learning from policy-dependent human feedback. arXiv preprint arXiv:1902.04257, 2019.
  • Barto and Sutton [1998] Andrew Barto and Richard S. Sutton. Reinforcement Learning: An Introduction. MIT Press, 1998.
  • Bengio et al. [1991] Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. In Proc. of the International Joint Conference on Neural Networks (IJCNN), 1991.
  • Bertsekas and Tsitsiklis [1996] Dmitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
  • Borsa et al. [2017] Diana Borsa, Bilal Piot, Remi Munos, and Olivier Pietquin. Observational learning by reinforcement learning, 2017.
  • Brown et al. [2018] Daniel S. Brown, Yuchen Cui, and Scott Niekum. Risk-aware active inverse reinforcement learning. In Proc. of Conference on Robot Learning (CoRL), 2018.
  • Chebotar et al. [2017] Yevgen Chebotar, Karol Hausman, Marvin Zhang, Gaurav Sukhatme, Stefan Schaal, and Sergey Levine. Combining model-based and model-free updates for trajectory-centric reinforcement learning. In International Conference on Machine Learning (ICML), 2017.
  • Christiano et al. [2017] Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Proc. of the Conference on Neural Information Processing Systems (NIPS), 2017.
  • Duan et al. [2016] Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl: Fast reinforcement learning via slow reinforcement learning, 2016.
  • Duan et al. [2017] Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in neural information processing systems, pages 1087–1098, 2017.
  • Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. of the International Conference on Machine Learning (ICML), 2017.
  • Finn et al. [2019] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. arXiv preprint arXiv:1902.08438, 2019.
  • Foerster et al. [2016] Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Proc. of the Conference on Neural Information Processing Systems (NIPS), 2016.
  • Foerster et al. [2018] Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • Gupta et al. [2017] Jayesh K. Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In Proc. of Autonomous Agents and Multiagent Systems (AAMAS), 2017.
  • Hadfield-Menell et al. [2016] Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse reinforcement learning. In Proc. of the Conference on Neural Information Processing Systems (NIPS), 2016.
  • Hausknecht and Stone [2015] Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. In AAAI Fall Symposium Series, 2015.
  • Hester et al. [2018] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. Deep q-learning from demonstrations. In AAAI, 2018.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http://dx.doi.org/10.1162/neco.1997.9.8.1735.
  • Ibarz et al. [2018] Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari. In Proc. of the Conference on Neural Information Processing Systems (NIPS), 2018.
  • Isbell et al. [2001] Charles Lee Isbell, Christian R. Shelton, Michael Kearns, Satinder P. Singh, and Peter Stone. Cobot: A social reinforcement learning agent. In Proc. of the Conference on Neural Information Processing Systems (NIPS), 2001.
  • James et al. [2018] Stephen James, Michael Bloesch, and Andrew J. Davison. Task-embedded control networks for few-shot imitation learning. In Proc. of the Conference on Robot Learning (CoRL), 2018.
  • Kelly et al. [2018] Michael Kelly, Chelsea Sidrane, Katherine Driggs-Campbell, and Mykel J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts, 2018.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. of the International Conference for Learning Representations (ICLR), 2015.
  • Krening [2018] Samantha Krening. Newtonian action advice: Integrating human verbal instruction with reinforcement learning, 2018.
  • Lazaridou et al. [2016] Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182, 2016.
  • LeCun et al. [1995] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
  • Levine et al. [2016] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  • Littman [1994] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.
  • Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.
  • Matignon et al. [2007] Laetitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In Proc. of the International Conference on Intelligent Robots and Systems (IROS), 2007.
  • Mindermann et al. [2018] Soren Mindermann, Rohin Shah, Adam Gleave, and Dylan Hadfield-Menell. Active inverse reward design. In Proc. of the ICML/IJCAI/AAMAS Workshop on Goals for Reinforcement Learning, 2018.
  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. In

    Proc. of the Conference on Neural Information Processing Systems (NIPS), Workshop on Deep Learning

    , 2013.
  • Mordatch and Abbeel [2018] Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. In Proc. of the AAAI Conference on Artificial Intelligence (AAAI), 2018.
  • Palmer et al. [2018] Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani. Lenient multi-agent deep reinforcement learning. In Proc. of Autonomous Agents and Multiagent Systems (AAMAS), 2018.
  • Peloquin et al. [2019] Benjamin N. Peloquin, Noah D. Goodman, and Michael C. Frank. The interactions of rational, pragmatic agents lead to efficient language structure and use. PsyArXiv, 2019.
  • Perez-Dattari et al. [2018] Rodrigo Perez-Dattari, Carlos Celemin, Javier Ruiz del Solar, and Jens Kober. Interactive learning with corrective feedback for policies based on deep neural networks. In Proc. of the International Symposium on Experimental Robotics (ISER), 2018.
  • Potter and Jong [1994] Mitchell A. Potter and Kenneth A. De Jong. A cooperative coevolutionary approach to function optimization. The Third Parallel Problem Solving From Nature, pages 249–257, 1994.
  • Ross et al. [2011a] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011a.
  • Ross et al. [2011b] Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proc. of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2011b.
  • Santoro et al. [2016] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap. One-shot learning with memory-augmented neural networks. In Proc. of the International Conference on Machine Learning (ICML), 2016.
  • Schaal [1999] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, pages 233–242, 1999.
  • Schmidhuber [1987] Jürgen Schmidhuber. Evolutionary Principles in Self-Referential Learning. PhD thesis, Institut f. Informatik, Tech. Univ. Munich, 1987.
  • Singh et al. [2009] Satinder Singh, Richard L Lewis, and Andrew G Barto. Where do rewards come from. In Proceedings of the annual conference of the cognitive science society, pages 2601–2606, 2009.
  • Thomaz et al. [2005] Andrea Lockerd Thomaz, Guy Hoffman, , and Cynthia Breazeal. Real-time interactive reinforcement learning for robots. In AAAI, 2005.
  • Thrun and Pratt [2012] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.
  • Wang et al. [2016] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
  • Wang et al. [2015] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
  • Warnell et al. [2018] Garrett Warnell, Nicholas Waytowich, Vernon Lawhern, and Peter Stone. Deep tamer: Interactive agent shaping in high-dimensional state spaces. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Watkins [1989] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King’s College, Cambridge, 1989.
  • Xu et al. [2018] Kelvin Xu, Ellis Ratner, Anca Dragan, Sergey Levine, and Chelsea Finn. Learning a prior over intent via meta-inverse reinforcement learning, 2018.
  • Yu et al. [2018a] Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557, 2018a.
  • Yu et al. [2018b] Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. In Proc. of Robotics: Science and Systems (RSS), 2018b.

Appendix A Appendix

a.1 Model Architectures

a.1.1 Multi-Agent Independent DRQN (MAIDRQN)

The MAIDRQN model makes use of the independent DRQN models. See figure 1(a). There are independent parallel paths for each agent, giving and

. Each path consists of a convolutional neural network (CNN) 

[LeCun et al., 1995], followed by a long short-term memory (LSTM) [Hochreiter and Schmidhuber, 1997]

, followed by a fully-connected layer. The CNNs consist of a 2-layer network with 10, 3x3, stride 1 filters per layer and rectified linear unit (ReLU) activations. The LSTMs have 50 hidden units. The fully-connected network layer outputs are 5 dimensional, representing

values for the 5 actions in the environment.

a.1.2 Multi-Agent Dueling DRQN (MADDRQN)

Up to the final layer, the MADDRQN model is similarly structured to the MAIDRQN model, with the following modifications. The CNN filter weights are shared between the agents. The first CNN layer has 16, 8x8 filters with stride 4, and the second layer has 32, 8x8 filters with stride 2, both with ReLU activations. A 256-unit fully-connected layer with ReLU activations is inserted between the CNN and the LSTM, also with weights shared between the agents. Information about the current task is represented as a 1-hot vector and concatenated to the output of the fully-connected layer before feeding it to the principal’s LSTM. The outputs for each LSTM are fed to separate fully-connected layers with 5 outputs each, and the softmax across the 5 outputs is subtracted from each output, producing an advantage functions, , for each agent. In a separate path, the outputs from the two LSTMs are concatenated together and fed to a fully-connected layer with a single output, representing the shared joint value function . The MADDRQN model is depicted in Figure 1(b)

a.1.3 Multi-Agent Independent DRQN (MAIDRQN) (Experiment 4)

In Experiment 4, we conducted a comparison between MADDRQN and MAIDRQN. For a fair comparison, the MAIDRQN model in Experiment 4 is the MADDRQN model without the value function, and without the softmax subtracted from the fully-connected output layers, giving and . All other aspects, including the fully-connected layer after the CNN layers and the sharing of weights are identical to MADDRQN.

a.2 Experimental Details

All experiments use -greedy exploration during training (), and argmax when evaluating. The Adam optimizer with default parameters was used to train the models Kingma and Ba [2015].

All experiments operate on a grid. On every step, each agent can take one of five actions: move left, move right, move up, move down, or stay in place. Actions are taken in parallel, and the subsequent observations reflect the effect of both agents’ actions. If either agent moves onto a cell containing an object, then the object is removed and considered “collected”, with resulting joint reward defined by . The agents may occupy the same cell in the shape environment of Experiments 1-3, but collide with each other in the fruit environment of Experiment 4; if, in the fruit environment, the agents attempt to move onto the same cell, the principal gets to move to the cell and the assistant remains at its current location.

In experiments using the shape environment, the agents observe the world as a grid of binary vectors, one binary vector for each cell. Some fields in the binary vector are zeroed out depending on the agent and the experiment. Each binary vector contains: 1 bit indicating if the cell is visible to the agent, 1 bit indicating the presence of the principal, 1 bit indicating the presence of the assistant, 1 bit indicating the presence of an object, 2 bits indicating the class of the object in one-hot form, 2 bits indicating whether the object should be collected in one-hot form. All bits are zeroed if the cell is not visible to the agent, and all object bits are zeroed if no object is present in the cell.

In Experiment 1, the bits indicating whether the object should be collected are set to zero in the assistant’s binary vectors.

In Experiment 2, the bits indicating whether the object should be collected are set to zero in the assistant’s binary vectors. Further, for both agents, if a cell is more than 1 cell away from the agent, then all bits in the binary vector for that cell are set to zero.

In Experiment 3, on half of the tasks, the bits in indicating whether the object should be collected set to zero in the assistant’s binary vectors. Further, for both agents, if a cell is more than 1 cell away from the agent, then all bits in the binary vector for that cell are set to zero.

In Experiment 4, using the fruit environment, the agents observe the world as a 64x64x3 color image centered on the respective agent, see Figure 8. The camera view includes the entire world when the agent is at the center of the grid, but becomes partially observable as the agent moves away from the center. The principal agent receives the class to forage, in one-hot form, as an additional observation. The class of objects to collect is concatenated to the principal’s lstm input. In Experiment 4, the assistant never observes the class of objects to collect.

a.3 Training Algorithms

0:  : a set of training tasks
0:  A recurrent artificial neural network that outputs advantage functions and , and value function
0:  An environment interface with methods and
0:  , , and
1:  Initialize , , and
2:  while not done do
3:     Uniformly sample a batch of tasks from , let be a task in the batch
4:     for all  do
5:        Reset the memory for the recurrent neural networks to
8:        repeat
9:           Choose actions and using -greedy exploration on and
12:        until end of episode
13:        Calculate using Equations 4 and 5 with , , and from
14:     end for
15:     Take a gradient step on w.r.t. , , and
16:  end while
Algorithm 1 Training Procedure using the MADDRQN model
0:  : a set of training tasks
0:  Two recurrent artificial neural networks that output action-value functions and
0:  An environment interface with methods and
0:   and
1:  Initialize and
2:  while not done do
3:     Uniformly sample a batch of tasks from , let be a task in the batch.
4:     for all  do
5:        Reset the memory for the recurrent neural networks to
8:        repeat
9:           Choose actions and using -greedy exploration on and
12:        until end of episode
13:        Calculate using Equation 2 with , , and from
14:     end for
15:     Take a gradient step on w.r.t. and
16:  end while
Algorithm 2 Training Procedure using the MAIDRQN model

Algorithm 2 describes the training procedure for training with the Multi-Agent Independent DRQN (MAIDRQN) model, used in experiments 1-3. Algorithm 1 describes training procedure for training with our Multi-Agent Dueling DRQN (MADDRQN) model, used in experiment 4.

a.4 Experiment 3 Rollouts

(a) The assistant immediately consumes the good object when it observes object goodness
(b) The assistant asks the principal whether the object is good when it does not observe object goodness.
(c) The assistant ignores the bad object when it observes object goodness
(d) The assistant asks the principal whether the object is good when it does not observe object goodness.
Figure 9: The assistant has learned how and when to ask questions to the principal. The assistant randomly observes object goodness dependent on the episode. When asking a question, as shown in cases (fig:appendix_question_ug) and (fig:appendix_question_ub), the assistant moves into the visual field of the principal. Next, the principal moves to the right to observe the object and moves one step closer if the object is good or does not move if the object is bad. The assistant understands this answer and proceed correctly.

Figure A.4 shows rollouts for emerged behaviors in 4 of the 8 tasks.

In the case where the assistant observes object goodness and the object is bad, Figure 8(c), the assistant correctly does not collect the bad object. However, it does move into the principal’s visual field later in the episode. A late appearance in the principal’s visual field must not be a well-formed question since the principal does not move to answer. The emergence of time-step-dependent behaviors such as this might be avoided by starting agents at different random steps in the episode.