1 Introduction
Reinforcement learning (RL) algorithms have the potential to automate a wide range of decisionmaking and control tasks across a variety of different domains, as demonstrated by successful recent applications ranging from robotic control (Kober & Peters, 2012; Levine et al., 2016) to game playing (Mnih et al., 2015; Silver et al., 2016). A key assumption of the RL problem statement is the availability of a reward function that accurately describes the desired tasks. For many real world tasks, reward functions can be challenging to manually specify, while being crucial to good performance (Amodei et al., 2016). Most real world tasks are multifaceted and require reasoning over multiple factors in a task (e.g. an autonomous vehicle navigating a city at night), while simultaneously providing appropriate reward shaping to make the task feasible with tractable exploration (Ng et al., 1999). These challenges are compounded by the inherent difficulty of specifying rewards for tasks with highdimensional observation spaces such as images.
Inverse reinforcement learning (IRL) is an approach that aims to address this problem by instead inferring the reward function from demonstrations of the task (Ng & Russell, 2000). This has the appealing benefit of taking a datadriven approach to reward specification in place of hand engineering. In practice however, rewards functions are rarely learned as it can be prohibitively expensive to provide demonstrations that cover the variability common in real world tasks (e.g., collecting demonstrations of opening every type of door knob). In addition, while learning a complex function from high dimensional observations might make an expressive function approximator seem like a reasonable modelling assumption, in the “fewshot” domain it is notoriously difficult to unambiguously recover a good reward function with expressive function approximators. Prior solutions have thus instead relied on lowdimensional linear models with handcrafted features that effectively encode a strong prior on the relevant features of a task. This requires engineering a set of features by hand that work well for a specific problem. In this work, we propose an approach that instead explicitly learns expressive features that are robust even when learning with limited demonstrations.
Our approach relies on the key observation that related tasks share common structure that we can leverage when learning new tasks. To illustrate, considering a robot navigating through a home. While the exact reward function we provide to the robot may differ depending on the task, there is a structure amid the space of useful behaviours, such as navigating to a series of landmarks, and there are certain behaviors we always want to encourage or discourage, such as avoiding obstacles or staying a reasonable distance from humans. This notion agrees with our understanding of why humans can easily infer the intents and goals (i.e., reward functions) of even abstract agents from just one or a few demonstrations Baker et al. (2007), as humans have access to strong priors about how other humans accomplish similar tasks accrued over many years. Similarly, our objective is to discover the common structure among different tasks, and encode the structure in a way that can be used to infer reward functions from a few demonstrations.
More specifically, in this work we assume access to a set of tasks, along with demonstrations of the desired behaviors for those tasks, which we refer to as the metatraining set. From these tasks, we then learn a reward function parameterization that enables effective fewshot learning when used to initialize IRL in a novel task. Our method is summarized in Fig. 1
. Our key contribution is an algorithm that enables efficient learning of new reward functions by using metatraining to build a rich “prior” for goal inference. Using our proposed approach, we show that we can learn deep neural network reward functions from raw pixel observations with substantially better data efficiency than existing methods and standard baselines.
2 Related Work
Inverse reinforcement learning (IRL) (Ng & Russell, 2000) is the problem of inferring an expert’s reward function directly from demonstrations. Prior methods for performing IRL range from margin based approaches (Abbeel & Ng, 2004; Ratliff et al., 2006) to probabilistic approaches (Ramachandran & Amir, 2007; Ziebart et al., 2008). Although it is possible to extend our approach to any other IRL method, in this work we base on work on the maximum entropy (MaxEnt) framework (Ziebart et al., 2008)
. In addition to allowing for suboptimality in the expert demonstrations, MaxEntIRL can be reframed as a maximum likelihood estimation problem.
4).In part to combat the underspecified nature of IRL, prior work has often used lowdimensional linear parameterizations with handcrafted features (Abbeel & Ng, 2004; Ziebart et al., 2008). In order to learn from high dimensional input, Wulfmeier et al. (2015) proposed applying fully convolutional networks (Shelhamer et al., 2017) to the MaxEnt IRL framework (Ziebart et al., 2008) for several navigation tasks (Wulfmeier et al., 2016a, b). Other methods that have incorporated neural network rewards include guided cost learning (GCL) (Finn et al., 2017a), which uses importance sampling and regularization for scalability to highdimensional spaces, and adversarial IRL (Fu et al., 2018)
. Several other methods have also proposed imitation learning approaches based on adversarial frameworks that resemble IRL, but do not aim to directly recover a reward function
(Ho & Ermon, 2016; Li et al., 2017; Hausman et al., 2017; Kuefler & Kochenderfer, 2018). In this work, instead of improving the ability to learn reward functions on a single task, we focus on the problem of effectively learning to use prior demonstration data from other IRL tasks, allowing us to learn new tasks from a limited number demonstrations even with expressive nonlinear reward functions.Prior work has explored the problem of multitask IRL, where the demonstrated behavior is assumed to have originated from multiple experts achieving different goals. Some of these approaches include those that aim to incorporate a shared prior over reward functions through extending the Bayesian IRL (Ramachandran & Amir, 2007) framework to the multitask setting (Dimitrakakis & Rothkopf, 2012; Choi & Kim, 2012). Other approaches have clustered demonstrations while simultaneously inferring reward functions for each cluster (BabeşVroman et al., 2011) or introduced regularization between rewards to a common “shared reward” (Li & Burdick, 2017). Our work is similar in that we also seek to encode prior information common to the tasks. However, a critical difference is that our method specifically aims to distill the metatraining tasks into a prior that can then be used to learn rewards for new tasks efficiently. The goal therefore is not to acquire good reward functions that explain the metatraining tasks, but rather to use them to learn efficiently on new tasks.
Our approach builds on work on the broader problem of metalearning (Schmidhuber, 1987; Bengio et al., ; Naik & Mammone, 1992; Thrun & Pratt, 2012) and generative modelling (Rezende et al., 2016; Reed et al., 2018; Mordatch, 2018). Prior work has proposed a variety of solutions for learning to learn including memory based methods (Duan et al., 2016; Santoro et al., 2016; Wang et al., 2016; Mishra et al., 2017), methods that learn an optimizer and/or initialization (Andrychowicz et al., 2016; Ravi & Larochelle, 2016; Finn et al., 2017a; Li & Malik, 2017), and methods that compare new datapoints in a learned metric space (Koch, 2015; Vinyals et al., 2016; Shyam et al., 2017; Snell et al., 2017). Our work is motivated by the goal of broadening the applicability of IRL, but in principle it is possible to adapt many of these metalearning approaches for our problem statement. We leave it to future work to do a comprehensive investigation of different metalearning approaches which could broaden the applicability of IRL.
3 Preliminaries and Overview
In this section, we introduce our notation and describe the IRL and metalearning problems.
3.1 Learning Rewards via Inverse Reinforcement Learning
The standard Markov decision process (MDP) is defined by the tuple
where and denote the set of possible states and actions respectively, is the reward function, is the discount factor and denotes the transition distribution over the next state , given the current state and current action . Typically, the goal of “forward” RL is to maximize the expected discounted return .In IRL, we instead assume that the reward function is unknown but that we instead have access to a set of expert demonstrations , where .
The goal of IRL is to recover the unknown reward function from the set of demonstrations. We build on the maximum entropy (MaxEnt) IRL framework by Ziebart et al. (2008)
, which models the probability of the trajectories as being distributed proportional to their exponentiated return
(1) 
where is the partition function, . This distribution can be shown to be induced by the optimal policy in entropy regularized forward RL problem:
(2) 
This formulation allows us to pose the reward learning problem as a maximum likelihood estimation (MLE) problem in an energybased model
:(3) 
Learning in general energybased models of this form is common in many applications such as structured prediction. However, in contrast to applications where learning can be supervised by millions of labels (e.g. semantic segmentation), the learning problem in Eq. 3 must typically be performed with a relatively small number of example demonstrations. In this work, we seek to address this issue in IRL by providing a way to integrate information from prior tasks to constrain the optimization in Eq. 3 in the regime of limited demonstrations.
3.2 MetaLearning
The goal of metalearning algorithms is to optimize for the ability to learn efficiently on new tasks. Rather than attempt to generalize to new datapoints, metalearning can be understood as attempting to generalize to new tasks. It is assumed in the metalearning setting that there are two sets of tasks that we refer to as the metatraining set and metatest set , which are both drawn from a distribution . During metatraining time, the metalearner attempts to learn the structure of the tasks in the metatraining set, such that when it is presented with a new test task, it can leverage this structure to learn efficiently from a limited number of examples.
To illustrate this distinction, consider the case of fewshot learning setting. Let denote the learner, and let a task be defined by learning from training examples , , and evaluating on test examples , . One approach to metalearning is to directly parameterize the metalearner with an expressive model such as a recurrent or recursive neural network (Duan et al., 2016; Mishra et al., 2017) conditioned on the task training data and the inputs for the test task: . Such a model is optimized using loglikelihood across all tasks. In this approach to metalearning, since neural networks are known to be universal function approximators (Siegelmann & Sontag, 1995), any desired structure between tasks can be implicitly encoded.
Rather than learn a single blackbox function, another approach to metalearning is to learn components of the learning procedure such as the initialization (Finn et al., 2017a) or the optimization algorithm (Ravi & Larochelle, 2016; Andrychowicz et al., 2016). In this work we extend the approach of model agnostic metalearning (MAML) introduced by Finn et al. (2017a)
, which learns an initialization that is adapted by gradient descent. Concretely, in the supervised learning case, given a loss function
(e.g. crossentropy), MAML performs the following optimization(4) 
where the optimization is over an initial set of parameters and the loss on the held out tasks becomes the signal for learning the initial parameters for gradient descent on . This optimization is analogous to adding a constraint in a multitask setting, which we show in later sections is analogous in our setting to learning a prior over reward functions.
4 Learning to Learn Rewards
Our goal in metaIRL is to learn how to learn reward functions across many tasks such that the model can infer the reward function for a new task using only one or a few expert demonstrations. Intuitively, we can view this problem as aiming to learn a prior over the intentions of human demonstrators, such that when given just one or a few demonstrations of a new task, we can combine the learned prior with the new data to effectively determine the human’s reward function. Such a prior is helpful in inverse reinforcement learning settings, since the space of relevant reward functions is much smaller than the space of all possible rewards definable on the raw observations.
During metatraining, we have a set of tasks . Each task has a set of demonstrations from an expert policy which we partition into disjoint and sets. The demonstrations for each metatraining task are assumed to be produced by the expert according to the maximum entropy model in Section 3.1. During metatraining, these tasks will be used to encodes common structure so that our model can quickly acquire rewards for new tasks from just a few demonstrations.
After metatraining, our method is presented with a new task. During this metatest phase, the algorithm must infer the parameters of the reward function for the new task from a few demonstrations. As is standard in metalearning, we assume that the test task is from the same distribution of tasks seen during metatraining, a distribution that we denote as .
4.1 Meta Reward and Intention Learning (MandRIL)
In order to metalearn a reward function that can act as a prior for new tasks and new environments, we first formalize the notion of a good reward by defining a loss on the reward function for a particular task . We use the MaxEnt IRL loss discussed in Section 3, which, for a given , leads to the following gradient (Ziebart et al., 2008):
(5) 
where are the state visitations under the optimal maximum entropy policy under , and are the mean state visitations under the demonstrated trajectories.
If our end goal were to achieve a single reward function that works as well as possible across all tasks in , then we could simply follow the mean gradient across all tasks. However, our objective is different: instead of optimizing performance on the metatraining tasks, we aim to learn a reward function that can be quickly and efficiently adapted to new tasks at metatest time. In doing so, we aim to encode prior information over the task distribution in this learned reward prior.
We propose to implement such a learning algorithm by finding the parameters , such that starting from and taking a small number of gradient steps on a few demonstrations from given task leads to a reward function for which a set of test demonstrations have high likelihood, with respect to the MaxEnt IRL model. In particular, we would like to find a such that the parameters
(6) 
lead to a reward function for task , such that the IRL loss (corresponding to negative loglikelihood) for a disjoint set of test demonstrations, given by , is minimized. The corresponding optimization problem for can therefore be written as follows:
(7) 
Our method acquires this prior over rewards in the task distribution by optimizing this loss. This amounts to an extension of the MAML algorithm in Section 3.2 to the inverse reinforcement learning setting. This extension is quite challenging, because computing the MaxEnt IRL gradient requires repeatedly solving for the current maximum entropy policy and visitation frequencies, and the MAML objective requires computing derivatives through this gradient step. Next, we describe in detail how this is done. An overview of our method is also outlined in Alg. 1.
Metatraining. The computation of the metagradient for the objective in Eq. 7 can be conceptually separated into two parts. First, we perform the update in Eq. 6 by computing the expected state visitations , which is the expected number of times an agent will visit each state. We denote this overall procedure as StateVisitationsPolicy, and follow Ziebart et al. (2008) by first computing the maximum entropy optimal policy in Eq. 2 under the current , and then approximating using dynamic programming. Next, we compute the state visitation distribution of the expert using a procedure which we denote as StateVisitationsTraj. This can be done either empirically, by averaging the state visitation of the experts demonstrations, or by using StateVisitationsPolicy if the true reward is available at metatraining time. This allows us to recover the IRL gradient according to Eq. 5, which we can then apply to compute according to Eq. 6.
Second, we need to differentiate through this update to compute the gradient of the metaloss in Eq. 7. Note that the metaloss itself is the IRL loss evaluated with a different set of test demonstrations. We follow the same procedure as above to evaluate the gradient of with respect to the postupdate parameters , and then apply the chain rule to compute the metagradient:
(8) 
where on the second line we differentiate through the MaxEntIRL update. The derivation of this expression is somewhat more involved and provided in the supplementary Appendix D.
Metatesting. Once we have acquired the metatrained parameters that encode a prior over , we can leverage this prior to enable fast, fewshot IRL of novel tasks in . For each task, we first compute the state visitations from the available set of demonstrations for that task. Next, we use these state visitations to compute the gradient, which is the same as the inner loss gradient computation of the metatraining loop in Alg. 1. We apply this gradient to adapt the parameters to the new task. Even if the model was trained with only one to three inner gradient steps, we found in practice that it was beneficial to take substantially more gradient steps during metatesting; performance continued to improve with up to 20 steps.
4.2 Interpretation as Learning a Prior over Intent
The objective in Eq. 6 optimizes for parameters that enable that reward function to adapt and generalize efficiently on a wide range of tasks. Intuitively, constraining the space of reward functions to lie within a few steps of gradient descent can be interpreted as expressing a “locality” prior over reward function parameters. This intuition can be made more concrete with the following analysis.
By viewing IRL as maximum likelihood estimation, we can take the perspective of Grant et al. (2018) who showed that for a linear model, fast adaptation via a few steps of gradient descent in MAML is performing MAP inference over , under a Gaussian prior with the mean and a covariance that depends on the step size, number of steps and curvature of the loss. This is based on the connection between early stopping and regularization previously discussed in Santos (1996), which we refer the readers to for a more detailed discussion. The interpretation of MAML as imposing a Gaussian prior on the parameters is exact in the case of a likelihood that is quadratic in the parameters (such as the loglikelihood of a Gaussian in terms of its mean). For any nonquadratic likelihood, this is an approximation in a local neighborhood around (i.e. up to convex quadratic approximation). In the case of very complex parameterizations, such as deep function approximators, this is a coarse approximation and unlikely to be the mode of a posterior. However, we can still frame the effect of early stopping and initialization as serving as a prior in a similar way as prior work (Sjöberg & Ljung, 1995; Duvenaud et al., 2016; Grant et al., 2018). More importantly, this interpretation hints at future extensions to our approach that could benefit from employing more fully Bayesian approaches to reward and goal inference.
5 Experiments
Our evaluation seeks to answer two questions. First, we aim to test our core hypothesis that leveraging prior task experience enables reward learning for new tasks with just a few demonstrations. Second, we compare our method with alternative algorithms that make use of multitask experience.
We test our core hypothesis by comparing learning performance on a new task starting from the learned initialization produced by MandRIL, compared to starting from scratch with a random initialization. This comparison is meant to evaluate whether prior experience on other tasks can in fact make inverse RL more efficient.
To our knowledge, there is no prior work that addresses the metainverse reinforcement learning problem introduced in this paper. Thus, to provide a point of comparison and calibrate the difficulty of the tasks, we adapt two alternative blackbox metalearning methods to the IRL setting. The comparisons to both of the blackbox methods described below evaluate the importance of incorporating the IRL gradient into the metalearning process, rather than learning the adaptation process entirely from scratch.

[leftmargin=*]

Demo conditional model: Our method implicitly conditions on the demonstrations through the gradient update. In principle, a conditional deep model with sufficient capacity could implicitly implement a similar learning rule. Thus, we consider a conditional model (often referred to as a “contextual model” (Finn et al., 2017b)), which receives the demonstration as an additional input.
Our approach can be understood as explicitly optimizing for an effective parameter initialization for the IRL problem. In order to test the benefits of our proposed formulation, we also compare with finetuning an initialization obtained with the same set of prior tasks, but with supervised pretraining as follows:

[leftmargin=*]

Supervised pretraining: We compare to following the average gradient during metatraining, averaged across tasks, and finetuning at metatest time (as discussed in Section 4). This comparison evaluates the benefits of optimizing explicitly for weights that perform well under finetuning. We compare to pretraining on a single task as well as all the metatraining tasks.
Next, we describe our environment and evaluation.
Spriteworld navigation domain.
Since most prior IRL works (and multitask IRL works) have studied settings where linear reward function approximators suffice (i.e. lowdimensional state spaces made up from handdesigned features), we design an experiment that is significantly more challenging—that requires learning rewards on raw pixels—while still exhibiting multitask structure needed to test our core hypothesis. We consider a navigation problem where we aim to learn a convolutional neural network that directly maps image pixels to rewards. To do so, we introduce “SpriteWorld,” which is a synthetically generated task, some examples of which are shown in Fig.
3. The task visuals are inspired by Starcraft and work applying learning algorithms to perform micromanagement (e.g. (Synnaeve et al., 2016)), although we do not use the game engine. Tasks involve navigating to goal objects while exhibiting preference over terrain types (e.g. the agent prefers to traverse dirt tiles over traversing grass tiles). At metatest time, we provide one or a few demonstrations in a single training environment and evaluate the reward learned using these demonstrations in a new, test environment that contains the same objects as the training environment, but arranged differently. Evaluating in a new test environment is critical to measure that the reward learned the correct visual cues, rather than simply memorizing the demonstration trajectory.). However, in both test settings, MandRIL achieves comparable performance to the training environment, while the other methods overfit until they receive at least 10 demonstrations. The recurrent metalearner has a value difference larger than 60 in both test settings. Shaded regions show 95% confidence intervals.
The underlying MDP structure of SpriteWorld is a grid, where the states are each of the grid cells, and the actions enable the agent to move to any one of its 8connected neighbors. We generate unique tasks from this domain as follows. First, we randomly choose a set of 3 sprites from a total of 100 sprites from the original game (creating a total of 161,700 unique tasks). We randomly place these three sprites within a randomly generated terrain tiling; we designate one of the sprites to be the goal of the navigation task, i.e. the object to which the agent must navigate. The other two objects are treated as obstacles for which the agent incurs a large negative reward for not avoiding. In each task, we optimize our model on a metatraining set and evaluate the ability of the reward function to the generalize to a rearrangement of the same objects. For example, suppose the goal of the task is to navigate to sprite A, while avoiding sprites B and C. Then, to generate an environment for evaluation, we resample the positions of the sprites, while the underlying task remains the same (i.e., navigate to A). This requires the model to make use of the visual patterns in the scene to generalize effectively, rather than simply memorizing positions of sprites. We evaluate on novel combinations of units seen in metatraining, as well as the ability to generalize to new unseen units. We provide further details on our setup in Appendices A and B.
We measure performance using the expected value difference, which measures the suboptimality of a policy learned under the learned reward; this is a standard performance metric used in prior IRL work (Levine et al., 2011; Wulfmeier et al., 2015). The metric is computed by taking the difference between the value of the optimal policy under the learned reward and the value of the optimal policy under the true reward.
Evaluation protocol.
We evaluate on heldout tasks that were unseen during metatraining. We consider two settings: (1) tasks involving new combinations and placements of sprites, with individual sprites that were present during metatraining, and (2) tasks with combinations of entirely new sprites which we refer to as “out of domain objects.” For each task, we generate one environment (set of sprite positions) along with one or a few demonstrations for adapting the reward, and generate a second environment (with new sprite positions) where we evaluate the adapted reward. In these metatest time evaluations, we refer to the performance on the first environment as “training performance” (not to be confused with metatraining) and to performance on the second as “testing performance”. We evaluate on 32 metatest randomly generated tasks.
Results.
The results are shown in Fig. 4, which illustrate test performance with indistribution and outofdistribution sprites. Our approach, MandRIL, achieves consistently better performance in both settings. Most significantly, our approach performs well even with singledigit numbers of demonstrations. By comparison, alternative metalearning methods generally overfit considerably, attaining good training performance (see Appendix. A for curves) but poor test performance. Learning the reward function from scratch is in fact the most competitive baseline – as the number of demonstrations increases, simply training the fully convolutional reward function from scratch on the new task is the only method that matches the performance of MandRIL when provided or more demonstrations. However, with only a few demonstrations, MandRIL has substantially lower value difference. It is worth noting the performance of MandRIL on the out of distribution test setting (Fig. 4, bottom): although the evaluation is on new sprites, MandRIL is still able to adapt via gradient descent and exceed the performance of learning from scratch and all other methods.
Finally, as it is common practice to finetune representations obtained from a supervised pretraining phase, we perform this comparison in Figure 5. We compare against an approach that follows the mean gradient across the tasks at metatraining time and is finetuned at metatest time which we find is less effective than learning from a random initialization. We conclude that fine tuning reward functions learned in this manner is not an effective way of using prior task information. When using a single task for pretraining, we find that it performs comparable to random initialization. In contrast, we find that our approach, which explicitly optimizes for initial weights for finetuning, robustly improves performance.
6 Conclusion
In this work, we present an approach that enables fewshot learning for reward functions of new tasks. We achieve this through a novel formulation of inverse reinforcement learning that learns to encode common structure across tasks. Using our metaIRL approach, we show that we can leverage data from previous tasks to effectively learn deep neural network reward functions from raw pixel observations for new tasks, from only a handful of demonstrations. Our work paves the way for future work that considers environments with unknown dynamics, or work that employs a more fully probabilistic approaches to reward and goal inference (Kim et al., 2018; Finn et al., 2018).
Acknowledgments
We thank Frederik Ebert, Adam Gleave, Erin Grant, Rowan McAllister, Charlotte Nguyen, Sid Reddy and Aravind Srinivas for comments on an earlier version of this paper. This work as supported by the Open Philanthropy Foundation, NVIDIA, NSF IIS 1651843 and IIS 1700696. CF was supported by an NSF graduate research fellowship. We also thank Amazon AWS for providing compute used in this project.
References

Abbeel & Ng (2004)
Pieter Abbeel and Andrew Y. Ng.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the Twentyfirst International Conference on Machine Learning
, ICML, New York, NY, USA, 2004.  Amodei et al. (2016) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
 Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989, 2016.
 BabeşVroman et al. (2011) Monica BabeşVroman, Vukosi Marivate, Kaushik Subramanian, and Michael Littman. Apprenticeship learning about multiple intentions. In International Conference on International Conference on Machine Learning, ICML, USA, 2011.
 Baker et al. (2007) Chris Baker, Joshua B Tenenbaum, and Rebecca R Saxe. Goal inference as inverse planning. 01 2007.
 (6) Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule.
 Choi & Kim (2012) Jaedeug Choi and KeeEung Kim. Nonparametric bayesian inverse reinforcement learning for multiple reward functions. In Advances in Neural Information Processing Systems, NIPS, USA, 2012.
 Dimitrakakis & Rothkopf (2012) Christos Dimitrakakis and Constantin A. Rothkopf. Bayesian multitask inverse reinforcement learning. In European Conference on Recent Advances in Reinforcement Learning, EWRL, 2012.
 Duan et al. (2016) Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
 Duan et al. (2017) Yan Duan, Marcin Andrychowicz, Bradly C. Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. Oneshot imitation learning. In NIPS, pp. 1087–1098, 2017.
 Duvenaud et al. (2016) David Duvenaud, Dougal Maclaurin, and Ryan Adams. Early stopping as nonparametric variational inference. In Artificial Intelligence and Statistics, pp. 1070–1077, 2016.
 Finn et al. (2017a) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017a.
 Finn et al. (2017b) Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. Oneshot visual imitation learning via metalearning. arXiv preprint arXiv:1709.04905, 2017b.
 Finn et al. (2018) Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic modelagnostic metalearning. NIPS, 2018.
 Fu et al. (2018) Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. International Conference on Learning Representations, 2018.
 Grant et al. (2018) Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradientbased metalearning as hierarchical bayes. International Conference on Learning Representations (ICLR), 2018.
 Hausman et al. (2017) Karol Hausman, Yevgen Chebotar, Stefan Schaal, Gaurav Sukhatme, and Joseph J Lim. Multimodal imitation learning from unstructured demonstrations using generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 1235–1245, 2017.
 Ho & Ermon (2016) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Neural Information Processing Systems (NIPS), 2016.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Kim et al. (2018) Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian modelagnostic metalearning. NIPS, 2018.
 Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kober & Peters (2012) Jens Kober and Jan Peters. Reinforcement learning in robotics: A survey. In Reinforcement Learning, pp. 579–610. Springer, 2012.
 Koch (2015) Gregory Koch. Siamese neural networks for oneshot image recognition. 2015.
 Kuefler & Kochenderfer (2018) Alex Kuefler and Mykel J. Kochenderfer. Burnin demonstrations for multimodal imitation learning. 2018.
 Levine et al. (2011) Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, pp. 19–27, 2011.
 Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research (JMLR), 17(39):1–40, 2016.
 Li & Malik (2017) Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441, 2017.
 Li & Burdick (2017) Kun Li and Joel W. Burdick. Meta inverse reinforcement learning via maximum reward sharing for human motion analysis. CoRR, abs/1710.03592, 2017.
 Li et al. (2017) Yunzhu Li, Jiaming Song, and Stefano Ermon. Inferring the latent structure of human decisionmaking from raw visual inputs. arXiv preprint arXiv:1703.08840, 2017.
 Mishra et al. (2017) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. Metalearning with temporal convolutions. arXiv preprint arXiv:1707.03141, 2017.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Mordatch (2018) Igor Mordatch. Concept learning with energybased models. In ICLR Workshop, 2018.
 Naik & Mammone (1992) Devang K Naik and RJ Mammone. Metaneural networks that learn by learning. In Neural Networks, 1992. IJCNN., International Joint Conference on, volume 1, pp. 437–442. IEEE, 1992.
 Ng & Russell (2000) Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
 Ng et al. (1999) Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. 1999.
 Ramachandran & Amir (2007) Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In International Joint Conference on Artifical Intelligence, San Francisco, CA, USA, 2007.
 Ratliff et al. (2006) Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736. ACM, 2006.
 Ravi & Larochelle (2016) Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. 2016.
 Reed et al. (2018) Scott Reed, Yutian Chen, Thomas Paine, Aaron van den Oord, Oriol Vinyals, SM Ali Eslami, Danilo Rezende, and Nando de Freitas. Fewshot autoregressive density estimation: Towards learning to learn distributions. In ICLR, 2018.
 Rezende et al. (2016) Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. Oneshot generalization in deep generative models. arXiv preprint arXiv:1603.05106, 2016.
 Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memoryaugmented neural networks. In International conference on machine learning, 2016.
 Santos (1996) Reginaldo J Santos. Equivalence of regularization and truncated iteration for general illposed problems. Linear algebra and its applications, 236:25–33, 1996.
 Schmidhuber (1987) Jürgen Schmidhuber. Evolutionary principles in selfreferential learning, or on learning how to learn: the metameta… hook. PhD thesis, Technische Universität München, 1987.
 Shelhamer et al. (2017) E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), April 2017.
 Shyam et al. (2017) Pranav Shyam, Shubham Gupta, and Ambedkar Dukkipati. Attentive recurrent comparators. arXiv preprint arXiv:1703.00767, 2017.
 Siegelmann & Sontag (1995) Hava T Siegelmann and Eduardo D Sontag. On the computational power of neural nets. Journal of computer and system sciences, 50(1):132–150, 1995.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Sjöberg & Ljung (1995) Jonas Sjöberg and Lennart Ljung. Overtraining, regularization and searching for a minimum, with application to neural networks. International Journal of Control, 62(6):1391–1407, 1995.
 Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pp. 4080–4090, 2017.
 Synnaeve et al. (2016) Gabriel Synnaeve, Nantas Nardelli, Alex Auvolat, Soumith Chintala, Timothée Lacroix, Zeming Lin, Florian Richoux, and Nicolas Usunier. Torchcraft: a library for machine learning research on realtime strategy games. arXiv preprint arXiv:1611.00625, 2016.
 Thrun & Pratt (2012) Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.
 Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638, 2016.
 Wang et al. (2016) Jane X Wang, Zeb KurthNelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
 Wulfmeier et al. (2016a) M. Wulfmeier, D. Z. Wang, and I. Posner. Watch this: Scalable costfunction learning for path planning in urban environments. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2089–2095, Oct 2016a.
 Wulfmeier et al. (2015) Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. In Neural Information Processing Systems Conference, Deep Reinforcement Learning Workshop, volume abs/1507.04888, 2015.
 Wulfmeier et al. (2016b) Markus Wulfmeier, Dushyant Rao, and Ingmar Posner. Incorporating human domain knowledge into large scale cost function learning. CoRR, abs/1612.04318, 2016b.
 Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
 Ziebart et al. (2008) Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence  Volume 3, AAAI’08. AAAI Press, 2008.
Appendix A Experimental Details
The input to our reward function for all experiments is a resized RGB image, with an output space of in the underlying MDP state space . In order to compute the optimal policy, we use Qiteration. In our experiments, we parameterize the reward function for all our reward functions starting from the same base learner. The first layer is a
convolution with a stride of 2, 256 filters and symmetric padding of 4. The second layer is a
with a stride of 2, 128 filters and symmetric of 1. The third and fourth layer are convolutions with a stride of 1, 64 filters and symmetric padding of 1. The final layer is a convolution.Our LSTM (Hochreiter & Schmidhuber, 1997) implementation is based on the variant used by Zaremba et al. (Zaremba et al., 2014). The input to the LSTM at each time step is the location of the agent in the, we separately embed the coordinates. This is then used to predict an additional channel in to the base CNN architecture described above. We also experimented with conditioning the initial hidden state on image features from a separate CNN, but found that this did not improve performance.
In our demo conditional model, we preserve the spatial information of the demonstrations by feeding in the state visitation map as a imagegrid, upsampled with bilinear interpolation, as an additional channel to the image. In our setup, both the democonditional models share the same convolutional architecture, but differ only in how they encode condition on the demonstrations.
For all our methods, we optimzed our model with Adam (Kingma & Ba, 2014). We turned over the learning rate , the inner learning rate of our approach and weight decay on the initial parameters. In our LSTM learner, we experimented with different embedding sizes, as well as the dimensionality of the LSTM although we found that these hyperparameters did not impact performance. A negative result we found was that bias transformation (Finn et al., 2017b) did not help in our experimental setting.
Hyperparameters  Value 

Architecture  Conv(2568x82) 
Conv(1284x42)  
Conv(643x31)  
Conv(643x31)  
Conv(11x11)  
Learning rate  Best chosen from {0.0001, 0.00001} 
Inner learning rate  Best chosen from {0.001, 0.0005} 
Weight decay  Best chosen from {0, 0.0001} 
LSTM hidden dimension  Best chosen from {128, 256} 
LSTM embedding sizes  Best chosen from {64, 128} 
Batch size  16 
Number of metatraining environments  1000 
Number of metaval/test environments  32 
Maximum horizon (T)  15 
Appendix B Environment Details
The sprites in our environment are extracted directly from the StarCraft files. We used in total 100 random units for metatraining. Evaluation on new objects was performed with 5 randomly selected sprites. For computational efficiency, we create a metatraining set of 1000 tasks and cache the optimal policy and state visitations under the true cost. Our evaluation is over 32 tasks. Our set of sprites was divided into two categories: buildings and characters. Each characters had multiple poses (taken from different frames of animation, such as walking/running/flying), whereas buildings only had a single pose. During metatraining the units were randomly placed, but to avoid the possibility that the agent would not need to actively avoid obstacles, the units were placed away from the boundary of the image in both the metavalidation and metatest set.
The terrain in each environment was randomly generated using a set of tiles, each belonging to a specific category (e.g. grass, dirt, water). For each tile, we also specified a set of possible tiles for each of the 4neighbors. Using these constraints on the neighbors, we generated random environment terrains using a graph traversal algorithm, where successor tiles were sampled randomly from this set of possible tiles. This process resulted in randomly generated, seamless environments. The names of the units used in our experiments are as follows (names are from the original game files):
The list of buildings used is: academy, assim, barrack, beacon, cerebrat, chemlab, chrysal, cocoon, comsat, control, depot, drydock, egg, extract, factory, fcolony, forge, gateway, genelab, geyser, hatchery, hive, infest, lair, larva, mutapit, nest, nexus, nukesilo, nydustpit, overlord, physics, probe, pylon, prism, pillbox, queen, rcluster, refinery, research, robotic, sbattery, scolony, spire, starbase, stargate, starport, temple, warm, weaponpl, wessel.
The list of characters used is: acritter, arbiter, archives, archon, avenger, battlecr, brood, bugguy, carrier, civilian, defiler, dragoon, drone, dropship, firebat, gencore, ghost, guardian, hydra, intercep, jcritter, lurker, marine, missile, mutacham, mutalid, sapper, scout, scv, shuttle, snakey, spider, stank, tank, templar, trilob, ucereb, uikerr, ultra, vulture, witness, zealot, zergling.
Appendix C MetaTest Training Performance
Appendix D Detailed MetaObjective Derivation
We define the quality of reward function parameterized by on task with the MaxEnt IRL loss, , described in Section 4. The corresponding gradient is
(9) 
where is the dimensional Jacobian matrix of the reward function with respect to the parameters . Here,
is the vector of
state visitations under the trajectory (i.e. the vector whose elements are 1 if the corresponding stateaction pair has been visited by the trajectory , and 0 otherwise), and is the mean state visitations over all demonstrated trajectories in . Let be the updated parameters after a single gradient step. Then(10) 
Let be the MaxEnt IRL loss, where the expectation over trajectories is computed with respect to a test set that is disjoint from the set of demonstrations used to compute in Eq. 10. We seek to minimize
(11) 
over the parameters . To do so, we first compute the gradient of Eq. 11, which we derive here. Applying the chain rule
(12) 
where in the last line we substitute in the gradient of the MaxEnt IRL loss in Eq. 9. In Eq. D, we use the following notation:

denotes the dimensional vector of partial derivatives ,

denotes the dimensional matrix of partial derivatives ,

and, denotes the dimensional gradient vector of with respect to .
We will now focus on the term inside of the parentheses in Eq. D, which is a dimensional matrix of partial derivatives.
where between the first and second lines, we apply the chain rule to expand the second term. In this expression, we make use of the following notation:

denotes the dimensional matrix of secondorder partial derivatives of the form ,

denotes the th element of the dimensional vector ,

denotes the dimensional matrix of partial derivatives of the form ,

and, is the dimensional Jacobian matrix of with respect to the reward function (we will examine in more detail exactly what this is below).