1 Introduction
Inverse reinforcement learning (IRL) is the task of determining the reward function that generated a set of trajectories: sequences of stateaction pairs (Ng & Russell, 2000). It is a form of learning from demonstration that assumes the demonstrator follows a nearoptimal policy with respect to an unknown reward. Sample efficiency is a key design goal, since it is costly to elicit human demonstrations.
In practice, demonstrations are often generated from multiple reward functions. This situation naturally arises when the demonstrations are for different tasks, such as grasping different types of objects, as depicted in fig. 1. Less obviously, it also occurs when different individuals perform what is nominally the same task, reflecting individuals’ unique preferences and styles. In this paper, we assume demonstrations of the same task are assigned a common label.
A naive solution to multitask IRL is to repeatedly apply a singletask IRL algorithm to demonstrations of each task. However, this method requires that the number of samples increases proportionally with the number of tasks, which is prohibitive in many settings. Fortunately, the reward functions for related tasks are often similar, and exploiting this structure can enable greater sample efficiency.
Previous work on the multitask IRL problem (Dimitrakakis & Rothkopf, 2011; BabeşVroman et al., 2011; Choi & Kim, 2012) builds on Bayesian IRL (Ramachandran & Amir, 2007). Unfortunately, no extant Bayesian IRL methods scale to complex environments with highdimensional, continuous state spaces such as robotics. By contrast, approaches based on maximum causal entropy show more promise (Ziebart et al., 2010). Although the original maximum causal entropy IRL algorithm is limited to discrete state spaces, recent extensions such as guided cost learning and adversarial IRL scale to challenging continuous control environments (Finn et al., 2016; Fu et al., 2018).
Our two main contributions in this paper are:

Regularised Maximum Causal Entropy (MCE). We present a formulation of the multitask IRL problem in the MCE framework. Our approach simply adds a regularisation term to the loss, retaining the computational efficiency of the original MCE IRL algorithm. We evaluate in a gridworld that takes hundreds of demonstrations for MCE IRL to solve. By contrast, after a single demonstration our regularised variant recovers a reward leading to a nearoptimal policy.

MetaLearning Rewards. We describe preliminary work applying metalearning to adversarial IRL. Evaluating on multitask variants of continuous control tasks, we find baseline singletask adversarial IRL has poor performance even when given ample samples. This limitation consequently effects our multitask variant. The poor performance appears to be due to the multimodal nature of optimal policies in these environments, presenting a challenge not present in other benchmarks. We conjecture this is analogous to the mode collapse problem in generative adversarial networks (Goodfellow et al., 2014), and conclude with suggestions for further work.
2 Preliminaries and SingleTask IRL
A Markov Decision Process (MDP)
is a tuple where and are sets of states and actions;is the probability of transitioning to
from after taking action ; is a discount factor; is the probability of starting in ; and is the reward upon taking action in state . We write to denote an MDP without a reward function.In the singletask IRL problem, the IRL algorithm is given access to an and demonstrations from an (approximately) optimal policy. The goal is to recover a reward function that explains the demonstrations . Note this is an illposed problem: many reward functions , including the constant zero reward function , make the demonstrations optimal.
Bayesian IRL addresses this identification problem by inferring a posterior distribution (Ramachandran & Amir, 2007). Although some probability mass will be placed on degenerate reward functions, for reasonable priors the majority of the probability will lie on more plausible explanations.
By contrast, maximum causal entropy chooses a single reward function, using the principle of maximum entropy to select the least specific reward function that is still consistent with the demonstrations
(Ziebart et al., 2008, 2010). It models the demonstrations as being sampled from:(1) 
a stochastic expert policy that is noisily optimal for:
(2)  
Note there can exist multiple solutions to these softmax Bellman equations (Asadi & Littman, 2017).
To reduce the dimension of the problem, it is common to assume the reward function is linear in features over the stateaction pairs:
(3) 
Let the expert demonstration consist of trajectories . For convenience, write:
(4)  
(5) 
Given a known feature map , the IRL problem reduces to finding weights .
A key insight behind maximum causal entropy IRL is that actions in the trajectory sequence depend causally on previous states and actions: i.e. may depend on and , but not on states or actions that occur later in time. The causal loglikelihood of a trajectory is defined to be:
(6) 
with the causal entropy of a policy defined in terms of the causal loglikelihood of its trajectories:
(7) 
Maximum causal likelihood estimation of
given the expert demonstrations is equivalent to maximising the causal entropy of the stochastic policy subject to the constraint that its expected feature counts match those of the demonstrations:(8) 
Note this constraint guarantees attains the same (expected) reward as the expert demonstrations (Abbeel & Ng, 2004). Maximum causal entropy thus recovers reward weights that match the performance of the expert, while avoiding degeneracy by maximising the diversity of the policy.
3 Methods for MultiTask IRL
In multitask IRL, the reward varies between MDPs with associated expert demonstrations . If the reward functions are unrelated to each other, we cannot do better than repeated application of a singletask IRL algorithm. However, in practice similar tasks have reward functions with similar structure, enabling specialised multitask IRL algorithms to accurately infer the reward with fewer demonstrations.
In the next section, we solve the multitask IRL problem using the original maximum causal entropy IRL algorithm with an additional regularisation term. Following this, we describe how our method can be extended to scalable approximations of maximum causal entropy IRL.
3.1 Regularised Maximum Causal Entropy IRL
In the multitask setting, we must jointly infer reward weights that explain each demonstration . To make progress we must make some assumption on the relationship between different reward weights. A natural assumption is that the reward weights for most tasks lie close to the mean across all tasks, i.e. should be small, where . This corresponds to a prior that is drawn from i.i.d. Gaussians with mean
and variance monotonic with
. In practice, we do not know , but we can estimate it by taking the mean of the current iterates for . This results in a pleasingly simple inference procedure. The regularised loss is:(9) 
with gradient:
(10) 
3.2 MetaLearning Reward Networks
In the previous section, we saw how multitask IRL can be incorporated directly into the Maximum Causal Entropy (MCE) framework. However, the original MCE IRL algorithm has two major limitations. First, it assumes the MDP’s dynamics are known, whereas in many applications (e.g. robotics) the dynamics are unknown and must also be learned. Second, it requires the practitioner to provide a feature mapping such that the resulting reward is linear. For many tasks, finding these features may be the bulk of the problem, negating the benefit of IRL.
Both of these shortcomings are addressed by guided cost learning (Finn et al., 2016) and its successor adversarial IRL (Fu et al., 2018)
, scalable approximations of MCE IRL. Specifically, adversarial IRL uses a neural network to represent the reward
as a function from states and actions, obviating the need to specify a feature map . Furthermore, it can handle unknown transition dynamics since it estimates the loss gradient via sampling rather than direct computation, and so only requires access to a simulation of the environment for rollouts.Naively, we could directly translate the regularisation approach given in the previous section to this setting, applying it to the parameters of the neural network . However, regularising the parameter space may not regularise the output space: small changes in some parameters may have a large effect on the predicted reward, while large changes in other parameters may have little effect.
A more promising approach is to metalearn the reward network parameters . We selected Reptile (Nichol et al., 2018)
as the basis for our initial experiments due to its computational efficiency, a key consideration given that IRL in complex environments is already computationally demanding. Moreover, Reptile attains similar accuracy in fewshot supervised learning tasks as more computationally expensive algorithms.
Our meta adversarial IRL (metaAIRL) method is described in algorithm 1. We seek to find an initialisation for the reward network that can be quickly finetuned for any given task (by running adversarial IRL on demonstrations of that task). To achieve this, we repeatedly sample a task and run steps of adversarial IRL, starting from our current initialisation . The initialisation is then updated along the line between the initialisation and final iterate of adversarial IRL. Although this appears superficially similar to joint training, for it is an approximation to firstorder modelagnostic metalearning (MAML) (Finn et al., 2017), a more principled but computationally expensive metalearning algorithm.
Algorithm 1 cannot be applied verbatim since adversarial IRL jointly learns a reward function and a policy optimising that reward function. This is analogous to a GAN, where the policy network is a generator and the reward network defines a discriminator (assigning greater probability to higher reward trajectories). We developed two concrete implementations of metaAIRL, differing only in the policy used during metatraining:

Random. The simplest solution is to randomly initialise the policy at the start of each new task. This reduces adversarial IRL to a pure samplebased approximation of MCE IRL. We expect this to work in simple environments, where a random agent can cover most of the state space, but to fail in more challenging environments.

Taskspecific. A more sophisticated option is to maintain separate policy parameters per task. This method learns reward parameters that can be quickly finetuned to discriminate data from a distribution of generators.
However, since the policy for a task is updated only when that task is sampled, care must be taken to ensure the frequency between samples does not grow too large. Otherwise, policies for many tasks might become very suboptimal for the current reward network weights, slowing convergence. Accordingly, we suggest training in minibatches of small numbers of tasks.
4 Related Work
Previous work in multitask IRL has approached the problem from a Bayesian perspective. Dimitrakakis & Rothkopf (2011) model rewardpolicy function pairs
as being drawn from a common (unknown) prior, over which they place a hyperprior. This work provides a solid theoretical basis for work on multitask IRL, but the inference problem is intractable even for moderately sized finitestate MDPs.
Complementary work has tackled an unlabelled variant of the multitask IRL problem. That is, not only are the reward functions unknown, it is also not known which reward each trajectory is paired with. BabeşVroman et al. (2011) use expectationmaximisation to cluster trajectories, an approach applicable to several IRL algorithms. Choi & Kim (2012) instead take a Bayesian IRL approach using a Dirichlet process mixture model, allowing a variable number of clusters. Both methods reduce the problem to multiple singletask IRL problems, and so unlike our work do not exploit similarities between reward functions.
Amin et al. (2017) have studied the similar problem of repeated IRL: learning a common reward component shared across tasks, given known taskspecific reward components. Although this could be solved by applying IRL to any one of the tasks, a repeated IRL algorithm can attain better bounds and even resolve the ambiguity inherent in singletask IRL.
IRL is often used for imitation learning. Multitask imitation learning has also been studied from a nonIRL perspective, especially in the context of generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016). Recent extensions to GAIL augment trajectories with a latent intention variable that specifies the task, and then maximise mutual information between the stateaction pairs and intention variable (Hausman et al., 2017; Li et al., 2017). These approaches are focused on disentangling trajectories from different tasks, and are not intended to speed up the learning of new tasks.
However, the imitation learning community does address this problem in oneshot imitation learning: having seen a distribution of trajectories over various tasks, learn a new task from a single demonstration. Wang et al. (2017) use GAIL with the discriminator conditioned on the output generated by an LSTM encoder. After training on unlabelled trajectories, this method can perform oneshot imitation learning by conditioning on the code of a demonstration trajectory. Oneshot imitation learning has also been tackled within the behavioural cloning paradigm (Duan et al., 2017).
Multiagent GAIL (Song et al., 2018) is the imitation learning method most similar to our paper. Although GAIL does not explicitly learn a reward function, it is equivalent to IRL composed with RL. Similar to our work, multiagent GAIL seeks to improve sample efficiency by exploiting similarity between the reward functions. However, unlike our work, multiagent GAIL makes strong assumptions on the reward function (e.g. zero sum games).
In concurrent work, Xu et al. (2018) independently developed a metaIRL algorithm, applying MAML directly to Maximum Entropy IRL. They obtain good performance on a “Spriteworld” navigation domain, consisting of a gridworld with overlaid “sprite” textures. However, their algorithm inherits the limitations of Maximum Entropy IRL: the MDP must have a finite state space and known transition dynamics. By contrast, our metaAIRL method (algorithm 1) learns the transition dynamics and can operate in infinite state space MDPs such as continuous control environments. However, this flexibility comes at a cost, with our experiments showing that adversarial IRL has difficulty learning from nonunimodal policies. Accordingly, we view Xu et al. (2018) and our own work as being complementary, making different tradeoffs in order to target different applications.
5 Experiments
5.1 regularised Maximum Causal Entropy IRL
We evaluate our regularised maximum causal entropy (MCE) IRL algorithm in a fewshot reward learning problem on the gridworld depicted in fig. 2. Transitions in the gridworld are stochastic, with probability of moving in the desired direction, and of moving in each of the two orthogonal directions. Each cell in the gridworld is either a wall (in which case the state can never be visited), or one of five objects types: dirt, grass, lava, gold and silver.
We define three different reward functions A, B and A+B in terms of these object types, as specified by the legend of fig. 2. The reward functions assign the same weights to dirt, grass and lava, but differ in the weights for gold and silver. A likes silver but is neutral about gold, B has the opposite preferences and A+B likes both gold and silver. We generate synthetic demonstrations for each of these three reward functions using the MCE planner given by eq. (1).
Our multitask IRL algorithm is then presented with demonstrations from an optimal policy for each reward function. Demonstrations for the fewshot environment are restricted to trajectories, varying between and , while demonstrations for the other two environments contain trajectories. To make the task more challenging, our algorithm is not provided with the feature representation, instead having to learn the reward separately for each state. We repeat all experiments for random seeds.
5.1.1 Comparison to Baselines
We compare against two baselines. The first (‘single’) corresponds to using singletask MCE IRL, seeing only the trajectories from the fewshot environment. The second (‘joint training’) combines the demonstrations from all three environments into a single length sequence of trajectories. For reference, we also display the value obtained by an optimal (‘oracle’) policy. Figure 3 shows the best out of 5 random seeds.
Our multitask IRL algorithm recovers a nearoptimal policy in all 5 runs after only two trajectories, and in the best case requires only a single trajectory. By contrast, the ‘single’ baseline requires trajectories or more to recover a good policy even in the best case, and after trajectories several seeds still obtain negative total rewards.
The ‘joint training’ baseline performs well on A+B. This is unsurprising, since an optimal policy in A or B is nearoptimal in A+B. However, it fares poorly in both the A and B environments, never obtaining a positive reward even in the best case.
Note that all approaches fail in the zeroshot case on A and B, making the success of multitask IRL in the fewshot case all the more remarkable. Demonstrations solely from nontarget environments are not enough to recover a good reward in the target, and so substantial learning must be taking place with only one or two trajectories.
5.1.2 Hyperparameter Choice
Our regularised MCE IRL algorithm takes a hyperparameter
that specifies the regularisation strength. We show in fig. 4 the results of a hyperparameter sweep between and . As expected, the weakest regularisation constant suffers from high variance across the random seeds when the number of trajectories is small.Perhaps more surprising, the strongest regularisation constant also has high variance. We conjecture that it imposes too strong a prior, making it highly sensitive to the trajectories observed in the offtarget environments.
The median regularisation constant attains the lowest variance and highest mean of the hyperparameters tested, and was used in the previous section’s experiments.
These results indicate that where sample efficiency is paramount, it is important to choose a regularisation hyperparameter suited to the task distribution. However, the algorithm is reasonably robust to hyperparameter choice, with all parameters (varying across two orders of magnitude) attaining nearoptimal performance after as few as 20 trajectories. By contrast, the singletask IRL algorithm did not achieve this level of performance even in the best case until observing 50 or more trajectories.
5.2 MetaLearning Reward Networks v2
We evaluated both our random and taskspecific implementations of metaAIRL (algorithm 1) on a multitask variant of the mountain car continuous control problem. As a baseline we compare against both a standard version of singletask AIRL and one with a random policy.
Our test environment, illustrated in figure 6, is a symmetric version of Gym’s MountainCarContinuousv0 environment. Both the left and right mountains have a flag at their peak, and the episode ends as soon as the car touches either flag. One flag is the goal, and the agent receives reward from reaching this flag. The other flag is a decoy, and the agent receives a penalty if it touches this flag. In addition, there is a quadratic control cost.
We evaluate in two test cases. The fixed test case consists of two environments: one where the goal flag is always on the left, another where it is always on the right. In the variable test case, the side of the goal flag is chosen randomly at the start of each episode. It consists of two environments: one where the blue flag is the goal, another where it is the red flag. The position of both flags is included in the state.
Figure 5 reports the value of the resulting policies in the fixed test case (top row) and variable test case (bottom row). The left column corresponds to policies learnt directly by AIRL, and the right column policies trained with PPO on the reward learnt by AIRL.
For the fixed test case, we see that both the singletask baseline and our metaAIRL algorithm produce nearoptimal solutions. This is unsurprising: the optimal policy is unimodal, and so it is simple to extrapolate from a single trajectory, especially in a lowdimensional environment such as mountain car.
In the variable test case, figure 5 shows that singletask AIRL is unable to reliably find a good solution even after observing trajectories. Reptile can only learn a good metainitialisation in the outer loop if consistent progress is made in the AIRL inner loop, so unsurprisingly our metaAIRL algorithm also fails in this environment. Note the variable test case has a bimodal expert policy: the best trajectory depends on whether the target flag is on the left or right.
Our findings suggest that adversarial IRL attains good performance only in environments with a unimodal optimal policy. In these environments, a handful of trajectories is sufficient to recover the reward, leaving little room for improvement from applying metalearning. While existing simulated robotics benchmarks can largely be solved by unimodal policies, many practical tasks (such as multistep assembly) cannot, making this a pressing area for further research.
6 Conclusions and Future Work
Sample efficient solutions to the multitask IRL problem are critical for enabling realworld applications, where collecting human demonstrations is expensive and slow. The multitask IRL problem has previously been studied exclusively from a Bayesian IRL perspective. In this paper we took the alternative approach of formulating the multitask problem inside the maximum causal entropy IRL framework by adding a regularisation term to the loss. Experiments find our multitask IRL algorithm can perform oneshot imitation learning in an environment that singletask IRL requires hundreds of demonstrations to learn.
Maximum causal entropy IRL (Ziebart et al., 2010) cannot scale to MDPs with large or infinite state spaces, and moreover requires known dynamics. Both these problems have been alleviated by recent extensions to maximum causal entropy IRL, such as guided cost learning and adversarial IRL (Finn et al., 2016; Fu et al., 2018). Our second contribution is to show how in this function approximator setting, multitask IRL can be framed as a metalearning problem.
Testing of our prototype metaAIRL method (algorithm 1) found that adversarial IRL (Fu et al., 2018) can only learn from unimodal expert policies, seriously limiting the applicability of metaAIRL. We conjecture this limitation in adversarial IRL is related to the wellknown problem of mode collapse in generative adversarial networks (GAN). A fruitful research direction might be to apply recent innovations in GAN training such as unrolling the optimisation of the discriminator (Metz et al., 2017) or variational learning (Srivastava et al., 2017) to stabilise adversarial IRL training.
Another limitation of adversarial IRL is that the inferred reward network (discriminator) often overfits to the jointly learnt policy (generator). In particular, training a policy using the reward network fails from most random initialisations, even if the policy learnt by adversarial IRL obtains good performance. Our prototype performed metalearning only on the reward network, making it particularly sensitive to this limitation. Jointly metalearning the reward and policy network could improve performance, but we believe this will first require significant improvements in metaRL algorithms.
The source code for our algorithms and experiments is open source and available at —removed for double blind— .
Acknowledgements
Removed for double blind review.
References
 Abbeel & Ng (2004) Abbeel, Pieter and Ng, Andrew Y. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
 Amin et al. (2017) Amin, Kareem, Jiang, Nan, and Singh, Satinder. Repeated inverse reinforcement learning. In NIPS. 2017.
 Asadi & Littman (2017) Asadi, Kavosh and Littman, Michael L. An alternative softmax operator for reinforcement learning. In ICML, 2017.
 BabeşVroman et al. (2011) BabeşVroman, Monica, Marivate, Vukosi, Subramanian, Kaushik, and Littman, Michael. Apprenticeship learning about multiple intentions. In ICML, 2011.
 Choi & Kim (2012) Choi, Jaedeug and Kim, KeeEung. Nonparametric Bayesian inverse reinforcement learning for multiple reward functions. In NIPS, 2012.
 Dimitrakakis & Rothkopf (2011) Dimitrakakis, Christos and Rothkopf, Constantin A. Bayesian multitask inverse reinforcement learning. In EWRL, 2011.
 Duan et al. (2017) Duan, Yan, Andrychowicz, Marcin, Stadie, Bradly, Ho, Jonathan, Schneider, Jonas, Sutskever, Ilya, Abbeel, Pieter, and Zaremba, Wojciech. Oneshot imitation learning. In NIPS, 2017.
 Finn et al. (2016) Finn, Chelsea, Levine, Sergey, and Abbeel, Pieter. Guided cost learning: Deep inverse optimal control via policy optimization. In ICML, 2016.
 Finn et al. (2017) Finn, Chelsea, Abbeel, Pieter, and Levine, Sergey. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017.
 Fu et al. (2018) Fu, Justin, Luo, Katie, and Levine, Sergey. Learning robust rewards with adverserial inverse reinforcement learning. In ICLR, 2018.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In NIPS. 2014.
 Hausman et al. (2017) Hausman, Karol, Chebotar, Yevgen, Schaal, Stefan, Sukhatme, Gaurav, and Lim, Joseph J. Multimodal imitation learning from unstructured demonstrations using generative adversarial nets. In NIPS. 2017.
 Ho & Ermon (2016) Ho, Jonathan and Ermon, Stefano. Generative adversarial imitation learning. In NIPS. 2016.
 Li et al. (2017) Li, Yunzhu, Song, Jiaming, and Ermon, Stefano. InfoGAIL: Interpretable imitation learning from visual demonstrations. In NIPS. 2017.
 Metz et al. (2017) Metz, Luke, Poole, Ben, Pfau, David, and SohlDickstein, Jascha. Unrolled generative adversarial networks. In ICLR, 2017.
 Ng & Russell (2000) Ng, Andrew Y. and Russell, Stuart. Algorithms for inverse reinforcement learning. In ICML, 2000.
 Nichol et al. (2018) Nichol, Alex, Achiam, Joshua, and Schulman, John. On firstorder metalearning algorithms. CoRR, abs/1803.02999, 2018.
 Ramachandran & Amir (2007) Ramachandran, Deepak and Amir, Eyal. Bayesian inverse reinforcement learning. In IJCAI, 2007.
 Song et al. (2018) Song, Jiaming, Ren, Hongyu, Sadigh, Dorsa, and Ermon, Stefano. Multiagent generative adversarial imitation learning. In ICLR Workshop. 2018.
 Srivastava et al. (2017) Srivastava, Akash, Valkoz, Lazar, Russell, Chris, Gutmann, Michael U., and Sutton, Charles. VEEGAN: Reducing mode collapse in gans using implicit variational learning. In NIPS. 2017.
 Wang et al. (2017) Wang, Ziyu, Merel, Josh S, Reed, Scott E, de Freitas, Nando, Wayne, Gregory, and Heess, Nicolas. Robust imitation of diverse behaviors. In NIPS. 2017.
 Xu et al. (2018) Xu, Kelvin, Ratner, Ellis, Dragan, Anca D., Levine, Sergey, and Finn, Chelsea. Learning a prior over intent via metainverse reinforcement learning. CoRR, abs/1805.12573, 2018.
 Ziebart et al. (2008) Ziebart, Brian D., Maas, Andrew, Bagnell, J. Andrew, and Dey, Anind K. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
 Ziebart et al. (2010) Ziebart, Brian D., Bagnell, J. Andrew, and Dey, Anind K. Modeling interaction via the principle of maximum causal entropy. In ICML, 2010.
Comments
There are no comments yet.