Inverse reinforcement learning (IRL) is the task of determining the reward function that generated a set of trajectories: sequences of state-action pairs (Ng & Russell, 2000). It is a form of learning from demonstration that assumes the demonstrator follows a near-optimal policy with respect to an unknown reward. Sample efficiency is a key design goal, since it is costly to elicit human demonstrations.
In practice, demonstrations are often generated from multiple reward functions. This situation naturally arises when the demonstrations are for different tasks, such as grasping different types of objects, as depicted in fig. 1. Less obviously, it also occurs when different individuals perform what is nominally the same task, reflecting individuals’ unique preferences and styles. In this paper, we assume demonstrations of the same task are assigned a common label.
A naive solution to multi-task IRL is to repeatedly apply a single-task IRL algorithm to demonstrations of each task. However, this method requires that the number of samples increases proportionally with the number of tasks, which is prohibitive in many settings. Fortunately, the reward functions for related tasks are often similar, and exploiting this structure can enable greater sample efficiency.
Previous work on the multi-task IRL problem (Dimitrakakis & Rothkopf, 2011; Babeş-Vroman et al., 2011; Choi & Kim, 2012) builds on Bayesian IRL (Ramachandran & Amir, 2007). Unfortunately, no extant Bayesian IRL methods scale to complex environments with high-dimensional, continuous state spaces such as robotics. By contrast, approaches based on maximum causal entropy show more promise (Ziebart et al., 2010). Although the original maximum causal entropy IRL algorithm is limited to discrete state spaces, recent extensions such as guided cost learning and adversarial IRL scale to challenging continuous control environments (Finn et al., 2016; Fu et al., 2018).
Our two main contributions in this paper are:
Regularised Maximum Causal Entropy (MCE). We present a formulation of the multi-task IRL problem in the MCE framework. Our approach simply adds a regularisation term to the loss, retaining the computational efficiency of the original MCE IRL algorithm. We evaluate in a gridworld that takes hundreds of demonstrations for MCE IRL to solve. By contrast, after a single demonstration our regularised variant recovers a reward leading to a near-optimal policy.
Meta-Learning Rewards. We describe preliminary work applying meta-learning to adversarial IRL. Evaluating on multi-task variants of continuous control tasks, we find baseline single-task adversarial IRL has poor performance even when given ample samples. This limitation consequently effects our multi-task variant. The poor performance appears to be due to the multimodal nature of optimal policies in these environments, presenting a challenge not present in other benchmarks. We conjecture this is analogous to the mode collapse problem in generative adversarial networks (Goodfellow et al., 2014), and conclude with suggestions for further work.
2 Preliminaries and Single-Task IRL
A Markov Decision Process (MDP)is a tuple where and are sets of states and actions;
is the probability of transitioning tofrom after taking action ; is a discount factor; is the probability of starting in ; and is the reward upon taking action in state . We write to denote an MDP without a reward function.
In the single-task IRL problem, the IRL algorithm is given access to an and demonstrations from an (approximately) optimal policy. The goal is to recover a reward function that explains the demonstrations . Note this is an ill-posed problem: many reward functions , including the constant zero reward function , make the demonstrations optimal.
Bayesian IRL addresses this identification problem by inferring a posterior distribution (Ramachandran & Amir, 2007). Although some probability mass will be placed on degenerate reward functions, for reasonable priors the majority of the probability will lie on more plausible explanations.
By contrast, maximum causal entropy chooses a single reward function, using the principle of maximum entropy to select the least specific reward function that is still consistent with the demonstrations(Ziebart et al., 2008, 2010). It models the demonstrations as being sampled from:
a stochastic expert policy that is noisily optimal for:
Note there can exist multiple solutions to these softmax Bellman equations (Asadi & Littman, 2017).
To reduce the dimension of the problem, it is common to assume the reward function is linear in features over the state-action pairs:
Let the expert demonstration consist of trajectories . For convenience, write:
Given a known feature map , the IRL problem reduces to finding weights .
A key insight behind maximum causal entropy IRL is that actions in the trajectory sequence depend causally on previous states and actions: i.e. may depend on and , but not on states or actions that occur later in time. The causal log-likelihood of a trajectory is defined to be:
with the causal entropy of a policy defined in terms of the causal log-likelihood of its trajectories:
Maximum causal likelihood estimation ofgiven the expert demonstrations is equivalent to maximising the causal entropy of the stochastic policy subject to the constraint that its expected feature counts match those of the demonstrations:
Note this constraint guarantees attains the same (expected) reward as the expert demonstrations (Abbeel & Ng, 2004). Maximum causal entropy thus recovers reward weights that match the performance of the expert, while avoiding degeneracy by maximising the diversity of the policy.
3 Methods for Multi-Task IRL
In multi-task IRL, the reward varies between MDPs with associated expert demonstrations . If the reward functions are unrelated to each other, we cannot do better than repeated application of a single-task IRL algorithm. However, in practice similar tasks have reward functions with similar structure, enabling specialised multi-task IRL algorithms to accurately infer the reward with fewer demonstrations.
In the next section, we solve the multi-task IRL problem using the original maximum causal entropy IRL algorithm with an additional regularisation term. Following this, we describe how our method can be extended to scalable approximations of maximum causal entropy IRL.
3.1 Regularised Maximum Causal Entropy IRL
In the multi-task setting, we must jointly infer reward weights that explain each demonstration . To make progress we must make some assumption on the relationship between different reward weights. A natural assumption is that the reward weights for most tasks lie close to the mean across all tasks, i.e. should be small, where . This corresponds to a prior that is drawn from i.i.d. Gaussians with mean
and variance monotonic with. In practice, we do not know , but we can estimate it by taking the mean of the current iterates for . This results in a pleasingly simple inference procedure. The regularised loss is:
3.2 Meta-Learning Reward Networks
In the previous section, we saw how multi-task IRL can be incorporated directly into the Maximum Causal Entropy (MCE) framework. However, the original MCE IRL algorithm has two major limitations. First, it assumes the MDP’s dynamics are known, whereas in many applications (e.g. robotics) the dynamics are unknown and must also be learned. Second, it requires the practitioner to provide a feature mapping such that the resulting reward is linear. For many tasks, finding these features may be the bulk of the problem, negating the benefit of IRL.
, scalable approximations of MCE IRL. Specifically, adversarial IRL uses a neural network to represent the rewardas a function from states and actions, obviating the need to specify a feature map . Furthermore, it can handle unknown transition dynamics since it estimates the loss gradient via sampling rather than direct computation, and so only requires access to a simulation of the environment for rollouts.
Naively, we could directly translate the regularisation approach given in the previous section to this setting, applying it to the parameters of the neural network . However, regularising the parameter space may not regularise the output space: small changes in some parameters may have a large effect on the predicted reward, while large changes in other parameters may have little effect.
A more promising approach is to meta-learn the reward network parameters . We selected Reptile (Nichol et al., 2018)
as the basis for our initial experiments due to its computational efficiency, a key consideration given that IRL in complex environments is already computationally demanding. Moreover, Reptile attains similar accuracy in few-shot supervised learning tasks as more computationally expensive algorithms.
Our meta adversarial IRL (meta-AIRL) method is described in algorithm 1. We seek to find an initialisation for the reward network that can be quickly finetuned for any given task (by running adversarial IRL on demonstrations of that task). To achieve this, we repeatedly sample a task and run steps of adversarial IRL, starting from our current initialisation . The initialisation is then updated along the line between the initialisation and final iterate of adversarial IRL. Although this appears superficially similar to joint training, for it is an approximation to first-order model-agnostic meta-learning (MAML) (Finn et al., 2017), a more principled but computationally expensive meta-learning algorithm.
Algorithm 1 cannot be applied verbatim since adversarial IRL jointly learns a reward function and a policy optimising that reward function. This is analogous to a GAN, where the policy network is a generator and the reward network defines a discriminator (assigning greater probability to higher reward trajectories). We developed two concrete implementations of meta-AIRL, differing only in the policy used during meta-training:
Random. The simplest solution is to randomly initialise the policy at the start of each new task. This reduces adversarial IRL to a pure sample-based approximation of MCE IRL. We expect this to work in simple environments, where a random agent can cover most of the state space, but to fail in more challenging environments.
Task-specific. A more sophisticated option is to maintain separate policy parameters per task. This method learns reward parameters that can be quickly finetuned to discriminate data from a distribution of generators.
However, since the policy for a task is updated only when that task is sampled, care must be taken to ensure the frequency between samples does not grow too large. Otherwise, policies for many tasks might become very suboptimal for the current reward network weights, slowing convergence. Accordingly, we suggest training in mini-batches of small numbers of tasks.
4 Related Work
Previous work in multi-task IRL has approached the problem from a Bayesian perspective. Dimitrakakis & Rothkopf (2011) model reward-policy function pairs
as being drawn from a common (unknown) prior, over which they place a hyperprior. This work provides a solid theoretical basis for work on multi-task IRL, but the inference problem is intractable even for moderately sized finite-state MDPs.
Complementary work has tackled an unlabelled variant of the multi-task IRL problem. That is, not only are the reward functions unknown, it is also not known which reward each trajectory is paired with. Babeş-Vroman et al. (2011) use expectation-maximisation to cluster trajectories, an approach applicable to several IRL algorithms. Choi & Kim (2012) instead take a Bayesian IRL approach using a Dirichlet process mixture model, allowing a variable number of clusters. Both methods reduce the problem to multiple single-task IRL problems, and so unlike our work do not exploit similarities between reward functions.
Amin et al. (2017) have studied the similar problem of repeated IRL: learning a common reward component shared across tasks, given known task-specific reward components. Although this could be solved by applying IRL to any one of the tasks, a repeated IRL algorithm can attain better bounds and even resolve the ambiguity inherent in single-task IRL.
IRL is often used for imitation learning. Multi-task imitation learning has also been studied from a non-IRL perspective, especially in the context of generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016). Recent extensions to GAIL augment trajectories with a latent intention variable that specifies the task, and then maximise mutual information between the state-action pairs and intention variable (Hausman et al., 2017; Li et al., 2017). These approaches are focused on disentangling trajectories from different tasks, and are not intended to speed up the learning of new tasks.
However, the imitation learning community does address this problem in one-shot imitation learning: having seen a distribution of trajectories over various tasks, learn a new task from a single demonstration. Wang et al. (2017) use GAIL with the discriminator conditioned on the output generated by an LSTM encoder. After training on unlabelled trajectories, this method can perform one-shot imitation learning by conditioning on the code of a demonstration trajectory. One-shot imitation learning has also been tackled within the behavioural cloning paradigm (Duan et al., 2017).
Multi-agent GAIL (Song et al., 2018) is the imitation learning method most similar to our paper. Although GAIL does not explicitly learn a reward function, it is equivalent to IRL composed with RL. Similar to our work, multi-agent GAIL seeks to improve sample efficiency by exploiting similarity between the reward functions. However, unlike our work, multi-agent GAIL makes strong assumptions on the reward function (e.g. zero sum games).
In concurrent work, Xu et al. (2018) independently developed a meta-IRL algorithm, applying MAML directly to Maximum Entropy IRL. They obtain good performance on a “Spriteworld” navigation domain, consisting of a gridworld with overlaid “sprite” textures. However, their algorithm inherits the limitations of Maximum Entropy IRL: the MDP must have a finite state space and known transition dynamics. By contrast, our meta-AIRL method (algorithm 1) learns the transition dynamics and can operate in infinite state space MDPs such as continuous control environments. However, this flexibility comes at a cost, with our experiments showing that adversarial IRL has difficulty learning from non-unimodal policies. Accordingly, we view Xu et al. (2018) and our own work as being complementary, making different trade-offs in order to target different applications.
5.1 regularised Maximum Causal Entropy IRL
We evaluate our regularised maximum causal entropy (MCE) IRL algorithm in a few-shot reward learning problem on the gridworld depicted in fig. 2. Transitions in the gridworld are stochastic, with probability of moving in the desired direction, and of moving in each of the two orthogonal directions. Each cell in the gridworld is either a wall (in which case the state can never be visited), or one of five objects types: dirt, grass, lava, gold and silver.
We define three different reward functions A, B and A+B in terms of these object types, as specified by the legend of fig. 2. The reward functions assign the same weights to dirt, grass and lava, but differ in the weights for gold and silver. A likes silver but is neutral about gold, B has the opposite preferences and A+B likes both gold and silver. We generate synthetic demonstrations for each of these three reward functions using the MCE planner given by eq. (1).
Our multi-task IRL algorithm is then presented with demonstrations from an optimal policy for each reward function. Demonstrations for the few-shot environment are restricted to trajectories, varying between and , while demonstrations for the other two environments contain trajectories. To make the task more challenging, our algorithm is not provided with the feature representation, instead having to learn the reward separately for each state. We repeat all experiments for random seeds.
5.1.1 Comparison to Baselines
We compare against two baselines. The first (‘single’) corresponds to using single-task MCE IRL, seeing only the trajectories from the few-shot environment. The second (‘joint training’) combines the demonstrations from all three environments into a single -length sequence of trajectories. For reference, we also display the value obtained by an optimal (‘oracle’) policy. Figure 3 shows the best out of 5 random seeds.
Our multi-task IRL algorithm recovers a near-optimal policy in all 5 runs after only two trajectories, and in the best case requires only a single trajectory. By contrast, the ‘single’ baseline requires trajectories or more to recover a good policy even in the best case, and after trajectories several seeds still obtain negative total rewards.
The ‘joint training’ baseline performs well on A+B. This is unsurprising, since an optimal policy in A or B is near-optimal in A+B. However, it fares poorly in both the A and B environments, never obtaining a positive reward even in the best case.
Note that all approaches fail in the zero-shot case on A and B, making the success of multi-task IRL in the few-shot case all the more remarkable. Demonstrations solely from non-target environments are not enough to recover a good reward in the target, and so substantial learning must be taking place with only one or two trajectories.
5.1.2 Hyperparameter Choice
Our regularised MCE IRL algorithm takes a hyperparameterthat specifies the regularisation strength. We show in fig. 4 the results of a hyperparameter sweep between and . As expected, the weakest regularisation constant suffers from high variance across the random seeds when the number of trajectories is small.
Perhaps more surprising, the strongest regularisation constant also has high variance. We conjecture that it imposes too strong a prior, making it highly sensitive to the trajectories observed in the off-target environments.
The median regularisation constant attains the lowest variance and highest mean of the hyperparameters tested, and was used in the previous section’s experiments.
These results indicate that where sample efficiency is paramount, it is important to choose a regularisation hyperparameter suited to the task distribution. However, the algorithm is reasonably robust to hyperparameter choice, with all parameters (varying across two orders of magnitude) attaining near-optimal performance after as few as 20 trajectories. By contrast, the single-task IRL algorithm did not achieve this level of performance even in the best case until observing 50 or more trajectories.
5.2 Meta-Learning Reward Networks v2
We evaluated both our random and task-specific implementations of meta-AIRL (algorithm 1) on a multi-task variant of the mountain car continuous control problem. As a baseline we compare against both a standard version of single-task AIRL and one with a random policy.
Our test environment, illustrated in figure 6, is a symmetric version of Gym’s MountainCarContinuous-v0 environment. Both the left and right mountains have a flag at their peak, and the episode ends as soon as the car touches either flag. One flag is the goal, and the agent receives reward from reaching this flag. The other flag is a decoy, and the agent receives a penalty if it touches this flag. In addition, there is a quadratic control cost.
We evaluate in two test cases. The fixed test case consists of two environments: one where the goal flag is always on the left, another where it is always on the right. In the variable test case, the side of the goal flag is chosen randomly at the start of each episode. It consists of two environments: one where the blue flag is the goal, another where it is the red flag. The position of both flags is included in the state.
Figure 5 reports the value of the resulting policies in the fixed test case (top row) and variable test case (bottom row). The left column corresponds to policies learnt directly by AIRL, and the right column policies trained with PPO on the reward learnt by AIRL.
For the fixed test case, we see that both the single-task baseline and our meta-AIRL algorithm produce near-optimal solutions. This is unsurprising: the optimal policy is unimodal, and so it is simple to extrapolate from a single trajectory, especially in a low-dimensional environment such as mountain car.
In the variable test case, figure 5 shows that single-task AIRL is unable to reliably find a good solution even after observing trajectories. Reptile can only learn a good meta-initialisation in the outer loop if consistent progress is made in the AIRL inner loop, so unsurprisingly our meta-AIRL algorithm also fails in this environment. Note the variable test case has a bimodal expert policy: the best trajectory depends on whether the target flag is on the left or right.
Our findings suggest that adversarial IRL attains good performance only in environments with a unimodal optimal policy. In these environments, a handful of trajectories is sufficient to recover the reward, leaving little room for improvement from applying meta-learning. While existing simulated robotics benchmarks can largely be solved by unimodal policies, many practical tasks (such as multi-step assembly) cannot, making this a pressing area for further research.
6 Conclusions and Future Work
Sample efficient solutions to the multi-task IRL problem are critical for enabling real-world applications, where collecting human demonstrations is expensive and slow. The multi-task IRL problem has previously been studied exclusively from a Bayesian IRL perspective. In this paper we took the alternative approach of formulating the multi-task problem inside the maximum causal entropy IRL framework by adding a regularisation term to the loss. Experiments find our multi-task IRL algorithm can perform one-shot imitation learning in an environment that single-task IRL requires hundreds of demonstrations to learn.
Maximum causal entropy IRL (Ziebart et al., 2010) cannot scale to MDPs with large or infinite state spaces, and moreover requires known dynamics. Both these problems have been alleviated by recent extensions to maximum causal entropy IRL, such as guided cost learning and adversarial IRL (Finn et al., 2016; Fu et al., 2018). Our second contribution is to show how in this function approximator setting, multi-task IRL can be framed as a meta-learning problem.
Testing of our prototype meta-AIRL method (algorithm 1) found that adversarial IRL (Fu et al., 2018) can only learn from unimodal expert policies, seriously limiting the applicability of meta-AIRL. We conjecture this limitation in adversarial IRL is related to the well-known problem of mode collapse in generative adversarial networks (GAN). A fruitful research direction might be to apply recent innovations in GAN training such as unrolling the optimisation of the discriminator (Metz et al., 2017) or variational learning (Srivastava et al., 2017) to stabilise adversarial IRL training.
Another limitation of adversarial IRL is that the inferred reward network (discriminator) often overfits to the jointly learnt policy (generator). In particular, training a policy using the reward network fails from most random initialisations, even if the policy learnt by adversarial IRL obtains good performance. Our prototype performed meta-learning only on the reward network, making it particularly sensitive to this limitation. Jointly meta-learning the reward and policy network could improve performance, but we believe this will first require significant improvements in meta-RL algorithms.
The source code for our algorithms and experiments is open source and available at —removed for double blind— .
Removed for double blind review.
- Abbeel & Ng (2004) Abbeel, Pieter and Ng, Andrew Y. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
- Amin et al. (2017) Amin, Kareem, Jiang, Nan, and Singh, Satinder. Repeated inverse reinforcement learning. In NIPS. 2017.
- Asadi & Littman (2017) Asadi, Kavosh and Littman, Michael L. An alternative softmax operator for reinforcement learning. In ICML, 2017.
- Babeş-Vroman et al. (2011) Babeş-Vroman, Monica, Marivate, Vukosi, Subramanian, Kaushik, and Littman, Michael. Apprenticeship learning about multiple intentions. In ICML, 2011.
- Choi & Kim (2012) Choi, Jaedeug and Kim, Kee-Eung. Nonparametric Bayesian inverse reinforcement learning for multiple reward functions. In NIPS, 2012.
- Dimitrakakis & Rothkopf (2011) Dimitrakakis, Christos and Rothkopf, Constantin A. Bayesian multitask inverse reinforcement learning. In EWRL, 2011.
- Duan et al. (2017) Duan, Yan, Andrychowicz, Marcin, Stadie, Bradly, Ho, Jonathan, Schneider, Jonas, Sutskever, Ilya, Abbeel, Pieter, and Zaremba, Wojciech. One-shot imitation learning. In NIPS, 2017.
- Finn et al. (2016) Finn, Chelsea, Levine, Sergey, and Abbeel, Pieter. Guided cost learning: Deep inverse optimal control via policy optimization. In ICML, 2016.
- Finn et al. (2017) Finn, Chelsea, Abbeel, Pieter, and Levine, Sergey. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
- Fu et al. (2018) Fu, Justin, Luo, Katie, and Levine, Sergey. Learning robust rewards with adverserial inverse reinforcement learning. In ICLR, 2018.
- Goodfellow et al. (2014) Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In NIPS. 2014.
- Hausman et al. (2017) Hausman, Karol, Chebotar, Yevgen, Schaal, Stefan, Sukhatme, Gaurav, and Lim, Joseph J. Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In NIPS. 2017.
- Ho & Ermon (2016) Ho, Jonathan and Ermon, Stefano. Generative adversarial imitation learning. In NIPS. 2016.
- Li et al. (2017) Li, Yunzhu, Song, Jiaming, and Ermon, Stefano. InfoGAIL: Interpretable imitation learning from visual demonstrations. In NIPS. 2017.
- Metz et al. (2017) Metz, Luke, Poole, Ben, Pfau, David, and Sohl-Dickstein, Jascha. Unrolled generative adversarial networks. In ICLR, 2017.
- Ng & Russell (2000) Ng, Andrew Y. and Russell, Stuart. Algorithms for inverse reinforcement learning. In ICML, 2000.
- Nichol et al. (2018) Nichol, Alex, Achiam, Joshua, and Schulman, John. On first-order meta-learning algorithms. CoRR, abs/1803.02999, 2018.
- Ramachandran & Amir (2007) Ramachandran, Deepak and Amir, Eyal. Bayesian inverse reinforcement learning. In IJCAI, 2007.
- Song et al. (2018) Song, Jiaming, Ren, Hongyu, Sadigh, Dorsa, and Ermon, Stefano. Multi-agent generative adversarial imitation learning. In ICLR Workshop. 2018.
- Srivastava et al. (2017) Srivastava, Akash, Valkoz, Lazar, Russell, Chris, Gutmann, Michael U., and Sutton, Charles. VEEGAN: Reducing mode collapse in gans using implicit variational learning. In NIPS. 2017.
- Wang et al. (2017) Wang, Ziyu, Merel, Josh S, Reed, Scott E, de Freitas, Nando, Wayne, Gregory, and Heess, Nicolas. Robust imitation of diverse behaviors. In NIPS. 2017.
- Xu et al. (2018) Xu, Kelvin, Ratner, Ellis, Dragan, Anca D., Levine, Sergey, and Finn, Chelsea. Learning a prior over intent via meta-inverse reinforcement learning. CoRR, abs/1805.12573, 2018.
- Ziebart et al. (2008) Ziebart, Brian D., Maas, Andrew, Bagnell, J. Andrew, and Dey, Anind K. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
- Ziebart et al. (2010) Ziebart, Brian D., Bagnell, J. Andrew, and Dey, Anind K. Modeling interaction via the principle of maximum causal entropy. In ICML, 2010.