1 Introduction
The increasing capabilities of autonomous systems allow to apply them in more and more complex environments. However, a fast and easy deployment requires simple programming approaches to adjust autonomous systems to new goals and tasks. A wellknown class of approaches offering such properties is apprenticeship learning (Al, learning from demonstration) Argall2009SurveyLfD , which summarizes the methods for teaching new skills by demonstration, instead of programming them directly. Common Al methods include, for instances, behavioral cloning (Bc) Bain1999BehavioralCloning , Al via inverse reinforcement learning (Irl) Russell1998LearningAgents , or adversarial imitation learning GAIL .
The goal of behavior cloning is to mimic the expert’s behavior by estimating the policy directly from demonstrations. This method typically requires a large amount of data, and the resulting models often suffer from compounding errors due to covariate shifts caused by policy errors
ERILross10a . Instead of copying the behavior directly, inverse reinforcement learning assumes rational agents to estimate an unknown reward function that represents the underlying motivations and goals. Hence, the reward function is considered to be a more parsimonious Ng2000AlgorithmsForIRL or succinct Abbeel:2004:ALV:1015330.1015430 description of the expert’s objective. It allows to infer optimal actions for unobserved states, new environments, or various dynamics. Although Irl has shown to be very sampleefficient in terms of expert demonstrations (Abbeel2008ParkingLotNavigation ; Kuderer2013TeachingMobileRobots ), it often requires careful handengineering of reward functions, as the Irl problem is known to be illposed and multiple reward functions can explain a certain observed behavior. Furthermore, many Irl approaches require solving the reinforcement learning (Rl) problem in an inner loop, which becomes increasingly prohibitive in highdimensional continuous space.To overcome these problems, generative adversarial imitation learning (Gail) GAIL proposes a Ganbased method to learn the policy directly by matching occupancy measures, which is shown to be similar as running Rl after Irl, and therefore preserves the advantages of Irl while being computationally more viable. However, the proposed method intrinsically minimizes JensenShannon divergence by learning a discriminator, which cannot be interpreted as a proper reward function eventually. Recently, adversarial IRL fu2018learning decouples the discriminator to output a valid reward function, however, a specific network architecture is required. Nevertheless, it is known that the standard Gan training Goodfellow:2014:GAN is prone to training instabilities, e.g. mode collapse or vanishing gradient. Hence researchers investigate several alternative approaches to replace the JensenShannon divergence objective, most notably, by a Wasserstein Ganbased objective (Wang_WIOC ; tail2018socially ). While the quantative results indicate an improvement in performance, a theoretical justification is missing so far.
In this work, we show that it is indeed possible to generalise the existing approaches to a larger set of reward function space. In particular, we derive a novel imitation learning algorithm built on the Wasserstein distance. Our contribution is threefold: first, we justify theoretically that there is a natural connection between apprenticeship learning by minimizing Integral Probability Metrics,
e.g. Wasserstein distance sriperumbudur2012empirical , and via inverse reinforcement learning. This enables a broader class of reward functions with desirable properties, e.g. smoothness. When choosing Wasserstein distance, we observe strong connection to the optimal transport (OT) theory that the Kantorovich potential in the dual form of OT problem can be interpreted as a valid reward function. Second, we propose a novel approach called Wasserstein Adversarial Imitation Learning (Wail), which leverages regularized optimal transport to enable largescale applications. Finally, we perform several robotic experiments, in which our approach outperforms the baselines in terms of average cumulative rewards and shows a significant improvement in sampleefficiency, by requiring just one expert demonstration.2 Background
Preliminaries.
An infinite horizon discounted Markov decision process setting
is defined by the tuple consisting of the finite state space , the finite action space and the transition function as being a probability measure on for all . Moreover, we have the reward function (i.e. , a function mapping from to ), the starting distribution on satisfying for all and finally the discount factor .If we combine this setting with a stochastic policy from the set of policies
, i.e. a conditional probability distribution on
given some state, we obtain a Markov chain
in the following natural way: Take a random starting state , choose a random action and then restart the chain with probability or choose the next state otherwise, repeat the last two steps.It is well known, that , if seen as a Markov chain on , has a stationary distribution (or occupancy measure) which satisfies as well as the Bellmann equation for all and . Moreover, there is a onetoone correspondence between those measures and the policies in .
Proposition 2.1.
(see Theorem 2. of Syed:2008:ALU:1390156.1390286 )) The mapping defined by is a bijection between and the set of measures on satisfying the Bellmann equation.
Due to this correspondence, we write
for the expected value of a random variable
on with respect to . We observe, that the expected cumulative reward of , i.e. the expected sum of rewards up to the first restart of the chain, is given by . Hence it is easy to derive that it is also equal to . For brevity, we write sometimes.Inverse Reinforcement Learning. The goal of reinforcement learning is to learn a policy that maximizes the expected cumulative rewards. Hence, typical Rl approaches assume the reward function to be given. However, for many problems the true reward function is not known or too hard to specify. Therefore, Irl resorts to estiamte the unknown reward function from the expert policy or demonstrations. Morever, to weaken the assumption about the optimality of the policy, maximum causal entropy Irl (MceIrl) Ziebart2010Modeling has been proposed, which learns a reward function as
(1) 
where is the discounted causal entropy of the policy ZhouMCEIRL . In practice, the full expert policy is often not available, while it is possible to query expert demonstrations. Hence, the expected expert rewards are often approximated via a finite set of demonstrations. Then, MceIrl seeks for a reward function that assigns high rewards to expert demonstrations and low rewards to all the others, in favor of high entropy policies. The corresponding Rl problem follows to derive a policy from this reward function: , which is embedded in the inner loop of Irl.
Generative Adversarial Imitation Learning. Irlbased imitation learning approaches have shown to generalise well, if the estimated reward functions are properly designed. However, many approaches are inefficient, as they require to solve the Rl problem in an inner loop. In addition, the goal of imitation learning is typically an imitation policy. Hence, the reward function from the intermediate Irl step is often not required. Previous work GAIL has characterized the policies that originate from running Rl after Irl, and propose to extend the Irl problem from Eq.(1) by imposing an additional cost function regularizer ,
(2) 
with cost functions . Based on this definition, the combined optimization problem of learning a policy via Rl with the learned reward function from Irl results in:
(3) 
where is the convex conjugate of the cost function regularizer that measures a distance between the induced occupancy measures of the expert and the learned policy. In practice, a discriminator is employed to differentiate the stateactions from both the expert and learned policy. And the distance measure is formulated as: , by taking
as the surrogate cost function to guide the reinforcement learning. At convergence, the discriminator can not distinguish the expert and learned policy, and classifies as
everywhere. Therefore, it can not be used as a valid cost or reward function. Note that the negative cost can be treated as a reward function, we will use reward function representation throughout this work for consistency.3 From Apprenticeship Learning to Optimal Transport
Suppose we aim to learn an imitation policy that tries to recover the expert’s policy with corresponding occupancy measure , while achieving similar expected rewards under the unknown expert’s reward function. For these purposes, apprenticeship learning approaches via Irl have been proposed, which learn a reward function to derive a policy. Similarly, we use the causal entropy regularized apprenticeship learning Ho:2016:MIL:3045390.3045681 formulation as follows,
(4) 
As mentioned by GAIL , most apprenticeship learning approaches often strongly restrict the reward function space to derive efficient methods, to enhance generalisability, and also to provide feasible solutions for the illposed Irl problem. Many approaches assume the reward function to be a linear combination of handengineered basis functions Abbeel:2004:ALV:1015330.1015430 ; Ziebart2008MaxEntIRL ; Syed2007GameTheoreticAL or learn to construct features from a large collection of component features levine2010feature . More recent approaches use Gaussian processes Levine2011GPIRL
or deep learningbased models
Finn2016GuidedCostLearning to learn nonlinear reward functions from the feature or stateaction space. However, they require careful regularization to ensure that the reward functions do not degenerate.From apprenticeship learning to Wasserstein distance. Although it is often easier to specify general properties of the desired reward functions , in practice it might not be possible to specify arbitrary reward function spaces and to derive efficient solutions for the corresponding apprenticeship learning problem. However, suppose the function space is closed under negation, i.e. , this is true if we consider for each reward function , there exists a cost function . Then we observe that the latter part of the regularized apprenticeship learning problem in Eq.(4) can be interpreted as an Integral Probability Metric (Ipm) sriperumbudur2012empirical ,
between the induced occupancy measures and . Depending on how is chosen, turns into different metrics for probability measures like total variation: , Maximum Mean Discrepancy: with being the norm of a reproducing kernel Hilbert space , and also Wasserstein distance: being the Lipschitzfunctions with constant with respect to some distance function on .
In Rl, many approaches suffer from sparse and delayed rewards. Hence, for a subsequent Rl task it is beneficial to have smooth reward functions that are easier to optimize and interpret. To continue our discussion, we investigate chosing as a class of smooth reward functions namely those, which are Lipschitz(1)continuous with respect to some metric on .
Proposition 3.1.
Given two occupancy measures and induced by policy and expert respectively, the causal entropy regularized apprenticeship learning (4) can be formulated as following,
(5) 
where is the wellknown 1Wasserstein distance of the two measures with respect to the ground cost function , which is a valid distance metric defined in the stateaction space.
Proof. The proof can be easily shown by treating the reward function as the Kantorovich potentials in the dual form of optimal transport (OT) problem computationalOT . Formally, let
denote the concatenated stateaction vector
to avoid cluterred notations and assume , for a certain policy we write the dual form of OT as(6)  
s.t. 
where are known as Kantorovich potentials. Moreover, if is a distance metric defined in then is a Lip(1) function and by the ctransform trick computationalOT , and this reduces to 1Wasserstein distance . Consider the Kantorovich potential as the reward function in Eq.(4), we conclude our proposition.
∎
There are many ways of measuring distance between two probability measures, notably such as total variation, KLdivergence, distance and so on. None of them reflects the underlying geometric structure of the sample space, therefore the distance is sometimes illdefined when the measures do not actually overlap improved_GAN . On the other hand, Wasserstein distance, originated from optimal transport, is a canonical way of lifting geometry to define a proper distance metric between two distributions. Whereas the Kantorovich duality suggests the dual form of the OT as in Eq.(6), it allows the application of stochastic gradient methods to make it eligible for largescale problems. In a special case of Euclidean spaces , by choosing , we can interprete the Wasserstein distance in Eq.(5) as follows,
(7) 
The constraint on the gradient of reward function implies that the gradient norm at any point is upperbounded by : . This simple form suggests several ways of computing the Wasserstein distance by enforcing the Lipschitz condition, such as weight clipping wgan and gradient penalty improved_GAN .
Relation to generative adversarial imitation learning. In canonical form of OT, we consider a ground cost (not neccessarily as a distance), the Kantorovich potentials can be chosen differently as seen in Eq.(6), as long as the potentials (, ) are Lipschitz regular: . Given a discriminator parameterized by , and let and , the OT problem becomes:
(8)  
It is then obvious, if the ground cost is a nonnegative constant, the constraint always holds. Therefore it recovers the objective of generative adversarial imitation learning (Gail) GAIL . Note that by choosing a second Kantorovich potential , it will eventually relax the proposition (3.1) to take arbitrary functions and , however it does no longer resemble the apprenticeship learning in Eq.(4). This implies that if , the Kantorovich potentials might not resemble a valid reward function, and in Sec. 2 we emphasized the same conclusion for Gail as well.
From Wasserstein distance to IRL. Moreover, we remark that the cost function regularizer defined in Eq.(2) actually induces a Wassetstein distance of the occupancy measures and . An natural question would be, how the cost function regularizer in the context of Gail looks like for the Wassetstein distance and which type of Irl problem it actually solves. Note we use cost function instead of reward in order to make it comparable to Gail.
Proposition 3.2.
If the cost regularizer for generative adversarial imitation learning is chosen as
then the method coincides with our approach, i.e. . In particular the inverse reinforcement learning part becomes
On one hand, the first term in estimates an occupancy measure to minimize the Wasserstein distance to the expert occupancy measure , while it regularizes on its induced expected cost . On the other hand, the second term of finds an policy that minimizes the induced expected cost while it maximizes the causal entropy of . Therefore, the convex conjugate of the cost regularizer reduces to the Wasserstein distance, and the Irl problem couples the Wasserstein distance minimization and the entropy maximization. More details of the proof can be found in the supplementary material Sec. A.
Deriving from apprenticeship learning we extend the choices of reward function by leveraging Ipm. In particular, we formulate the apprenticeship learning by 1Wasserstein distance. Moreover, the Kantorovich potentials in OT give a more general set of functions they can represent, i.e. Lipschitiz regular. We show that this corresponds to a certain type of Irl problem. Thus it enables a wide applications for imitation learning by choosing a large set of reward functions. Furthermore, similar to Gail, we can summarise an efficient iterative procedure such that the objective in Eq.(6) updates the dual potentials to improve the reward estimation, while for the next the policy is improved guided by the reward so as to generate expertlike stateactions.
4 Wasserstein Adversarial Imitation Learning
Following the proposition 3.1, we propose a practical imitation learning algorithm based on 1Wassersein distance, where we use a single Kantorovich potential to represent a valid reward function. Although optimal transport theory is well established, it is known for the computational difficulty. Recent advances in computational optimal transport enable largescale applications using stochastic gradient methods and parallel computing on a modern GPU platform NIPS2013_4927 ; pmlrv84blondel18a . Moreover, OT solvers are also extended to semidiscrete and continuouscontinuous cases with arbitrary parameterization on dual variables and ground cost DBLP:conf/iclr/SeguyDFCRB18 ; Genevay:2016:SOL:3157382.3157482 . For a more complete reading of computational OT, we refer readers to the short book review computationalOT .
Regularized optimal transport. For many problems, it is infeasible to directly solve the OT problem because it is not easy to enforce the Lipschitz continuity of the rewards. Therefore, we resort to the entropic regularization of OT and cast the problem to a single convex optimization problem. The regularized OT has been studied thoroughly recently and shows nice properties of strong convexity as well as a smooth approximation of the original OT problem wilson1968use ; pmlrv84blondel18a . Let the ground cost be a distance metric defined on the stateaction space, we compute the regularized dual form of OT as following, for simplicity we denote as being 1Wasserstein distance.
(9) 
where
regularizes the reward function in such a way that it decreases the objective if is not a Lipschitz(1) function. This decrease is larger for small and we obtain the solution of the original OT problem in the limit (See more in DBLP:journals/moc/ChizatPSV18 for the proof and further details).
Note that the expert demonstrations, i.e. , is typically a finite set, while samples from the policy can be infinite. Without loss of generality, we consider both as generic continuous density measures, and parameterize the reward
by a deep neural network with parameters
. For continuouscontinuous OT optimization, we take a stochastic ascend step on a minibatch of samples from both expert and policy.Policy gradient. Following the reward update, we take a policy gradient step by maximizing the expected reward while regularizing the policy causal entropy with a factor .
(10) 
Note that the estimated reward from the previous step is treated as a fixed reward. Similar to GAIL , for a finite minibatch of stateactions, the gradient on the causal entropy term can be rewritten as , such that the empirical policy gradient becomes,
(11) 
This is the standard policy gradient with the Kantorovich potential and negative loglikelihood on the policy as a fixed reward. It reads that the policy entropy favors those rewards with uncertain stateactions. However, our reward function from the OT problem is a valid reward function after training, whereas the surrogate reward of Gail becomes constant and is not useful anymore after convergence. Finally, we employ the trust region policy optimization (TRPO) pmlrv37schulman15 to update the policy by taking a KLconstrained natural gradient step.
The algorithm is listed in Algorithm (1) named as Wasserstein Adversarial Imitation Learning (Wail), since it follows a basic style of adversarial training. On one hand, the objective in Eq.(9) is increased by minimizing the Wasserstein distance. On the other hand, with the policy gradient in Eq.(11), it is driven towards expert region to generate more expertlike stateactions, which will in turn decrease the objective in Eq.(9). For completeness, we derive a theorem to show that the Algorithm (1) actually converges to an optimal solution . The proof of Theorem. 4.1 is detailed in supplementary material Sec. A.
Theorem 4.1.
5 Experiments
We evaluate Algorithm (1) on both classic and high dimensional continuous control tasks whose environments are provided with true reward functions in OpenAI’s gym gym16 . Similary to GAIL , the expert demonstrations are sampled from policies that are pretrained by TRPO with generalised advantage estimation Schulmanetal_ICLR2016 in different sizes (number of trajectories). Each expert trajectory contains about stateaction pairs. The TRPO training setup and the expert performance are detailed in supplementary matrial Sec. B.
Baselines. To evaluate the imitation performance of Algorithm (1), we compare Wail with two baselines, i.e. Gail and behavior cloning (Bc). The baseline Gail uses a discrminator over the stateaction space and takes the logarithm as the cost function: . In GAIL , authors show superior imitation performance over other approaches such as Feature expectation matching (Fem) and Gametheoretic apprenticeship learning (Gtal). Behavior cloning, on the other hand, is a straightforward baseline to compare with, where no environment interaction is required but only expert demonstrations. All methods considered in our experiments do not assume the existence of the true reward function, but estimate the optimal policy directly.
For all the methods, we parameterize the policy and value function with the same neural network architecture, which have two layers of units and tanh
activation. We model a stochastic policy, which outputs parameters of a normal distribution with a diagonal covariance.
Bc is trained via maximizing the loglikelihood of the expert demonstrations with Adam adam14 . For the discriminator in Gail and the reward function in Wail, we use the same network architecture as policy network. In Wail, we choose the euclidean distance in the stateaction space as the ground transport cost. Moreover, we found theregularization for OT is more stable and less hyperparameter sensitive, therefore we use
through out our experiments. The learning rate and regularization factors for all the three methods are finetuned to achieve optimal results. Finally, the parameters and environment interactions of the TRPO steps are the same as for the training of the expert policies.Results. We evaluate our appoach Wail on different control tasks in terms of expert sample complexity and show the result in Fig. (3). To compare with other baselines, we compute the averaged environment reward by randomly sampling trajectories using the policy learned and scale it by taking expert reward as and random policy reward as . It shows that Wail outperforms both Gail and Bc on almost all the learning tasks with varying data sizes, and in particular we observe that Wail is extremely expert sample efficient. For all the tasks, only one expert trajectory is suffcient for Wail to approach expert behavior. In classic control tasks (Cartpole, Mountaincar and Acrobot), all the three methods achieve nearlyexpert performance even with only one demonstration. For all the highdimensional MuJoCo envrionments except Reacher, Bc can only imitate expert’s behavior when trained with enough demonstrations, while Gail shows more promising results over Bc in terms of expert sample sizes. On the other hand, Wail
dominates all the tasks in almost all the sample sizes, even though there is only one expert demonstration. Note that we observe a performance drop for the Humanoid task when we increase the data sizes, which might be incurred by the higher variance of expert policy. For the Reacher task, which is notably more difficult to learn, both
Wail and Bc perform consistently well over varing data sizes (see the upper part of Fig.((b)b) for the zoomin view). However, Gail can only achieve expert’s performance when we feed more than demonstrations. Advised by GAIL , the causal entropy term improves Gail slightly only for small sample sizes. Finally, we report the full experiment results in supplementary material Sec. C.Reward surface. At Last, we demonstrate the reward surface of both Wail and Gail on the Humanoid environment, which has the highest number of dimensions of states () and actions () among the experiments. The reward surface is drawn on a D board where the expert stateactions are projected onto via Pca Abdi2010PCA . After training, we take the reward function: for Wail and the negative Logdiscriminator: for Gail to compute the reward score for each discretized point on the D board, where each point is projected back to the stateaction space via inverse Pca. The reward surface is then scaled to and presented in Fig. (4). The projected stateactions spread in the lowerleft region as two wings. For Wail, the learned reward function assigns higher reward along the directions of wings. When the data sizes is small, e.g. , rewards are concentrated on either wing. For we see a smooth reward function with increasing rewards along both wings. On the other hand in Gail, for small datasets, the discriminator saturates and fails to imitate the experts. With increasing number of demonstrations, e.g. , the discriminator only learns to differentiate expert and others, eventually assigns constant reward almost everywhere. Thus it can not serve as a proper reward to update the policy afterwards. In summary, the reward surface for Wail is obviously much smoother than Gail for all the expert data sizes. Although for a large portion of the D board, the reverseprojected stateactions might fall out of expert support, the reward scores computed via our approach is still well defined thanks to the geometric property of Wasserstein distance.
6 Conclusion
In this work, a natural connection between apprenticeship learning, optimal transport and Irl is built and justified theoretically. Upon the observation, we present a novel imitation learning approach based on Wasserstein distance and enables the choices of smooth reward functions in the stateaction space. Our approach is model free and outputs the dual function as intermediate result that can serve as a proper reward function. In several robotic control tasks, we demonstrate that our approach dominates other baselines by a large margin, in particular, it achieves expert behavior even with a extremely small number of expert demonstrations. This property might benefit from the optimization of OT problem and the smooth reward function. We leave the analysis of the sample complexity to the future work.
Moreover, we remark that our approach Wail offers advantages in three folds. First, the ground cost can be enforced with a properly defined prior which encodes domain knowledge about the stateaction space, or can even be estimated simutaneously. Second, the reward function is smooth and well defined. When it is parameterized by a neural network, it can represent a much more expressive family of reward function. Third, intrisically the optimal transport learns a transportation map in stateaction space. It implies that the reward function can be transfered to different learning tasks as long as we can estimate such a tranport map. We will leave the justification of the remarks to the future work.
References
 (1) P. Abbeel, D. Dolgov, A. Y. Ng, and S. Thrun. Apprenticeship learning for motion planning with application to parking lot navigation. pages 1083–1090, Sept 2008.

(2)
P. Abbeel and A. Y. Ng.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the Twentyfirst International Conference on Machine Learning
, ICML ’04, pages 1–, New York, NY, USA, 2004. ACM.  (3) H. Abdi and L. J. Williams. Principal component analysis. WIREs Comput. Stat., 2(4):433–459, July 2010.
 (4) B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robot. Auton. Syst., 57(5):469–483, May 2009.
 (5) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 (6) M. Bain and C. Sammut. A framework for behavioural cloning. In Machine Intelligence 15, Intelligent Agents [St. Catherine’s College, Oxford, July 1995], pages 103–129, Oxford, UK, UK, 1999. Oxford University.
 (7) M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. In A. Storkey and F. PerezCruz, editors, Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 880–889, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR.
 (8) G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016. cite arxiv:1606.01540.
 (9) L. Chizat, G. Peyré, B. Schmitzer, and F. Vialard. Scaling algorithms for unbalanced optimal transport problems. Math. Comput., 87(314):2563–2609, 2018.
 (10) M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2292–2300. Curran Associates, Inc., 2013.
 (11) C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pages 49–58, 2016.
 (12) J. Fu, K. Luo, and S. Levine. Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations, 2018.
 (13) A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for largescale optimal transport. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 3440–3448, USA, 2016. Curran Associates Inc.
 (14) I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 2, NIPS’14, pages 2672–2680, Cambridge, MA, USA, 2014. MIT Press.
 (15) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5767–5777. Curran Associates, Inc., 2017.
 (16) J. Ho and S. Ermon. Generative adversarial imitation learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4565–4573. Curran Associates, Inc., 2016.
 (17) J. Ho, J. K. Gupta, and S. Ermon. Modelfree imitation learning with policy optimization. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 2760–2769. JMLR.org, 2016.
 (18) D. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
 (19) M. Kuderer, H. Kretzschmar, and W. Burgard. Teaching mobile robots to cooperatively navigate in populated environments. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Tokyo, Japan, 2013.
 (20) S. Levine, Z. Popovic, and V. Koltun. Feature construction for inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 1342–1350, 2010.
 (21) S. Levine, Z. Popovic, and V. Koltun. Nonlinear inverse reinforcement learning with gaussian processes. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 19–27. Curran Associates, Inc., 2011.
 (22) A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, pages 663–670, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
 (23) G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11 (56):355–602, 2019.
 (24) S. Ross and D. Bagnell. Efficient reductions for imitation learning. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 661–668, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.

(25)
S. Russell.
Learning agents for uncertain environments.
In
Proceedings of the eleventh annual conference on Computational learning theory
, pages 101–103. ACM, 1998.  (26) J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1889–1897, Lille, France, 07–09 Jul 2015. PMLR.
 (27) J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. Highdimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
 (28) V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large scale optimal transport and mapping estimation. In ICLR. OpenReview.net, 2018.
 (29) B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, G. R. Lanckriet, et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.

(30)
U. Syed, M. Bowling, and R. E. Schapire.
Apprenticeship learning using linear programming.
In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 1032–1039, New York, NY, USA, 2008. ACM.  (31) U. Syed and R. E. Schapire. A gametheoretic approach to apprenticeship learning. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, NIPS, pages 1449–1456. Curran Associates, Inc., 2007.
 (32) L. Tail, J. Zhang, M. Liu, and W. Burgard. Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1111–1117. IEEE, 2018.
 (33) Y. Wang, L. Song, and H. Zha. Learning to optimize via wasserstein deep inverse optimal control. CoRR, abs/1805.08395, 2018.
 (34) A. Wilson. The Use of Entropy Maximising Models in the Theory of Trip Distribution, Mode Split and Route Split. Working papers // Centre for Environmental Studies. Centre for Environmental Studies, 1968.
 (35) Z. Zhou, M. Bloem, and N. Bambos. Infinite time horizon maximum causal entropy inverse reinforcement learning. IEEE Transactions on Automatic Control, 63(9):2787–2802, Sep. 2018.
 (36) B. D. Ziebart, J. A. Bagnell, and A. K. Dey. Modeling interaction via the principle of maximum causal entropy. In Proc. of the International Conference on Machine Learning, pages 1255–1262, 2010.
 (37) B. D. Ziebart, A. Maas, J. A. D. Bagnell, and A. Dey. Maximum entropy inverse reinforcement learning. In Proceeding of AAAI 2008, July 2008.
Appendix A Proofs
a.1 Proof to Proposition 3.2
Let be the corresponding cost function for the reward . Let . If we consider the function defined by then we can see that it is convex and lower semicontinuous as a pointwise supremum of linear functions. Hence . Moreover we have that
and hence . The second claim follows by pluging in into the Irl formula from [16]
∎
a.2 Proof to Theorem 4.1
Let us denote the function on the right hand side of Eq.(9), by , i.e.,
where and parameterize the reward function and the policy. First, we show that for any given , in which are obtained from the policy gradient step in Algorithm (1) converges. To do so, it suffices to show that is Cauchy.
where denotes the total variation distance. In the above derivation, we also use the fact that for any around the . The second inequality is due to the definition of total variation. The last inequality is because of the Pinsker’s inequality and the fact that based on the policy gradient update, the KLdivergence between two consecutive policies is bounded by s. The above inequality and the assumption in Theorem 4.1 imply that is Cauchy and hence it converges. Furthermore, since the convergence is independent of , it converges uniformly.
On the other hand, since for fixed parameter , is a continuous and concave function of , there exists such that , when converges uniformly to . The concavity implies that also converges to . Moreover, assuming is uniformly continuous with respect to , we obtain that . These results imply that obtained from the gradient ascend step in Algorithm (1) will converge to and concludes the Theorem.
∎
Appendix B Environments and expert policies
We run experiments on expert demonstrations trained by TRPO with generalised advantage estimation on 9 different control tasks, the environments details and performance of both expert and random policies are listed in Table. (1). Note expert performance is evaluated by sampling trajectories from the pretrained expert policy, while random policy is initialized randomly and sampled for trajectories to compute random policy performance.
Env. name  State dimension  Action dimension  Random policy performance  Expert performance 

CartPolev1  
Acrobotv1  
MountainCarv0  
Hopperv2  
Walker2dv2  
HalfCheetahv2  
Antv2  
Humanoidv2  
Reacherv2 
Env. name  Batch size  Iterations  max. episode steps  max. KL  Damping  

CartPolev1  
Acrobotv1  
MountainCarv0  
Hopperv2  
Walker2dv2  
HalfCheetahv2  
Antv2  
Humanoidv2  
Reacherv2 
In Table. (2), we list the TRPO training parameters for the expert policies, where the parameters and are two discount factors trading off the biasvariance and preserve the same meaning as stated in [27]. For classic control tasks, i.e. CartPole, MountainCar and Acrobot, we chose the maximal episode steps as , while for the other tasks they are allowed to be rolled out for longer trajectories with time steps.
Appendix C Experiment details and further results
We first describe the training parameters for our approach Wail. In Table. (3), the ground cost and type of regularization are fixed through out all experiments. Regularization factor and learning rate are finetuned to achieve optimal performance for each task. We train Wailwith the same amount of environment interactions as how we train the TRPO policies for experts, however the task Ant requires more iterations to converge.
Env. name  Iterations  Reg. value  Learning rate  Ground cost  Regularization 

CartPolev1  Euclidean  
Acrobotv1  Euclidean  
MountainCarv0  Euclidean  
Hopperv2  Euclidean  
Walker2dv2  Euclidean  
HalfCheetahv2  Euclidean  
Antv2  Euclidean  
Humanoidv2  Euclidean  
Reacherv2  Euclidean 
In the paper, we complement the experiment results in Table. (4) and Table. (5), where we list all the results on the control tasks with respect to varying sizes of expert demonstrations. The performance is computed by averaging the rewards of the randomly sampled trajectories using the policy after training. For the task Reacher, we also report results for Gailwith additional causal entropy term. All the other tasks do not require the causal entropy term.
Tasks  Dataset Sizes  Bc  Gail  Wail 

CartPolev1  1  
4  
7  
10  
MountainCarv0  1  
4  
7  
10  
Acrobotv1  1  
4  
7  
10  
Hopperv2  1  
4  
10  
25  
50  
Walker2dv2  1  
4  
10  
25  
50  
HalfCheetahv2  1  
4  
10  
25  
50  
Antv2  1  
4  
10  
25  
50  
Humanoidv2  1  
4  
10  
25  
50  
80  
100 
Dataset Sizes  Bc  Gail  Wail  Gail, 

1  
4  
10  
25  
50  
80  
100 
Moreover, we include the training curves of all the tasks for Wail with respect to different dataset sizes. In Fig. (5), the reward is computed over all sampled trajectories at each iteration during training, where the sampled trajectories are used for TRPO step as well.
Appendix D Reward surface
To verify our claim that the reward function learned from Wailis smoother and can be used as real reward, we show reward surface of both Wailand Gailon a D board. For all the tasks, we take the learned the reward function in Wailor negative cost function in Gailto compute the reward score of each point on the D surface, to where we project a subset of the expert samples (stateactions) using PCA. Note that each point on the D surface is transformed back to its stateaction space and then is valued for its reward. In Fig.(2), we list all the reward surfaces for all tasks varing the expert dataset sizes.
Comments
There are no comments yet.