1 Introduction
Reinforcement learning (RL) has emerged as a promising tool for solving complex decisionmaking and control tasks from predefined highlevel reward functions (Silver et al., 2016; Qureshi et al., 2017). However, defining an optimizable reward function that inculcates the desired behavior can be challenging for many robotic applications, which include learning socialinteraction skills (Qureshi et al., 2018), dexterous manipulation (Finn et al., 2016b), autonomous driving (Kuderer et al., 2015), and robotic surgery (Yip & Das, 2017).
Inverse reinforcement learning (IRL) (Ng et al., 2000)
addresses the problem of learning reward functions from expert demonstrations, and it is often considered as a branch of imitation learning
(Argall et al., 2009). The prior work in IRL includes maximummargin (Abbeel & Ng, 2004; Ratliff et al., 2006) and maximumentropy (Ziebart et al., 2008)formulations. Currently, maximum entropy (MaxEnt) IRL is a widely used approach towards IRL, and has been extended to use nonlinear function approximators such as neural networks in scenarios with unknown dynamics by leveraging samplingbased techniques
(Boularias et al., 2011; Finn et al., 2016b; Kalakrishnan et al., 2013). However, designing the IRL algorithm is usually complicated as it requires, to some extent, hand engineering such as deciding domainspecific regularizers (Finn et al., 2016b).Rather than learning reward functions and solving the IRL problem, the imitation learning (IL) methods were proposed that learn a policy directly from expert demonstrations. Prior work addressed the IL problem through behavior cloning (BC) which learns a policy from expert trajectories using supervised learning
(Pomerleau, 1991). Although BC methods are simple solutions to IL, these methods require a large amount of data because of compounding errors induced by covariate shift (Ross et al., 2011). To overcome BC limitations a generative adversarial imitation learning (GAIL) algorithm (Ho & Ermon, 2016) was proposed. GAIL uses Generative Adversarial Networks (GANs) formulation (Goodfellow et al., 2014), i.e., a generatordiscriminator framework, where generator learns to generate expertlike trajectories and discriminator learns to distinguish between generated and expert trajectories. Although GAIL is highly effective and efficient framework, it does not recover transferable/portable reward functions along with the policies. Reward function learning is ultimately preferable, if possible, over direct imitation learning as rewards are portable functions that represent the most basic and complete representation of agent intention, and can be reoptimized in new environments and new agents.Reward learning is challenging as there can be many optimal policies explaining a set of demonstrations and many reward functions inducing an optimal policy (Ng et al., 1999). Recently, an adversarial inverse reinforcement learning (AIRL) framework (Fu et al., 2017), an extension of GAIL, was proposed that offers a solution to the former issue by exploiting the maximum entropy IRL method (Ziebart et al., 2008) whereas the latter issue is addressed through learning disentangled reward functions, i.e., the reward is a function of state only instead of both state and action. The disentangled reward prevents actionsdriven reward shaping (Fu et al., 2017) and is able to recover transferable reward functions, but has two main disadvantages. First, AIRL fails to recover the ground truth reward when the ground truth reward is a function of both state and action. For example, the reward function in any locomotion or ambulation tasks contains a penalty term that discourages actions with large magnitudes. This need for action regularization is well known in optimal control literature and limits the use cases of a stateonly reward function in most practical reallife applications. Second, reward shaping plays a vital role in quickly recovering invariant policies (Ng et al., 1999) and thus for AIRL, it is usually not possible to simultaneously recover optimal/nearoptimal policies when learning disentangled rewards.
In this paper, we propose the empowermentbased adversarial inverse reinforcement learning (EAIRL) algorithm^{1}^{1}1Supplementary material is available at sites.google.com/view/eairl. Empowerment (Salge et al., 2014) is a mutual informationbased theoretic measure, like state or actionvalue functions, that assigns a value to a given state to quantify an extent to which an agent can influence its environment. Our method uses variational information maximization (Mohamed & Rezende, 2015) to learn empowerment in parallel to learning the reward and policy from expert data. The empowerment acts as a potential function for shaping rewards. Our experimentation shows that the proposed method recovers not only nearoptimal policies but also recovers robust, nearoptimal, transferable, nondisentangled (stateaction) reward functions. The results on reward learning show that EAIRL outperforms several stateoftheart methods by recovering groundtruth reward functions. On policy learning, results demonstrate that policies learned through EAIRL perform comparably to GAIL and AIRL with nondisentangled (stateaction) reward function but significantly outperform policies learned through AIRL with disentangled reward and GAN interpretation of Guided Cost Learning (GANGCL) (Finn et al., 2016a).
2 Background
We consider a Markov decision process (MDP) represented as a tuple
where denotes the statespace, denotes the actionspace,represents the transition probability distribution, i.e.,
, corresponds to the reward function, is the initial state distribution , and is the discount factor. Let be an inverse model that maps current state and next state to a distribution over actions , i.e., . Let be a stochastic policy that takes a state and outputs a distribution over actions such that . Let and denote a set of trajectories, a sequence of stateaction pairs , generated by a policy and an expert policy , respectively, where denotes the terminal time. Finally, let be a potential function that quantifies a utility of a given state , i.e., . In our proposed work, we use an empowermentbased potential function for reward shaping to adversarially learn both reward function and policy. Therefore, the following sections provide a brief background on potentialbased reward shaping functions and their benefits to imitation learning, adversarial reward and policy learning, and variational informationmaximization approach to learn the empowerment.2.1 Shaping Rewards
In this section, we briefly describe a formal framework of rewardshaping and its importance to policy and reward learning (for details see (Ng et al., 1999)). We consider a general form of reward function , i.e., the reward is a function of current state , action , and next state . Let be a reward shaping function and be a transformed reward function denoted as . Ng et al. (1999) proved that an optimal behavior of a policy remains unchanged if the reward undergoes transformation through a shaping function of form , i.e.,
Theorem 1 (see (Ng et al., 1999)) We say is a potentialbased shaping function if there exist a realvalued function and . Then a potentialbased shaping function is a necessary and sufficient condition to grantee that an optimal policy learned in MDP is also optimal in the MDP , i.e., a policy is invariant to reward transformations.
Reward shaping plays a vital role in learning both rewards and policies from expert demonstrations (Ng et al., 1999). In the former case, reward shaping determines the extent to which a true reward function can be recovered whereas in the latter case, reward shaping speeds up the learning process by supplementing an actual reward function to guide the learning process. Despite several advantages of shaping rewards, a potentialbased shaping function which is a sufficient and necessary condition for preserving policy behavior (Ng et al., 1999) is usually not available. There exist several methods (Asmuth et al., 2008; Grzes & Kudenko, 2009) to learn potentialbased reward shaping functions but they assume the availability of transition model , and are furthermore demonstrated in smallscale mazesolving problems. In this paper, we show we are able to learn the potentialbased reward shaping functions without a transition model, as well as one that also scales to higher dimensional problems by modeling a function as Empowerment (Salge et al., 2014) which we learn efficiently online through variational informationmaximization (Mohamed & Rezende, 2015).
2.2 Adversarial Inverse Reinforcement Learning
This section briefly describes Adversarial Inverse Reinforcement Learning (AIRL) (Fu et al., 2017) algorithm which forms a baseline of our proposed method. AIRL is stateoftheart IRL method that builds on GAIL (Ho & Ermon, 2016), maximum entropy IRL framework (Ziebart et al., 2008) and GANGCL, a GAN interpretation of Guided Cost Learning (Finn et al., 2016b, a).
GAIL is a modelfree adversarial learning framework, inspired from GANs (Goodfellow et al., 2014), where the policy learns to imitate the expert policy behavior by minimizing the JensenShannon divergence between the stateaction distributions generated by and the expert stateaction distribution by through following objective
(1) 
where is the discriminator that performs the binary classification to distinguish between samples generated by and , is a hyperparameter, and is an entropy regularization term . Note that GAIL does not recover reward; however, Finn et al. (2016a) shows that the discriminator can be modeled as a reward function. Thus AIRL (Fu et al., 2017) presents a formal implementation of (Finn et al., 2016a) and extends GAIL to recover reward along with the policy by imposing a following structure on the discriminator:
(2) 
where comprises disentangled reward term with training parameters , and shaping term with training parameters . The entire
is trained as a binary classifier to distinguish between expert demonstrations
and policy generated demonstrations . The policy is trained to maximize the discriminative reward . Note that the function consists of freeparameters as no structure is imposed on , and as mentioned in (Fu et al., 2017), the reward function and function are tied upto a constant , where , thus the impact of , the shaping term, on the recovered reward is quite limited and therefore, the benefits of reward shaping are barely utilized.2.3 Empowerment as Maximal Mutual Information
Mutual information (MI), an informationtheoretic measure, quantifies the dependency between two random variables.
In intrinsicallymotivated reinforcement learning, a maximal of mutual information between a sequence of actions and the final state reached after the execution of , conditioned on current state is often used as a measure of internal reward (Mohamed & Rezende, 2015), known as Empowerment , i.e.,
(3) 
where is a step transition probability, is a distribution over , and
is a jointdistribution of
actions and final state ^{2}^{2}2In our proposed work, we consider only immediate step transitions i.e., , hence variables and will be represented in nonbold notations..Intuitively, the empowerment of a state quantifies an extent to which an agent can influence its future. Empowerment, like value functions, is a potential function that has been previously used in reinforcement learning but its applications were limited to smallscale cases due to computational intractability of MI maximization in higherdimensional problems. However, recently a scalable method (Mohamed & Rezende, 2015) was proposed that learns the empowerment through the moreefficient maximization of variational lower bound, which has been shown to be equivalent to maximizing MI (Agakov, 2004). The lower bound was derived (for complete derivation see Appendix A.1) by representing MI in term of the difference in conditional entropies and utilizing the nonnegativity property of KLdivergence, i.e.,
(4) 
where , , is a variational distribution with parameters and is a distribution over actions with parameters .
Finally, the lower bound in Eqn. 4 is maximized under the constraint (to avoid divergence, see (Mohamed & Rezende, 2015)) to compute empowerment as follow:
(5) 
where is dependent temperature term. Mohamed & Rezende (2015)
also applied the principles of ExpectationMaximization (EM)
(Agakov, 2004) to learn empowerment, i.e., alternatively maximizing Eqn. 5 with respect to and . Given a set of training trajectories , the maximization of Eqn. 5 w.r.t is shown to be a supervised maximum loglikelihood problem whereas the maximization w.r.t is determined through the functional derivative under the constraint . The optimal that maximizes Eqn. 5 turns out to be , where is a normalization term. By substituting in Eqn. 5 showed that the empowerment (for full derivation, see Appendix A.2).Since is an unnormalized distribution, Mohamed & Rezende (2015) introduced an approximation where
is a normalized distribution which leaves the scalar function
to account for the normalization term . Finally, the parameters of policy and scalar function are optimized by minimizing the discrepancy between the two approximations and through the squared error as follow:(6) 
3 Empowered Adversarial Inverse Reinforcement Learning
We present an inverse reinforcement learning algorithm that simultaneously and adversarially learns a robust, transferable reward function and policy from expert demonstrations. Our proposed method comprises (i) an inverse model that takes the current state and the next state to output a distribution over actions that resulted in to transition, (ii) a reward , with parameters , that is a function of both state and action, (iii) an empowermentbased potential function with parameters that determines the rewardshaping function , and (iv) a policy model outputs a distribution over actions given the current state . All these models are trained simultaneously based on the objective functions described in the following sections.
3.1 Inverse model optimization
As mentioned in Section 2.3, learning the inverse model is a maximum loglikelihood supervised learning problem. Therefore, given a set of trajectories , where a single trajectory is a sequence states and actions, i.e., , the inverse model is trained to minimize the meansquare error between its predicted action and the action taken according to the generated trajectory , i.e.,
(7) 
3.2 Empowerment optimization
Empowerment will be expressed in terms of normalization function of optimal , i.e.,
. Therefore, the estimation of empowerment
is approximated by minimizing the loss function
, presented in Eqn. 6, w.r.t parameters , and the inputs are sampled from the policygenerated trajectories .3.3 Reward function
To train the reward function, we first compute the discriminator as follow:
(8) 
where is the reward function to be learned with parameters . We also maintain the target and learning parameters of the empowermentbased potential function. The target parameters are a replica of except that the target parameters are updated to learning parameters after every
training epochs. Note that keeping a stationary target
stabilizes the learning as also highlighted in (Mnih et al., 2015). Finally, the discriminator/reward function parametersare trained via binary logistic regression to discriminate between expert
and generated trajectories, i.e.,(9) 
3.4 Policy optimization policy
We train our policy to maximize the discriminative reward and to minimize the loss function to learn the empowerment. Hence, the overall policy training objective is:
(10) 
where policy parameters are updated by taking KLconstrained natural gradient step using any policy optimization method such as TRPO (Schulman et al., 2015) or an approximated step such as PPO (Schulman et al., 2017).
Algorithm 1 outlines the overall training procedure to train all function approximators simultaneously. Note that the expert samples are seen by the discriminator only, whereas all other models are trained using the policy generated samples . Furthermore, as highlighted in (Fu et al., 2017), the discriminating reward boils down to the following expression
(11) 
where . Hence, our policy training objective maximizes the learned shapedreward function and minimizes the discrepancy between and , with the term acting as a regularizer. Moreover, note that the function can be viewed as a singlesample estimate of the advantage function i.e.,
(12) 
Hence, our method trains the policy under reward transformations which leads to learning an invariant and robust policy from expert demonstrations.
4 Results
Our proposed method, EAIRL, simultaneously learns reward and policy from expert demonstrations. We evaluate our method against stateoftheart policy and reward learning techniques on several control tasks in OpenAI Gym. In case of policy learning, we compare our method against GAIL, GANGCL, AIRL with stateonly reward, denoted as , and AIRL with stateaction reward, denoted as . In reward learning, we only compare our method against and as GAIL does not recover reward, and GANGCL is shown to exhibit inferior performance than AIRL (see (Fu et al., 2017)). Furthermore, in the comparisons, we also include the expert performances which represents a policy learned by optimizing a groundtruth reward using TRPO. The performance of different methods are evaluated in term of average total reward accumulated (denoted as score) by an agent during the trial, and for each experiment, we run five trials.
Algorithm  StatesOnly  PointmassMaze  CrippledAnt 

Expert  N/A  
EAIRL(Ours)  No  
AIRL  Yes  
AIRL  No 
The evaluation of reward learning on transfer learning tasks. Mean scores (higher the better) with standard deviation are presented over 5 trials.
4.1 Reward learning performance (Transfer learning experiments)
To evaluate the learned rewards, we consider a transfer learning problem in which the testing environments are made to be different from the training environments. More precisely, the rewards learned via IRL in the training environments are used to reoptimize a new policy in the testing environment. We consider two test cases, shown in the Fig. 1 and Fig. 2, in which the agent’s dynamics and physical environment is modified, respectively.
In the first test case, as shown in Fig. 1(a), we modify the agent itself during testing. We trained a reward function to make a standard quadrupled ant to run forward. During testing, we disabled the front two legs (indicated in red) of the ant (crippledant), and the learned reward is used to reoptimize the policy to make a crippledant move forward. Note that the crippledant cannot move sideways (see Appendix B.1). Therefore, the agent has to change the gait to run forward. In the second test case, shown in Fig 2(a), the agent learns to navigate a 2D pointmass to the goal region in a simple maze. We reposition the maze centralwall during testing so that the agent has to take a different path, compared to the training environment, to reach the target (see Appendix B.2).
Fig. 1(b) and Fig. 2(b) compare the policy performance scores over five different trials of EAIRL, and in the aforementioned transfer learning tasks. The expert score is shown as a horizontal line to indicate the standard set by an expert policy. Table 1 summarizes the mean score of five trials with a standard deviation in abovementioned transfer learning experiments. It can be seen that our method recovers nearoptimal reward functions as the policy scores almost reach the expert scores in all five trials. Furthermore, our method performs significantly better than both and in matching an expert’s performance.
4.2 Policy learning performance (Imitation learning)
Table 2 presents the means and standard deviations of policy learning performance scores, over the five different trials, in various control tasks. For each algorithm, we provided 20 expert demonstrations for imitation, generated by optimizing a policy on a groundtruth reward using TRPO. The tasks, shown in Fig. 3, include (i) making a 2D cheetah robot to run forward, (ii) making a 3D quadrupled robot (ant) to move forward, (iii) making a 2D robot to swim (swimmer), and (iv) keeping a friction less pendulum to stand vertically up. It can be seen that EAIRL, and GAIL demonstrate similar performance and successfully learn to imitate expert policy whereas and GANGCL fails to recover a policy.
Methods  Environments  

HalfCheetah  Ant  Swimmer  Pendulum  
Expert  
GAIL  
GCL  
AIRL(s,a)  
AIRL(s)  
EAIRL 
5 Discussion
This section highlights the importance of stateaction rewards and potentialbased reward shaping functions on learning policies and rewards, respectively, from expert demonstrations.
Ng et al. (1999) theoretically discussed the importance of potentialbased reward shaping in a structural prediction of the MDP but, to the best of our knowledge, no prior work has reported the practical approach to learn potentialbased reward shaping function and its implications to IRL. Note that our method, EARIL, and AIRL with stateaction reward function, i.e., , shares the same discriminator formulation except that does not impose any structure on the rewardshaping function, while our method models the rewardshaping function through empowerment. The numerical results of reward learning, reported in the previous section, indicate that fails to learn rewards whereas EAIRL recovers the nearoptimal reward functions. This highlights the positive impact of using a potentialbased rewardshaping function on reward learning. Thus our experimentation validates the theoretical propositions of (Ng et al., 1999) that the reward shaping function determines an extent to which the true reward function can be recovered from the expert demonstrations.
Our experimentation highlights the importance of modeling discriminator/reward functions in the adversarial learning framework as a function of both state and action. The notion of disentangled rewards leaves the discriminator function to depend on states only. The results show that AIRL with disentangled rewards fails to learn a policy whereas EAIRL, GAIL, and AIRL that include stateaction reward successfully recover the policies. Hence, it is crucial to model reward/discriminator as a function of stateaction as otherwise, adversarial imitation learning fails to retrieve a policy from expert data.
Our method leverages both the potentialbased rewardshaping function and stateaction dependent rewards, and therefore learns both reward and policy simultaneously. On the other hand, GAIL learns policy but cannot recover reward function whereas AIRL cannot learn reward and policy simultaneously.
6 Conclusions and Future Work
We present an approach to adversarial reward and policy learning from expert demonstrations by efficiently and effectively utilizing reward shaping for inverse reinforcement learning. We learn a potentialbased reward shaping function in parallel to learning the reward and policy. Our method transforms the learning reward through shaping function that leads to acquiring a rewardtransformations preserving invariant policy. The invariant policy in turn guides the rewardlearning process to recover nearoptimal reward. We show that our method successfully learns nearoptimal rewards, policies, and performs significantly better than stateoftheart IRL methods in both imitation learning and transfer learning. The learned rewards are shown to be transferable to environments that are dynamically or structurally different from training environments.
In our future work, we plan to extend our method to learn rewards and policies from diverse human/expert demonstrations as the proposed method assumes that a single expert generates the training data. Another exciting direction is to learn from suboptimal demonstrations that also contains failures in addition to optimal behaviors.
References

Abbeel & Ng (2004)
Pieter Abbeel and Andrew Y Ng.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the twentyfirst international conference on Machine learning
, pp. 1. ACM, 2004.  Agakov (2004) David Barber Felix Agakov. The im algorithm: a variational approach to information maximization. Advances in Neural Information Processing Systems, 16:201, 2004.
 Argall et al. (2009) Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
 Asmuth et al. (2008) John Asmuth, Michael L Littman, and Robert Zinkov. Potentialbased shaping in modelbased reinforcement learning. In AAAI, pp. 604–609, 2008.

Boularias et al. (2011)
Abdeslam Boularias, Jens Kober, and Jan Peters.
Relative entropy inverse reinforcement learning.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pp. 182–189, 2011.  Finn et al. (2016a) Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energybased models. arXiv preprint arXiv:1611.03852, 2016a.
 Finn et al. (2016b) Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58, 2016b.
 Fu et al. (2017) Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Grzes & Kudenko (2009) Marek Grzes and Daniel Kudenko. Learning shaping rewards in modelbased reinforcement learning. In Proc. AAMAS 2009 Workshop on Adaptive Learning Agents, volume 115, 2009.
 Ho & Ermon (2016) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
 Kalakrishnan et al. (2013) Mrinal Kalakrishnan, Peter Pastor, Ludovic Righetti, and Stefan Schaal. Learning objective functions for manipulation. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 1331–1336. IEEE, 2013.
 Kuderer et al. (2015) Markus Kuderer, Shilpa Gulati, and Wolfram Burgard. Learning driving styles for autonomous vehicles from demonstration. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pp. 2641–2646. IEEE, 2015.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Mohamed & Rezende (2015) Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 2125–2133, 2015.
 Ng et al. (1999) Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.
 Ng et al. (2000) Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.
 Pomerleau (1991) Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
 Qureshi et al. (2017) Ahmed Hussain Qureshi, Yutaka Nakamura, Yuichiro Yoshikawa, and Hiroshi Ishiguro. Show, attend and interact: Perceivable humanrobot social interaction through neural attention qnetwork. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 1639–1645. IEEE, 2017.
 Qureshi et al. (2018) Ahmed Hussain Qureshi, Yutaka Nakamura, Yuichiro Yoshikawa, and Hiroshi Ishiguro. Intrinsically motivated reinforcement learning for human–robot interaction in the realworld. Neural Networks, 2018.
 Ratliff et al. (2006) Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736. ACM, 2006.
 Ross et al. (2011) Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.
 Salge et al. (2014) Christoph Salge, Cornelius Glackin, and Daniel Polani. Empowerment–an introduction. In Guided SelfOrganization: Inception, pp. 67–114. Springer, 2014.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 Yip & Das (2017) Michael Yip and Nikhil Das. Robot autonomy for surgery. arXiv preprint arXiv:1707.03080, 2017.
 Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008.
Appendices
Appendix A Variational Empowerment
For completeness, we present a derivation of presenting mutual information (MI) as variational lower bound and maximization of lower bound to learn empowerment.
a.1 Variational Information Lower Bound
As mentioned in section 2.3, the variational lower bound representation of MI is computed by defining MI as a difference in conditional entropies, and the derivation is formalized as follow.
a.2 Variational Information Maximization
The empowerment is a maximal of MI and it can be formalized as follow by exploiting the variational lower bound formulation (for details see (Mohamed & Rezende, 2015)).
(13) 
As mentioned in section 2.3, given a training trajectories, the maximization of Eqn. 13 w.r.t inverse model is a supervised maximum loglikelihood problem. The maximization of Eqn. 13 w.r.t is derived through a functional derivative under the constraint . For simplicity, we consider discrete state and action spaces, and the derivation is as follow:
By using the constraint , it can be shown that the optimal solution , where and . This solution maximizes the lower bound since .
Appendix B Transfer learning problems
b.1 Ant environment
The following figures show the difference between the path profiles of standard and crippled Ant. It can be seen that the standard Ant can move sideways whereas the crippled ant has to rotate in order to move forward.
b.2 Maze environment
The following figures show the path profiles of a 2D pointmass agent to reach the target in training and testing environment. It can be seen that in the testing environment the agent has to take the opposite route compared to the training environment to reach the target.
Appendix C Implementation Details
c.1 Network Architectures
We use twolayer ReLU network with 32 units in each layer for the potential function
and , reward function , discriminators of GAIL and GANGCL. Furthermore, policy of all presented models and the inverse modelof EAIRL are presented by twolayer RELU network with 32 units in each layer, where the network’s output parametrizes the Gaussian distribution, i.e., we assume a Gaussian policy.
c.2 Hyperparameters
For all experiments we use the temperature term
. We set entropy regularization weight to 0.1 and 0.001 for reward and policy learning, respectively. The hyperparameter
was set to 1 for reward learning and 0.001 for policy learning. The target parameters of the empowermentbased potential function were updated every 5 and 2 epochs during reward and policy learning respectively. Although reward learning parameters are also applicable to policy learning, we decrease the magnitude of entropy and information regularizers during policy learning to speed up the policy convergence to optimal values. Furthermore, we set the batch size to 2000 and 20000steps per TRPO update for the pendulum and remaining environments, respectively. For the methods (Fu et al., 2017; Ho & Ermon, 2016) presented for comparison, we use their suggested hyperparameters. We also use policy samples from previous 20 iterations as negative data to train the discriminator of all IRL methods presented in this paper to prevent the parametrized reward functions from overfitting the current policy samples.