The imitation behavior always runs through from beginning to end in the growth and learning of the human beings . In early childhood, we imitate the facial and manual gestures of the others, and we imitate the language and try to speak 
. The inverse reinforcement learning (IRL) and imitation learning (IL) formulate these imitation behavior as recovering the policy from the demonstrations of the expert. The IRL attempts to infer the reward function which explains the observed behavior. For instance, a sophisticated driver is thought to have some secrets which leads him to drive well. The IRL formulates such secrets as an explicit reward function, then the driver can get feedback and adjust his driving policy through this reward function. Based on the inferred reward function, the policy of the expert can be obtained via reinforcement learning (RL) 
. As for the IL, it has several optional methods such as IRL, supervised learning (SL) and so on. However, the recovered policy via IL and IRL is always sub-optimal, and it is intractable to outperform the demonstrators . This is mainly because of two challenges. On the one hand, it is difficult to provide high-quality demonstrations for the agent considering the completeness of the sampling. This makes the demonstration level always less than the actual expert level, and the optimal level is likely to be much higher than the expert level. On the other hand, the IRL only aims to find the reward function that justifies demonstrations only, which does not make any further exploration to improve the policy . This critic flaw makes both the IRL and IL be underestimated for a long time.
In this paper, we investigate the problem of the IRL, which aims to extrapolating beyond-demonstrator (BD) policy via IRL. With the wide application of the RL, its dilemma is also increasingly obvious. In RL, the reward signal characterizes the optimization objective, which determines the final pattern of the learned policy . But the reward signal of the most realistic scenario is always sparse or missing, and the design of the reward function lacks strict theoretical basis . Moreover, the tremendous computations and interactions of the RL are becoming more intractable due to the growing of task complexity. Therefore, if the IRL can not only imitate but also outperform the policy of the expert, it will provide a more economical and efficient way to build expert system.
There are some pioneering work has been achieved in the research of the beyond-demonstrator IRL. Brown et al. investigate the reason why the traditional IRL can not outperform the expert, and propose a trajectory-ranked reward extrapolation (T-REX) framework 
. The T-REX first sorts the sampled trajectories according to the cumulative reward in each trajectory. Then the T-REX forms a reward function which is parameterized by a deep neural network (DNN). Finally, this network is trained to make the rank relation be true. This method can be considered as following the pattern of the maximum likelihood method (MLM), which infers the reward function to explain the observed and ranked trajectories. The T-REX explores the potential reward space to provide high-quality reward function, so as to learn advantageous policies. The simulation results demonstrate the T-REX outperforms the state-of-the-art (SOTA) IRL and IL methods. Huang et al. also extend the T-REX to the multi-agent task and propose a MA-TREX . Based on the T-REX, Goo et al. first study the sufficient condition that impels such methods to successfully extrapolate beyond the performance of the demonstrator . Then they propose a Disturbance-based Reward Extrapolation (D-REX), which is also a ranking-based IL method. But the D-REX can automatically generate the ranked demonstrations by injecting noise through the policy learning. Moreover, the D-REX thoroughly deprecates the addition supervision, it still can be performed even there are no labeled demonstrations. Both the T-REX and D-REX requires large number of demonstrations, but it is difficult to guarantee the completeness of the sampling. Yu et al. propose a generative intrinsic reward driven imitation learning (GIRIL) framework, which takes one-life demonstration to learn a family of intrinsic reward functions . The GIRIL first introduces the concept of the intrinsic reward to the IL, which is first applied in RL 
. The GIRIL uses variational autoencoder (VAE) as the model base, which contains a encoder and a decoder. The encoder accepts a state and the corresponding next-state and outputs a latent variable as the encoding information. Then the latent variable is packed with the state and sent to decoder to predict the next-state. Through the VAE, the GIRIL complete the reconstruction of the state transition, and learn the transition model of the task. Finally, the difference of the real next-state and the predicated next-state serves as the intrinsic reward function. Using the prediction error as the reward function is interpreted as the curiosity of the agent, which realizes the extensive and comprehensive exploration. Thus the GIRIL can not only imitate the expert policy but also extrapolate the beyond-demonstrator policy.
Considering the detailed learning procedures, these works follow the same pattern that can be summarized as ”two-stage” algorithms. The two-stage implies that the algorithm first learn a reward function, then recover the policy via RL methods. On the contrary, we summarize the algorithms which straightforwardly learn the policies as ”one-stage” algorithms, such as generative adversarial imitation learning (GAIL) and variational adversarial imitation learning (VAIL) 
. All of the these beyond-demonstrator methods are two-stage, because they realize the transcendence via specialized reward function generated by the IRL. However, the two-stage procedures increase the computation complexity while introducing more variance. The latter RL procedure is entirely independent with the former IRL procedure, if the IRL can not generate high-quality reward functions, the final policy may not be successfully obtained. Although the GIRIL reduces the amount of the demonstrations, it does not reduce the consequent training complexity. Compared with the two-stage algorithms, the one-stage algorithms are more efficient and robust. To our best knowledge, there is no one-stage algorithms which realize the beyond-demonstrator learning. In, Justin et al. propose a novel framework entitled adversarial inverse reinforcement learning (AIRL), which infers the reward function and learns the policy at the same time. The AIRL transform the IRL problem into a generative adversarial (GA) fashion, where a policy generates trajectories and a discriminator evaluates whether the trajectories is from the expert. Meanwhile, the score of the discriminator is set as the reward function of the policy. To maximize the reward function, the policy need to approach the expert to get higher score. Once the training is over, both the reward function and the policy are obtained. The AIRL is a special one-stage algorithm but it learn the policy based on the inferred reward function. Moreover, the inference of the reward function and the learning of the policy is closely related, which realizes the mutual supervision and effectively reduces the variance. Based on the basis of the AIRL, we argue it is feasible to inherit its basis and redesign the reward function to achieve imitation and transcendence. If so, the efficiency of the BD-IRL will be greatly promoted, and the construction of the expert system will be more convenient.
In this work, we propose a framework entitled hybrid adversarial inverse reinforcement learning (HARIL), which is a model-free, one-stage, generative-adversarial (GA) fashion and curiosity-driven inverse reinforcement learning algorithm. The HARIL realizes the goal of behavior imitation while extrapolating the beyond-demonstrator policy via GA fashion and curiosity module. The simulation results show that the HAIRL outperforms the current similar SOTA algoritms. Our main contributions can be summarized as follows:
We first analyze the flaw and corresponding reasons of the existed IRL and IL algorithms, along with the research tendency of the IRL. Then we review and classify the existed work of the BD-IRL, and discuss the feasibility of building beyond-demonstrator and one-stage IRL algorithms.
We dive into the structure of the AIRL and the intrinsic curiosity module (ICM), then make improvements on both the two structures. Based on the improved AIRL and ICM, we propose the HAIRL framework, which successfully integrates the imitation and exploration into one procedure. Moreover, the HAIRL has higher efficiency and lower variance.
We compare the performance of the HAIRL and other algorithms on multi environments within OpenAI Gym . The simulation results show that the HAIRL makes efficient extrapolation, while greatly reducing the computations. Moreover, we evaluate the performance of the HAIRL with different amount of demonstrations and noise. The further experiment results demonstrate the HAIRL is an adaptive and robust framework.
Ii Problem Formulation
We study the inverse reinforcement learning (IRL) problem which considers an Entropy-Regularized Markov decision process (ER-MDP) defined as below:
The ER-MDP can be defined as a tuple , where:
is the state space;
is the action space;
is the transition probability;
is the reward function.
is a discount factor;
is the initial state distribution.
Denote by the policy which selects the actions in the ER-MDP. For the standard Reinforcement Learning (RL) setting, the reward function and the initial state distribution are unknown which can only be obtained by the interaction with the MDP. Based on the former configurations, we can define the optimization objective of the RL.
Given the markov decision process , thus the goal of the RL is:
where being the set of all stationary stochastic policies, being the trajectory generated by the policy and being the causal entropy .
This optimization objective implies the learning method should not only maximize the cumulative reward but also maximize the entropy of the each output action. Such operation aims to randomize the policy so that the agent can explore more state-action pairs. The following theorem elaborates the optimal policy for the ER-MDP based on the policy improvement theorem in .
The proof can be found in . ∎
The inverse reinforcement learning aims to find a reward function which explains the observed behavior . Given a set of trajectories , the trajectories are assumed to be drawn an optimal policy in IRL. Then the IRL problem can be defined as:
Given the ER-MDP and a set of trajectories , the goal of the IRL is:
where , and the is the parameterized reward function.
Iii Hybrid adversarial Inverse Reinforcement Learning
Iv Experiments and Results
V-a Extrinsic Reward Block
using GA fashion. For a standard GA structure, there is a generator which aims to capture the distribution of the training data, while a discriminator estimating the probability that a sample belongs to the training data rather than the generator. In AIRL, the generator is served by a policy which is parameterized by a Deep Neural Network (DNN). The policy accepts the states from the environment and makes corresponding actions. Then a discriminator is designed to judge whether a state-action pair is generated by the expert policy or not. Instead of directly output the estimation probability, here the discriminator takes a special form:
where being the discriminator parameterized by a DNN with parameters ,
is the learned function. Finally, the loss function of the discriminator is:
Meanwhile, the policy is set to maximize the following objective:
In each training epoch, the generation policywill be first executed to generate trajectories . Then the discriminator will be trained with generation trajectories and expert trajectories , so as to classify expert data (state-action pairs) from the . The discriminator estimates the probability that a state-action pair is generated by the expert policy . This induces the corresponding rewards for all the state-action pairs in , which will be used for policy update via any policy optimization method. Therefore, the more similar the is to , the higher reward the will obtain based on the Eq. 8. And the generation policy will continuously adjust itself to approach the expert policy.
Through the discriminator, the learning process is simplified only to simulate the expert policy. Regardless of the original extrinsic reward (e.g. speed, time, etc.) of a task, the AIRL transform the IRL problem into the RL problem with specific reward function. And this reward function only measures the difference between the generation policy and the expert policy. This allows the agent to focus on the imitation rather than exploration, which speeds up the learning process. Accordingly, this operation limits the improvement of the agent.
In HARIL, the generation policy is expected to efficiently imitate the expert policy. So the discriminator for classifying the expert data is taken to serve as the extrinsic reward block. This block generates rewards for state-action pairs when executing the generation policy, which guides the generation policy to rapidly approach the expert policy. In AIRL, the discriminator takes the cross-entropy as the loss function, which induces the corresponding reward function. However, such loss function can not indicate the training process and may lead to the model collapse . And it is difficult to balance the training level of the generator and discriminator. Moreover, the original reward function Eq. 8 is monotonic, where the value may approach the infinity if the discriminator is over-fitting. To address the problems, the Wasserstein Distance (WD) is leveraged to serve as the loss function of the discriminator, which induces the Wasserstein GAN in . For the discriminator, the loss function is formulated as:
Furthermore, the discriminator form of the Eq. 6 is deprecated, which directly outputs the estimation probability. For the generation policy, the reward function is set as:
According to the Def. 3, let:
where is the partition function. Then the gradient of the can be computed as:
Let be the state-action marginal at time , the above equation can be rewritten as:
Taking the derivative of the loss function Eq. 9 w.r.t :
As , thus the two objectives are consistent. Moreover, because the is constant, we only need to maximize the . So the theorem is proved. ∎
We refer the to the Eq. 10 which serves as the extrinsic reward block. Assume there are only for the generation policy, it is set to maximize the following objective:
By maximizing the Eq. 11, the WD between the expert policy and the generation policy will be decreased, which means the is more and more similar to the .
V-B Intrinsic Reward Block
Through the extrinsic reward, the generation policy can quickly approach the expert policy. After the generation policy finish its ”courses”, it should actively explore the environment to get better performance than the expert policy. To realized the exploration, the Intrinsic Curiosity Module (ICM) is introduced to serve as the intrinsic reward block, which contains a forward model and a inverse model . Different from the original ICM which uses the encoding fashion of the state-action pairs, we design the end-to-end ICM (E-ICM) in the Fig. 1. The E-ICM drops the enconding procedure and applies the end-to-end training, which simplifies the model architecture and decreases the variance.
Consider a state-action pair generated by the generation policy at time and a resulted at time . The inverse model accepts the and outputs a predicted action , which may results in the at . Then the is sent to the forward model to predict the next state .
Denote by the inverse model parameterized by a DNN with parameters , the is trained to optimize the following objective:
where the being the loss function that evaluate the difference between the predicted and real actions. For instance, the can be cross-entropy if the action is discrete.
Similarly, denote by the forward model parameterized by a DNN with parameters , the is trained to minimize the following loss function:
Finally, the intrinsic reward is defined as:
Through the inverse model and the forward model, the transition process is reconstructed. The E-ICM is trained to actually learn the potential transition model of the MDP. Assume the E-ICM is well-trained, then all the possible transition processes will be fully predictable, which produce little prediction errors. Accordingly, if a state-action pair produces much prediction error, it can be considered out of control. Moreover, this implies the unseen search space, which motivates the policy to make further exploration.
V-C Hybrid Reward
So far the extrinsic reward block and the intrinsic reward block are obtained respectively, which forms the hybrid reward as follows:
where the is a decaying scalar that weights the extrinsic reward against the intrinsic reward.
At the beginning of the training, the importance of is higher than the , which drives the generation policy to rapidly approach the expert policy. After the intrinsic term gets higher importance, the generation policy will turn to do more exploration rather than imitation.
-  M. Iacoboni, R. P. Woods, M. Brass, H. Bekkering, J. C. Mazziotta, and G. Rizzolatti, “Cortical mechanisms of human imitation,” science, vol. 286, no. 5449, pp. 2526–2528, 1999.
-  A. N. Meltzoff and M. K. Moore, “Imitation of facial and manual gestures by human neonates,” Science, vol. 198, no. 4312, pp. 75–78, 1977.
-  A. Y. Ng, S. J. Russell, et al., “Algorithms for inverse reinforcement learning.,” in Icml, vol. 1, p. 2, 2000.
-  A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation learning: A survey of learning methods,” ACM Computing Surveys (CSUR), vol. 50, no. 2, pp. 1–35, 2017.
-  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  D. S. Brown, W. Goo, P. Nagarajan, and S. Niekum, “Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations,” arXiv preprint arXiv:1904.06387, 2019.
X. Yu, Y. Lyu, and I. Tsang, “Intrinsic reward driven imitation learning via
generative model,” in
International Conference on Machine Learning, pp. 10925–10935, PMLR, 2020.
-  D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in , pp. 16–17, 2017.
-  R. V. Hogg, J. W. McKean, and A. T. Craig, Introduction to Mathematical Statistics 8th ed. Pearson, 2019.
-  S. Huang, B. Yang, H. Chen, H. Piao, Z. Sun, and Y. Chang, “Ma-trex: Mutli-agent trajectory-ranked reward extrapolation via inverse reinforcement learning,” in International Conference on Knowledge Science, Engineering and Management, pp. 3–14, Springer, 2020.
-  D. S. Brown, W. Goo, and S. Niekum, “Better-than-demonstrator imitation learning via automatically-ranked demonstrations,” in Conference on Robot Learning, pp. 330–359, PMLR, 2020.
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
-  J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Advances in neural information processing systems, pp. 4565–4573, 2016.
-  X. B. Peng, A. Kanazawa, S. Toyer, P. Abbeel, and S. Levine, “Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow,” arXiv preprint arXiv:1810.00821, 2018.
-  J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” arXiv preprint arXiv:1710.11248, 2017.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
-  B. D. Ziebart, “Modeling purposeful adaptive behavior with the principle of maximum causal entropy,” 2010.
-  M. Bloem and N. Bambos, “Infinite time horizon maximum causal entropy inverse reinforcement learning,” in 53rd IEEE Conference on Decision and Control, pp. 4911–4916, IEEE, 2014.
-  T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energy-based policies,” arXiv preprint arXiv:1702.08165, 2017.
-  B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.,” in Aaai, vol. 8, pp. 1433–1438, Chicago, IL, USA, 2008.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, pp. 2672–2680, 2014.
-  M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks,” arXiv preprint arXiv:1701.04862, 2017.
-  M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
-  Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial on energy-based learning,” Predicting structured data, vol. 1, no. 0, 2006.