I Introduction
The imitation behavior always runs through from beginning to end in the growth and learning of the human beings [1]. In early childhood, we imitate the facial and manual gestures of the others, and we imitate the language and try to speak [2]
. The inverse reinforcement learning (IRL) and imitation learning (IL) formulate these imitation behavior as recovering the policy from the demonstrations of the expert
[3][4]. The IRL attempts to infer the reward function which explains the observed behavior. For instance, a sophisticated driver is thought to have some secrets which leads him to drive well. The IRL formulates such secrets as an explicit reward function, then the driver can get feedback and adjust his driving policy through this reward function. Based on the inferred reward function, the policy of the expert can be obtained via reinforcement learning (RL) [5]. As for the IL, it has several optional methods such as IRL, supervised learning (SL) and so on
[6]. However, the recovered policy via IL and IRL is always suboptimal, and it is intractable to outperform the demonstrators [7]. This is mainly because of two challenges. On the one hand, it is difficult to provide highquality demonstrations for the agent considering the completeness of the sampling. This makes the demonstration level always less than the actual expert level, and the optimal level is likely to be much higher than the expert level. On the other hand, the IRL only aims to find the reward function that justifies demonstrations only, which does not make any further exploration to improve the policy [8]. This critic flaw makes both the IRL and IL be underestimated for a long time.In this paper, we investigate the problem of the IRL, which aims to extrapolating beyonddemonstrator (BD) policy via IRL. With the wide application of the RL, its dilemma is also increasingly obvious. In RL, the reward signal characterizes the optimization objective, which determines the final pattern of the learned policy [5]. But the reward signal of the most realistic scenario is always sparse or missing, and the design of the reward function lacks strict theoretical basis [9]. Moreover, the tremendous computations and interactions of the RL are becoming more intractable due to the growing of task complexity. Therefore, if the IRL can not only imitate but also outperform the policy of the expert, it will provide a more economical and efficient way to build expert system.
There are some pioneering work has been achieved in the research of the beyonddemonstrator IRL. Brown et al. investigate the reason why the traditional IRL can not outperform the expert, and propose a trajectoryranked reward extrapolation (TREX) framework [7]
. The TREX first sorts the sampled trajectories according to the cumulative reward in each trajectory. Then the TREX forms a reward function which is parameterized by a deep neural network (DNN). Finally, this network is trained to make the rank relation be true. This method can be considered as following the pattern of the maximum likelihood method (MLM), which infers the reward function to explain the observed and ranked trajectories
[10]. The TREX explores the potential reward space to provide highquality reward function, so as to learn advantageous policies. The simulation results demonstrate the TREX outperforms the stateoftheart (SOTA) IRL and IL methods. Huang et al. also extend the TREX to the multiagent task and propose a MATREX [11]. Based on the TREX, Goo et al. first study the sufficient condition that impels such methods to successfully extrapolate beyond the performance of the demonstrator [12]. Then they propose a Disturbancebased Reward Extrapolation (DREX), which is also a rankingbased IL method. But the DREX can automatically generate the ranked demonstrations by injecting noise through the policy learning. Moreover, the DREX thoroughly deprecates the addition supervision, it still can be performed even there are no labeled demonstrations. Both the TREX and DREX requires large number of demonstrations, but it is difficult to guarantee the completeness of the sampling. Yu et al. propose a generative intrinsic reward driven imitation learning (GIRIL) framework, which takes onelife demonstration to learn a family of intrinsic reward functions [8]. The GIRIL first introduces the concept of the intrinsic reward to the IL, which is first applied in RL [9]. The GIRIL uses variational autoencoder (VAE) as the model base, which contains a encoder and a decoder
[13]. The encoder accepts a state and the corresponding nextstate and outputs a latent variable as the encoding information. Then the latent variable is packed with the state and sent to decoder to predict the nextstate. Through the VAE, the GIRIL complete the reconstruction of the state transition, and learn the transition model of the task. Finally, the difference of the real nextstate and the predicated nextstate serves as the intrinsic reward function. Using the prediction error as the reward function is interpreted as the curiosity of the agent, which realizes the extensive and comprehensive exploration. Thus the GIRIL can not only imitate the expert policy but also extrapolate the beyonddemonstrator policy.Considering the detailed learning procedures, these works follow the same pattern that can be summarized as ”twostage” algorithms. The twostage implies that the algorithm first learn a reward function, then recover the policy via RL methods. On the contrary, we summarize the algorithms which straightforwardly learn the policies as ”onestage” algorithms, such as generative adversarial imitation learning (GAIL) and variational adversarial imitation learning (VAIL) [14][15]
. All of the these beyonddemonstrator methods are twostage, because they realize the transcendence via specialized reward function generated by the IRL. However, the twostage procedures increase the computation complexity while introducing more variance. The latter RL procedure is entirely independent with the former IRL procedure, if the IRL can not generate highquality reward functions, the final policy may not be successfully obtained. Although the GIRIL reduces the amount of the demonstrations, it does not reduce the consequent training complexity. Compared with the twostage algorithms, the onestage algorithms are more efficient and robust. To our best knowledge, there is no onestage algorithms which realize the beyonddemonstrator learning. In
[16], Justin et al. propose a novel framework entitled adversarial inverse reinforcement learning (AIRL), which infers the reward function and learns the policy at the same time. The AIRL transform the IRL problem into a generative adversarial (GA) fashion, where a policy generates trajectories and a discriminator evaluates whether the trajectories is from the expert. Meanwhile, the score of the discriminator is set as the reward function of the policy. To maximize the reward function, the policy need to approach the expert to get higher score. Once the training is over, both the reward function and the policy are obtained. The AIRL is a special onestage algorithm but it learn the policy based on the inferred reward function. Moreover, the inference of the reward function and the learning of the policy is closely related, which realizes the mutual supervision and effectively reduces the variance. Based on the basis of the AIRL, we argue it is feasible to inherit its basis and redesign the reward function to achieve imitation and transcendence. If so, the efficiency of the BDIRL will be greatly promoted, and the construction of the expert system will be more convenient.In this work, we propose a framework entitled hybrid adversarial inverse reinforcement learning (HARIL), which is a modelfree, onestage, generativeadversarial (GA) fashion and curiositydriven inverse reinforcement learning algorithm. The HARIL realizes the goal of behavior imitation while extrapolating the beyonddemonstrator policy via GA fashion and curiosity module. The simulation results show that the HAIRL outperforms the current similar SOTA algoritms. Our main contributions can be summarized as follows:

We first analyze the flaw and corresponding reasons of the existed IRL and IL algorithms, along with the research tendency of the IRL. Then we review and classify the existed work of the BDIRL, and discuss the feasibility of building beyonddemonstrator and onestage IRL algorithms.

We dive into the structure of the AIRL and the intrinsic curiosity module (ICM), then make improvements on both the two structures. Based on the improved AIRL and ICM, we propose the HAIRL framework, which successfully integrates the imitation and exploration into one procedure. Moreover, the HAIRL has higher efficiency and lower variance.

We compare the performance of the HAIRL and other algorithms on multi environments within OpenAI Gym [17]. The simulation results show that the HAIRL makes efficient extrapolation, while greatly reducing the computations. Moreover, we evaluate the performance of the HAIRL with different amount of demonstrations and noise. The further experiment results demonstrate the HAIRL is an adaptive and robust framework.
Ii Problem Formulation
We study the inverse reinforcement learning (IRL) problem which considers an EntropyRegularized Markov decision process (ERMDP) defined as below
[18]:Definition 1.
The ERMDP can be defined as a tuple , where:

is the state space;

is the action space;

is the transition probability;

is the reward function.

is a discount factor;

is the initial state distribution.
Denote by the policy which selects the actions in the ERMDP. For the standard Reinforcement Learning (RL) setting, the reward function and the initial state distribution are unknown which can only be obtained by the interaction with the MDP. Based on the former configurations, we can define the optimization objective of the RL.
Definition 2.
Given the markov decision process , thus the goal of the RL is:
(1) 
where being the set of all stationary stochastic policies, being the trajectory generated by the policy and being the causal entropy [19].
This optimization objective implies the learning method should not only maximize the cumulative reward but also maximize the entropy of the each output action. Such operation aims to randomize the policy so that the agent can explore more stateaction pairs. The following theorem elaborates the optimal policy for the ERMDP based on the policy improvement theorem in [5].
Theorem 1.
Proof.
The proof can be found in [20]. ∎
The inverse reinforcement learning aims to find a reward function which explains the observed behavior [3]. Given a set of trajectories , the trajectories are assumed to be drawn an optimal policy in IRL. Then the IRL problem can be defined as:
Definition 3.
Given the ERMDP and a set of trajectories , the goal of the IRL is:
(5) 
where , and the is the parameterized reward function.
Iii Hybrid adversarial Inverse Reinforcement Learning
Iv Experiments and Results
V Conclusion
Va Extrinsic Reward Block
We build the HARIL based on the Adversarial Inverse Reinforcement Learning [16], which solves the IRL problem in Def. 3
using GA fashion. For a standard GA structure, there is a generator which aims to capture the distribution of the training data, while a discriminator estimating the probability that a sample belongs to the training data rather than the generator
[22]. In AIRL, the generator is served by a policy which is parameterized by a Deep Neural Network (DNN). The policy accepts the states from the environment and makes corresponding actions. Then a discriminator is designed to judge whether a stateaction pair is generated by the expert policy or not. Instead of directly output the estimation probability, here the discriminator takes a special form:(6) 
where being the discriminator parameterized by a DNN with parameters ,
is the learned function. Finally, the loss function of the discriminator is:
(7)  
Meanwhile, the policy is set to maximize the following objective:
(8) 
where .
In each training epoch, the generation policy
will be first executed to generate trajectories . Then the discriminator will be trained with generation trajectories and expert trajectories , so as to classify expert data (stateaction pairs) from the . The discriminator estimates the probability that a stateaction pair is generated by the expert policy . This induces the corresponding rewards for all the stateaction pairs in , which will be used for policy update via any policy optimization method. Therefore, the more similar the is to , the higher reward the will obtain based on the Eq. 8. And the generation policy will continuously adjust itself to approach the expert policy.Through the discriminator, the learning process is simplified only to simulate the expert policy. Regardless of the original extrinsic reward (e.g. speed, time, etc.) of a task, the AIRL transform the IRL problem into the RL problem with specific reward function. And this reward function only measures the difference between the generation policy and the expert policy. This allows the agent to focus on the imitation rather than exploration, which speeds up the learning process. Accordingly, this operation limits the improvement of the agent.
In HARIL, the generation policy is expected to efficiently imitate the expert policy. So the discriminator for classifying the expert data is taken to serve as the extrinsic reward block. This block generates rewards for stateaction pairs when executing the generation policy, which guides the generation policy to rapidly approach the expert policy. In AIRL, the discriminator takes the crossentropy as the loss function, which induces the corresponding reward function. However, such loss function can not indicate the training process and may lead to the model collapse [23]. And it is difficult to balance the training level of the generator and discriminator. Moreover, the original reward function Eq. 8 is monotonic, where the value may approach the infinity if the discriminator is overfitting. To address the problems, the Wasserstein Distance (WD) is leveraged to serve as the loss function of the discriminator, which induces the Wasserstein GAN in [24]. For the discriminator, the loss function is formulated as:
(9) 
Furthermore, the discriminator form of the Eq. 6 is deprecated, which directly outputs the estimation probability. For the generation policy, the reward function is set as:
(10) 
Theorem 2.
Proof.
According to the Def. 3, let:
where . Through the Boltzmann distribution, the demonstrations can be modeled as [21][25]:
where is the partition function. Then the gradient of the can be computed as:
Let be the stateaction marginal at time , the above equation can be rewritten as:
Taking the derivative of the loss function Eq. 9 w.r.t :
As , thus the two objectives are consistent. Moreover, because the is constant, we only need to maximize the . So the theorem is proved. ∎
We refer the to the Eq. 10 which serves as the extrinsic reward block. Assume there are only for the generation policy, it is set to maximize the following objective:
(11) 
By maximizing the Eq. 11, the WD between the expert policy and the generation policy will be decreased, which means the is more and more similar to the .
VB Intrinsic Reward Block
Through the extrinsic reward, the generation policy can quickly approach the expert policy. After the generation policy finish its ”courses”, it should actively explore the environment to get better performance than the expert policy. To realized the exploration, the Intrinsic Curiosity Module (ICM) is introduced to serve as the intrinsic reward block, which contains a forward model and a inverse model [9]. Different from the original ICM which uses the encoding fashion of the stateaction pairs, we design the endtoend ICM (EICM) in the Fig. 1. The EICM drops the enconding procedure and applies the endtoend training, which simplifies the model architecture and decreases the variance.
Consider a stateaction pair generated by the generation policy at time and a resulted at time . The inverse model accepts the and outputs a predicted action , which may results in the at . Then the is sent to the forward model to predict the next state .
Denote by the inverse model parameterized by a DNN with parameters , the is trained to optimize the following objective:
(12) 
where the being the loss function that evaluate the difference between the predicted and real actions. For instance, the can be crossentropy if the action is discrete.
Similarly, denote by the forward model parameterized by a DNN with parameters , the is trained to minimize the following loss function:
(13) 
Finally, the intrinsic reward is defined as:
(14) 
Through the inverse model and the forward model, the transition process is reconstructed. The EICM is trained to actually learn the potential transition model of the MDP. Assume the EICM is welltrained, then all the possible transition processes will be fully predictable, which produce little prediction errors. Accordingly, if a stateaction pair produces much prediction error, it can be considered out of control. Moreover, this implies the unseen search space, which motivates the policy to make further exploration.
VC Hybrid Reward
So far the extrinsic reward block and the intrinsic reward block are obtained respectively, which forms the hybrid reward as follows:
(15) 
where the is a decaying scalar that weights the extrinsic reward against the intrinsic reward.
At the beginning of the training, the importance of is higher than the , which drives the generation policy to rapidly approach the expert policy. After the intrinsic term gets higher importance, the generation policy will turn to do more exploration rather than imitation.
References
 [1] M. Iacoboni, R. P. Woods, M. Brass, H. Bekkering, J. C. Mazziotta, and G. Rizzolatti, “Cortical mechanisms of human imitation,” science, vol. 286, no. 5449, pp. 2526–2528, 1999.
 [2] A. N. Meltzoff and M. K. Moore, “Imitation of facial and manual gestures by human neonates,” Science, vol. 198, no. 4312, pp. 75–78, 1977.
 [3] A. Y. Ng, S. J. Russell, et al., “Algorithms for inverse reinforcement learning.,” in Icml, vol. 1, p. 2, 2000.
 [4] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation learning: A survey of learning methods,” ACM Computing Surveys (CSUR), vol. 50, no. 2, pp. 1–35, 2017.
 [5] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.

[6]
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
nature, vol. 521, no. 7553, pp. 436–444, 2015.  [7] D. S. Brown, W. Goo, P. Nagarajan, and S. Niekum, “Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations,” arXiv preprint arXiv:1904.06387, 2019.

[8]
X. Yu, Y. Lyu, and I. Tsang, “Intrinsic reward driven imitation learning via
generative model,” in
International Conference on Machine Learning
, pp. 10925–10935, PMLR, 2020. 
[9]
D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiositydriven
exploration by selfsupervised prediction,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pp. 16–17, 2017.  [10] R. V. Hogg, J. W. McKean, and A. T. Craig, Introduction to Mathematical Statistics 8th ed. Pearson, 2019.
 [11] S. Huang, B. Yang, H. Chen, H. Piao, Z. Sun, and Y. Chang, “Matrex: Mutliagent trajectoryranked reward extrapolation via inverse reinforcement learning,” in International Conference on Knowledge Science, Engineering and Management, pp. 3–14, Springer, 2020.
 [12] D. S. Brown, W. Goo, and S. Niekum, “Betterthandemonstrator imitation learning via automaticallyranked demonstrations,” in Conference on Robot Learning, pp. 330–359, PMLR, 2020.
 [13] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 [14] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Advances in neural information processing systems, pp. 4565–4573, 2016.
 [15] X. B. Peng, A. Kanazawa, S. Toyer, P. Abbeel, and S. Levine, “Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow,” arXiv preprint arXiv:1810.00821, 2018.
 [16] J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” arXiv preprint arXiv:1710.11248, 2017.
 [17] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
 [18] B. D. Ziebart, “Modeling purposeful adaptive behavior with the principle of maximum causal entropy,” 2010.
 [19] M. Bloem and N. Bambos, “Infinite time horizon maximum causal entropy inverse reinforcement learning,” in 53rd IEEE Conference on Decision and Control, pp. 4911–4916, IEEE, 2014.
 [20] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energybased policies,” arXiv preprint arXiv:1702.08165, 2017.
 [21] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.,” in Aaai, vol. 8, pp. 1433–1438, Chicago, IL, USA, 2008.
 [22] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, pp. 2672–2680, 2014.
 [23] M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks,” arXiv preprint arXiv:1701.04862, 2017.
 [24] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
 [25] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial on energybased learning,” Predicting structured data, vol. 1, no. 0, 2006.
Comments
There are no comments yet.