Hybrid Adversarial Inverse Reinforcement Learning

In this paper, we investigate the problem of the inverse reinforcement learning (IRL), especially the beyond-demonstrator (BD) IRL. The BD-IRL aims to not only imitate the expert policy but also extrapolate BD policy based on finite demonstrations of the expert. Currently, most of the BD-IRL algorithms are two-stage, which first infer a reward function then learn the policy via reinforcement learning (RL). Because of the two separate procedures, the two-stage algorithms have high computation complexity and lack robustness. To overcome these flaw, we propose a BD-IRL framework entitled hybrid adversarial inverse reinforcement learning (HAIRL), which successfully integrates the imitation and exploration into one procedure. The simulation results show that the HAIRL is more efficient and robust when compared with other similar state-of-the-art (SOTA) algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5


Off-Policy Adversarial Inverse Reinforcement Learning

Adversarial Imitation Learning (AIL) is a class of algorithms in Reinfor...

A Primer on Maximum Causal Entropy Inverse Reinforcement Learning

Inverse Reinforcement Learning (IRL) algorithms infer a reward function ...

Online Observer-Based Inverse Reinforcement Learning

In this paper, a novel approach to the output-feedback inverse reinforce...

Adversarial Intrinsic Motivation for Reinforcement Learning

Learning with an objective to minimize the mismatch with a reference dis...

Bayesian multitask inverse reinforcement learning

We generalise the problem of inverse reinforcement learning to multiple ...

Inverse reinforcement learning for video games

Deep reinforcement learning achieves superhuman performance in a range o...

Inverse Reinforcement Learning with Conditional Choice Probabilities

We make an important connection to existing results in econometrics to d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The imitation behavior always runs through from beginning to end in the growth and learning of the human beings [1]. In early childhood, we imitate the facial and manual gestures of the others, and we imitate the language and try to speak [2]

. The inverse reinforcement learning (IRL) and imitation learning (IL) formulate these imitation behavior as recovering the policy from the demonstrations of the expert

[3][4]. The IRL attempts to infer the reward function which explains the observed behavior. For instance, a sophisticated driver is thought to have some secrets which leads him to drive well. The IRL formulates such secrets as an explicit reward function, then the driver can get feedback and adjust his driving policy through this reward function. Based on the inferred reward function, the policy of the expert can be obtained via reinforcement learning (RL) [5]

. As for the IL, it has several optional methods such as IRL, supervised learning (SL) and so on

[6]. However, the recovered policy via IL and IRL is always sub-optimal, and it is intractable to outperform the demonstrators [7]. This is mainly because of two challenges. On the one hand, it is difficult to provide high-quality demonstrations for the agent considering the completeness of the sampling. This makes the demonstration level always less than the actual expert level, and the optimal level is likely to be much higher than the expert level. On the other hand, the IRL only aims to find the reward function that justifies demonstrations only, which does not make any further exploration to improve the policy [8]. This critic flaw makes both the IRL and IL be underestimated for a long time.

In this paper, we investigate the problem of the IRL, which aims to extrapolating beyond-demonstrator (BD) policy via IRL. With the wide application of the RL, its dilemma is also increasingly obvious. In RL, the reward signal characterizes the optimization objective, which determines the final pattern of the learned policy [5]. But the reward signal of the most realistic scenario is always sparse or missing, and the design of the reward function lacks strict theoretical basis [9]. Moreover, the tremendous computations and interactions of the RL are becoming more intractable due to the growing of task complexity. Therefore, if the IRL can not only imitate but also outperform the policy of the expert, it will provide a more economical and efficient way to build expert system.

There are some pioneering work has been achieved in the research of the beyond-demonstrator IRL. Brown et al. investigate the reason why the traditional IRL can not outperform the expert, and propose a trajectory-ranked reward extrapolation (T-REX) framework [7]

. The T-REX first sorts the sampled trajectories according to the cumulative reward in each trajectory. Then the T-REX forms a reward function which is parameterized by a deep neural network (DNN). Finally, this network is trained to make the rank relation be true. This method can be considered as following the pattern of the maximum likelihood method (MLM), which infers the reward function to explain the observed and ranked trajectories

[10]. The T-REX explores the potential reward space to provide high-quality reward function, so as to learn advantageous policies. The simulation results demonstrate the T-REX outperforms the state-of-the-art (SOTA) IRL and IL methods. Huang et al. also extend the T-REX to the multi-agent task and propose a MA-TREX [11]. Based on the T-REX, Goo et al. first study the sufficient condition that impels such methods to successfully extrapolate beyond the performance of the demonstrator [12]. Then they propose a Disturbance-based Reward Extrapolation (D-REX), which is also a ranking-based IL method. But the D-REX can automatically generate the ranked demonstrations by injecting noise through the policy learning. Moreover, the D-REX thoroughly deprecates the addition supervision, it still can be performed even there are no labeled demonstrations. Both the T-REX and D-REX requires large number of demonstrations, but it is difficult to guarantee the completeness of the sampling. Yu et al. propose a generative intrinsic reward driven imitation learning (GIRIL) framework, which takes one-life demonstration to learn a family of intrinsic reward functions [8]. The GIRIL first introduces the concept of the intrinsic reward to the IL, which is first applied in RL [9]

. The GIRIL uses variational autoencoder (VAE) as the model base, which contains a encoder and a decoder

[13]. The encoder accepts a state and the corresponding next-state and outputs a latent variable as the encoding information. Then the latent variable is packed with the state and sent to decoder to predict the next-state. Through the VAE, the GIRIL complete the reconstruction of the state transition, and learn the transition model of the task. Finally, the difference of the real next-state and the predicated next-state serves as the intrinsic reward function. Using the prediction error as the reward function is interpreted as the curiosity of the agent, which realizes the extensive and comprehensive exploration. Thus the GIRIL can not only imitate the expert policy but also extrapolate the beyond-demonstrator policy.

Considering the detailed learning procedures, these works follow the same pattern that can be summarized as ”two-stage” algorithms. The two-stage implies that the algorithm first learn a reward function, then recover the policy via RL methods. On the contrary, we summarize the algorithms which straightforwardly learn the policies as ”one-stage” algorithms, such as generative adversarial imitation learning (GAIL) and variational adversarial imitation learning (VAIL) [14][15]

. All of the these beyond-demonstrator methods are two-stage, because they realize the transcendence via specialized reward function generated by the IRL. However, the two-stage procedures increase the computation complexity while introducing more variance. The latter RL procedure is entirely independent with the former IRL procedure, if the IRL can not generate high-quality reward functions, the final policy may not be successfully obtained. Although the GIRIL reduces the amount of the demonstrations, it does not reduce the consequent training complexity. Compared with the two-stage algorithms, the one-stage algorithms are more efficient and robust. To our best knowledge, there is no one-stage algorithms which realize the beyond-demonstrator learning. In

[16], Justin et al. propose a novel framework entitled adversarial inverse reinforcement learning (AIRL), which infers the reward function and learns the policy at the same time. The AIRL transform the IRL problem into a generative adversarial (GA) fashion, where a policy generates trajectories and a discriminator evaluates whether the trajectories is from the expert. Meanwhile, the score of the discriminator is set as the reward function of the policy. To maximize the reward function, the policy need to approach the expert to get higher score. Once the training is over, both the reward function and the policy are obtained. The AIRL is a special one-stage algorithm but it learn the policy based on the inferred reward function. Moreover, the inference of the reward function and the learning of the policy is closely related, which realizes the mutual supervision and effectively reduces the variance. Based on the basis of the AIRL, we argue it is feasible to inherit its basis and redesign the reward function to achieve imitation and transcendence. If so, the efficiency of the BD-IRL will be greatly promoted, and the construction of the expert system will be more convenient.

In this work, we propose a framework entitled hybrid adversarial inverse reinforcement learning (HARIL), which is a model-free, one-stage, generative-adversarial (GA) fashion and curiosity-driven inverse reinforcement learning algorithm. The HARIL realizes the goal of behavior imitation while extrapolating the beyond-demonstrator policy via GA fashion and curiosity module. The simulation results show that the HAIRL outperforms the current similar SOTA algoritms. Our main contributions can be summarized as follows:

  • We first analyze the flaw and corresponding reasons of the existed IRL and IL algorithms, along with the research tendency of the IRL. Then we review and classify the existed work of the BD-IRL, and discuss the feasibility of building beyond-demonstrator and one-stage IRL algorithms.

  • We dive into the structure of the AIRL and the intrinsic curiosity module (ICM), then make improvements on both the two structures. Based on the improved AIRL and ICM, we propose the HAIRL framework, which successfully integrates the imitation and exploration into one procedure. Moreover, the HAIRL has higher efficiency and lower variance.

  • We compare the performance of the HAIRL and other algorithms on multi environments within OpenAI Gym [17]. The simulation results show that the HAIRL makes efficient extrapolation, while greatly reducing the computations. Moreover, we evaluate the performance of the HAIRL with different amount of demonstrations and noise. The further experiment results demonstrate the HAIRL is an adaptive and robust framework.

The remainder of the paper is organized as follows: Section II gives the problem formulation. Section III elaborates the framework of the HAIRL. Section IV shows the simulation results and numerical analysis. Finally, Section V summarizes the paper and proposes the prospects.

Ii Problem Formulation

We study the inverse reinforcement learning (IRL) problem which considers an Entropy-Regularized Markov decision process (ER-MDP) defined as below


Definition 1.

The ER-MDP can be defined as a tuple , where:

  • is the state space;

  • is the action space;

  • is the transition probability;

  • is the reward function.

  • is a discount factor;

  • is the initial state distribution.

Denote by the policy which selects the actions in the ER-MDP. For the standard Reinforcement Learning (RL) setting, the reward function and the initial state distribution are unknown which can only be obtained by the interaction with the MDP. Based on the former configurations, we can define the optimization objective of the RL.

Definition 2.

Given the markov decision process , thus the goal of the RL is:


where being the set of all stationary stochastic policies, being the trajectory generated by the policy and being the causal entropy [19].

This optimization objective implies the learning method should not only maximize the cumulative reward but also maximize the entropy of the each output action. Such operation aims to randomize the policy so that the agent can explore more state-action pairs. The following theorem elaborates the optimal policy for the ER-MDP based on the policy improvement theorem in [5].

Theorem 1.

Given the , define the soft action-value function and soft state-value function [20]:


Then the optimal policy for Eq. (1) is given by:


The proof can be found in [20]. ∎

The inverse reinforcement learning aims to find a reward function which explains the observed behavior [3]. Given a set of trajectories , the trajectories are assumed to be drawn an optimal policy in IRL. Then the IRL problem can be defined as:

Definition 3.

Given the ER-MDP and a set of trajectories , the goal of the IRL is:


where , and the is the parameterized reward function.

The Def. 3 induces a maximum likelihood problem, which aims to find a parameterized reward function that maximizes the expectation of choosing the trajectories . Moreover, the probability of choosing a trajectory is proportional to the sum reward of the trajectory [21].

Iii Hybrid adversarial Inverse Reinforcement Learning

Iv Experiments and Results

V Conclusion

Fig. 1: The architecture of HARIL. (a) The calculation process of the hybrid reward. (b) The curiosity module.

V-a Extrinsic Reward Block

We build the HARIL based on the Adversarial Inverse Reinforcement Learning [16], which solves the IRL problem in Def. 3

using GA fashion. For a standard GA structure, there is a generator which aims to capture the distribution of the training data, while a discriminator estimating the probability that a sample belongs to the training data rather than the generator

[22]. In AIRL, the generator is served by a policy which is parameterized by a Deep Neural Network (DNN). The policy accepts the states from the environment and makes corresponding actions. Then a discriminator is designed to judge whether a state-action pair is generated by the expert policy or not. Instead of directly output the estimation probability, here the discriminator takes a special form:


where being the discriminator parameterized by a DNN with parameters ,

is the learned function. Finally, the loss function of the discriminator is:


Meanwhile, the policy is set to maximize the following objective:


where .

In each training epoch, the generation policy

will be first executed to generate trajectories . Then the discriminator will be trained with generation trajectories and expert trajectories , so as to classify expert data (state-action pairs) from the . The discriminator estimates the probability that a state-action pair is generated by the expert policy . This induces the corresponding rewards for all the state-action pairs in , which will be used for policy update via any policy optimization method. Therefore, the more similar the is to , the higher reward the will obtain based on the Eq. 8. And the generation policy will continuously adjust itself to approach the expert policy.

Through the discriminator, the learning process is simplified only to simulate the expert policy. Regardless of the original extrinsic reward (e.g. speed, time, etc.) of a task, the AIRL transform the IRL problem into the RL problem with specific reward function. And this reward function only measures the difference between the generation policy and the expert policy. This allows the agent to focus on the imitation rather than exploration, which speeds up the learning process. Accordingly, this operation limits the improvement of the agent.

In HARIL, the generation policy is expected to efficiently imitate the expert policy. So the discriminator for classifying the expert data is taken to serve as the extrinsic reward block. This block generates rewards for state-action pairs when executing the generation policy, which guides the generation policy to rapidly approach the expert policy. In AIRL, the discriminator takes the cross-entropy as the loss function, which induces the corresponding reward function. However, such loss function can not indicate the training process and may lead to the model collapse [23]. And it is difficult to balance the training level of the generator and discriminator. Moreover, the original reward function Eq. 8 is monotonic, where the value may approach the infinity if the discriminator is over-fitting. To address the problems, the Wasserstein Distance (WD) is leveraged to serve as the loss function of the discriminator, which induces the Wasserstein GAN in [24]. For the discriminator, the loss function is formulated as:


Furthermore, the discriminator form of the Eq. 6 is deprecated, which directly outputs the estimation probability. For the generation policy, the reward function is set as:

Theorem 2.

The objective Eq. 5 can be achieved via minimizing the loss function Eq. 9 of the discriminator, while maximizing the reward function Eq. 10.


According to the Def. 3, let:

where . Through the Boltzmann distribution, the demonstrations can be modeled as [21][25]:

where is the partition function. Then the gradient of the can be computed as:

Let be the state-action marginal at time , the above equation can be rewritten as:

Taking the derivative of the loss function Eq. 9 w.r.t :

As , thus the two objectives are consistent. Moreover, because the is constant, we only need to maximize the . So the theorem is proved. ∎

We refer the to the Eq. 10 which serves as the extrinsic reward block. Assume there are only for the generation policy, it is set to maximize the following objective:


By maximizing the Eq. 11, the WD between the expert policy and the generation policy will be decreased, which means the is more and more similar to the .

V-B Intrinsic Reward Block

Through the extrinsic reward, the generation policy can quickly approach the expert policy. After the generation policy finish its ”courses”, it should actively explore the environment to get better performance than the expert policy. To realized the exploration, the Intrinsic Curiosity Module (ICM) is introduced to serve as the intrinsic reward block, which contains a forward model and a inverse model [9]. Different from the original ICM which uses the encoding fashion of the state-action pairs, we design the end-to-end ICM (E-ICM) in the Fig. 1. The E-ICM drops the enconding procedure and applies the end-to-end training, which simplifies the model architecture and decreases the variance.

Consider a state-action pair generated by the generation policy at time and a resulted at time . The inverse model accepts the and outputs a predicted action , which may results in the at . Then the is sent to the forward model to predict the next state .

Denote by the inverse model parameterized by a DNN with parameters , the is trained to optimize the following objective:


where the being the loss function that evaluate the difference between the predicted and real actions. For instance, the can be cross-entropy if the action is discrete.

Similarly, denote by the forward model parameterized by a DNN with parameters , the is trained to minimize the following loss function:


Finally, the intrinsic reward is defined as:


Through the inverse model and the forward model, the transition process is reconstructed. The E-ICM is trained to actually learn the potential transition model of the MDP. Assume the E-ICM is well-trained, then all the possible transition processes will be fully predictable, which produce little prediction errors. Accordingly, if a state-action pair produces much prediction error, it can be considered out of control. Moreover, this implies the unseen search space, which motivates the policy to make further exploration.

V-C Hybrid Reward

So far the extrinsic reward block and the intrinsic reward block are obtained respectively, which forms the hybrid reward as follows:


where the is a decaying scalar that weights the extrinsic reward against the intrinsic reward.

At the beginning of the training, the importance of is higher than the , which drives the generation policy to rapidly approach the expert policy. After the intrinsic term gets higher importance, the generation policy will turn to do more exploration rather than imitation.

1:  Obtain expert trajectories
2:  Initialize the generation policy network , discriminator , inverse model and forward model with parameters , , and .
3:  for epoch  do
4:     Execute policy and collect the trajectories
5:     Train the discriminator by minimizing the loss function defined in Eq. 9 with and .
6:     Train the E-ICM by minimizing the loss functions defined in Eq. 13 and Eq. 12 with .
7:     Calculate the hybrid reward for each state-action pair in .
8:     Update with respect to the hybrid reward using any policy optimization method.
9:  end for
Algorithm 1 Hybrid Adversarial IRL