Imitation learning based on entropy-regularized forward and inverse reinforcement learning

by   Eiji Uchibe, et al.

This paper proposes Entropy-Regularized Imitation Learning (ERIL), which is a combination of forward and inverse reinforcement learning under the framework of the entropy-regularized Markov decision process. ERIL minimizes the reverse Kullback-Leibler (KL) divergence between two probability distributions induced by a learner and an expert. Inverse reinforcement learning (RL) in ERIL evaluates the log-ratio between two distributions using the density ratio trick, which is widely used in generative adversarial networks. More specifically, the log-ratio is estimated by building two binary discriminators. The first discriminator is a state-only function, and it tries to distinguish the state generated by the forward RL step from the expert's state. The second discriminator is a function of current state, action, and transitioned state, and it distinguishes the generated experiences from the ones provided by the expert. Since the second discriminator has the same hyperparameters of the forward RL step, it can be used to control the discriminator's ability. The forward RL minimizes the reverse KL estimated by the inverse RL. We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy under entropy regularization. Consequently, a new policy is derived from an algorithm that resembles Dynamic Policy Programming and Soft Actor-Critic. Our experimental results on MuJoCo-simulated environments show that ERIL is more sample-efficient than such previous methods. We further apply the method to human behaviors in performing a pole-balancing task and show that the estimated reward functions show how every subject achieves the goal.


page 30

page 31

page 32


Model-Based Imitation Learning Using Entropy Regularization of Model and Policy

Approaches based on generative adversarial networks for imitation learni...

Discriminator Soft Actor Critic without Extrinsic Rewards

It is difficult to be able to imitate well in unknown states from a smal...

Optimistic Reinforcement Learning by Forward Kullback-Leibler Divergence Optimization

This paper addresses a new interpretation of reinforcement learning (RL)...

Weighted Maximum Entropy Inverse Reinforcement Learning

We study inverse reinforcement learning (IRL) and imitation learning (IM...

A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models

Generative adversarial networks (GANs) are a recently proposed class of ...

Regularized Inverse Reinforcement Learning

Inverse Reinforcement Learning (IRL) aims to facilitate a learner's abil...

Uniqueness and Complexity of Inverse MDP Models

What is the action sequence aa'a" that was likely responsible for reachi...

Please sign up or login with your details

Forgot password? Click here to reset