PAGAR: Imitation Learning with Protagonist Antagonist Guided Adversarial Reward

06/02/2023
by   Weichao Zhou, et al.
0

Imitation learning (IL) algorithms often rely on inverse reinforcement learning (IRL) to first learn a reward function from expert demonstrations. However, IRL can suffer from identifiability issues and there is no performance or efficiency guarantee when training with the learned reward function. In this paper, we propose Protagonist Antagonist Guided Adversarial Reward (PAGAR), a semi-supervised learning paradigm for designing rewards for policy training. PAGAR employs an iterative adversarially search for reward functions to maximize the performance gap between a protagonist policy and an antagonist policy. This allows the protagonist policy to perform well across a set of possible reward functions despite the identifiability issue. When integrated with IRL-based IL, PAGAR guarantees that the trained policy succeeds in the underlying task. Furthermore, we introduce a practical on-and-off policy approach to IL with PAGAR. This approach maximally utilizes samples from both the protagonist and antagonist policies for the optimization of policy and reward functions. Experimental results demonstrate that our algorithm achieves higher training efficiency compared to state-of-the-art IL/IRL baselines in standard settings, as well as zero-shot learning from demonstrations in transfer environments.

READ FULL TEXT

page 8

page 9

research
05/03/2020

Off-Policy Adversarial Inverse Reinforcement Learning

Adversarial Imitation Learning (AIL) is a class of algorithms in Reinfor...
research
04/07/2022

Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning

The goal of imitation learning is to mimic expert behavior from demonstr...
research
06/01/2022

Transferable Reward Learning by Dynamics-Agnostic Discriminator Ensemble

Inverse reinforcement learning (IRL) recovers the underlying reward func...
research
05/09/2020

Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

Dialogue policy optimization often obtains feedback until task completio...
research
04/13/2020

Imitation Learning for Fashion Style Based on Hierarchical Multimodal Representation

Fashion is a complex social phenomenon. People follow fashion styles fro...
research
04/24/2018

No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling

Though impressive results have been achieved in visual captioning, the t...
research
02/21/2020

Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences

Bayesian reward learning from demonstrations enables rigorous safety and...

Please sign up or login with your details

Forgot password? Click here to reset