Improved Policy Optimization for Online Imitation Learning

We consider online imitation learning (OIL), where the task is to find a policy that imitates the behavior of an expert via active interaction with the environment. We aim to bridge the gap between the theory and practice of policy optimization algorithms for OIL by analyzing one of the most popular OIL algorithms, DAGGER. Specifically, if the class of policies is sufficiently expressive to contain the expert policy, we prove that DAGGER achieves constant regret. Unlike previous bounds that require the losses to be strongly-convex, our result only requires the weaker assumption that the losses be strongly-convex with respect to the policy's sufficient statistics (not its parameterization). In order to ensure convergence for a wider class of policies and losses, we augment DAGGER with an additional regularization term. In particular, we propose a variant of Follow-the-Regularized-Leader (FTRL) and its adaptive variant for OIL and develop a memory-efficient implementation, which matches the memory requirements of FTL. Assuming that the loss functions are smooth and convex with respect to the parameters of the policy, we also prove that FTRL achieves constant regret for any sufficiently expressive policy class, while retaining O(√(T)) regret in the worst-case. We demonstrate the effectiveness of these algorithms with experiments on synthetic and high-dimensional control tasks.

READ FULL TEXT

page 25

page 27

page 28

research
07/06/2020

Explaining Fast Improvement in Online Policy Optimization

Online policy optimization (OPO) views policy optimization for sequentia...
research
11/06/2018

A Dynamic Regret Analysis and Adaptive Regularization Algorithm for On-Policy Robot Imitation Learning

On-policy imitation learning algorithms such as Dagger evolve a robot co...
research
06/12/2018

Model-Based Imitation Learning with Accelerated Convergence

Sample efficiency is critical in solving real-world reinforcement learni...
research
03/07/2017

Online Convex Optimization with Unconstrained Domains and Losses

We propose an online convex optimization algorithm (RescaledExp) that ac...
research
12/14/2021

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

We consider the task of building strong but human-like policies in multi...
research
02/13/2021

Online Apprenticeship Learning

In Apprenticeship Learning (AL), we are given a Markov Decision Process ...
research
11/14/2022

Follow the Clairvoyant: an Imitation Learning Approach to Optimal Control

We consider control of dynamical systems through the lens of competitive...

Please sign up or login with your details

Forgot password? Click here to reset