Explaining Fast Improvement in Online Policy Optimization

07/06/2020
by   Xinyan Yan, et al.
0

Online policy optimization (OPO) views policy optimization for sequential decision making as an online learning problem. In this framework, the algorithm designer defines a sequence of online loss functions such that the regret rate in online learning implies the policy convergence rate and the minimal loss witnessed by the policy class determines the policy performance bias. This reduction technique has been successfully applied to solving various policy optimization problems, including imitation learning, structured prediction, and system identification. Interestingly, the policy improvement speed observed in practice is usually much faster than existing theory suggests. In this work, we provide an explanation of this fast policy improvement phenomenon. Let ϵ denote the policy class bias and assume the online loss functions are convex, smooth, and non-negative. We prove that, after N rounds of OPO with stochastic feedback, the policy converges in Õ(1/N + √(ϵ/N)) in both expectation and high probability. In other words, we show that adopting a sufficiently expressive policy class in OPO has two benefits: both the convergence rate increases and the performance bias decreases, as the policy class becomes reasonably rich. This new theoretical insight is further verified in an online imitation learning experiment.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/22/2018

Convergence of Value Aggregation for Imitation Learning

Value aggregation is a general framework for solving imitation learning ...
research
07/29/2022

Improved Policy Optimization for Online Imitation Learning

We consider online imitation learning (OIL), where the task is to find a...
research
11/02/2010

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Sequential prediction problems such as imitation learning, where future ...
research
10/15/2018

Predictor-Corrector Policy Optimization

We present a predictor-corrector framework, called PicCoLO, that can tra...
research
12/03/2019

Continuous Online Learning and New Insights to Online Imitation Learning

Online learning is a powerful tool for analyzing iterative algorithms. H...
research
12/02/2021

Quantile Filtered Imitation Learning

We introduce quantile filtered imitation learning (QFIL), a novel policy...
research
10/01/2019

The Choice Function Framework for Online Policy Improvement

There are notable examples of online search improving over hand-coded or...

Please sign up or login with your details

Forgot password? Click here to reset