Policy Improvement from Multiple Experts

07/01/2020
by   Ching-An Cheng, et al.
0

Despite its promise, reinforcement learning's real-world adoption has been hampered by its need for costly exploration to learn a good policy. Imitation learning (IL) mitigates this shortcoming by using an expert policy during training as a bootstrap to accelerate the learning process. However, in many practical situations, the learner has access to multiple suboptimal experts, which may provide conflicting advice in a state. The existing IL literature provides a limited treatment of such scenarios. Whereas in the single-expert case, the return of the expert's policy provides an obvious benchmark for the learner to compete against, neither such a benchmark nor principled ways of outperforming it are known for the multi-expert setting. In this paper, we propose the state-wise maximum of the expert policies' values as a natural baseline to resolve conflicting advice from multiple experts. Using a reduction of policy optimization to online learning, we introduce a novel IL algorithm MAMBA, which can provably learn a policy competitive with this benchmark. In particular, MAMBA optimizes policies by using a gradient estimator in the style of generalized advantage estimation (GAE). Our theoretical analysis shows that this design makes MAMBA robust and enables it to outperform the expert policies by a larger margin than IL state of the art, even in the single-expert case. In an evaluation against standard policy gradient with GAE and AggreVaTeD, we showcase MAMBA's ability to leverage demonstrations both from a single and from multiple weak experts, and significantly speed up policy optimization.

READ FULL TEXT
research
05/26/2018

Fast Policy Learning through Imitation and Reinforcement

Imitation learning (IL) consists of a set of tools that leverage expert ...
research
06/17/2023

Active Policy Improvement from Multiple Black-box Oracles

Reinforcement learning (RL) has made significant strides in various comp...
research
03/01/2023

MEGA-DAgger: Imitation Learning with Multiple Imperfect Experts

Imitation learning has been widely applied to various autonomous systems...
research
02/28/2022

LobsDICE: Offline Imitation Learning from Observation via Stationary Distribution Correction Estimation

We consider the problem of imitation from observation (IfO), in which th...
research
01/31/2020

Preventing Imitation Learning with Adversarial Policy Ensembles

Imitation learning can reproduce policies by observing experts, which po...
research
06/18/2011

Bayesian multitask inverse reinforcement learning

We generalise the problem of inverse reinforcement learning to multiple ...
research
10/15/2018

Predictor-Corrector Policy Optimization

We present a predictor-corrector framework, called PicCoLO, that can tra...

Please sign up or login with your details

Forgot password? Click here to reset