Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning

05/29/2018
by   Wen Sun, et al.
0

In this paper, we propose to combine imitation and reinforcement learning via the idea of reward shaping using an oracle. We study the effectiveness of the near-optimal cost-to-go oracle on the planning horizon and demonstrate that the cost-to-go oracle shortens the learner's planning horizon as function of its accuracy: a globally optimal oracle can shorten the planning horizon to one, leading to a one-step greedy Markov Decision Process which is much easier to optimize, while an oracle that is far away from the optimality requires planning over a longer horizon to achieve near-optimal performance. Hence our new insight bridges the gap and interpolates between imitation learning and reinforcement learning. Motivated by the above mentioned insights, we propose Truncated HORizon Policy Search (THOR), a method that focuses on searching for policies that maximize the total reshaped reward over a finite planning horizon when the oracle is sub-optimal. We experimentally demonstrate that a gradient-based implementation of THOR can achieve superior performance compared to RL baselines and IL baselines even when the oracle is sub-optimal.

READ FULL TEXT

page 9

page 10

research
02/06/2023

DITTO: Offline Imitation Learning with World Models

We propose DITTO, an offline imitation learning algorithm which uses wor...
research
10/12/2020

Nearly Minimax Optimal Reward-free Reinforcement Learning

We study the reward-free reinforcement learning framework, which is part...
research
12/31/2019

Reward-Conditioned Policies

Reinforcement learning offers the promise of automating the acquisition ...
research
08/03/2022

Understanding Adversarial Imitation Learning in Small Sample Regime: A Stage-coupled Analysis

Imitation learning learns a policy from expert trajectories. While the e...
research
11/17/2017

Data-driven Planning via Imitation Learning

Robot planning is the process of selecting a sequence of actions that op...
research
08/11/2021

Gap-Dependent Unsupervised Exploration for Reinforcement Learning

For the problem of task-agnostic reinforcement learning (RL), an agent f...
research
06/14/2020

Recursive Two-Step Lookahead Expected Payoff for Time-Dependent Bayesian Optimization

We propose a novel Bayesian method to solve the maximization of a time-d...

Please sign up or login with your details

Forgot password? Click here to reset