Beyond the One Step Greedy Approach in Reinforcement Learning

02/10/2018
by   Yonathan Efroni, et al.
0

The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation. Implementations of this algorithm with several variants of the latter evaluation stage, e.g, n-step and trace-based returns, have been analyzed in previous works. However, the case of multiple-step lookahead policy improvement, despite the recent increase in empirical evidence of its strength, has to our knowledge not been carefully analyzed yet. In this work, we introduce the first such analysis. Namely, we formulate variants of multiple-step policy improvement, derive new algorithms using these definitions and prove their convergence. Moreover, we show that recent prominent Reinforcement Learning algorithms are, in fact, instances of our framework. We thus shed light on their empirical success and give a recipe for deriving new algorithms for future study.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/21/2018

Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning

Multiple-step lookahead policies have demonstrated high empirical compet...
research
03/13/2023

Path Planning using Reinforcement Learning: A Policy Iteration Approach

With the impact of real-time processing being realized in the recent pas...
research
10/26/2021

Hinge Policy Optimization: Rethinking Policy Improvement and Reinterpreting PPO

Policy optimization is a fundamental principle for designing reinforceme...
research
07/13/2020

Structured Policy Iteration for Linear Quadratic Regulator

Linear quadratic regulator (LQR) is one of the most popular frameworks t...
research
11/15/2019

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

Off-policy policy evaluation (OPE) is the problem of estimating the onli...
research
02/29/2016

Easy Monotonic Policy Iteration

A key problem in reinforcement learning for control with general functio...
research
11/23/2021

Schedule Based Temporal Difference Algorithms

Learning the value function of a given policy from data samples is an im...

Please sign up or login with your details

Forgot password? Click here to reset