Dual Policy Iteration

05/28/2018
by   Wen Sun, et al.
0

Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [2], AlphaGo-Zero from [27]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/29/2020

Model-based Reinforcement Learning for Semi-Markov Decision Processes with Neural ODEs

We present two elegant solutions for modeling continuous-time dynamics, ...
research
11/28/2020

Approximate Midpoint Policy Iteration for Linear Quadratic Control

We present a midpoint policy iteration algorithm to solve linear quadrat...
research
10/14/2019

Bootstrapping the Expressivity with Model-based Planning

We compare the model-free reinforcement learning with the model-based ap...
research
05/12/2014

Approximate Policy Iteration Schemes: A Comparison

We consider the infinite-horizon discounted optimal control problem form...
research
10/12/2020

Local Search for Policy Iteration in Continuous Control

We present an algorithm for local, regularized, policy improvement in re...
research
02/06/2022

Trusted Approximate Policy Iteration with Bisimulation Metrics

Bisimulation metrics define a distance measure between states of a Marko...
research
03/31/2016

Reactive Policies with Planning for Action Languages

We describe a representation in a high-level transition system for polic...

Please sign up or login with your details

Forgot password? Click here to reset