Relative Policy-Transition Optimization for Fast Policy Transfer

06/13/2022
by   Lei Han, et al.
1

We consider the problem of policy transfer between two Markov Decision Processes (MDPs). We introduce a lemma based on existing theoretical results in reinforcement learning (RL) to measure the relativity between two arbitrary MDPs, that is the difference between any two cumulative expected returns defined on different policies and environment dynamics. Based on this lemma, we propose two new algorithms referred to as Relative Policy Optimization (RPO) and Relative Transition Optimization (RTO), which can offer fast policy transfer and dynamics modeling, respectively. RPO updates the policy using the relative policy gradient to transfer the policy evaluated in one environment to maximize the return in another, while RTO updates the parameterized dynamics model (if there exists) using the relative transition gradient to reduce the gap between the dynamics of the two environments. Then, integrating the two algorithms offers the complete algorithm Relative Policy-Transition Optimization (RPTO), in which the policy interacts with the two environments simultaneously, such that data collections from two environments, policy and transition updates are completed in one closed loop to form a principled learning framework for policy transfer. We demonstrate the effectiveness of RPTO in OpenAI gym's classic control tasks by creating policy transfer problems via variant dynamics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2018

Trust Region Policy Optimization of POMDPs

We propose Generalized Trust Region Policy Optimization (GTRPO), a Reinf...
research
09/19/2019

Revisit Policy Optimization in Matrix Form

In tabular case, when the reward and environment dynamics are known, pol...
research
06/02/2022

Policy Gradient Algorithms with Monte-Carlo Tree Search for Non-Markov Decision Processes

Policy gradient (PG) is a reinforcement learning (RL) approach that opti...
research
01/26/2023

Robust Almost-Sure Reachability in Multi-Environment MDPs

Multiple-environment MDPs (MEMDPs) capture finite sets of MDPs that shar...
research
06/15/2021

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control

Reinforcement learning is a framework for interactive decision-making wi...
research
09/06/2022

Cross apprenticeship learning framework: Properties and solution approaches

Apprenticeship learning is a framework in which an agent learns a policy...
research
06/04/2020

Policy Learning of MDPs with Mixed Continuous/Discrete Variables: A Case Study on Model-Free Control of Markovian Jump Systems

Markovian jump linear systems (MJLS) are an important class of dynamical...

Please sign up or login with your details

Forgot password? Click here to reset