Trust Region Policy Optimization of POMDPs

10/18/2018
by   Kamyar Azizzadenesheli, et al.
0

We propose Generalized Trust Region Policy Optimization (GTRPO), a Reinforcement Learning algorithm for TRPO of Partially Observable Markov Decision Processes (POMDP). While the principle of policy gradient methods does not require any model assumption, previous studies of more sophisticated policy gradient methods are mainly limited to MDPs. Many real-world decision-making tasks, however, are inherently non-Markovian, i.e., only an incomplete representation of the environment is observable. Moreover, most of the advanced policy gradient methods are designed for infinite horizon MDPs. Our proposed algorithm, GTRPO, is a policy gradient method for continuous episodic POMDPs. We prove that its policy updates monotonically improve the expected cumulative return. We empirically study GTRPO on many RoboSchool environments, an extension to the MuJoCo environments, and provide insights into its empirical behavior.

READ FULL TEXT
research
06/18/2020

Competitive Policy Optimization

A core challenge in policy optimization in competitive Markov decision p...
research
06/13/2022

Relative Policy-Transition Optimization for Fast Policy Transfer

We consider the problem of policy transfer between two Markov Decision P...
research
11/03/2020

A Study of Policy Gradient on a Class of Exactly Solvable Models

Policy gradient methods are extensively used in reinforcement learning a...
research
03/06/2023

Scenario-Agnostic Zero-Trust Defense with Explainable Threshold Policy: A Meta-Learning Approach

The increasing connectivity and intricate remote access environment have...
research
06/06/2019

Classical Policy Gradient: Preserving Bellman's Principle of Optimality

We propose a new objective function for finite-horizon episodic Markov d...
research
06/13/2019

Jacobian Policy Optimizations

Recently, natural policy gradient algorithms gained widespread recogniti...
research
12/03/2021

Episodic Policy Gradient Training

We introduce a novel training procedure for policy gradient methods wher...

Please sign up or login with your details

Forgot password? Click here to reset