Revisit Policy Optimization in Matrix Form

09/19/2019
by   Sitao Luan, et al.
1

In tabular case, when the reward and environment dynamics are known, policy evaluation can be written as V_π = (I - γ P_π)^-1r_π, where P_π is the state transition matrix given policy π and r_π is the reward signal given π. What annoys us is that P_π and r_π are both mixed with π, which means every time when we update π, they will change together. In this paper, we leverage the notation from wang2007dual to disentangle π and environment dynamics which makes optimization over policy more straightforward. We show that policy gradient theorem sutton2018reinforcement and TRPO schulman2015trust can be put into a more general framework and such notation has good potential to be extended to model-based reinforcement learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/09/2019

Gradient-Aware Model-based Policy Search

Traditional model-based reinforcement learning approaches learn a model ...
research
11/28/2019

Hierarchical model-based policy optimization: from actions to action sequences and back

We develop a normative framework for hierarchical model-based policy opt...
research
06/13/2022

Relative Policy-Transition Optimization for Fast Policy Transfer

We consider the problem of policy transfer between two Markov Decision P...
research
09/04/2020

Policy Gradient Reinforcement Learning for Policy Represented by Fuzzy Rules: Application to Simulations of Speed Control of an Automobile

A method of a fusion of fuzzy inference and policy gradient reinforcemen...
research
01/11/2022

Learning Robust Policies for Generalized Debris Capture with an Automated Tether-Net System

Tether-net launched from a chaser spacecraft provides a promising method...
research
07/29/2019

Hindsight Trust Region Policy Optimization

As reinforcement learning continues to drive machine intelligence beyond...
research
05/28/2021

A nearly Blackwell-optimal policy gradient method

For continuing environments, reinforcement learning methods commonly max...

Please sign up or login with your details

Forgot password? Click here to reset