Refined Policy Improvement Bounds for MDPs

07/16/2021
by   J. G. Dai, et al.
0

The policy improvement bound on the difference of the discounted returns plays a crucial role in the theoretical justification of the trust-region policy optimization (TRPO) algorithm. The existing bound leads to a degenerate bound when the discount factor approaches one, making the applicability of TRPO and related algorithms questionable when the discount factor is close to one. We refine the results in <cit.> and propose a novel bound that is "continuous" in the discount factor. In particular, our bound is applicable for MDPs with the long-run average rewards as well.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2023

Reducing Blackwell and Average Optimality to Discounted MDPs via the Blackwell Discount Factor

We introduce the Blackwell discount factor for Markov Decision Processes...
research
05/01/2022

Processing Network Controls via Deep Reinforcement Learning

Novel advanced policy gradient (APG) algorithms, such as proximal policy...
research
10/10/2017

On- and Off-Policy Monotonic Policy Improvement

Monotonic policy improvement and off-policy learning are two main desira...
research
02/02/2023

Average-Constrained Policy Optimization

Reinforcement Learning (RL) with constraints is becoming an increasingly...
research
11/28/2022

Some Upper Bounds on the Running Time of Policy Iteration on Deterministic MDPs

Policy Iteration (PI) is a widely used family of algorithms to compute o...
research
02/16/2016

Q(λ) with Off-Policy Corrections

We propose and analyze an alternate approach to off-policy multi-step te...
research
10/15/2022

When to Update Your Model: Constrained Model-based Reinforcement Learning

Designing and analyzing model-based RL (MBRL) algorithms with guaranteed...

Please sign up or login with your details

Forgot password? Click here to reset