Refined Policy Improvement Bounds for MDPs

07/16/2021
by   J. G. Dai, et al.
0

The policy improvement bound on the difference of the discounted returns plays a crucial role in the theoretical justification of the trust-region policy optimization (TRPO) algorithm. The existing bound leads to a degenerate bound when the discount factor approaches one, making the applicability of TRPO and related algorithms questionable when the discount factor is close to one. We refine the results in <cit.> and propose a novel bound that is "continuous" in the discount factor. In particular, our bound is applicable for MDPs with the long-run average rewards as well.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset