Policy Optimization With Penalized Point Probability Distance: An Alternative To Proximal Policy Optimization
This paper proposes a first order gradient reinforcement learning algorithm, which can be seen as a variant for Trust Region Policy Optimization(TRPO). This method, which we call policy optimization with penalized point probability distance (POP3D), keeps almost all positive spheres of proximal policy optimization (PPO) such as easy implementation, fast learning and high score capability. As PPO, we also use a single surrogate objective without constraints, where a penalized item based on point probability distance is included to prevent update step from growing too large. Experiments verify that POP3D is state-of-the-art within 40 million frame steps on 49 Atari games based on two common metrics, which can be a competitive alternative to PPO. Moreover, comparison experiments regarding PPO based on Mujoco environment verify that POP3D is also competitive in continuous domain. In addition, we release the code on github https://github.com/cxxgtxy/POP3D.git.
READ FULL TEXT