An Adaptive Clipping Approach for Proximal Policy Optimization

04/17/2018
by   Gang Chen, et al.
0

Very recently proximal policy optimization (PPO) algorithms have been proposed as first-order optimization methods for effective reinforcement learning. While PPO is inspired by the same learning theory that justifies trust region policy optimization (TRPO), PPO substantially simplifies algorithm design and improves data efficiency by performing multiple epochs of clipped policy optimization from sampled data. Although clipping in PPO stands for an important new mechanism for efficient and reliable policy update, it may fail to adaptively improve learning performance in accordance with the importance of each sampled state. To address this issue, a new surrogate learning objective featuring an adaptive clipping mechanism is proposed in this paper, enabling us to develop a new algorithm, known as PPO-λ. PPO-λ optimizes policies repeatedly based on a theoretical target for adaptive policy improvement. Meanwhile, destructively large policy update can be effectively prevented through both clipping and adaptive control of a hyperparameter λ in PPO-λ, ensuring high learning reliability. PPO-λ enjoys the same simple and efficient design as PPO. Empirically on several Atari game playing tasks and benchmark control tasks, PPO-λ also achieved clearly better performance than PPO.

READ FULL TEXT

page 5

page 6

research
06/14/2020

Optimistic Distributionally Robust Policy Optimization

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization...
research
12/16/2018

A Logarithmic Barrier Method For Proximal Policy Optimization

Proximal policy optimization(PPO) has been proposed as a first-order opt...
research
10/29/2021

Generalized Proximal Policy Optimization with Sample Reuse

In real-world decision making tasks, it is critical for data-driven rein...
research
01/31/2022

You May Not Need Ratio Clipping in PPO

Proximal Policy Optimization (PPO) methods learn a policy by iteratively...
research
01/11/2022

Learning Robust Policies for Generalized Debris Capture with an Automated Tether-Net System

Tether-net launched from a chaser spacecraft provides a promising method...
research
05/29/2018

Supervised Policy Update

We propose a new sample-efficient methodology, called Supervised Policy ...
research
01/29/2019

Trust Region-Guided Proximal Policy Optimization

Model-free reinforcement learning relies heavily on a safe yet explorato...

Please sign up or login with your details

Forgot password? Click here to reset