You May Not Need Ratio Clipping in PPO

01/31/2022
by   Mingfei Sun, et al.
0

Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data. Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples. Ratio clipping yields a pessimistic estimate of the original surrogate objective, and has been shown to be crucial for strong performance. We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios. Instead, one can directly optimize the original surrogate objective for multiple epochs; the key is to find a proper condition to early stop the optimization epoch in each iteration. Our theoretical analysis sheds light on how to determine when to stop the optimization epoch, and call the resulting algorithm Early Stopping Policy Optimization (ESPO). We compare ESPO with PPO across many continuous control tasks and show that ESPO significantly outperforms PPO. Furthermore, we show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.

READ FULL TEXT
research
02/15/2023

Trust-Region-Free Policy Optimization for Stochastic Policies

Trust Region Policy Optimization (TRPO) is an iterative method that simu...
research
04/17/2018

An Adaptive Clipping Approach for Proximal Policy Optimization

Very recently proximal policy optimization (PPO) algorithms have been pr...
research
12/04/2020

Proximal Policy Optimization Smoothed Algorithm

Proximal policy optimization (PPO) has yielded state-of-the-art results ...
research
12/10/2019

Imitation Learning via Off-Policy Distribution Matching

When performing imitation learning from expert demonstrations, distribut...
research
12/13/2022

PPO-UE: Proximal Policy Optimization via Uncertainty-Aware Exploration

Proximal Policy Optimization (PPO) is a highly popular policy-based deep...
research
01/31/2022

Monotonic Improvement Guarantees under Non-stationarity for Decentralized PPO

We present a new monotonic improvement guarantee for optimizing decentra...
research
09/23/2020

Revisiting Design Choices in Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a popular deep policy gradient alg...

Please sign up or login with your details

Forgot password? Click here to reset