Hinge Policy Optimization: Rethinking Policy Improvement and Reinterpreting PPO

10/26/2021
by   Hsuan-Yu Yao, et al.
7

Policy optimization is a fundamental principle for designing reinforcement learning algorithms, and one example is the proximal policy optimization algorithm with a clipped surrogate objective (PPO-clip), which has been popularly used in deep reinforcement learning due to its simplicity and effectiveness. Despite its superior empirical performance, PPO-clip has not been justified via theoretical proof up to date. This paper proposes to rethink policy optimization and reinterpret the theory of PPO-clip based on hinge policy optimization (HPO), called to improve policy by hinge loss in this paper. Specifically, we first identify sufficient conditions of state-wise policy improvement and then rethink policy update as solving a large-margin classification problem with hinge loss. By leveraging various types of classifiers, the proposed design opens up a whole new family of policy-based algorithms, including the PPO-clip as a special case. Based on this construct, we prove that these algorithms asymptotically attain a globally optimal policy. To our knowledge, this is the first ever that can prove global convergence to an optimal policy for a variant of PPO-clip. We corroborate the performance of a variety of HPO algorithms through experiments and an ablation study.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/25/2019

Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

Proximal policy optimization and trust region policy optimization (PPO a...
research
03/13/2020

Taylor Expansion Policy Optimization

In this work, we investigate the application of Taylor expansions in rei...
research
07/19/2021

Reward-Weighted Regression Converges to a Global Optimum

Reward-Weighted Regression (RWR) belongs to a family of widely known ite...
research
02/10/2018

Beyond the One Step Greedy Approach in Reinforcement Learning

The famous Policy Iteration algorithm alternates between policy improvem...
research
07/15/2023

Evaluation of Deep Reinforcement Learning Algorithms for Portfolio Optimisation

We evaluate benchmark deep reinforcement learning (DRL) algorithms on th...
research
04/24/2019

Towards Combining On-Off-Policy Methods for Real-World Applications

In this paper, we point out a fundamental property of the objective in r...
research
05/18/2022

A2C is a special case of PPO

Advantage Actor-critic (A2C) and Proximal Policy Optimization (PPO) are ...

Please sign up or login with your details

Forgot password? Click here to reset