Proximal Policy Optimization and its Dynamic Version for Sequence Generation

by   Yi-Lin Tuan, et al.

In sequence generation task, many works use policy gradient for model optimization to tackle the intractable backpropagation issue when maximizing the non-differentiable evaluation metrics or fooling the discriminator in adversarial learning. In this paper, we replace policy gradient with proximal policy optimization (PPO), which is a proved more efficient reinforcement learning algorithm, and propose a dynamic approach for PPO (PPO-dynamic). We demonstrate the efficacy of PPO and PPO-dynamic on conditional sequence generation tasks including synthetic experiment and chit-chat chatbot. The results show that PPO and PPO-dynamic can beat policy gradient by stability and performance.


Proximal Policy Gradient: PPO with Policy Gradient

In this paper, we propose a new algorithm PPG (Proximal Policy Gradient)...

A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning

We propose a novel hybrid stochastic policy gradient estimator by combin...

Natural Policy Gradients In Reinforcement Learning Explained

Traditional policy gradient methods are fundamentally flawed. Natural gr...

Batch Policy Gradient Methods for Improving Neural Conversation Models

We study reinforcement learning of chatbots with recurrent neural networ...

Policy Gradient Stock GAN for Realistic Discrete Order Data Generation in Financial Markets

This study proposes a new generative adversarial network (GAN) for gener...

Distributed Proximal Policy Optimization for Contention-Based Spectrum Access

The increasing number of wireless devices operating in unlicensed spectr...

Revisiting Design Choices in Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a popular deep policy gradient alg...

Please sign up or login with your details

Forgot password? Click here to reset