Soft policy optimization using dual-track advantage estimator

09/15/2020
by   Yubo Huang, et al.
0

In reinforcement learning (RL), we always expect the agent to explore as many states as possible in the initial stage of training and exploit the explored information in the subsequent stage to discover the most returnable trajectory. Based on this principle, in this paper, we soften the proximal policy optimization by introducing the entropy and dynamically setting the temperature coefficient to balance the opportunity of exploration and exploitation. While maximizing the expected reward, the agent will also seek other trajectories to avoid the local optimal policy. Nevertheless, the increase of randomness induced by entropy will reduce the train speed in the early stage. Integrating the temporal-difference (TD) method and the general advantage estimator (GAE), we propose the dual-track advantage estimator (DTAE) to accelerate the convergence of value functions and further enhance the performance of the algorithm. Compared with other on-policy RL algorithms on the Mujoco environment, the proposed method not only significantly speeds up the training but also achieves the most advanced results in cumulative return.

READ FULL TEXT

page 5

page 7

research
12/03/2022

Constrained Reinforcement Learning via Dissipative Saddle Flow Dynamics

In constrained reinforcement learning (C-RL), an agent seeks to learn fr...
research
06/23/2023

Reinforcement Learning with Temporal-Logic-Based Causal Diagrams

We study a class of reinforcement learning (RL) tasks where the objectiv...
research
02/28/2020

Mixed Reinforcement Learning with Additive Stochastic Uncertainty

Reinforcement learning (RL) methods often rely on massive exploration da...
research
08/19/2022

Entropy Augmented Reinforcement Learning

Deep reinforcement learning has gained a lot of success with the presenc...
research
07/13/2021

Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks

We propose the k-Shortest-Path (k-SP) constraint: a novel constraint on ...
research
05/07/2023

Truncating Trajectories in Monte Carlo Reinforcement Learning

In Reinforcement Learning (RL), an agent acts in an unknown environment ...
research
08/31/2023

Curriculum Proximal Policy Optimization with Stage-Decaying Clipping for Self-Driving at Unsignalized Intersections

Unsignalized intersections are typically considered as one of the most r...

Please sign up or login with your details

Forgot password? Click here to reset