Mixed Policy Gradient

02/23/2021
by   Yang Guan, et al.
13

Reinforcement learning (RL) has great potential in sequential decision-making. At present, the mainstream RL algorithms are data-driven, relying on millions of iterations and a large number of empirical data to learn a policy. Although data-driven RL may have excellent asymptotic performance, it usually yields slow convergence speed. As a comparison, model-driven RL employs a differentiable transition model to improve convergence speed, in which the policy gradient (PG) is calculated by using the backpropagation through time (BPTT) technique. However, such methods suffer from numerical instability, model error sensitivity and low computing efficiency, which may lead to poor policies. In this paper, a mixed policy gradient (MPG) method is proposed, which uses both empirical data and the transition model to construct the PG, so as to accelerate the convergence speed without losing the optimality guarantee. MPG contains two types of PG: 1) data-driven PG, which is obtained by directly calculating the derivative of the learned Q-value function with respect to actions, and 2) model-driven PG, which is calculated using BPTT based on the model-predictive return. We unify them by revealing the correlation between the upper bound of the unified PG error and the predictive horizon, where the data-driven PG is regraded as 0-step model-predictive return. Relying on that, MPG employs a rule-based method to adaptively adjust the weights of data-driven and model-driven PGs. In particular, to get a more accurate PG, the weight of the data-driven PG is designed to grow along the learning process while the other to decrease. Besides, an asynchronous learning framework is proposed to reduce the wall-clock time needed for each update iteration. Simulation results show that the MPG method achieves the best asymptotic performance and convergence speed compared with other baseline algorithms.

READ FULL TEXT

page 1

page 7

page 8

page 15

research
06/24/2019

Ranking Policy Gradient

Sample inefficiency is a long-lasting problem in reinforcement learning ...
research
01/06/2021

Smoothed functional-based gradient algorithms for off-policy reinforcement learning

We consider the problem of control in an off-policy reinforcement learni...
research
05/15/2022

Policy Gradient Method For Robust Reinforcement Learning

This paper develops the first policy gradient method with global optimal...
research
01/08/2020

A Nonparametric Offpolicy Policy Gradient

Reinforcement learning (RL) algorithms still suffer from high sample com...
research
10/19/2022

Integrated Decision and Control for High-Level Automated Vehicles by Mixed Policy Gradient and Its Experiment Verification

Self-evolution is indispensable to realize full autonomous driving. This...
research
05/25/2020

Meta-Reinforcement Learning for Trajectory Design in Wireless UAV Networks

In this paper, the design of an optimal trajectory for an energy-constra...
research
05/11/2023

Towards Theoretical Understanding of Data-Driven Policy Refinement

This paper presents an approach for data-driven policy refinement in rei...

Please sign up or login with your details

Forgot password? Click here to reset