Average-Reward Reinforcement Learning with Trust Region Methods

06/07/2021
by   Xiaoteng Ma, et al.
0

Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. To the best of our knowledge, our work is the first one to study the trust region approach with the average criterion and it complements the framework of reinforcement learning beyond the discounted criterion. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/14/2021

On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

We develop theory and algorithms for average-reward on-policy Reinforcem...
research
05/24/2023

Inverse Reinforcement Learning with the Average Reward Criterion

We study the problem of Inverse Reinforcement Learning (IRL) with an ave...
research
12/26/2019

Quasi-Newton Trust Region Policy Optimization

We propose a trust region method for policy optimization that employs Qu...
research
12/06/2022

First-order perturbation theory of trust-region subproblems

Trust-region subproblem (TRS) is an important problem arising in many ap...
research
02/02/2023

Average-Constrained Policy Optimization

Reinforcement Learning (RL) with constraints is becoming an increasingly...
research
06/15/2022

Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement Learning

Keeping risk under control is often more crucial than maximizing expecte...
research
01/18/2019

On-Policy Trust Region Policy Optimisation with Replay Buffers

Building upon the recent success of deep reinforcement learning methods,...

Please sign up or login with your details

Forgot password? Click here to reset