Variational Policy Gradient Method for Reinforcement Learning with General Utilities

07/04/2020
by   Junyu Zhang, et al.
4

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem <cit.> available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order O(1/t) by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/03/2022

Policy Gradient for Reinforcement Learning with General Utilities

In Reinforcement Learning (RL), the goal of agents is to discover an opt...
research
02/17/2021

On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method

Policy gradient gives rise to a rich class of reinforcement learning (RL...
research
03/07/2020

Convergence of Q-value in case of Gaussian rewards

In this paper, as a study of reinforcement learning, we converge the Q f...
research
07/25/2023

Submodular Reinforcement Learning

In reinforcement learning (RL), rewards of states are typically consider...
research
07/03/2023

Monte Carlo Policy Gradient Method for Binary Optimization

Binary optimization has a wide range of applications in combinatorial op...
research
11/28/2022

Quantile Constrained Reinforcement Learning: A Reinforcement Learning Framework Constraining Outage Probability

Constrained reinforcement learning (RL) is an area of RL whose objective...
research
01/21/2022

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

We develop a new measure of the exploration/exploitation trade-off in in...

Please sign up or login with your details

Forgot password? Click here to reset