
On the Convergence and Sample Efficiency of VarianceReduced Policy Gradient Method
Policy gradient gives rise to a rich class of reinforcement learning (RL...
read it

Sample Complexity of Policy Gradient Finding SecondOrder Stationary Points
The goal of policybased reinforcement learning (RL) is to search the ma...
read it

Convergence of Qvalue in case of Gaussian rewards
In this paper, as a study of reinforcement learning, we converge the Q f...
read it

Is the Policy Gradient a Gradient?
The policy gradient theorem describes the gradient of the expected disco...
read it

Steady State Analysis of Episodic Reinforcement Learning
This paper proves that the episodic learning environment of every finite...
read it

Joint Optimization of MultiObjective Reinforcement Learning with Policy Gradient Based Algorithm
Many engineering problems have multiple objectives, and the overall aim ...
read it

AlgaeDICE: Policy Gradient from Arbitrary Experience
In many realworld applications of reinforcement learning (RL), interact...
read it
Variational Policy Gradient Method for Reinforcement Learning with General Utilities
In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the stateaction occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem <cit.> available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order O(1/t) by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.
READ FULL TEXT
Comments
There are no comments yet.