A functional mirror ascent view of policy gradient methods with function approximation

08/12/2021
by   Sharan Vaswani, et al.
13

We use functional mirror ascent to propose a general framework (referred to as FMA-PG) for designing policy gradient methods. The functional perspective distinguishes between a policy's functional representation (what are its sufficient statistics) and its parameterization (how are these statistics represented) and naturally results in computationally efficient off-policy updates. For simple policy parameterizations, the FMA-PG framework ensures that the optimal policy is a fixed point of the updates. It also allows us to handle complex policy parameterizations (e.g., neural networks) while guaranteeing policy improvement. Our framework unifies several PG methods and opens the way for designing sample-efficient variants of existing methods. Moreover, it recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) in a principled way. With a softmax functional representation, FMA-PG results in a variant of TRPO with additional desirable properties. It also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on MuJoCo. Via experiments on simple reinforcement learning problems, we evaluate algorithms instantiated by FMA-PG.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/18/2019

Sample Efficient Policy Gradient Methods with Recursive Variance Reduction

Improving the sample efficiency in reinforcement learning has been a lon...
research
05/18/2020

Entropy-Augmented Entropy-Regularized Reinforcement Learning and a Continuous Path from Policy Gradient to Q-Learning

Entropy augmented to reward is known to soften the greedy argmax policy ...
research
02/21/2018

Variational Inference for Policy Gradient

Inspired by the seminal work on Stein Variational Inference and Stein Va...
research
02/07/2020

Provably efficient reconstruction of policy networks

Recent research has shown that learning poli-cies parametrized by large ...
research
07/17/2021

Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences

Approximate Policy Iteration (API) algorithms alternate between (approxi...
research
12/22/2021

An Alternate Policy Gradient Estimator for Softmax Policies

Policy gradient (PG) estimators for softmax policies are ineffective wit...
research
12/28/2016

Efficient iterative policy optimization

We tackle the issue of finding a good policy when the number of policy u...

Please sign up or login with your details

Forgot password? Click here to reset