SoftTreeMax: Exponential Variance Reduction in Policy Gradient via Tree Search

01/30/2023
by   Gal Dalal, et al.
0

Despite the popularity of policy gradient methods, they are known to suffer from large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax – a generalization of softmax that takes planning into account. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We consider two variants of SoftTreeMax, one for cumulative reward and one for exponentiated reward. For both, we analyze the gradient variance and reveal for the first time the role of a tree expansion policy in mitigating this variance. We prove that the resulting variance decays exponentially with the planning horizon as a function of the expansion policy. Specifically, we show that the closer the resulting state transitions are to uniform, the faster the decay. In a practical implementation, we utilize a parallelized GPU-based simulator for fast and efficient tree search. Our differentiable tree-based policy leverages all gradients at the tree leaves in each environment step instead of the traditional single-sample-based gradient. We then show in simulation how the variance of the gradient is reduced by three orders of magnitude, leading to better sample complexity compared to the standard policy gradient. On Atari, SoftTreeMax demonstrates up to 5x better performance in a faster run time compared to distributed PPO. Lastly, we demonstrate that high reward correlates with lower variance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/28/2022

SoftTreeMax: Policy Gradient with Tree Search

Policy-gradient methods are widely used for learning control policies. T...
research
02/01/2022

PAGE-PG: A Simple and Loopless Variance-Reduced Policy Gradient Method with Probabilistic Gradient Estimation

Despite their success, policy gradient methods suffer from high variance...
research
11/15/2022

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

In this paper, we revisit and improve the convergence of policy gradient...
research
03/09/2020

Stochastic Recursive Momentum for Policy Gradient Methods

In this paper, we propose a novel algorithm named STOchastic Recursive M...
research
06/12/2020

Zeroth-order Deterministic Policy Gradient

Deterministic Policy Gradient (DPG) removes a level of randomness from s...
research
11/28/2001

Gradient-based Reinforcement Planning in Policy-Search Methods

We introduce a learning method called "gradient-based reinforcement plan...
research
03/05/2018

Learning Sample-Efficient Target Reaching for Mobile Robots

In this paper, we propose a novel architecture and a self-supervised pol...

Please sign up or login with your details

Forgot password? Click here to reset