SoftTreeMax: Policy Gradient with Tree Search

09/28/2022
by   Gal Dalal, et al.
0

Policy-gradient methods are widely used for learning control policies. They can be easily distributed to multiple workers and reach state-of-the-art results in many domains. Unfortunately, they exhibit large variance and subsequently suffer from high-sample complexity since they aggregate gradients over entire trajectories. At the other extreme, planning methods, like tree search, optimize the policy using single-step transitions that consider future lookahead. These approaches have been mainly considered for value-based algorithms. Planning-based algorithms require a forward model and are computationally intensive at each step, but are more sample efficient. In this work, we introduce SoftTreeMax, the first approach that integrates tree-search into policy gradient. Traditionally, gradients are computed for single state-action pairs. Instead, our tree-based policy structure leverages all gradients at the tree leaves in each environment step. This allows us to reduce the variance of gradients by three orders of magnitude and to benefit from better sample complexity compared with standard policy gradient. On Atari, SoftTreeMax demonstrates up to 5x better performance in faster run-time compared with distributed PPO.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2023

SoftTreeMax: Exponential Variance Reduction in Policy Gradient via Tree Search

Despite the popularity of policy gradient methods, they are known to suf...
research
02/01/2022

PAGE-PG: A Simple and Loopless Variance-Reduced Policy Gradient Method with Probabilistic Gradient Estimation

Despite their success, policy gradient methods suffer from high variance...
research
02/18/2019

Fast Efficient Hyperparameter Tuning for Policy Gradients

The performance of policy gradient methods is sensitive to hyperparamete...
research
07/14/2020

Lifelong Policy Gradient Learning of Factored Policies for Faster Training Without Forgetting

Policy gradient methods have shown success in learning control policies ...
research
02/03/2022

ExPoSe: Combining State-Based Exploration with Gradient-Based Online Search

A tree-based online search algorithm iteratively simulates trajectories ...
research
02/17/2020

Adaptive Experience Selection for Policy Gradient

Policy gradient reinforcement learning (RL) algorithms have achieved imp...
research
02/20/2023

Improving Deep Policy Gradients with Value Function Search

Deep Policy Gradient (PG) algorithms employ value networks to drive the ...

Please sign up or login with your details

Forgot password? Click here to reset