Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

07/04/2021
by   Assaf Hallak, et al.
0

Tree Search (TS) is crucial to some of the most influential successes in reinforcement learning. Here, we tackle two major challenges with TS that limit its usability: distribution shift and scalability. We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent, even when having access to the exact state and reward in future steps. We show this is due to a distribution shift to areas where value estimates are highly inaccurate and analyze this effect using Extreme Value theory. To overcome this problem, we introduce a novel off-policy correction term that accounts for the mismatch between the pre-trained value and its corresponding TS policy by penalizing under-sampled trajectories. We prove that our correction eliminates the above mismatch and bound the probability of sub-optimal action selection. Our correction significantly improves pre-trained Rainbow agents without any further training, often more than doubling their scores on Atari games. Next, we address the scalability issue given by the computational complexity of exhaustive TS that scales exponentially with the tree depth. We introduce Batch-BFS: a GPU breadth-first search that advances all nodes in each depth of the tree simultaneously. Batch-BFS reduces runtime by two orders of magnitude and, beyond inference, enables also training with TS of depths that were not feasible before. We train DQN agents from scratch using TS and show improvement in several Atari games compared to both the original DQN and the more advanced Rainbow.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/15/2020

Reducing Sampling Error in Batch Temporal Difference Learning

Temporal difference (TD) learning is one of the main foundations of mode...
research
01/26/2023

Model-based Offline Reinforcement Learning with Local Misspecification

We present a model-based offline reinforcement learning policy performan...
research
02/23/2023

Targeted Search Control in AlphaZero for Effective Policy Improvement

AlphaZero is a self-play reinforcement learning algorithm that achieves ...
research
10/06/2021

No-Press Diplomacy from Scratch

Prior AI successes in complex games have largely focused on settings wit...
research
06/23/2023

Correcting discount-factor mismatch in on-policy policy gradient methods

The policy gradient theorem gives a convenient form of the policy gradie...
research
07/11/2019

Safe Policy Improvement with Soft Baseline Bootstrapping

Batch Reinforcement Learning (Batch RL) consists in training a policy us...

Please sign up or login with your details

Forgot password? Click here to reset