TreeQN and ATreeC: Differentiable Tree Planning for Deep Reinforcement Learning

by   Gregory Farquhar, et al.
University of Oxford

Combining deep model-free reinforcement learning with on-line planning is a promising approach to building on the successes of deep RL. On-line planning with look-ahead trees has proven successful in environments where transition models are known a priori. However, in complex environments where transition models need to be learned from data, the deficiencies of learned models have limited their utility for planning. To address these challenges, we propose TreeQN, a differentiable, recursive, tree-structured model that serves as a drop-in replacement for any value function network in deep RL with discrete actions. TreeQN dynamically constructs a tree by recursively applying a transition model in a learned abstract state space and then aggregating predicted rewards and state-values using a tree backup to estimate Q-values. We also propose ATreeC, an actor-critic variant that augments TreeQN with a softmax layer to form a stochastic policy network. Both approaches are trained end-to-end, such that the learned model is optimised for its actual use in the planner. We show that TreeQN and ATreeC outperform n-step DQN and A2C on a box-pushing task, as well as n-step DQN and value prediction networks (Oh et al., 2017) on multiple Atari games, with deeper trees often outperforming shallower ones. We also present a qualitative analysis that sheds light on the trees learned by TreeQN.


page 4

page 8


Value Prediction Network

This paper proposes a novel deep reinforcement learning (RL) architectur...

An investigation of model-free planning

The field of reinforcement learning (RL) is facing increasingly challeng...

Reinforcement Learning Architectures: SAC, TAC, and ESAC

The trend is to implement intelligent agents capable of analyzing availa...

The Predictron: End-To-End Learning and Planning

One of the key challenges of artificial intelligence is to learn models ...

Sample-Efficient Reinforcement Learning via Conservative Model-Based Actor-Critic

Model-based reinforcement learning algorithms, which aim to learn a mode...

GrASP: Gradient-Based Affordance Selection for Planning

Planning with a learned model is arguably a key component of intelligenc...

Deeper & Sparser Exploration

We address the problem of efficient exploration by proposing a new meta ...

1 Introduction

11footnotetext: Our code is available at

A promising approach to improving model-free deep reinforcement learning (RL) is to combine it with on-line planning. The model-free value function can be viewed as a rough global estimate which is then locally refined on the fly for the current state by the on-line planner. Crucially, this does not require new samples from the environment but only additional computation, which is often available.

One strategy for on-line planning is to use look-ahead tree search (knuth1975analysis; browne2012survey). Traditionally, such methods have been limited to domains where perfect environment simulators are available, such as board or card games (coulom2006efficient; sturtevant2008analysis). However, in general, models for complex environments with high dimensional observation spaces and complex dynamics must be learned from agent experience. Unfortunately, to date, it has proven difficult to learn models for such domains with sufficient fidelity to realise the benefits of look-ahead planning (oh2015action; talvitie2017self).

A simple approach to learning environment models is to maximise a similarity metric between model predictions and ground truth in the observation space. This approach has been applied with some success in cases where model fidelity is less important, e.g., for improving exploration (chiappa2017recurrent; oh2015action). However, this objective causes significant model capacity to be devoted to predicting irrelevant aspects of the environment dynamics, such as noisy backgrounds, at the expense of value-critical features that may occupy only a small part of the observation space (pathak2017curiosity). Consequently, current state-of-the-art models still accumulate errors too rapidly to be used for look-ahead planning in complex environments.

Another strategy is to train a model such that, when it is used to predict a value function, the error in those predictions is minimised. Doing so can encourage the model to focus on features of the observations that are relevant for the control task. An example is the predictron (silver2017predictron), where the model is used to aid policy evaluation without addressing control. Value prediction networks (VPNs, oh2017value) take a similar approach but use the model to construct a look-ahead tree only when constructing bootstrap targets and selecting actions, similarly to TD-search (silver2012temporal). Crucially, the model is not embedded in a planning algorithm during optimisation.

We propose a new tree-structured neural network architecture to address the aforementioned problems. By formulating the tree look-ahead in a differentiable way and integrating it directly into the

-function or policy, we train the entire agent, including its learned transition model, end-to-end. This ensures that the model is optimised for the correct goal and is suitable for on-line planning during execution of the policy.

Since the transition model is only weakly grounded in the actual environment, our approach can alternatively be viewed as a model-free method in which the fully connected layers of DQN are replaced by a recursive network that applies transition functions with shared parameters at each tree node expansion.

The resulting architecture, which we call TreeQN, encodes an inductive bias based on the prior knowledge that the environment is a stationary Markov process, which facilitates faster learning of better policies. We also present an actor-critic variant, ATreeC, in which the tree is augmented with a softmax layer and used as a policy network.

We show that TreeQN and ATreeC outperform their DQN-based counterparts in a box-pushing domain and a suite of Atari games, with deeper trees often outperforming shallower trees, and TreeQN outperforming VPN (oh2017value) on most Atari games. We also present ablation studies investigating various auxiliary losses for grounding the transition model more strongly in the environment, which could improve performance as well as lead to interpretable internal plans. While we show that grounding the reward function is valuable, we conclude that how to learn strongly grounded transition models and generate reliably interpretable plans without compromising performance remains an open research question.

2 Background

We consider an agent learning to act in a Markov Decision Process (MDP), with the goal of maximising its expected discounted sum of rewards

, by learning a policy that maps states to actions . The state-action value function (-function) is defined as ; the optimal -function is .

The Bellman optimality equation writes recursively as

where is the MDP state transition function and is a reward function, which for simplicity we assume to be deterministic. -learning (watkins1992q) uses a single-sample approximation of the contraction operator to iteratively improve an estimate of .

In deep -learning (mnih2015human), is represented by a deep neural network with parameters , and is improved by regressing to a target , where are the parameters of a target network periodically copied from .

We use a version of -step -learning (mnih2016asynchronous) with synchronous environment threads. In particular, starting at a timestep , we roll forward threads for timesteps each. We then bootstrap off the final states only and gather all transitions in a single batch for the backward pass, minimising the loss:


If the episode terminates, we use the remaining episode return as the target, without bootstrapping.

This algorithm’s actor-critic counterpart is A2C, a synchronous variant of A3C (mnih2016asynchronous) in which a policy and state-value function are trained using the gradient:


where is an advantage estimate given by , is the policy entropy,

is a hyperparameter tuning the degree of entropy regularisation, and

is a hyperparameter controlling the relative learning rates of actor and critic.

These algorithms were chosen for their simplicity and reasonable wallclock speeds, but TreeQN can also be used in other algorithms, as described in Section LABEL:sec:method_treeqn. Our implementations are based on OpenAI Baselines (baselines).