## 1 Introduction

^{1}

^{1}footnotetext: Our code is available at https://github.com/oxwhirl/treeqn.

A promising approach to improving model-free deep reinforcement learning (RL) is to combine it with on-line planning. The model-free value function can be viewed as a rough global estimate which is then locally refined on the fly for the current state by the on-line planner. Crucially, this does not require new samples from the environment but only additional computation, which is often available.

One strategy for on-line planning is to use look-ahead tree search (knuth1975analysis; browne2012survey). Traditionally, such methods have been limited to domains where perfect environment simulators are available, such as board or card games (coulom2006efficient; sturtevant2008analysis). However, in general, models for complex environments with high dimensional observation spaces and complex dynamics must be learned from agent experience. Unfortunately, to date, it has proven difficult to learn models for such domains with sufficient fidelity to realise the benefits of look-ahead planning (oh2015action; talvitie2017self).

A simple approach to learning environment models is to maximise a similarity
metric between model predictions and ground truth in the observation space.
This approach has been applied with some success in cases where model fidelity is less important, *e.g.*, for improving exploration (chiappa2017recurrent; oh2015action).
However, this objective causes significant model capacity to be devoted to
predicting irrelevant aspects of the environment dynamics, such as noisy
backgrounds, at the expense of value-critical features that may occupy only a small
part of the observation space (pathak2017curiosity).
Consequently, current state-of-the-art models still accumulate errors
too rapidly to be used for look-ahead planning in complex environments.

Another strategy is to train a model such that, when it is used to predict a
value function, the error in those predictions is minimised. Doing so can
encourage the model to focus on features of the observations that are relevant
for the control task.
An example is the *predictron* (silver2017predictron), where the
model is used to aid policy evaluation without addressing control.
*Value prediction networks* (VPNs, oh2017value) take a similar approach but use the model
to construct a look-ahead tree only when constructing bootstrap targets and selecting actions, similarly to *TD-search* (silver2012temporal). Crucially, the model is not embedded in a planning algorithm during optimisation.

We propose a new tree-structured neural network architecture to address the aforementioned problems. By formulating the tree look-ahead in a differentiable way and integrating it directly into the

-function or policy, we train the entire agent, including its learned transition model, end-to-end. This ensures that the model is optimised for the correct goal and is suitable for on-line planning during execution of the policy.Since the transition model is only weakly grounded in the actual environment, our approach can alternatively be viewed as a model-free method in which the fully connected layers of DQN are replaced by a recursive network that applies transition functions with shared parameters at each tree node expansion.

The resulting architecture, which we call TreeQN, encodes an inductive bias based on the prior knowledge that the environment is a stationary Markov process, which facilitates faster learning of better policies. We also present an actor-critic variant, ATreeC, in which the tree is augmented with a softmax layer and used as a policy network.

We show that TreeQN and ATreeC outperform their DQN-based counterparts in a box-pushing domain and a suite of Atari games, with deeper trees often outperforming shallower trees, and TreeQN outperforming VPN (oh2017value) on most Atari games. We also present ablation studies investigating various auxiliary losses for grounding the transition model more strongly in the environment, which could improve performance as well as lead to interpretable internal plans. While we show that grounding the reward function is valuable, we conclude that how to learn strongly grounded transition models and generate reliably interpretable plans without compromising performance remains an open research question.

## 2 Background

We consider an agent learning to act in a Markov Decision Process (MDP), with the goal of maximising its expected discounted sum of rewards

, by learning a policy that maps states to actions . The state-action value function (-function) is defined as ; the optimal -function is .The Bellman optimality equation writes recursively as

where is the MDP state transition function and is a reward function, which for simplicity we assume to be deterministic. -learning (watkins1992q) uses a single-sample approximation of the contraction operator to iteratively improve an estimate of .

In deep -learning (mnih2015human), is represented by a deep neural network with parameters , and is improved by regressing to a target , where are the parameters of a target network periodically copied from .

We use a version of -step -learning (mnih2016asynchronous) with synchronous environment threads. In particular, starting at a timestep , we roll forward threads for timesteps each. We then bootstrap off the final states only and gather all transitions in a single batch for the backward pass, minimising the loss:

(1) |

If the episode terminates, we use the remaining episode return as the target, without bootstrapping.

This algorithm’s actor-critic counterpart is A2C, a synchronous variant of A3C (mnih2016asynchronous) in which a policy and state-value function are trained using the gradient:

(2) |

where is an advantage estimate given by , is the policy entropy,

is a hyperparameter tuning the degree of entropy regularisation, and

is a hyperparameter controlling the relative learning rates of actor and critic.These algorithms were chosen for their simplicity and reasonable wallclock speeds, but TreeQN can also be used in other algorithms, as described in Section LABEL:sec:method_treeqn. Our implementations are based on OpenAI Baselines (baselines).