QXplore: Q-learning Exploration by Maximizing Temporal Difference Error

by   Riley Simmons-Edler, et al.
Princeton University

A major challenge in reinforcement learning for continuous state-action spaces is exploration, especially when reward landscapes are very sparse. Several recent methods provide an intrinsic motivation to explore by directly encouraging RL agents to seek novel states. A potential disadvantage of pure state novelty-seeking behavior is that unknown states are treated equally regardless of their potential for future reward. In this paper, we propose that the temporal difference error of predicting primary reward can serve as a secondary reward signal for exploration. This leads to novelty-seeking in the absence of primary reward, and at the same time accelerates exploration of reward-rich regions in sparse (but nonzero) reward landscapes compared to state novelty-seeking. This objective draws inspiration from dopaminergic pathways in the brain that influence animal behavior. We implement this idea with an adversarial method in which Q and Qx are the action-value functions for primary and secondary rewards, respectively. Secondary reward is given by the absolute value of the TD-error of Q. Training is off-policy, based on a replay buffer containing a mixture of trajectories induced by Q and Qx. We characterize performance on a suite of continuous control benchmark tasks against recent state of the art exploration methods and demonstrate comparable or better performance on all tasks, with much faster convergence for Q.



There are no comments yet.


page 3

page 7

page 13


Deep Curiosity Search: Intra-Life Exploration Improves Performance on Challenging Deep Reinforcement Learning Problems

Traditional exploration methods in RL require agents to perform random a...

Novel Policy Seeking with Constrained Optimization

In this work, we address the problem of learning to seek novel policies ...

Exploring Unknown States with Action Balance

Exploration is a key problem in reinforcement learning. Recently bonus-b...

Value-Based Reinforcement Learning for Continuous Control Robotic Manipulation in Multi-Task Sparse Reward Settings

Learning continuous control in high-dimensional sparse reward settings, ...

Clustered Reinforcement Learning

Exploration strategy design is one of the challenging problems in reinfo...

Learning more skills through optimistic exploration

Unsupervised skill learning objectives (Gregor et al., 2016, Eysenbach e...

Fast active learning for pure exploration in reinforcement learning

Realistic environments often provide agents with very limited feedback. ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep reinforcement learning (RL) has recently achieved impressive results across several challenging domains, such as playing games (Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, 2016; Silver et al., 2017; OpenAI, 2018) and controlling robots (OpenAI et al., 2018; Kalashnikov et al., 2018)

. In many of these tasks, a well-shaped reward function is used to assist in learning good policies. On the other hand, deep RL still remains challenging for tasks where the reward function is very sparse. In these settings, state-of-the-art RL methods often perform poorly and train very slowly, if at all, due to the low probability of observing improved rewards by following the current optimal policy or with a naive exploration policy such as

-greedy sampling.

The challenge of learning from sparse rewards is typically framed as a problem of exploration, inspired by the notion that a successful RL agent must efficiently explore the state space of its environment in order to find improved sources of reward. One common exploration paradigm is to directly determine the novelty of states and to encourage the agent to visit states with the highest novelty. In small, countable state spaces this can be achieved through counting how many times each state has been visited. Direct counting of states is challenging or impossible in high-dimensional or continuous state spaces, but recent work (Tang et al., 2017; Bellemare et al., 2016; Fu et al., 2017) using count-like statistics have shown success on benchmark tasks with complex state spaces. Another paradigm for exploration learns a dynamic model of the environment and computes a novelty measure proportional to the error of the model in predicting transitions in the environment. This exploration method relies on the core assumption that well-modeled regions of the state space are similar to previously visited states and thus are less interesting than other regions of state space. Predictions of the transition dynamics can be directly computed (Pathak et al., 2017; Stadie et al., 2015; Savinov et al., 2019; Burda et al., 2019a), or related to an information gain objective on the state space, as described in VIME (Houthooft et al., 2016) and EMI (Kim et al., 2018).

Several exploration methods have recently been proposed that capitalize on the function approximation properties of neural networks. Random network distillation (RND) trains a function to predict the output of a randomly-initialized neural network from an input state, and uses the approximation error as a reward bonus for a separately-trained RL agent

(Burda et al., 2019b). Similarly, DORA (Fox et al., 2018) trains a network to predict zero on observed states and deviations from zero are used to indicate unexplored states. An important shortcoming of existing exploration methods is that they only incorporate information about states and therefore treat all unobserved states the same, regardless of their viability for future reward. This can be problematic in scenarios where there are sparse rewards and successful exploration requires finding unobserved states that actually lead to higher rewards. As a result, current state-based novelty exploration methods can be quite sample inefficient and can require very long training times to achieve good performance (Burda et al., 2019b; Ecoffet et al., 2019; Andrychowicz et al., 2017).

In this paper we propose QXplore, a new exploration formulation that seeks novelty in the predicted reward landscape instead of novelty in the state space. QXplore exploits the inherent reward-space signal from the computation of temporal difference error (TD-error) in value-based RL, and explicitly promotes visiting states where the current understanding of reward dynamics is poor. Our formulation draws inspiration from biological models of dopamine pathways in the brain where levels of dopamine correlate with TD-error in learning trials (Niv et al., 2005). Dopamine-seeking behavior has previously been described in animals Arias-Carrión and Pöppel (2007) and serves as a biologically plausible exploration objective in contrast to simple state novelty. In the following sections, we describe QXplore and demonstrate its utility for sample efficient learning on a variety of complex benchmark environments with continuous controls and sparse rewards.

2 QXplore: TD-Error as Adversarial Reward Signal

2.1 Method Overview

Figure 1: Method diagram for QXplore. We define two Q-functions, each of which samples trajectories from the environment, which share replay buffer contents. ’s reward function is the unsigned temporal difference error of the current Q on data sampled from both and . is defined by and maximizes the TD-error of , while is defined by and maximizes reward.

We first provide an overview of the method - a visual representation is depicted in Figure 1. At a high level, QXplore is a Q-Learning method that jointly trains two independent agents equipped with their own Q-functions and reward functions:

  1. : A standard Q-function, that learns a value function on reward provided by the external environment.

  2. : A Q-function that learns a value function directly on the TD-error of .

Together, and form an adversarial pair, where a policy that samples achieves reward when the agent ventures into states whose reward dynamics are foreign to (i.e. under/overestimates reward achieved). Separate replay buffers are maintained for each agent, but each agent receives samples from both buffers at train time.

QXplore’s formulation is independent of the Q-learning method used, and is therefore a general exploration technique that can be applied to any Q-learning setting, and can easily be combined with other exploration schemes.

2.2 Preliminaries

We consider RL in the terminology of Sutton and Barto Sutton and Barto (1998)

, in which an agent seeks to maximize reward in a Markov Decision Process (MDP). An MDP consists of states

, actions , a state transition function giving the probability of moving to state after taking action from state for discrete timesteps . Rewards are sampled from reward function . An RL agent has a policy that gives the probability of taking action when in state . The agent aims to learn a policy to maximize the expectation of the time-decayed sum of reward where .

A value function with parameters is a function which computes for some policy . Temporal Difference (TD) error measures the bootstrapped error between the value function at the current timestep and the next timestep as


A Q-function is a value function of the form , which computes , the expected future reward assuming the optimal action is taken at each future timestep. An approximation to this optimal Q-function with some parameters may be trained using a mean squared TD-error objective given some target Q-function , commonly a time-delayed version of (Mnih et al., 2015). Extracting a policy given amounts to approximating . Many methods exist for approximating the operation in both discrete and continuous action spaces (Lillicrap et al., 2015; Haarnoja et al., 2018). Following the convention of Mnih et al. (Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, 2016), we train using an off-policy replay buffer of previously visited tuples, which we sample uniformly.

2.3 TD-error objective

First we will describe our TD-error exploration objective. The core concept of using TD-error as part of a reward function was first described as a count-based exploration weighting function in work by Schmidhuber, Thrun, and Moller (Schmidhuber, 1991; Thrun and Möller, 1991, 1992). The operative notion being that, state visitation frequencies being equivalent, it is better to visit states which have high TD-error associated with them to learn more efficiently about the environment. Later, Gehring and Precup used TD-error as a negative signal to constrain exploration to focus on states that are well understood by the value function to avoid common failure modes Gehring and Precup (2013).

In contrast to these previous works, we treat TD-error as a reward signal and use a Q-function trained on this signal to imply an exploration policy directly, rather than as a supplementary objective or to compute a confidence bound. Crucially, when combined with neural network function approximators, this signal provides meaningful exploration information everywhere as discussed in Section 2.5. For a given value function with parameters , and TD-error we define our exploration reward function as


for some extrinsic reward function and target Q-function . Notably, we use the absolute value of the temporal difference, rather than the squared error used to compute updates for to keep the magnitudes of and

comparable and reduce the influence of outlier temporal differences on the gradients of

, which we describe below.

Intuitively, a policy maximizing the expected sum of will sample trajectories where

does not have a good estimate of the future rewards it will experience. This is useful for exploration because

will be large not only for state-action pairs producing unexpected reward, but for all state-action pairs leading to such states. In addition, a policy maximizing TD error can be seen as an adversarial teacher for training . Further, TD-error-based exploration with a dedicated exploration policy removes the exploitation-versus-exploration tradeoff that state-novelty methods must contend with, as maximizing TD-error will definitionally produce trajectories that provide information about the task for to train on.

2.4 : Learning an adversarial Q-function to maximize TD-error

  Input: MDP , Q-function with target , function with target , replay buffers and , batch size and sampling ratios and , CEM policies and , time decay parameter , soft target update rate , and environments ,
  while not converged do
     Reset ,
     while  and are not done do
        Sample environments
        Sample minibatches
         elements from and from
         elements from and from
     end while
  end while
Algorithm 1 QXplore Algorithm

Next, we will describe how we use the TD-error signal defined in Section 2.3 to define an exploration policy. itself is a generic reward objective, which can be combined with many RL algorithms, but given its derivation from a Q-function trained off-policy via bootstrapped target function training a second Q-function to maximize allows the entire algorithm to be trained off-policy with a replay buffer shared between and the Q-function maximizing , which we term . This approach is beneficial for exploration, as trajectories producing improved reward may be sampled only very rarely, and a shared replay buffer improves data efficiency for training both Q-functions. Extensions using on-policy training with actor-critic methods (Mnih et al., 2016) and/or substituting for a value function are also possible, though we do not explore them here.

We define a Q-function, with parameters , whose reward objective is . We train

using the standard bootstrapped loss function


The two Q-functions, and , are trained off-policy in parallel, sharing replay data so that can train on sources of reward discovered by and so that can better predict the TD-errors of . Since the two share data, acts as an adversarial teacher for , sampling trajectories that produce high TD-error under and thus provide novel information about the reward landscape. To avoid off-policy divergence issues due to the two Q-functions’ different reward objectives, we perform rollouts of both Q-functions in parallel, using a cross-entropy method policy inspired by Kalashnikov et al. Kalashnikov et al. (2018), and sample a fixed ratio of experiences collected by each policy for each training batch. We find that the ratio of training experiences collected by each policy was critical for stable training, as described in Section 3.3. Our full method is shown in Figure 1, and pseudocode in Algorithm 1.

2.5 State Novelty from Neural Network Function Approximation Error

A key question in using TD-error for exploration is: What happens when the reward landscape is flat? Theoretically, in the case that for some constant , an optimal Q-function which generalizes perfectly to unseen states will, in the infinite time horizon case, simply output . This results in a TD-error of 0 everywhere and thus no exploration signal. However, using neural network function approximation, we find that perfect generalization to unseen states-action pairs does not occur, and in fact observe in Figure 2 that the distance of a new datapoint from the training data manifold correlates with the magnitude of the network output’s deviation from and thus with TD-error. As a result, in the case where the reward landscape is flat TD-error exploration converges to a form of state novelty exploration. This property of neural network function approximation has been used by several previous exploration methods to good effect, including RND (Burda et al., 2019b) and DORA (Fox et al., 2018).

Figure 2: Predictions of 3-layer MLPs of 256 hidden units per layer trained to imitate on with training data sampled uniformly from the range

(the dark shaded regions in the figure). Each line is the final response curve of an independently trained network once its training error has converged (MSE < 1e-7). The networks consistently fail to either extrapolate outside of the training regions or interpolate between the two training regions.

3 Experiments

We performed a number of experiments to demonstrate the effectiveness of

on continuous control benchmark tasks. We benchmark on five continuous control tasks using the MuJoCo physics simulator that each require exploration due to sparse rewards. We limit our investigation in this work to continuous control exploration tasks for two reasons. First, the exploration problem is more isolated in such tasks compared to Atari games and other visual tasks, while still remaining highly non-trivial to solve. Second, due to the smaller state observations (length  20 vectors for most tasks) compared to images, state novelty methods have relatively less information to guide exploration, and have been reported to perform worse

Kim et al. (2018), making non-image-based RL tasks a more compelling use-case for , as TD-error prediction should be unaffected by the observation dimensionality.

We first compare with a state of the art state novelty-based method, RND Burda et al. (2019b), as well as

-greedy sampling as a simple baseline. We then present several ablations to QXplore, as well as some analysis of its robustness in response to key hyperparameters.

3.1 Experimental Setup

Figure 3: Illustration of our CheckpointCheetah task versus the SparseHalfCheetah task. CheckpointCheetah receives 1 reward when crossing through each checkpoint for the first time and 0 otherwise, with optimal behaviour similar to the original HalfCheetah task from OpenAI Gym Brockman et al. (2016) (run as fast as you can), while the SparseHalfCheetah task receives 0 reward during any step it is in the goal zone and -1 otherwise

We describe here the details of our implementation and training parameters. We held these factors constant for both QXplore and RND/-greedy to enable a fair comparison. We used an off-policy Q-learning method based off of TD3 Fujimoto et al. (2018) and CGP Simmons-Edler et al. (2019) with twin Q-functions and a cross-entropy method policy for better hyperparameter robustness. Each network (,

, RND’s random and predictor networks) consisted of a 3-layer MLP of 256 neurons per hidden layer, with ReLU non-linearities. We used a batch size of 128 and learning rate of 0.001, and for QXplore sampled training batches for

and of 75% self-collected data and 25% data collected by the other Q-function’s policy.

We evaluate on several benchmark tasks. First, the SparseHalfCheetah task originally proposed by VIME Houthooft et al. (2016), in which a simulated cheetah receives a reward of 0 if it is at least 5 units forward from the initial position and otherwise receives a reward of -1. Second, we trained on a variant of the halfcheetah task from the OpenAI gym Brockman et al. (2016) that we refer to as CheckpointCheetah, in which reward is 0 except when the cheetah passes through a checkpoint, which are spaced 5 units apart. The simulation step size is larger than in SparseHalfCheetah, making reaching the first checkpoint easier, but the small number of states that provide reward means that state novelty is less efficient at reaching high performance. We illustrate the difference between them in Figure 3.

Finally we benchmark on three OpenAI gym tasks, FetchPush, FetchSlide, and FetchPickAndPlace, originally developed for goal-directed exploration methods such as HER Andrychowicz et al. (2017). In each of the three the objective is to move a block to a target position, with a reward function returning -1 if the block is not at the target and 0 if it is. We trained each method with 5 random seeds for 500,000 timesteps on cheetah tasks and 1,000,000 timesteps on Fetch tasks (20,000 episodes for the Fetch tasks, 1000 episodes for cheetahs). For consistency, we structured the reward function of the SparseHalfCheetah task to match the Fetch tasks, such that the baseline reward level is -1 while a successful state provides 0 reward.

3.2 Exploration Benchmark Performance

(a) SparseHalfCheetah
(b) CheckpointCheetah
(c) FetchPush-v1
(d) FetchPickAndPlace-v1
Figure 4: Performance of QXplore on various benchmarks, compared with RND and -greedy sampling. Note that due to parallel rollouts of and QXplore samples twice as many trajectories compared to the other methods, but does the same number of training iterations. QXplore performs consistently better due to faster exploration and the separation of exploration and reward maximization objectives.

We show the performance of each method on each task in Figure 4. QXplore outperforms RND and -greedy on both cheetah tasks. This is consistent with our belief that TD-error allows exploration to focus on reward-relevant regions of the state space, leading to faster convergence. QXplore also manages to achieve improved performance in only 20,000 episodes on the Fetch tasks, which took 30,000-50,000 episodes with HER Andrychowicz et al. (2017), while RND and -greedy both fail to make progress. While the comparison is not apples-to-apples due to differing baseline methods (TRPO Schulman et al. (2015) versus Q-Learning), the results reported by EMI on the SparseHalfCheetah task for both their method and EX2 Fu et al. (2017) are similar to QXplore’s performance, but require 10 times more iterations Kim et al. (2018).

Qualitatively, on SparseHalfCheetah we observe some interesting behaviour from late in training. After initially converging to obtain reward consistently, appears to get “bored” during the last  200 episodes and will try to move closer to the 5-unit reward threshold, often jumping back and forth across it during an episode, which results in reduced reward but higher TD-error. This behaviour is distinctive of TD-error seeking over state novelty seeking, as such states are not novel compared to moving further past the threshold but do result in higher TD-error.

3.3 Robustness

As RL tasks are highly heterogeneous, and good parameterization/performance can be hard to obtain in practice for many methods Henderson et al. (2018), we provide some sweeps over core hyperparameters of QXplore on SparseHalfCheetah to demonstrate some degree of robustness and thus a hope for good behaviour on other tasks without extensive parameter tuning. The parameters we consider are the learning rates of Q and , as well as the ratio of experiences collected using each Q-function used to train each Q-function (self-data versus other-data). We found that while QXplore is fairly sensitive to learning rate, keeping learning rates for Q and the same works well. We also found that performance is surprisingly invariant to the on/off-policy data ratio, including when Q is trained entirely off-policy on data collected by , suggesting that the data collected by is sufficient to train a policy to maximize reward decently well even without directly observing the reward function. Figures and further details can be found in the supplement.

3.4 TD-Error Versus Reward Prediction Error

While is described for TD-error reward, we ran several tests with set to 0 such that simply predicts for the current step instead of acting as a Q-function, shown in Figure 5. Reward error reward for still produces the same state novelty fallback behaviour in the absence of reward, but provides less information than TD-errors do and does not allow us to use as an optimal Q-function once trained. We tested this variant on SparseHalfCheetah and FetchPush (where does not consistently find reward but does provide data to train ), and observe that while still finds reward in the case of SparseHalfCheetah, it takes consistently longer to converge for all runs, demonstrating that TD-error makes learning easier in sparse reward environments.

(a) SparseHalfCheetah
(b) FetchPush-v1
Figure 5: Plot showing the performance of for QXplore with (QXplore) and (QXplore-near). In QXplore-near, is trained with of 0, and only predicts the reward of the current state-action pair. While QXplore-near is capable of finding reward through the state novelty fallback described in section 2.5, it converges much slower.

4 Discussion and Conclusions

Here, we have described a new source of exploration signal in reinforcement learning, inspired by animal behaviour in response to dopamine signals in the brain which are correlated with TD-error of the value function. We instantiate a reward function using TD-error, and show that when combined with neural network approximation, it is sufficient to discover solutions to challenging exploration tasks in much less time than recent state novelty-based exploration methods. TD-error has different advantages and disadvantages for exploration compared to state prediction, and we hope that our results can spur further work on diverse exploration signals in RL.

It is worth noting that there may be additional benefits provided by for training in non-exploration contexts. Maximizing TD-error can be seen as a form of hard example mining, and for complex tasks could result in better generalization behaviour. In contrast to our work, minimizing TD-error has previously been used to constrain exploration in environments such as helicopter control where the cost of risky behavior can be high Gehring and Precup (2013).

Stochastic reward sources serve as attractors for , due to unpredictable noise placing a lower bound on TD-error for that state, similar to the "Noisy-TV" problem described for state prediction methods Burda et al. (2019b)

. In adversarial environments, it can prevent exploration depending on the variance of the reward relative to the total TD-error. Stochastic policies also introduce some measure of unpredictable noise in the target Q-values, resulting in higher TD-error. Empirically, these factors do not prevent learning for the tasks tested here, and tasks that are adversarial in this way might be uncommon in practice (as highly noisy reward sources are undesirable for practical reasons), but we hope to rectify this issue in future work through variance estimation and

rescaling, or signed TD-error which preserves the mean TD-error at zero.

We emphasize that we have only scratched the surface in terms of TD-error exploration here. Our instantiation makes several assumptions, such as the use of off-policy Q-learning, the use of two sampling policies, and the use of unsigned TD-error which is positive for both under-prediction and over-prediction of future rewards. We experimented briefly with a signed TD-error but more work remains to be done in this direction. In addition, explicitly combining TD-error rewards with extrinsic rewards is also worth investigating, as the two objectives are partially correlated for most tasks.


Appendix A: Hyperparameters

Default Parameters
      iterations 4
      number of samples 64
      top k 6
All Networks
      neurons per layer 256
      number of layers 3
      non-linearities ReLU
      Optimizer Adam
      Adam momentum terms
      Q learning rate 0.001
      batch size 128
      time decay 0.99
      target Q-function update 0.005
      target update frequency 2
      TD3 policy noise 0.2
      TD3 noise clip 0.5
      training steps per env timestep 1
      learning rate 0.001
      batch data ratio 0.75
      batch data ratio 0.75
      predictor network learning rate 0.001

Table 1: Parameters used for benchmark runs.

We present the parameters we used for the benchmark tasks in table 1.

Appendix B: Parameter Sweeps

We performed two sets of parameter sweeps for QXplore: varying the learning rates of and , and varying the ratios of data sampled by each Q-function’s policy used in training batches for each method. For learning rate, we tested combinations (QLR, QxLR) (0.01, 0.01), (0.01, 0.001), (0.001, 0.01), (0.001, 0.001), (0.001, 0.0001), (0.0001, 0.001), (0.0001, 0.0001).

For batch data ratios, we tested combinations (specified as self-fraction for , then self-fraction for ) of (0, 1), (0.25, 0.75), (0.5, 0.5), (0.75, 0.25).

Results for these sweeps can be seen in Figures 6 and 7. QXplore is sensitive to learning rate, but relatively robust to the training data mix, to the point of training strictly off-policy with only modest performance loss.

(a) SparseHalfCheetah
(b) FetchPush-v1
Figure 6: Learning rate sweeps for and
(a) SparseHalfCheetah
(b) FetchPush-v1
Figure 7: Sample ratio sweeps for and

Appendix C: Unsuccessful Tasks

As part of our experiments, we attempted to train QXplore (and competing methods) on two extra tasks: FetchSlide-v1 and SwimmerGather-v1. We were unable to get any of the methods to train on them, but we believe this is because we trained for many fewer iterations and episodes than reported in previous work, and our lack of performance is consistent with other results at that time scale. See Figure 8.

(a) FetchSlide-v1
(b) SwimmerGather-v1
Figure 8: Tasks which we attempted on which no method trained successfully.