Log In Sign Up

Efficient Eligibility Traces for Deep Reinforcement Learning

Eligibility traces are an effective technique to accelerate reinforcement learning by smoothly assigning credit to recently visited states. However, their online implementation is incompatible with modern deep reinforcement learning algorithms, which rely heavily on i.i.d. training data and offline learning. We utilize an efficient, recursive method for computing λ-returns offline that can provide the benefits of eligibility traces to any value-estimation or actor-critic method. We demonstrate how our method can be combined with DQN, DRQN, and A3C to greatly enhance the learning speed of these algorithms when playing Atari 2600 games, even under partial observability. Our results indicate several-fold improvements to sample efficiency on Seaquest and Q*bert. We expect similar results for other algorithms and domains not considered here, including those with continuous actions.


page 1

page 2

page 3

page 4


Pretraining Deep Actor-Critic Reinforcement Learning Algorithms With Expert Demonstrations

Pretraining with expert demonstrations have been found useful in speedin...

Investigating Recurrence and Eligibility Traces in Deep Q-Networks

Eligibility traces in reinforcement learning are used as a bias-variance...

Neural Architecture Evolution in Deep Reinforcement Learning for Continuous Control

Current Deep Reinforcement Learning algorithms still heavily rely on han...

Auto Deep Compression by Reinforcement Learning Based Actor-Critic Structure

Model-based compression is an effective, facilitating, and expanded mode...

Q-Value Weighted Regression: Reinforcement Learning with Limited Data

Sample efficiency and performance in the offline setting have emerged as...

Concurrent Credit Assignment for Data-efficient Reinforcement Learning

The capability to widely sample the state and action spaces is a key ing...

Combinatorial Keyword Recommendations for Sponsored Search with Deep Reinforcement Learning

In sponsored search, keyword recommendations help advertisers to achieve...

1 Introduction

Eligibility traces [1, 15, 34] have been a historically successful approach to the credit assignment problem in reinforcement learning. By applying time-decaying 1-step updates to previously visited states, eligibility traces provide an efficient and online mechanism for generating the -return at each timestep [32]. The -return, equivalent to an exponential average of all -step returns [36]

, interpolates between low-variance (temporal-difference

[31]) and low-bias (Monte Carlo) returns and often yields notably faster empirical convergence. Eligibility traces can be effective when the reward signal is sparse or the environment is partially observable.

More recently, deep reinforcement learning has shown promise on a variety of high-dimensional tasks such as Atari 2600 games [22], Go [30], Doom [16], 3D maze navigation [20], and robotic locomotion [6, 10, 17, 18, 26]. While these methods could theoretically benefit from eligibility traces [33]

, they utilize offline learning schemes that render them fundamentally incompatible. This is primarily because temporally successive states are often highly correlated, but successfully training the neural network requires independent and identically distributed (

i.i.d.) training data to prevent overfitting. Circumventing this issue requires unconventional solutions; for example, DQN [22], DDPG [19], and ACER [35] perform gradient updates on randomly-sampled past experience. Asynchronous methods like A3C [21] aggregate parameter updates across environment instances in a non-deterministic order. Policy gradient methods such as TRPO [27], PPO [29], and ACKTR [37] alternate between trajectory sampling and offline, gradient-based policy improvement. These strategies represent a marked departure from the incremental learning of classic temporal-difference methods. Consequently, the benefits of eligibility traces and the -return have been largely unexplored in the context of deep reinforcement learning.

In this paper, we propose a general strategy for rectifying the -return with deep reinforcement learning. We begin with an efficient, recursive technique for computing an entire sequence of -returns offline in linear time with respect to its length, as opposed to the quadratic time complexity of the traditional definition. This formulation enables the fast calculation of long -return sequences and is ideal for the offline learning schemes prevalent in state-of-the-art deep reinforcement learning. We demonstrate how this technique can be incorporated into DQN, DRQN [9], and A3C to significantly increase the sample efficiency of these algorithms (with necessary modifications) on Atari 2600 games, even when the complete state information is unavailable. In environments such Seaquest and Q*bert, our method can achieve improvements approaching 300% increases in final score after training for the same duration. Our methodology is general enough to be adapted for other value-based or actor-critic methods beyond those described here, including those with continuous action spaces.

2 Background

Reinforcement learning is the problem where an agent must interact with an unknown environment through trial-and-error in order to maximize its cumulative reward [32]

. We first consider the standard setting where the environment can be formulated as a Markov Decision Process (MDP) defined by the 4-tuple

. At a given timestep , the environment exists in state . The agent takes an action according to policy , causing the environment to transition to a new state and yield a reward . Hence, the agent’s goal can be formalized as finding a policy that maximizes the expected discounted return up to some horizon . The discount affects the relative importance of future rewards and allows the sum to converge in the case where , . An important property of the MDP is that every state satisfies the Markov property; that is, the agent needs to consider only the current state when selecting an action in order to perform optimally.

In reality, most problems of interest violate the Markov property. Information presently accessible to the agent may be incomplete or otherwise unreliable, and therefore is no longer a sufficient statistic for the environment’s history [12]. We can extend our previous formulation to the more general case of the Partially Observable Markov Decision Process (POMDP) defined by the 6-tuple . At a given timestep , the environment exists in state and reveals observation . The agent takes an action according to policy and receives a reward , causing the environment to transition to a new state . In this setting, the agent may need to consider arbitrarily long sequences of past observations when selecting actions in order to perform well.111To achieve optimality, the policy must additionally consider the action history in general, i.e. . Our theory presented here is straightforward to extend to this case.

We can mathematically unify the MDP and POMDP by introducing the notion of an approximate state , where is an arbitrary transformation of the observation history. In practice, might consider only a subset of the history, or even just the most recent observation. This allows for the identical treatment of the MDP and POMDP by generalizing the notion of a Bellman backup, and greatly simplifies the following sections. However, it is important to emphasize that except in the strict case of the MDP, and that the choice of can otherwise affect the solution quality.

2.1 Eligibility traces

Value-based reinforcement learning algorithms seek to produce an accurate estimate ) of the expected discounted return achieved by following policy from state . Suppose the agent acts according to and experiences the finite trajectory . The 1-step temporal-difference update can be used to improve the estimate: where is the learning rate controlling the magnitude of the update. The primary drawback of this update is that only the current reward directly affects it; future rewards must influence the estimate indirectly through , which can result in slow learning. The update may also suffer from estimation bias. To increase the immediate sensitivity to future rewards, and to decrease the bias, -step updates can be performed instead: where is the -step return.222The -step return () is defined by . If is terminal, then by definition and the -step return is equivalent to the Monte Carlo return. When is large, the -step return simultaneously considers many rewards and can more rapidly assign credit. On the other hand, the combination of these rewards can have higher variance and require more samples to converge to the true expectation. This tradeoff can be more effectively balanced by averaging -step returns [32]. The -return is defined as the exponential average of all -step returns:



is a hyperparameter that controls the decay rate and

. The final -step return receives the total weight of all hypothetical returns longer than it. When , Equation (1) reduces to the -step temporal-difference return. When and is terminal, Equation (1) reduces to the Monte Carlo return. The -return can thus be seen a smooth interpolation between these methods. Furthermore, the monotonically decreasing weights can be interpreted as a specific form of credit assignment relying only on the reasonable assumption that recent states are likelier to have contributed to a given reward.

The -return presented here is the "forward view" [32], meaning its calculation must be delayed and conducted offline in practice. It is also expensive to compute, which we discuss further in Section 3. The -return can be more efficiently implemented in the "backward view" [32] using eligibility traces to gradually produce the return at each timestep. Although the backward view is generally applicable to function approximators [33], modern deep reinforcement learning algorithms do not learn incrementally and cannot use this technique. In the next sections, we illustrate this incompatibility through the examples of DQN and A3C. Later, we discuss a more efficient forward-view calculation that is practical for deep reinforcement learning.

2.2 Deep Q-Network

Deep Q-Network (DQN) was one of the first notable successes of deep reinforcement learning, achieving human-level performance on Atari 2600 games using only the screen pixels as input [22]

. DQN can be viewed as the deep-learning analog of Q-Learning

[36], in which the estimate of the expected greedy return achieved after taking action from state is updated incrementally: . Because maintaining tabular information for every state-action pair is not feasible for large state spaces, DQN instead learns a parameterized function (implemented as a deep neural network) to generalize over states. Unfortunately, directly updating according to a gradient-based version of the Q-Learning update does not work well. Training samples must be i.i.d. to prevent the neural network from overfitting and performing poorly, but sequentially collected experience is highly correlated. To overcome this, transitions are stored in a replay memory and gradient descent is performed on uniformly-sampled minibatches of past experience. Hence, DQN becomes a minimization problem where the following loss is iteratively approximated and reduced:

The parameters are a stale copy of that helps prevent oscillations or divergence of . Unfortunately, randomly sampling experience in this manner does not permit eligibility traces, which require online and incremental learning. Another limitation of DQN is its assumption that the input is Markovian. Because Atari 2600 games are partially observable given a single game frame, the four most-recent observations were concatenated together to form an approximate state [22]

. This technique does not scale well for other domains where arbitrarily distant past observations may be necessary. DQN can use a recurrent neural network (RNN) to more effectively address partial observability, producing a variant called Deep Recurrent Q-Network (DRQN)

[9]. Pseudocode for DQN (and DRQN) is provided in the Supplementary Material.

2.3 Asynchronous Advantage Actor-Critic

Asynchronous Advantage Actor-Critic (A3C) is a multi-threaded actor-critic method [21]. Unlike DQN, which directly estimates action values, A3C iteratively improves a parameterized, stochastic policy. This is accomplished by alternately sampling a short trajectory from the environment and updating the policy according to the estimated advantage of each state-action pair. If the trajectory begins from state , then its length is the number of timesteps from time until episode termination, or a fixed hyperparameter , whichever is smaller. Thus, the advantage of the state-action pair in the trajectory is calculated by

. The vectors

and parameterize the policy and the value function , respectively. While these vectors are treated separately for formality, in practice both are implemented as a single neural network sharing all parameters except those in the final layer. Upon completion of a trajectory, the policy and value function are updated:


Equation (2) improves the policy by altering action log-likelihoods in proportion to their advantages. Equation (3) improves the value function by reducing the squared error between the actual return and the expected return. As before, conducting these updates sequentially would result in poor performance due to the strong correlation between successive states. Rather than using a replay memory, separate actors operate on distinct instances of the environment in parallel. Each actor asynchronously updates the same global parameter vectors and . The stochasticity of the policy and environment helps to decorrelate the gradients. However, it is this asynchronous framework that precludes the usage of eligibility traces, as gradients are interleaved from independent environment instances in a non-deterministic order. Pseudocode for A3C is provided in the Supplementary Material.

3 Sample-efficient learning with -returns

Equation (1) provides theoretical guidance on how the forward-view -return is constructed. However, it is often the case with deep reinforcement learning methods that a sequence of -returns needs to be calculated – one for every state along a trajectory. Computing Equation (1) repeatedly for each state in an -step trajectory would require roughly operations. While this may be feasible for short trajectories, calculating the returns for an arbitrarily long trajectory becomes impractical.333For this reason, the -return calculation is often truncated in practice, but this introduces error and prohibits -values very close to 1. The efficient formulation in Equation (4) eliminates the need for truncation. Alternatively, given the full trajectory, the -returns can be calculated backwards more efficiently using recursion:


This formulation has been used in prior work (e.g. [5, 24]), but not in the context of deep reinforcement learning. We include a derivation in the Supplementary Material. Equation (4) provides a compact update rule for calculating given in a constant number of operations. Therefore, the entire sequence of -returns can be computed with time complexity, implying Equation (1) is asymptotically suboptimal. This is crucial for deep reinforcement learning, where returns may need to be estimated along extremely lengthy trajectories. As an illuminating example, DQN() presented in the next section must periodically update -returns stored in its replay memory, which has a typical capacity of one million transitions. Such an operation would not be practical using Equation (1). Another consequence of Equation (4) is that truncation of the -return calculation is unnecessary, as doing so no longer reduces the total runtime. As such, exact calculation can be conducted for any value of , whereas values arbitrarily close to 1 previously incurred intolerable truncation error.

3.1 Dqn()

We introduce a new algorithm called DQN(), presented in Algorithm 1, that combines the efficient -return calculation in Equation (4) with DQN. Because DQN randomly samples past experience from a replay memory, non-trivial changes to the algorithm are required. Our discussion thus far has been limited to state-value estimation; hence, we first redefine . With this new formulation, each -step return becomes a sum of discounted rewards corrected by a final maximization over action values. This is equivalent to Peng’s Q() [24]. This brings us to our principal modification of DQN: in addition to storing each reward in replay memory , we store . Training becomes a matter of sampling a minibatch of precomputed -returns from and reducing the squared error. Of course, the calculation of must be deferred because of its dependency on future states and rewards, so transitions are appended to an intermediate list . When a terminal state is reached,444We consider only episodic tasks in our work. A modification where the task is divided into "episodes" of arbitrary length, and all -returns bootstrap from the value function at the end, could accommodate continuing tasks. the -returns are efficiently calculated along and then stored in

. The new loss function becomes the following:

The final remaining challenge is that the -returns become outdated as changes, which greatly slows learning if the capacity of the replay memory is large. However, this presents an opportunity to eliminate the target network altogether. Rather than copying parameters to , we periodically "refresh" all of the -returns using the present -function. This allows for significantly faster learning while simultaneously providing stable temporal-difference targets. We note that DQN() is equivalent to DQN when . When an RNN is used for the Q-function, we refer to Algorithm 1 as DRQN().

procedure refresh()
     for transition processing back-to-front do
         if  then
              Get adjacent transition from
         end if
     end for
end procedure
Initialize replay memory to capacity , parameter vector randomly
Initialize state , episode start , transition list
for  do
     if  then refresh(end if
     Execute action
     Receive reward and new observation
     Approximate state
     Append transition ) to // Set arbitrarily – will be updated upon episode termination
     if  then
         refresh(); store in ;
     end if
     Sample random minibatch of transitions ) from
     Improve Q-function
end for
Algorithm 1 DQN()

3.2 A3c()

We now introduce A3C(), shown in Algorithm 2, which combines the efficient -return calculation and A3C. We begin by defining the advantage only for the critic to incorporate the -return: . As with the -step returns before, the -return is computed up to an episode boundary or , whichever comes first. The loss function in Equation (3) operates identically on this new advantage, and Equation (2) remains unchanged. The efficient -return calculation has no impact on runtime because the -step returns could previously be calculated recursively as well. However, the -return enables larger values of due to its variance-reducing properties. A3C() reduces to A3C when .

4 Related work

The -return has been used in prior work to improve the sample efficiency of DRQN for Atari 2600 games [7]

. Because RNNs produce a sequence of action values during truncated backpropagation through time, these precomputed values can be exploited to calculate

-returns with little computational expense over standard DRQN. The problem with this approach is its lack of generality; the Q-function is restricted to RNNs and the length over which the -return calculation is considered is constrained to the length of the training sequence. This either prohibits -values close to 1 (where truncation bias would be significant) or requires that the training sequences become unacceptably long. In the latter case, computation time scales unfavorably and training issues with exploding and vanishing gradients can occur [3]. The -return must also be calculated on every training step, even when the input sequence and target network do not change. In contrast, our proposed DQN() with recursive -return calculation is more efficient and makes no assumptions about the Q-function parameterization. By only periodically updating -returns stored in the replay memory, we avoid repeated calculations and eliminate the need for a target network altogether. This strategy provides additional flexibility by decoupling the training sequence length from the -return length.

Generalized advantage estimation (GAE) [28] is a related method for reducing the variance of actor-critic updates at the expense of increased bias. GAE computes an exponential average of -step advantage estimators along a trajectory, which is mathematically equivalent to A3C()’s advantage estimate where the value function baseline is subtracted from the -return. However, it is opposite to our approach in the sense that the -estimator is used to determine the actor gradient (policy improvement), whereas we use it to determine the critic gradient (value estimation). In our experiments, we found the latter to work better for A3C, though this may not be true for actor-critic methods in general. These strategies are not in opposition with each other either, and could be combined to create a continuum of actor-critic algorithms with two independent -parameters.

// Assume global shared parameter vectors and and global shared counter
// Recall that
     Reset local counter
     Reset state
         Execute action
         Receive reward and new observation
         Approximate state
     until  or
     for  do
         Improve actor
         Improve critic
     end for
Algorithm 2 A3C()

5 Experiments

In order to characterize the performance benefits of -returns when combined with deep reinforcement learning, we conducted numerous experiments on a subset of the Atari 2600 games. Our primary goal was to evaluate the absolute sample efficiency of DQN(), DRQN(), and A3C() under different conditions with respect to the observability of the environment. Specifically, we tested DQN() and A3C() with both 1-frame and 4-frame inputs and DRQN() trained on sequences of length 4. These can be seen as distinct instantiations of the history transformation discussed in Section 2.

We chose four of the Atari 2600 games for our experiments: Pong, Breakout, Seaquest, and Q*bert. We used the OpenAI Gym [4] to provide an interface to the Arcade Learning Environment [2], where observations consisted of the raw game frame pixels. For each experiment, we compared the unaltered algorithms against their respective -variants with . We did not systematically tune these, and it is likely better values exist for these environments. All game frames were subjected to the same preprocessing steps described in [22]

. In addition to these procedures, we converted the frames to grayscale and normalized their intensity values between 0 and 1. For comparison, we used the same convolutional neural network from

[22] for DQN() and A3C(). We replaced the penultimate fully connected layer with a 512-unit LSTM [11] for DRQN() to match the architecture in [9]. All agents were trained for ten million timesteps.

During training of DQN() and DRQN(), exploration was linearly annealed from 1 to 0.1 over the first one million timesteps and then to 0.05 by the end of training. We used Adam optimization [14] with a learning rate of and parameter of . All other hyperparameters matched those in [22]. For A3C(), we utilized 16 actors with . We used Adam optimization with the same hyperparameters as those for DQN(). We also added an entropy bonus with to the objective in order to encourage exploration.

Our experiments in Figure 1 indicate that DQN can benefit significantly from -returns, achieving sample-efficiency improvements ranging from two- to four-fold on Seaquest and Q*bert, with both 1-frame and 4-frame inputs. We note that the final scores achieved in these cases constitute human-level performance according to the definition in [22] after being trained for only one-fifth of the duration. This demonstrates that DQN() can generate highly successful control policies with far fewer samples from the environment than DQN. All three values of that we tested performed similarly, and there was no obvious choice for the best value in general. DQN() and DRQN() results for Pong and Breakout are included in the Supplementary Material.

We note that DRQN() performed nearly as well as DQN() with 4-frame input, suggesting that the RNN is capable of producing meaningful state approximations. This is expected based on the findings in [9]. DRQN() outperformed DRQN by one order of magnitude on Seaquest, and doubled the final score on Q*bert, indicating that recurrent Q-functions are also capable of benefiting significantly from -returns as was studied in [7].

Similarly to DQN(), A3C() saw the largest improvements on Seaquest and Q*bert (Figure 2). Sample efficiency in these cases was approximately doubled. We note that A3C() generally performed worse than DQN() on these tasks, appearing to converge to local optima; for example, the agent never learned to return to the surface for oxygen in Seaquest. However, the increased convergence speed to these local optima still show that the -return can benefit A3C(). These preliminary results suggest that DQN may be more sensitive than A3C to eligibility traces because of its direct value estimation of actions, but more experiments would be needed to conclusively state this. A3C() results for Pong and Breakout (included in the Supplementary Material) also showed modest performance improvements in some cases.

Figure 1: Sample efficiency comparison of various -values for DQN(

) with 1-frame, 4-frame, and recurrent inputs on Seaquest and Q*bert. Training consisted of 40 epochs of 250,000 timesteps each, for a total of 10 million timesteps. Results were averaged over 3 random seeds.

Figure 2: Sample efficiency comparison of various -values for A3C() with 1-frame and 4-frame inputs on Seaquest and Q*bert. Training consisted of 40 epochs of 250,000 timesteps each, for a total of 10 million timesteps. Results were averaged over 5 random seeds.

6 Discussion and conclusion

Eligibility traces were historically successful in enhancing the empirical performance of temporal-difference methods. However, recent advances in reinforcement learning have departed from traditional, tabular-based return estimation in order to scale to previously intractable problems. This trend has been primarily driven by progress in neural networks, which require i.i.d. training data to avoid overfitting and necessitate offline learning schemes such as experience replay, asynchronous parameter updates, and Monte Carlo return estimation. Consequently, state-of-the-art methods are incompatible with the incremental learning needed for eligibility traces, and their expected performance improvements have eluded them.

We highlighted an offline, recursive method for calculating a sequence of -returns in linear time with respect to its length. This is highly useful for deep reinforcement learning algorithms, which often need to repeatedly estimate returns for every visited state in a long trajectory. The same procedure would require a quadratic number of operations with respect to length when using the traditional forward-view definition of the -return, and is far less practical for this purpose. We proposed significant modifications to DQN, DRQN, and A3C that enable incorporation of this fast -return calculation for enhanced performance. Our experiments on Seaquest and Q*bert show that these -variants can improve sample efficiency by a large factor, even when the complete state information is obscured.

We chose Peng’s Q() for our -return calculation in DQN(), although many alternatives exist. One possibility is Watkin’s Q() [36]. In contrast to Peng’s Q(), which unconditionally conducts Bellman backups using the maximizing action, Watkin’s Q() terminates the -return calculation by setting whenever an exploratory action is taken. This ensures that only on-policy data with respect to the greedy policy is used. Watkin’s Q() was recently proved to converge optimally in the tabular setting [23]. Unfortunately, terminating the returns slows learning when exploration is frequent and erases most of the benefit of using -returns early in the training process. A "naive" implementation could simply ignore this, but it is unclear what implications this would have for performance and convergence [32]. Peng’s Q() similarly mixes on- and off-policy data in the return estimation and can yield better empirical performance; however, it has not been proved to converge, even as . Because DQN is not guaranteed to converge optimally anyway as a consequence of its nonlinear function approximation, this is not necessarily crucial. An empirical comparison of these Q() variants in a deep reinforcement learning setting would be an interesting avenue for future work.

It is important to note that when the behavior policy is different from the target policy , the -return will be a biased estimator for the expected discounted return. This is especially relevant to DQN() because the replay memory collects samples obtained from a continuum of differing policies as exploration is annealed. Consequently, the return estimates may be slightly biased. Importance sampling is a standard technique for correcting bias, but it substantially increases variance too. Other methods have been proposed to more favorably balance bias correction with lower variance: for example, Tree Backup [25], Q*() [8], and Retrace() [23]. We did not consider bias correction in our work because it is orthogonal to DQN(); feasibly, any of these strategies could be incorporated into our algorithm and performance would be expected to improve. For A3C(), bias correction is unnecessary because the policy changes negligibly between sampling a trajectory and applying its corresponding gradient.

Our methodology described here is general enough to be adapted for any value-based or actor-critic method. We expect similar empirical results for other state-of-the-art deep reinforcement learning algorithms, and hope that this work serves as inspiration for combining -returns with them. For example, replay-based methods like DDPG and ACER could benefit substantially from our refresh procedure proposed for DQN(). TRPO, PPO, and ACKTR all estimate returns along trajectories in a parallel manner to A3C() and could utilize -returns to improve value estimation of the critic. Further modifications to actor-critic methods could include incorporating GAE, discussed in Section 4, for potentially larger improvements to sample efficiency. These results would be especially useful in scenarios where samples are difficult to obtain, such as robotic systems acting in physical environments. Similarly, there exist interesting opportunities to substantially reduce computation time for expensive algorithms by decreasing the total number of gradient updates needed during training. This includes instances where very large neural networks are necessary, or where costly auxiliary optimization techniques are used in conjunction with backpropagation, or both (e.g. [13]).