1 Introduction
Eligibility traces [1, 15, 34] have been a historically successful approach to the credit assignment problem in reinforcement learning. By applying timedecaying 1step updates to previously visited states, eligibility traces provide an efficient and online mechanism for generating the return at each timestep [32]. The return, equivalent to an exponential average of all step returns [36]
, interpolates between lowvariance (temporaldifference
[31]) and lowbias (Monte Carlo) returns and often yields notably faster empirical convergence. Eligibility traces can be effective when the reward signal is sparse or the environment is partially observable.More recently, deep reinforcement learning has shown promise on a variety of highdimensional tasks such as Atari 2600 games [22], Go [30], Doom [16], 3D maze navigation [20], and robotic locomotion [6, 10, 17, 18, 26]. While these methods could theoretically benefit from eligibility traces [33]
, they utilize offline learning schemes that render them fundamentally incompatible. This is primarily because temporally successive states are often highly correlated, but successfully training the neural network requires independent and identically distributed (
i.i.d.) training data to prevent overfitting. Circumventing this issue requires unconventional solutions; for example, DQN [22], DDPG [19], and ACER [35] perform gradient updates on randomlysampled past experience. Asynchronous methods like A3C [21] aggregate parameter updates across environment instances in a nondeterministic order. Policy gradient methods such as TRPO [27], PPO [29], and ACKTR [37] alternate between trajectory sampling and offline, gradientbased policy improvement. These strategies represent a marked departure from the incremental learning of classic temporaldifference methods. Consequently, the benefits of eligibility traces and the return have been largely unexplored in the context of deep reinforcement learning.In this paper, we propose a general strategy for rectifying the return with deep reinforcement learning. We begin with an efficient, recursive technique for computing an entire sequence of returns offline in linear time with respect to its length, as opposed to the quadratic time complexity of the traditional definition. This formulation enables the fast calculation of long return sequences and is ideal for the offline learning schemes prevalent in stateoftheart deep reinforcement learning. We demonstrate how this technique can be incorporated into DQN, DRQN [9], and A3C to significantly increase the sample efficiency of these algorithms (with necessary modifications) on Atari 2600 games, even when the complete state information is unavailable. In environments such Seaquest and Q*bert, our method can achieve improvements approaching 300% increases in final score after training for the same duration. Our methodology is general enough to be adapted for other valuebased or actorcritic methods beyond those described here, including those with continuous action spaces.
2 Background
Reinforcement learning is the problem where an agent must interact with an unknown environment through trialanderror in order to maximize its cumulative reward [32]
. We first consider the standard setting where the environment can be formulated as a Markov Decision Process (MDP) defined by the 4tuple
. At a given timestep , the environment exists in state . The agent takes an action according to policy , causing the environment to transition to a new state and yield a reward . Hence, the agent’s goal can be formalized as finding a policy that maximizes the expected discounted return up to some horizon . The discount affects the relative importance of future rewards and allows the sum to converge in the case where , . An important property of the MDP is that every state satisfies the Markov property; that is, the agent needs to consider only the current state when selecting an action in order to perform optimally.In reality, most problems of interest violate the Markov property. Information presently accessible to the agent may be incomplete or otherwise unreliable, and therefore is no longer a sufficient statistic for the environment’s history [12]. We can extend our previous formulation to the more general case of the Partially Observable Markov Decision Process (POMDP) defined by the 6tuple . At a given timestep , the environment exists in state and reveals observation . The agent takes an action according to policy and receives a reward , causing the environment to transition to a new state . In this setting, the agent may need to consider arbitrarily long sequences of past observations when selecting actions in order to perform well.^{1}^{1}1To achieve optimality, the policy must additionally consider the action history in general, i.e. . Our theory presented here is straightforward to extend to this case.
We can mathematically unify the MDP and POMDP by introducing the notion of an approximate state , where is an arbitrary transformation of the observation history. In practice, might consider only a subset of the history, or even just the most recent observation. This allows for the identical treatment of the MDP and POMDP by generalizing the notion of a Bellman backup, and greatly simplifies the following sections. However, it is important to emphasize that except in the strict case of the MDP, and that the choice of can otherwise affect the solution quality.
2.1 Eligibility traces
Valuebased reinforcement learning algorithms seek to produce an accurate estimate ) of the expected discounted return achieved by following policy from state . Suppose the agent acts according to and experiences the finite trajectory . The 1step temporaldifference update can be used to improve the estimate: where is the learning rate controlling the magnitude of the update. The primary drawback of this update is that only the current reward directly affects it; future rewards must influence the estimate indirectly through , which can result in slow learning. The update may also suffer from estimation bias. To increase the immediate sensitivity to future rewards, and to decrease the bias, step updates can be performed instead: where is the step return.^{2}^{2}2The step return () is defined by . If is terminal, then by definition and the step return is equivalent to the Monte Carlo return. When is large, the step return simultaneously considers many rewards and can more rapidly assign credit. On the other hand, the combination of these rewards can have higher variance and require more samples to converge to the true expectation. This tradeoff can be more effectively balanced by averaging step returns [32]. The return is defined as the exponential average of all step returns:
(1) 
where
is a hyperparameter that controls the decay rate and
. The final step return receives the total weight of all hypothetical returns longer than it. When , Equation (1) reduces to the step temporaldifference return. When and is terminal, Equation (1) reduces to the Monte Carlo return. The return can thus be seen a smooth interpolation between these methods. Furthermore, the monotonically decreasing weights can be interpreted as a specific form of credit assignment relying only on the reasonable assumption that recent states are likelier to have contributed to a given reward.The return presented here is the "forward view" [32], meaning its calculation must be delayed and conducted offline in practice. It is also expensive to compute, which we discuss further in Section 3. The return can be more efficiently implemented in the "backward view" [32] using eligibility traces to gradually produce the return at each timestep. Although the backward view is generally applicable to function approximators [33], modern deep reinforcement learning algorithms do not learn incrementally and cannot use this technique. In the next sections, we illustrate this incompatibility through the examples of DQN and A3C. Later, we discuss a more efficient forwardview calculation that is practical for deep reinforcement learning.
2.2 Deep QNetwork
Deep QNetwork (DQN) was one of the first notable successes of deep reinforcement learning, achieving humanlevel performance on Atari 2600 games using only the screen pixels as input [22]
. DQN can be viewed as the deeplearning analog of QLearning
[36], in which the estimate of the expected greedy return achieved after taking action from state is updated incrementally: . Because maintaining tabular information for every stateaction pair is not feasible for large state spaces, DQN instead learns a parameterized function (implemented as a deep neural network) to generalize over states. Unfortunately, directly updating according to a gradientbased version of the QLearning update does not work well. Training samples must be i.i.d. to prevent the neural network from overfitting and performing poorly, but sequentially collected experience is highly correlated. To overcome this, transitions are stored in a replay memory and gradient descent is performed on uniformlysampled minibatches of past experience. Hence, DQN becomes a minimization problem where the following loss is iteratively approximated and reduced:The parameters are a stale copy of that helps prevent oscillations or divergence of . Unfortunately, randomly sampling experience in this manner does not permit eligibility traces, which require online and incremental learning. Another limitation of DQN is its assumption that the input is Markovian. Because Atari 2600 games are partially observable given a single game frame, the four mostrecent observations were concatenated together to form an approximate state [22]
. This technique does not scale well for other domains where arbitrarily distant past observations may be necessary. DQN can use a recurrent neural network (RNN) to more effectively address partial observability, producing a variant called Deep Recurrent QNetwork (DRQN)
[9]. Pseudocode for DQN (and DRQN) is provided in the Supplementary Material.2.3 Asynchronous Advantage ActorCritic
Asynchronous Advantage ActorCritic (A3C) is a multithreaded actorcritic method [21]. Unlike DQN, which directly estimates action values, A3C iteratively improves a parameterized, stochastic policy. This is accomplished by alternately sampling a short trajectory from the environment and updating the policy according to the estimated advantage of each stateaction pair. If the trajectory begins from state , then its length is the number of timesteps from time until episode termination, or a fixed hyperparameter , whichever is smaller. Thus, the advantage of the stateaction pair in the trajectory is calculated by
. The vectors
and parameterize the policy and the value function , respectively. While these vectors are treated separately for formality, in practice both are implemented as a single neural network sharing all parameters except those in the final layer. Upon completion of a trajectory, the policy and value function are updated:(2) 
(3) 
Equation (2) improves the policy by altering action loglikelihoods in proportion to their advantages. Equation (3) improves the value function by reducing the squared error between the actual return and the expected return. As before, conducting these updates sequentially would result in poor performance due to the strong correlation between successive states. Rather than using a replay memory, separate actors operate on distinct instances of the environment in parallel. Each actor asynchronously updates the same global parameter vectors and . The stochasticity of the policy and environment helps to decorrelate the gradients. However, it is this asynchronous framework that precludes the usage of eligibility traces, as gradients are interleaved from independent environment instances in a nondeterministic order. Pseudocode for A3C is provided in the Supplementary Material.
3 Sampleefficient learning with returns
Equation (1) provides theoretical guidance on how the forwardview return is constructed. However, it is often the case with deep reinforcement learning methods that a sequence of returns needs to be calculated – one for every state along a trajectory. Computing Equation (1) repeatedly for each state in an step trajectory would require roughly operations. While this may be feasible for short trajectories, calculating the returns for an arbitrarily long trajectory becomes impractical.^{3}^{3}3For this reason, the return calculation is often truncated in practice, but this introduces error and prohibits values very close to 1. The efficient formulation in Equation (4) eliminates the need for truncation. Alternatively, given the full trajectory, the returns can be calculated backwards more efficiently using recursion:
(4) 
This formulation has been used in prior work (e.g. [5, 24]), but not in the context of deep reinforcement learning. We include a derivation in the Supplementary Material. Equation (4) provides a compact update rule for calculating given in a constant number of operations. Therefore, the entire sequence of returns can be computed with time complexity, implying Equation (1) is asymptotically suboptimal. This is crucial for deep reinforcement learning, where returns may need to be estimated along extremely lengthy trajectories. As an illuminating example, DQN() presented in the next section must periodically update returns stored in its replay memory, which has a typical capacity of one million transitions. Such an operation would not be practical using Equation (1). Another consequence of Equation (4) is that truncation of the return calculation is unnecessary, as doing so no longer reduces the total runtime. As such, exact calculation can be conducted for any value of , whereas values arbitrarily close to 1 previously incurred intolerable truncation error.
3.1 Dqn()
We introduce a new algorithm called DQN(), presented in Algorithm 1, that combines the efficient return calculation in Equation (4) with DQN. Because DQN randomly samples past experience from a replay memory, nontrivial changes to the algorithm are required. Our discussion thus far has been limited to statevalue estimation; hence, we first redefine . With this new formulation, each step return becomes a sum of discounted rewards corrected by a final maximization over action values. This is equivalent to Peng’s Q() [24]. This brings us to our principal modification of DQN: in addition to storing each reward in replay memory , we store . Training becomes a matter of sampling a minibatch of precomputed returns from and reducing the squared error. Of course, the calculation of must be deferred because of its dependency on future states and rewards, so transitions are appended to an intermediate list . When a terminal state is reached,^{4}^{4}4We consider only episodic tasks in our work. A modification where the task is divided into "episodes" of arbitrary length, and all returns bootstrap from the value function at the end, could accommodate continuing tasks. the returns are efficiently calculated along and then stored in
. The new loss function becomes the following:
The final remaining challenge is that the returns become outdated as changes, which greatly slows learning if the capacity of the replay memory is large. However, this presents an opportunity to eliminate the target network altogether. Rather than copying parameters to , we periodically "refresh" all of the returns using the present function. This allows for significantly faster learning while simultaneously providing stable temporaldifference targets. We note that DQN() is equivalent to DQN when . When an RNN is used for the Qfunction, we refer to Algorithm 1 as DRQN().
3.2 A3c()
We now introduce A3C(), shown in Algorithm 2, which combines the efficient return calculation and A3C. We begin by defining the advantage only for the critic to incorporate the return: . As with the step returns before, the return is computed up to an episode boundary or , whichever comes first. The loss function in Equation (3) operates identically on this new advantage, and Equation (2) remains unchanged. The efficient return calculation has no impact on runtime because the step returns could previously be calculated recursively as well. However, the return enables larger values of due to its variancereducing properties. A3C() reduces to A3C when .
4 Related work
The return has been used in prior work to improve the sample efficiency of DRQN for Atari 2600 games [7]
. Because RNNs produce a sequence of action values during truncated backpropagation through time, these precomputed values can be exploited to calculate
returns with little computational expense over standard DRQN. The problem with this approach is its lack of generality; the Qfunction is restricted to RNNs and the length over which the return calculation is considered is constrained to the length of the training sequence. This either prohibits values close to 1 (where truncation bias would be significant) or requires that the training sequences become unacceptably long. In the latter case, computation time scales unfavorably and training issues with exploding and vanishing gradients can occur [3]. The return must also be calculated on every training step, even when the input sequence and target network do not change. In contrast, our proposed DQN() with recursive return calculation is more efficient and makes no assumptions about the Qfunction parameterization. By only periodically updating returns stored in the replay memory, we avoid repeated calculations and eliminate the need for a target network altogether. This strategy provides additional flexibility by decoupling the training sequence length from the return length.Generalized advantage estimation (GAE) [28] is a related method for reducing the variance of actorcritic updates at the expense of increased bias. GAE computes an exponential average of step advantage estimators along a trajectory, which is mathematically equivalent to A3C()’s advantage estimate where the value function baseline is subtracted from the return. However, it is opposite to our approach in the sense that the estimator is used to determine the actor gradient (policy improvement), whereas we use it to determine the critic gradient (value estimation). In our experiments, we found the latter to work better for A3C, though this may not be true for actorcritic methods in general. These strategies are not in opposition with each other either, and could be combined to create a continuum of actorcritic algorithms with two independent parameters.
5 Experiments
In order to characterize the performance benefits of returns when combined with deep reinforcement learning, we conducted numerous experiments on a subset of the Atari 2600 games. Our primary goal was to evaluate the absolute sample efficiency of DQN(), DRQN(), and A3C() under different conditions with respect to the observability of the environment. Specifically, we tested DQN() and A3C() with both 1frame and 4frame inputs and DRQN() trained on sequences of length 4. These can be seen as distinct instantiations of the history transformation discussed in Section 2.
We chose four of the Atari 2600 games for our experiments: Pong, Breakout, Seaquest, and Q*bert. We used the OpenAI Gym [4] to provide an interface to the Arcade Learning Environment [2], where observations consisted of the raw game frame pixels. For each experiment, we compared the unaltered algorithms against their respective variants with . We did not systematically tune these, and it is likely better values exist for these environments. All game frames were subjected to the same preprocessing steps described in [22]
. In addition to these procedures, we converted the frames to grayscale and normalized their intensity values between 0 and 1. For comparison, we used the same convolutional neural network from
[22] for DQN() and A3C(). We replaced the penultimate fully connected layer with a 512unit LSTM [11] for DRQN() to match the architecture in [9]. All agents were trained for ten million timesteps.During training of DQN() and DRQN(), exploration was linearly annealed from 1 to 0.1 over the first one million timesteps and then to 0.05 by the end of training. We used Adam optimization [14] with a learning rate of and parameter of . All other hyperparameters matched those in [22]. For A3C(), we utilized 16 actors with . We used Adam optimization with the same hyperparameters as those for DQN(). We also added an entropy bonus with to the objective in order to encourage exploration.
Our experiments in Figure 1 indicate that DQN can benefit significantly from returns, achieving sampleefficiency improvements ranging from two to fourfold on Seaquest and Q*bert, with both 1frame and 4frame inputs. We note that the final scores achieved in these cases constitute humanlevel performance according to the definition in [22] after being trained for only onefifth of the duration. This demonstrates that DQN() can generate highly successful control policies with far fewer samples from the environment than DQN. All three values of that we tested performed similarly, and there was no obvious choice for the best value in general. DQN() and DRQN() results for Pong and Breakout are included in the Supplementary Material.
We note that DRQN() performed nearly as well as DQN() with 4frame input, suggesting that the RNN is capable of producing meaningful state approximations. This is expected based on the findings in [9]. DRQN() outperformed DRQN by one order of magnitude on Seaquest, and doubled the final score on Q*bert, indicating that recurrent Qfunctions are also capable of benefiting significantly from returns as was studied in [7].
Similarly to DQN(), A3C() saw the largest improvements on Seaquest and Q*bert (Figure 2). Sample efficiency in these cases was approximately doubled. We note that A3C() generally performed worse than DQN() on these tasks, appearing to converge to local optima; for example, the agent never learned to return to the surface for oxygen in Seaquest. However, the increased convergence speed to these local optima still show that the return can benefit A3C(). These preliminary results suggest that DQN may be more sensitive than A3C to eligibility traces because of its direct value estimation of actions, but more experiments would be needed to conclusively state this. A3C() results for Pong and Breakout (included in the Supplementary Material) also showed modest performance improvements in some cases.
) with 1frame, 4frame, and recurrent inputs on Seaquest and Q*bert. Training consisted of 40 epochs of 250,000 timesteps each, for a total of 10 million timesteps. Results were averaged over 3 random seeds.
6 Discussion and conclusion
Eligibility traces were historically successful in enhancing the empirical performance of temporaldifference methods. However, recent advances in reinforcement learning have departed from traditional, tabularbased return estimation in order to scale to previously intractable problems. This trend has been primarily driven by progress in neural networks, which require i.i.d. training data to avoid overfitting and necessitate offline learning schemes such as experience replay, asynchronous parameter updates, and Monte Carlo return estimation. Consequently, stateoftheart methods are incompatible with the incremental learning needed for eligibility traces, and their expected performance improvements have eluded them.
We highlighted an offline, recursive method for calculating a sequence of returns in linear time with respect to its length. This is highly useful for deep reinforcement learning algorithms, which often need to repeatedly estimate returns for every visited state in a long trajectory. The same procedure would require a quadratic number of operations with respect to length when using the traditional forwardview definition of the return, and is far less practical for this purpose. We proposed significant modifications to DQN, DRQN, and A3C that enable incorporation of this fast return calculation for enhanced performance. Our experiments on Seaquest and Q*bert show that these variants can improve sample efficiency by a large factor, even when the complete state information is obscured.
We chose Peng’s Q() for our return calculation in DQN(), although many alternatives exist. One possibility is Watkin’s Q() [36]. In contrast to Peng’s Q(), which unconditionally conducts Bellman backups using the maximizing action, Watkin’s Q() terminates the return calculation by setting whenever an exploratory action is taken. This ensures that only onpolicy data with respect to the greedy policy is used. Watkin’s Q() was recently proved to converge optimally in the tabular setting [23]. Unfortunately, terminating the returns slows learning when exploration is frequent and erases most of the benefit of using returns early in the training process. A "naive" implementation could simply ignore this, but it is unclear what implications this would have for performance and convergence [32]. Peng’s Q() similarly mixes on and offpolicy data in the return estimation and can yield better empirical performance; however, it has not been proved to converge, even as . Because DQN is not guaranteed to converge optimally anyway as a consequence of its nonlinear function approximation, this is not necessarily crucial. An empirical comparison of these Q() variants in a deep reinforcement learning setting would be an interesting avenue for future work.
It is important to note that when the behavior policy is different from the target policy , the return will be a biased estimator for the expected discounted return. This is especially relevant to DQN() because the replay memory collects samples obtained from a continuum of differing policies as exploration is annealed. Consequently, the return estimates may be slightly biased. Importance sampling is a standard technique for correcting bias, but it substantially increases variance too. Other methods have been proposed to more favorably balance bias correction with lower variance: for example, Tree Backup [25], Q*() [8], and Retrace() [23]. We did not consider bias correction in our work because it is orthogonal to DQN(); feasibly, any of these strategies could be incorporated into our algorithm and performance would be expected to improve. For A3C(), bias correction is unnecessary because the policy changes negligibly between sampling a trajectory and applying its corresponding gradient.
Our methodology described here is general enough to be adapted for any valuebased or actorcritic method. We expect similar empirical results for other stateoftheart deep reinforcement learning algorithms, and hope that this work serves as inspiration for combining returns with them. For example, replaybased methods like DDPG and ACER could benefit substantially from our refresh procedure proposed for DQN(). TRPO, PPO, and ACKTR all estimate returns along trajectories in a parallel manner to A3C() and could utilize returns to improve value estimation of the critic. Further modifications to actorcritic methods could include incorporating GAE, discussed in Section 4, for potentially larger improvements to sample efficiency. These results would be especially useful in scenarios where samples are difficult to obtain, such as robotic systems acting in physical environments. Similarly, there exist interesting opportunities to substantially reduce computation time for expensive algorithms by decreasing the total number of gradient updates needed during training. This includes instances where very large neural networks are necessary, or where costly auxiliary optimization techniques are used in conjunction with backpropagation, or both (e.g. [13]).
References
 [1] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, 13(5):834–846, 1983.

[2]
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  [3] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 [4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.
 [5] Thomas Degris, Martha White, and Richard S Sutton. Offpolicy actorcritic. arXiv preprint arXiv:1205.4839, 2012.

[6]
Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel.
Benchmarking deep reinforcement learning for continuous control.
In
International Conference on Machine Learning
, pages 1329–1338, 2016.  [7] Jean Harb and Doina Precup. Investigating recurrence and eligibility traces in deep Qnetworks. arXiv preprint arXiv:1704.05495, 2017.
 [8] Anna Harutyunyan, Marc G Bellemare, Tom Stepleton, and Rémi Munos. Q() with offpolicy corrections. In International Conference on Algorithmic Learning Theory, pages 305–320. Springer, 2016.
 [9] Matthew Hausknecht and Peter Stone. Deep recurrent Qlearning for partially observable MDPs. CoRR, abs/1507.06527, 2015.
 [10] Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
 [11] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [12] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(12):99–134, 1998.
 [13] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qtopt: Scalable deep reinforcement learning for visionbased robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
 [14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [15] A Harry Klopf. Brain function and adaptive systems: a heterostatic theory. Technical report, AIR FORCE CAMBRIDGE RESEARCH LABS HANSCOM AFB MA, 1972.
 [16] Guillaume Lample and Devendra Singh Chaplot. Playing FPS games with deep reinforcement learning. In AAAI, pages 2140–2146, 2017.
 [17] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 [18] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning handeye coordination for robotic grasping with deep learning and largescale data collection. The International Journal of Robotics Research, 37(45):421–436, 2018.
 [19] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [20] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.
 [21] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
 [22] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [23] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062, 2016.
 [24] Jing Peng and Ronald J Williams. Incremental multistep Qlearning. In Machine Learning Proceedings 1994, pages 226–232. Elsevier, 1994.
 [25] Doina Precup. Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
 [26] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
 [27] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 [28] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
 [29] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 [30] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
 [31] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 [32] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1st edition, 1998.
 [33] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press Cambridge, 2nd edition, 2017. In progress.
 [34] Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. PhD thesis, University of Massachussetts, Amherst, 1984.
 [35] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
 [36] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King’s College, Cambridge, 1989.
 [37] Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In Advances in neural information processing systems, pages 5279–5288, 2017.