1 Introduction
Until recently, using temporal difference (TD) methods to approximate a value function from offpolicy samples was potentially unstable without resorting to quadratic (in the number of features) computation and storage, even in the case of linear approximations. Offpolicy learning involves learning an estimate of total future reward that we would expect to observe if the agent followed some target policy, while learning from samples generated by a different behavior policy. This offpolicy, policyevaluation problem, when combined with a policy improvement step, can can be used to model many different learning scenarios, such as learning from many policies in parallel
sutton2011horde , learning from demonstrations argall2009asurvey , learning from batch data lin1992self , or simply learning about the optimal policy while following an exploratory policy, as in the case of Qlearning watkins1989learning . In this paper, we focus exclusively on the offpolicy, policy evaluation problem, commonly referred to as value function approximation or simply the prediction problem. Over the past decade there has been an proliferation of new linearcomplexity, policyevaluation methods designed to be convergent in the offpolicy case.These novel algorithmic contributions have focused on different ways of achieving stable offpolicy prediction learning. The first such methods were the gradient TD family of algorithms that perform approximate stochastic gradient descent on the mean squared projected Bellman error (MSPBE). The primary drawback of these methods is the requirement for a second set of learned weights, a second step size parameter, and potentially high variance updates due to importance sampling. Empirically the results have been mixed, with some results indicating that TD can be superior in onpolicy settings
sutton2009fast , and others concluding the exact opposite dann2014policy .Later, provisional TD (PTD) was introduced sutton2014anew to rectify the issue that the bootstrap parameter , used in gradient TD methodsmaei2011gradient does not correspond well with the same parameter used by conventional TD learning sutton1988learning . Specifically, for , gradient TD methods do not correspond to any known variant of offpolicy Monte Carlo. The PTD algorithm fixes this issue, and in onpolicy prediction is exactly equivalent to the conventional TD algorithm. PTD does not use gradient corrections, and is only guaranteed to converge in the tabular offpolicy prediction setting. Its empirical performance relative to TD and gradient TD, however, is completely unknown.
Recently Sutton et al. sutton2015anemphatic observed that conventional TD does not correct its update based on the notion of a followon distribution. This distributional mismatch provides another way to understand the offpolicy divergence of conventional offpolicy TD. They derive the Emphatic TD (ETD) algorithm that surprisingly achieves convergence yu2015onconvergence without the need for a second set of weights, like those used by gradient TD methods. Like gradient TD methods, however, it seems that this algorithm also suffers from high variance due to importance sampling. Hallak et al. hallak2015generalized introduced a variant ETD that utilizes a scaling parameter , which is meant to reduce the magnitude of the followon trace. Comparative empirical studies for ETD and ETD(), however, have been limited.
The most recent contribution to this line of work explores a mirrorprox approach to minimizing the MSPBE mahadevan2012sparse ; mahadevan2014proximal ; liu2015finite . The main benefit of this work was that it enabled the first finite sample analysis of an offpolicy TDbased method with function approximation, and the application of advanced stochastic gradient optimizations. Liu et al. liu2015finite introduced two mirrorprox TD algorithms, one based on the GTD2 algorithm sutton2009fast the other based on TDC sutton2009fast ^{1}^{1}1The GTD2 and TDC algorithms are gradient TD methods that do not use eligibility traces; . and showed that these methods outperform their base counterparts on Baird’s counterexample baird1995residual , but did not extend their new methods with eligibility traces.
A less widely known approach to the offpolicy prediction problem is based on algorithms that perform precisely TD updates when the data is sampled onpolicy, and corrected gradientTD style updates when the data is generated offpolicy. The idea is to exploit the supposed superior efficiency of TD in onpolicy learning, while maintaining robustness in the offpolicy case. These “hybrid" TD methods were introduced for state valuefunction based prediction maei2011gradient , and stateaction valuefunction based prediction hackman2012thesis , but have not been extended to utilize eligibility traces, nor compared with the recent developments in linear offpolicy TD learning (many developed since 2014).
Meanwhile a separate but related thread of algorithmic development has sought to improve the operation of eligibility traces used in both on and offpolicy TD algorithms. This direction is based on another nonequivalence observation: the update performed by the forward view variant of the conventional TD is only equivalent to its backward view update at the end of sampling trajectories. The proposed trueonline TD (TOTD) prediction algorithm vanseijen2014true , and trueonline GTD (TOGTD) prediction algorithm vanhasselt2014off remedy this issue, and have been shown to outperform conventional TD and gradient TD methods respectively on chain domains. The TOTD algorithm requires only a modest increase in computational complexity over TD, however, the TOGTD algorithm is significantly more complex to implement and requires three eligibility traces compared to GTD. Nevertheless, both TOTD and TOGTD achieve linear complexity, and can be implemented in a completely incremental way.
Although there asymptotic convergence properties of many of these methods has been rigorously characterized, but empirically there is still much we do not understand about this now large collection of methods. A frequent criticism of gradient TD methods, for example, is that they are hard to tune and not wellunderstood empirically. It is somewhat disappointing that perhaps the most famous application of reinforcement learning—learning to play Atari games mnih2015human —uses potentially divergent offpolicy Qlearning. In addition, we have very little understanding of how these methods compare in terms of learning speed, robustness, and parameter sensitivity. By clarifying some of the empirical properties of these algorithms, we hope to promote more widespread adoption of these theoretically sound and computationally efficient algorithms.
This paper has two primary contributions. First, we introduce a novel extension of hybrid methods to eligibility traces resulting in two new algorithms, HTD() and trueonline HTD(). Second, we provide an empirical study of TDbased prediction learning with linear function approximation. The conclusions of our experiments are surprisingly clear:

[leftmargin=0.2cm,itemindent=.5cm,labelwidth=labelsep=0.1cm,align=left,itemsep=.2cm]

GTD() and TOGTD() should be preferred if robustness to offpolicy sampling is required

Between the two GTD() should be preferred if computation time is at a premium

Otherwise, TOETD() was clearly the best across our experiments except on Baird’s counterexample.
2 Background
This paper investigates the problem of estimating the discounted sum of future rewards online
and with function approximation. In the context of reinforcement learning we take online to mean that the agent makes decisions, the environment produces outcomes, and the agent updates its parameters in a continual, realtime interaction stream. We model the agent’s interaction as Markov decision process defined by a countably infinite set of states
, a finite set of actions , and a scalar discount function. The agent’s observation of the current situation is summarized by the feature vector
, where is the current state and . On each time step , the agent selects an action according to it’s behavior policy , where . The environment then transitions into a new state , and emits a scalar reward . The agent’s objective is to evaluate a fixed target policy , or estimate the expected return for policy :where is called the statevalue function for policy .
All the methods evaluated in this study perform temporal difference updates, and most utilize eligibility traces. The TD() algorithm is the prototypical example of these concepts and is useful for understanding all the other algorithms discussed in the remainder of this paper. TD() estimates as a linear function of the weight vector , where the estimate is formed as an inner product between the weight vector and the features of the current state: . The algorithm maintains a memory trace of recently experienced features, called the eligibility trace , allowing updates to assign credit to previously visited states. The TD() algorithm requires linear computation and storage , and can be implemented incrementally as follows:
In the case when the data is generated by a behavior policy, , with , we say that the data is generated offpolicy. In the offpolicy setting we must estimate with samples generated by selecting actions according to . This setting can cause the TD() algorithm to diverge. The GTD() algorithm solves the divergence issue by minimizing the MSPBE, resulting in a stochastic gradient descent algorithm that looks similar to TD(), with some important differences. GTD() uses importance weights,
in the eligibility trace to reweight the data and obtain an unbiased estimate of
. Note, in the policy iteration case—not studied here—it is still reasonable to assume knowledge of for all ; for example when is near greedy with respect to the current estimate of the stateaction value function. The GTD() has a auxiliary set of learned weights, , in addition to the primary weights , which maintain a quasistationary estimate of a part of the MSPBE. Like the TD() algorithm, GTD() requires only linear computation and storage and can be implemented fully incrementally as follows:The auxiliary weights also make use of a stepsize parameter, which is usually not equal to .
Due to space constraints we do not describe the other TDbased linear learning algorithms found in the literature and investigated in our study. We provide each algorithm’s pseudo code in the appendix, and in the next section describe two new offpolicy, gradient TD methods, before turning to empirical questions.
3 HTD derivation
Conventional temporal difference updating can be more data efficient than gradient temporal difference updating, but the correction term used by gradientTD methods helps prevent divergence. Previous empirical studiessutton2009fast demonstrated situations (specifically onpolicy) where linear TD(0) can outperform gradient TD methods, and others hackman2012thesis demonstrated that Expected Sarsa(0) can outperform multiple variants of the GQ(0) algorithm, even under offpolicy sampling. On the other hand, TD() can diverge on small, though somewhat contrived counterexamples.
The idea of hybridTD methods is to achieve sample efficiency closer to TD() during onpolicy sampling, while ensuring nondivergence under offpolicy sampling. To achieve this, a hybrid algorithm could do conventional, uncorrected TD updates when the data is sampled onpolicy, and use gradient corrections when the data is sampled offpolicy. This approach was pioneered by Maei maei2011gradient , leading to the derivation of the Hybrid Temporal Difference learning algorithm, or HTD(0). Later, Hackmanhackman2012thesis produced a hybrid version of the GQ(0) algorithm, estimating stateaction value functions rather than statevalue functions as we do here. In this paper, we derive the first hybrid temporal difference method to make use of eligibility traces, called HTD().
The key idea behind the derivation of HTD learning methods is to modify the gradient of the MSPBE to produce a new learning algorithm. Let represent the expectation according to samples generated under the behavior policy, . The MSPBEsutton2009fast can be written as
where and
(1)  
Therefore, the relative importance given to states in the MSPBE is weighted by the stationary distribution of the behavior policy, , (since it is generating samples), but the transitions are reweighted to reflect the returns that would produce.
The gradient of the MSPBE is:
(2) 
Assuming is nonsingular, we get the TDfixed point solution:
(3) 
The value of , for which (3) is zero, is the solution found by linear TD() and LSTD() where . The gradient of the MSPBE yields an incremental learning rule with the following general form (see bertsekas1996neuro ):
(4) 
where and . The update rule, in the case of TD(), will yield stable convergence if is positive definite (as shown by Tsitsiklis and van Roy tsitsiklis1997ananalysis ). In offpolicy learning, we require
to be positive definite to satisfy the conditions of the ordinary differential equation proof of convergence
maei2010gq , which holds because is positive definite and therefore is positive definite, because is full rank (true by assumption). See Sutton et al. sutton2015anemphatic for a nice discussion on why the matrix must be positive definite to ensure stable, nondivergent iterations. The matrix in Equation (3), can be replaced by any positive definite matrix and the fixed point will be unaffected, but the rate of convergence will almost surely change.Instead of following the usual recipe for deriving GTD, let us try replacing with
where is the regular onpolicy trace for the behavior policy (i.e., no importance weights)
The matrix is a positive definite matrix (proved by Tsitsiklis and van Roy tsitsiklis1997ananalysis ). Plugging into (2) results in the following expected update:
(5)  
As in the derivation of GTD() maei2011gradient , let the vector form a quasistationary estimate of the final term,
Getting back to the primary weight update, we can sample the first term using the fact that (see maei2011gradient ) and use (1) to get the final stochastic update
(6) 
Notice that when the data is generated onpolicy (), , and thus the correction term disappears and we are left with precisely linear TD(). When , the TD update is corrected as in GTD: unsurprisingly, the correction is slightly different but has the same basic form.
To complete the derivation, we must derive an incremental update rule for . We have a linear system, because
Following the general expected update in (4),
(7) 
which converges if is positive definite for any fixed and is chosen appropriately (see Sutton et al.’s recent papersutton2015anemphatic for an extensive discussion of convergence in expectation). To sample this update, recall
giving stochastic update rule for :
As in GTD, and are stepsize parameters, and . This hybridTD algorithm should converge under offpolicy sampling using a proof technique similar to the one used for GQ() (see Maei & Sutton’s proof maei2010gq ), but we leave this to future work. The HTD() algorithm is completely specified by the following equations:
This algorithm can be made more efficient by exploiting the common terms in and , as shown in the appendix.
4 True online HTD
Recently, a new forwardbackward view equivalence has been proposed for online TD methods, resulting in trueonline TD vanseijen2014true and trueonline GTD vanhasselt2014off algorithms. The original forwardbackward equivalence was for offline TD()^{2}^{2}2The idea of defining a forward view objective and then converting this computationally impractical forwardview into an efficiently implementable algorithm using traces is extensively treated in Sutton and Barto’s introductory text sutton1998reinforcement .. To derive a forwardbackward equivalence under online updating, a new truncated return was proposed, which uses the online weight vector that changes into the future,
with . A forwardview algorithm can be defined that computes online assuming access to future samples, and then an exactly equivalent incremental backwardview algorithm can be derived that does not require access to future samples. This framework was used to derive the TOTD algorithm for the onpolicy setting, and TOGTD for the more general offpolicy setting. This new trueonline equivalence is not only interesting theoretically, but also translates into improved prediction and control performance vanseijen2014true ; vanhasselt2014off .
In this section, we derive a trueonline variant of HTD(). When used onpolicy HTD() behaves similarly to TOTD(). Our goal in this section is to combine the benefits of both hybrid learning and trueonline traces in a single algorithm. We proceed with a similar derivation to TOGTD() (vanhasselt2014off, , Theorem 4), with the main difference appearing in the update of the auxiliary weights. Notice that the primary weights , and the auxiliary weights , of HTD() have a similar structure. Recall from (5), the modified gradient of the MSPBE, or expected primaryweight update can be written as:
Similarly, we can rewrite the expected update of the auxiliary weights by plugging into (7):
As in the derivation of TOGTD (vanhasselt2014off, , Equation 17,18)), for TOHTD we will sample the second part of the update using a backwardview and obtain forwardview samples for . The resulting TOHTD() algorithm is completely specified by the following equations
(8)  
In order to prove that this is a trueonline update, we use the constructive theorem due to van Hasselt et al. vanhasselt2014off .
Theorem 1 (Trueonline HTD())
For any , the weight vectors as defined by the forward view
are equal to , as defined by the backward view in (8).
Proof
We apply (vanhasselt2014off, , Theorem 1). The substitutions are
where is called in van Hasselt’s Theorem 1 vanhasselt2014off . The proof then follows through in the same way as in van Hasselt’s Theorem 4 vanhasselt2014off , where we apply Theorem 1 to and separately.
Our TOHTD(0) algorithm is equivalent to HTD(0), but TOHTD() is not equivalent to TOTD() under onpolicy sampling. To achieve the later equivalence, replace and . We opted for the first equivalence for two reasons. In preliminary experiments, TOHTD() described in Equation (8) already exhibited similar performance compared to TOTD(), and so designing for the second equivalence was unnecessary. Further, TOGTD() was derived to ensure equivalence between TOGTD(0) and GTD(0); this choice, therefore, better parallels that equivalence.
Given our two new hybrid methods, and the long list of existing linear prediction algorithms we now focus on how these algorithms perform in practice.
5 Experimental study
Our empirical study focused on three main aspects: (1) early learning performance with different feature representations, (2) parameter sensitivity, and, (3) efficacy in on and offpolicy learning. The majority of our experiments were conducted on random MDPs (variants of those used in previous studiesmahmood2015off ; geist2014off
). Each random MDP contains 30 states, and three actions in each state. From each state, and for each action, the agent can transition to one of four next states, assigned randomly from the entire set without replacement. Transition probabilities for each MDP instance are randomly sampled from
and the transitions were normalized to sum to one. The expected reward for each transition is also generated randomly in and the reward on each transition was sampled without noise. Two transitions are randomly selected to terminate: for . Each problem instance is held fixed during learning.We experimented with three different feature representations. The first, a tabular representation where each state is represented with a binary vector with a single one corresponding the current state index. This encoding allows perfect representation of the value function with no generalization over states. The second representation is computed by taking the tabular representation and aliasing five states to all have the same feature vector, so the agent cannot differentiate these states. These five states were selected randomly without replacement for each MDP instance. The third representation is a dense binary encoding where the feature vector for each state is the binary encoding of the state index, and thus the feature vector for a 30 state MDP requires just five components. Although the binary representation appears to exhibit an inappropriate amount of generalization, we believe it to be more realistic that a tabular representation, because access to MDP state is rare in realworld domains (e.g., a robotic with continuous sensor values). The binary representation should be viewed as an approximation to the poor, and relatively lowdimensional (compared to the number of states in the world) representations common in real applications. All feature encoding we normalized. Experiments conducted with the binary representation use , and the rest use .
To generate policies with purposeful behavior, we forced the agent to favor a single action in each state. The target policy is generated by randomly selecting an action and assigning it probability 0.9 (i.e., ) in each state, and then assigning the remaining actions the remaining probability evenly. In the offpolicy experiments the behavior policy is modified to be slightly different than the target policy, by selecting the same base action, but instead assigning a probability of 0.8 (i.e., ) . This choice ensures that the policies are related, but guarantees that is never greater than 1.5 thus avoiding inappropriately large variance due to importance sampling^{3}^{3}3See the recent study by Mahmood & Sutton mahmood2015off for an extensive treatment of offpolicy learning domains with large variance due to importance sampling..
Our experiment compared 12 different linear complexity value function learning algorithms, including: GTD(), HTD(), trueonline GTD(), trueonline HTD(), trueonline ETD(), trueonline ETD(), PTD(), GTD2  mp(), TDC  mp(), linear offpolicy TD(0), TD(), true online TD(). The later two being only applicable in onpolicy domains, and the two mirrorprox methods are straightforward extensions (and described in the appendix) of the GTD2mp and TDCmp methods mahadevan2014proximal to handle traces (). We drop the designation of each method in the figure labels to reduce clutter.
Our results were generated by performing a large parameter sweep, averaged over many independent runs, for each random MDP instance, and then averaging the results over the entire set of MDPs. We tested 14 different values of the stepsize parameter , seven values of (), and 20 values of . We intentionally precluded smaller values of from the parameter sweep because many of the gradient TD methods simply become their onpolicy variants as approaches zero, whereas in some offpolicy domains values of are required to avoid divergence white2015thesis . We believe this range of fairly reflects how the algorithms would be used in practice if avoiding divergence was a priority. The parameter of TOETD() was set equal to . Each algorithm instance, defined by one combination of , and was evaluated using the mean absolute value error on each time step,
averaged over 30 MDPs, each with 100 runs. Here denotes the true statevalue function, which can be easily computed with access to the parameters of the MDP.
The graphs in Figures 1 and 2 include (a) learning curves with , and selected to minimize the mean absolute value error, for each of the three different feature representations, and (b) parameter sensitivity graphs for , and , in which the mean absolute value error is plotted against the parameter value, while the remaining two parameters are selected to minimize mean absolute value error. These graphs are included across feature representations, for on and offpolicy learning. Across all results the parameters are selected to optimize performance over the last half of the experiment to ensure stable learning throughout the run.
To analyze large variance due to importance sampling and offpolicy learning we also investigated Baird’s counterexample baird1995residual , a simple MDP that causes TD learning to diverge. This seven state MDP uses a target policy that is very different from the behavior policy, a feature representation that allows perfect representation of the value function, but also causes inappropriate generalization. We used the variant of this problem described by Maei maei2011gradient and White (white2015thesis, , Figure 7.1). We present results with the root mean squared error ^{4}^{4}4In this counterexample the mean absolute value error is not appropriate because the optimal values for this task are zero. The MSPBE is often used as a performance measure, but the MSPBE changes with ; for completeness, we include results with the MSPBE in the appendix.,
in Figure 1. The experiment was conducted in the same way was the random MDPs, except we did not average over MDPs—there is only one—and we used different parameter ranges. We tested 11 different values of the stepsize parameter , 12 values of (), and the same 20 values of . We did not evaluate TD(0) on this domain because the algorithm will diverge and that has been shown many times before.
In addition to performance results in Figures 1 and 2, Table 1 summarizes the runtime comparison for these algorithms. Though the algorithms are all linear in storage and computation, they do differ in both implementation and runtime, particularly due to trueonline traces. The appendix contains several plots of runtime verses value error illustrating the tradeoff between computation and sample complexity for each algorithm. Due to space constraints, we have included the aliased tabular representation results for onpolicy learning in the appendix, since they are similar to the tabular representation results in onpolicy learning.
TD(0)  TD()  TOTD  PTD  GTD  TOETD  TOETD()  HTD  TOGTD  GTDMP  TDCMP  TOHTD  
Onpolicy  120.0  132.7  150.1  172.4  204.6  287.8  286.0  311.8  366.2  467.4  466.2  466.0 
Offpolicy  108.3      158.7  175.2  249.65  254.7  267.5  316.2  407.8  395.7  403.3 
Tabular features  Aliased Tabular features  Binary features 
sensitivity. The error bars are standard errors (
) computed from 100 independent runs.Tabular features  Binary features  Baird’s counterexample 

6 Discussion
There are three broad conclusions suggested by our results. First, we could not clearly demonstrate the supposed superiority of TD() over gradient TD methods in the onpolicy setting. In both tabular and aliased feature settings GTD() achieved faster learning and superior parameter sensitivity compared to TD, PTD, and HTD. Notably, the sensitivity of the GTD algorithm was very reasonable in both domains, however, large were required to achieve good performance on Baird’s for both GTD() and TOGTD(). Our onpolicy experiments with binary features did indicate a slight advantage for TD(), PTD, and HTD, and that PTD and HTD exhibit zero sensitivity to the choice of as expected. In offpolicy learning there is little difference between GTD() and PTD and HTD. Our results combined with the prior work of Dann et al. dann2014policy suggest that the advantage of conventional TD() over gradient TD methods, in onpolicy learning, is limited to specific domains.
Our second conclusion, is that the new mirror prox methods achieved poor performance in most settings except Baird’s counterexample. Both GTD2mp and TDCmp achieved the best performance in Baird’s counterexample. We hypothesize that the twostep gradient computation more effectively uses the transition to state 7, and so is ideally suited to the structure of the domain^{5}^{5}5Baird’s counterexample uses a specific initialization of the primary weights: far from one of the true solutions .. However, the GTD2MP method performed worse than offpolicy TD(0) in all offpolicy random MDP domains, while the learning curves of TDCmp exhibited higher variance than other methods in all but the onpolicy binary case and high parameter sensitivity across all settings except Baird’s. This does not seem to be a consequence of the extension to eligibility traces because in all cases except Baird’s, both TDCmp and GTD2mp performed best with . Like GTD and HTD, the mirror prox methods would likely have performed better with values of , however, this is undesirable because larger is required to ensure good performance in some off policy domains, such as Baird’s (e.g., ).
Third and finally, several methods exhibited nonconvergent behavior on Baird’s counterexample. All methods that exhibited reliable error reduce in Baird’s did so with near zero, suggesting that eligibility traces are of limited use in these more extreme offpolicy domains. In the case of PTD, nonconvergent behavior is not surprising since our implementation of this algorithm does not include gradient correction—a possible extension suggested by the authors sutton2014anew —and thus is only guaranteed to converge under offpolicy sampling in the tabular case. For the emphatic TD methods the performance on Baird’s remains a concern, especially considering how well TOETD() performed in all our other experiments. The addition of the parameter appears to significantly improve TOETD in the offpolicy domain with binary features, but could not mitigate the large variance in produced by the counterexample. It is not clear if this bad behavior is inherent to emphatic TD methods^{6}^{6}6The variance of TOETD has been examined before in two state domains sutton2015anemphatic . ETD is thought to have higher variance that other TD algorithms due to the emphasis weighting., or could be solved by more careful specification of the statebased interest function. In our implementation, we followed the original author’s recommendation of setting the interest for each state to 1.0 sutton2015anemphatic , because all our domains were discounted and continuing. Additionally, both HTD() and TOHTD() did not diverge on Baird’s, but performance was less than satisfactory to say the least.
Overall, the conclusions implied by our empirical study are surprisingly clear. If guarding against large variance due to offpolicy sampling is a chief concern, then GTD() and TOGTD() should be preferred. Between the two, GTD() should be preferred if computation is at a premium. If poor performance in problems like Baird’s is not a concern, then TOETD() was clearly the best across our experiments, and exhibited nearly the best runtime results. TOETD() on the other hand, exhibited high variance in offpolicy domains, and sharp parameter sensitivity, indicating parameter turnng of emphatic methods may be an issue in practice.
7 Appendix
Additional results and analysis can be found in the full version of the paper: http://arxiv.org/abs/1602.08771.
References
 [1] B. Argall, S. Chernova, M. M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems (), 2009.

[2]
L. Baird.
Residual algorithms: Reinforcement learning with function
approximation.
In
International Conference on Machine Learning
, 1995.  [3] D. P. Bertsekas and J. N. Tsitsiklis. Neurodynamic programming. Athena Scientific Press, 1996.
 [4] C. Dann, G. Neumann, and J. Peters. Policy evaluation with temporal differences: a survey and comparison. The Journal of Machine Learning Research, 2014.
 [5] M. Geist and B. Scherrer. Offpolicy learning with eligibility traces: a survey. The Journal of Machine Learning Research, 2014.
 [6] L. Hackman. Faster GradientTD Algorithms. PhD thesis, University of Alberta, 2012.
 [7] A. Hallak, A. Tamar, R. Munos, and S. Mannor. Generalized emphatic temporal difference learning: Biasvariance analysis. arXiv preprint arXiv:1509.05172, 2015.
 [8] L.J. Lin. SelfImproving Reactive Agents Based On Reinforcement Learning, Planning and Teaching. Machine Learning, 1992.

[9]
B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik.
FiniteSample Analysis of Proximal Gradient TD Algorithms.
Conference on Uncertainty in Artificial Intelligence
, 2015.  [10] H. Maei and R. Sutton. GQ (): A general gradient algorithm for temporaldifference prediction learning with eligibility traces. In AGI, 2010.
 [11] H. R. Maei. Gradient temporaldifference learning algorithms. University of Alberta, 2011.
 [12] S. Mahadevan and B. Liu. Sparse Qlearning with mirror descent. In Conference on Uncertainty in Artificial Intelligence, 2012.
 [13] S. Mahadevan, B. Liu, P. S. Thomas, W. Dabney, S. Giguere, N. Jacek, I. Gemp, and J. L. 0002. Proximal reinforcement learning: A new theory of sequential decision making in primaldual spaces. CoRR abs/1405.6757, 2014.
 [14] A. R. Mahmood and R. Sutton. Offpolicy learning based on weighted importance sampling with linear computational complexity. In Conference on Uncertainty in Artificial Intelligence, 2015.
 [15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [16] R. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT press, 1998.
 [17] R. Sutton, H. Maei, D. Precup, and S. Bhatnagar. Fast gradientdescent methods for temporaldifference learning with linear function approximation. International Conference on Machine Learning, 2009.
 [18] R. Sutton, J. Modayil, M. Delp, T. Degris, P. Pilarski, A. White, and D. Precup. Horde: A scalable realtime architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and Multiagent Systems, 2011.
 [19] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 [20] R. S. Sutton, A. R. Mahmood, D. Precup, and H. van Hasselt. A new Q(lambda) with interim forward view and Monte Carlo equivalence. ICML, 2014.
 [21] R. S. Sutton, A. R. Mahmood, and M. White. An emphatic approach to the problem of offpolicy temporaldifference learning. The Journal of Machine Learning Research, 2015.
 [22] J. Tsitsiklis and B. Van Roy. An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 1997.
 [23] H. van Hasselt, A. R. Mahmood, and R. Sutton. Offpolicy TD () with a true online equivalence. In Conference on Uncertainty in Artificial Intelligence, 2014.
 [24] H. van Seijen and R. Sutton. True online TD(lambda). In International Conference on Machine Learning, 2014.
 [25] C. Watkins. Watkins: Learning from delayed rewards. PhD thesis, University of Cambridge, 1989.
 [26] A. White. Developing a predictive approach to knowledge. PhD thesis, University of Alberta, 2015.
 [27] H. Yu. On convergence of emphatic temporaldifference learning. In Annual Conference on Learning Theory, 2015.
Appendix A Algorithms
The original ETD() algorithm as proposed by Sutton et al. (2015) is an not entirely obvious manipulation of the true online ETD() described above and used in our experiments. The difference is in the definition of the eligibility trace and the primary weight update. To achieve the original ETD() algorithm modify the above trueonline ETD() algorithm to use
and
In all the algorithms that follow, we assume , are initialized arbitrarily, and eligibility traces are initialized to a vector of zeros (e.g., ).