1. Introduction
As a basic mechanism in reinforcement learning (RL), eligibility traces Sutton (1988) unify and generalize temporaldifference (TD) and Monte Carlo methods Sutton and Barto (2011). As a temporary record of an event (e.g., taking an action or visiting a state) in RL, eligibility traces mark the memory parameters associated with the event as eligible for undergoing changes Sutton and Barto (1998). The eligible traces are then used to assign credit to the current TDerror which leads the learning of policies. With traces, credit is passed through multiple preceding states and therefore learning is often significantly faster Singh and Dayan (1998).
With the onpolicy TD learning with traces (e.g., TD(), Sarsa()), the assignment of credit to previous states decays exponentially according to the parameter . If , the traces are set to zero immediately and the onpolicy TD learning algorithm with traces is equal to onestep TD learning. If , the traces fade away slowly and no bootstrapping is made, and thus producing the Monte Carlo algorithm with online update Sutton et al. (2014). Moreover, the intermediate value of makes the learning algorithm to perform better than the method at either extreme.
In the offpolicy case, when the samples generated from a behavior policy is used to learn a different target policy, the usual approach is to measure the difference of the two policies in a probabilistic way. For example, PerDecision Importance Sampling Doina Precup (2000)
weights returns based on the mismatch between target and behavior probabilities of the related actions. Alternatively, Treebackup (TB) algorithm
Doina Precup (2000)combines the value estimates for the actions along the traces according to their probabilities of target policy. More recently, Retrace(
) Munos et al. (2016) combines Naive Q() with importance sampling, and offers a safe (whatever the behavior policy is) and efficient (can learn from full returns) way for offpolicy reinforcement learning. However, existing offpolicy learning methods based on stateaction probability are inefficient when utilizing the traces for offpolicy learning, especially when the target policy is deterministic, which is quite obvious in control problems. If the target policy is deterministic, the probability of target policy is zero when an exploratory action is taken. In this setting, importance sampling always involves a large variance since the importance ratio may be greater than 1 and is rarely used in practice. Retrace(
) and TB() is identical to Watkins’ Q() Watkins (1989) and the traces are cut when an exploratory action is taken. This may cause to lose the advantage of eligibility traces and slow down the learning process Sutton and Barto (1998). Peng’s Q() Peng and Williams (1994) tried to solve this problem, but fails to converge to the optimal value.On the other hand, some existing methods do not depend on target policy probabilities and can learn from full returns without cutting traces under the greedy target policy. Unfortunately, some of them may face limitations in convergence. For instance, Naive Q() Sutton and Barto (1998) never cuts traces thus provides a way to use full returns when performing offpolicy RL with eligibility traces, which can sometimes achieve a better performance over Watkins’ Q() Leng et al. (2009). A more recent work by Harutyunyan et al. (2016) shows that Naive Q() for control can converge to the optimal value under some conditions. An open question is: how about the intermediate condition between target policy probabilitiesbased and nontarget policy probabilitiesbased methods?
To address the above question, in this paper we propose a TBQ() algorithm, which unifies TB() (cutting traces immediately) and Naive Q() (never cutting traces). By introducing a new parameter to illustrate the degree of utilizing traces, TBQ() creates a continuous integration and role shift between TB() and Naive Q(). If then TBQ() is converted to the Naive Q() that never cuts traces; and if then TBQ() is transformed to the Watkins’ Q(). We then theoretically analyze the contraction property of TB() for both policy evaluation and control settings. We also derive the online version of TBQ() and give the convergence proof. Compared to TB(), TBQ() is efficient in trace utilization with the greedy target policy. Compared to Naive Q(), TBQ() can achieve convergence by adjusting a suitable . We empirically show that, for in greedy policies, there exists some degree of utilizing traces for , which can improve the efficiency in trace utilization, therefore accelerating the learning process and improving the performance as well.
2. Preliminaries and Problem Settings
Here, we introduce some basic concepts, our target problems, notations, and related work.
2.1. Preliminaries and Problem Settings
A reinforcement learning problem can be formulated as a Markovian Decision Process (MDP) , where is a finite state space, is the action space, is the discount factor and is the mapping of transition function for each stateaction pair to a distribution over . A policy
is a probability distribution over the set
.The stateaction value is a mapping on to , which indicates the expected discounted future reward when taking action at state under policy :
(1) 
where is the time of termination. For each policy , we define the operator Harutyunyan et al. (2016):
For an arbitrary policy we use to describe the unique Qfunction corresponding to :
The Bellman operator for a policy is defined as:
(2) 
Obviously, has a unique fixed point :
(3) 
The Bellman optimality operator introduces a maximization over a set of policies and is defined as:
(4) 
Its unique fixed point is .
The Bellman equation can also be extended using the exponentially weighted sum of step returns Sutton (1988):
(5)  
In this return version of Bellman equation, the fixed point of is also . By varying the parameter from 0 to 1, provides a continuous connection and role shift between onestep TD learning and Monte Carlo methods.
In this paper, we consider two types of RL problems, and mainly focus on actionvalue case under the offpolicy setting. That is, in a policy evaluation problem, we wish to estimate of a fixed policy under the samples drawn from a different behavior policy ; in a control problem, we seek to approximate based on the iteration of Qvalues. We specially focus on the learning scenario that the target policy is greedy, which is obvious in the control setting. Our main challenge is to improve the efficiency of trace utilization as well as ensure learning convergence during the offpolicy learning process.
2.2. Related Work
Based on the usage of target policy probability when calculating the return, existing works can be divided into 2 categories:
2.2.1. Target policy probabilitybased methods.
The step methods face challenges when involving offpolicy, which has triggered to produce many methods to solve those challenges. The most common approach is to measure the two policies in a probabilistic senseMeng et al. (2018). Based on the work in Munos et al. (2016), several offpolicy returnbased methods based on target policy probability: importance sampling (IS), treebackup and Retrace() can be expressed in a unified operator as follows:
(6)  
Importance sampling: . The IS methods correct the difference between target policy and behavior policy by their division of probabilities Sutton and Barto (2011). For example, PerDecision Importance Sampling (PDIS) Doina Precup (2000) incorporates eligibility traces with importance sampling. Since the estimation value contains a cumulative production of importance rations () which may exceeds 1, IS methods suffer from large variance and are seldom used in practice. In addition, weighted importance sampling Precup (2000) can reduce the variance of IS, but leads to a biased estimation.
Treebackup: . The TB() algorithm Doina Precup (2000) provides an alternative way for offpolicy learning without IS. In control problems, if the target policy is greedy, then TB() produces Watkins’ Q() Watkins and Dayan (1992). In this case, TB() is not efficient as it cuts traces when encountered an exploratory action and is not able to learn from the full returns.
Retrace(): , was proposed in Munos et al. (2016). Comparing to IS methods, this method truncates the importance ration by 1 to reduce the variance in IS. It is proved to convergence under any behavior policy and can learn from full returns when the behavior and target policies are near. However, in the control case when the target policy is greedy, Retrace() is identical to TB() and is not efficient in utilizing traces.
2.2.2. Nontarget policy probabilitybased methods.
In addition, there are also some methods that does not depend on target policy probability, and can make full use of the traces:
General Q(): General Q() Van Seijen et al. (2009)Hasselt (2011) generalizes the onpolicy Sarsa() using the following update equation:
In control case, when target policy is greedy, General Q() is identical to Peng’s Q() Peng and Williams (1994). It does not cut traces so much as Watkins’ Q(). However, When learning is offpolicy, General Q() lead to a biased estimation and does not converge to .
Q() with off policy corrections Harutyunyan et al. (2016): it is an offpolicy correction method based on a Qbaseline. Their proposed operator is the same as if in (6). Their algorithms, named and for policy evaluation and control, respectively. If the distance between target policy and behavior policy is small, i.e., , converges to its fixed point . In control scenarios, is equal to Naive Sutton and Barto (1998) and is guaranteed to converge to under . Besides, they also empirically show that in fact there exists some tradeoff between and beyond the convergence guarantee, which can make the learning faster and better. In addition, is proposed in Yang et al. (2018) to combine Sarsa() and Q, and inherit the similar properties with Q().
In conclusion, existing offpolicy learning methods based on target policy probability are inefficient when utilizing eligibility traces, especially when target policy is greedy. In this scenario, The traces are cut immediately when encountered an exploratory action and thus may lose the advantage of eligibility traces and slow down the learning process. In addition, existing nontarget policy probability based methods can make full use of the traces, but may face limitations in convergence. In this paper, we try to solve this dilemma by create a hybridization of those two different methods.
3. Tbq(): Degree of Traces Utilization
In the RL literature, unifying different algorithmic ideas to leverage the pros and cons in each idea and to produce better algorithms has been a pragmatic approach De Asis et al. (2017). This also applies to several policy learning methods, e.g., TD() to unify TDlearning and Monte Carlo methods, Q() De Asis et al. (2017) to fuse multistep treebackup and Sarsa, and Q(,) Yang et al. (2018) to integrate and Sarsa(). Such hybridization is useful for balancing the capabilities of different tracecutting methods discussed above. Accordingly, in this paper, we introduce a new parameter into tracecutting to enable the degree of utilizing traces. The proposed method, TBQ(), unifies TB() (cutting traces immediately) and Naive Q() (never cutting traces).
We first give the definition of operator that used for the update equation of TBQ():
Definition .
The proposed operator is a map on to ,
(7)  
where
TBQ() linearly combines TB() and Naive Q() by using the degree parameter . When then TBQ() is converted to TB(), and TBQ() is transformed to Naive Q(). By exploratory adjusting the parameter from 0 to 1 we can produce a continuous integration and role shift between cutting the traces immediately and never cutting traces. We then analyze the contraction property of in policy evaluation. We here use to represent the supremum norm.
The proposed operator has a unique fixed point . If the behavior policy and target policy are near, i.e.,
, then .
Proof.
Unfolding the operator:
Taking the supremum norm:
Per Lemma 1 in Harutyunyan et al. (2016) we have:
where is the distance between and :
Per Theorem 1 in Munos et al. (2016) we have:
Adding the above two items we have:
where .
Further, for , , we have
∎
Theorem 3.2 indicates that, for any , if the distance between two policies are near with regard to , then converges to . Comparing to Harutyunyan et al. (2016), our algorithm derives a wider convergence range w.r.t . We provide a hybridization of utilizing traces based on TB() and Naive Q(). In practice, the convergence condition can be satisfied by adjusting the parameter under different situations.
4. Tbq() for Control
In control problems, we want to estimate by iteratively applying policy evaluation and policy improvement processes, which is referred to generalized policy iteration (GPI) Sutton and Barto (1998). Denoting as the Qvalue and the corresponding target policy in the iteration process under the arbitrary behavior policy at step , then can be retrieved by our operator by using the following steps:

Policy evaluation step:

Policy improvement step:
We here use the notion to represent , which is greedy with respect to . Based on GPI, the TBQ() algorithm for control problems is depicted in Algorithm 1 with an online forward view, i.e., TBQ(). Note that is the indicator function.
To analyze the convergence of Algorithm 1, we first consider offline version of the TBQ() algorithm. The following lemma states that, if satisfies some condition with regard to , then the offline version of TBQ() is guaranteed to converge.
Lemma
Considering the sequence generated by the operator under a greedy target policy and an arbitrary behavior policy , we have:
where .
Specifically, if , then the sequence converges to exponentially fast.
Proof.
Unfolding the operator:
As a consequence, we deduce the result:
∎
Lemma 4.1 states that, for any , if then the offline control algorithm is guaranteed to converge. However, similar to Harutyunyan et al. (2016), in practice, there exist some tradeoffs between and under different values, which goes beyond the convergence guarantee. By introducing a new parameter , we can alleviate relationship through adjusting a suitable . The traces can also be utilized when an exploratory action is taken. In addition, comparing to Naive Q(), we derive a wider convergence range by tuning . Although we have not give a detail theoretical analyze of relationship under different , in the experiment part we will show that for any and in policies, there exist some degree of utilizing traces , which can accelerate the learning process and yield a better performance through utilizing the full returns as well.
4.1. Convergence Analysis of TBQ() Algorithm
We now consider the convergence proof of TBQ() described in Algorithm 1. First, we make some assumptions similar to Harutyunyan et al. (2016) Munos et al. (2016).
Assumption 1.
For bounded stepsize : ,
.
Assumption 2.
Minimum visit frequency: all pairs are visited infinitely often: .
Assumption 3.
Finite sample trajectories: , denotes the length of sample trajectories.
Under those assumptions, Algorithm 1 can converge to with probability 1 as stated below: Considering the sequence of Qfunctions generated from Algorithm 1, where is the greedy policy with respect to , if , then under Assumptions 13, with probability 1.
Proof.
For reading convenience, we first define some notations: Let denote the th iteration, denote the length of the trajectory, denote the th sample of current trajectory, then the accumulating trace Sutton and Barto (1998) can be written as:
(8) 
We use to emphasize the online setting, then Equation (7) can be written as:
(9) 
(10) 
Since , based on Assumption 3, we have:
(11) 
Therefore, the total update is bounded based on Equation (11). Further, we can rewrite the update Equation (9) as:
Based on Assumptions 1 and 2, the new stepsize satisfies Assumption (a) of Proposition 4.5 in Bertsekas and Tsitsiklis (1996). Lemma 4.1 states that the operator is a contraction, which satisfies Assumption (c) of Proposition 4.5 in Bertsekas and Tsitsiklis (1996). Based on Equation (7) and the bounded reward function, the variance noise term is bounded, thus Assumption (b) of Proposition 4.5 in Bertsekas and Tsitsiklis (1996) is satisfied. The noise term can also be shown to satisfy Assumption (d) of Proposition 4.5 in Bertsekas and Tsitsiklis (1996), based on Proposition 5.2 in Bertsekas and Tsitsiklis (1996). Finally, we are able to apply Proposition 4.5 Bertsekas and Tsitsiklis (1996) to conclude that the sequence converges to with probability 1. ∎
4.2. Online Backward Version of TBQ()
Since the online forward view algorithm described in Algorithm 1 needs extra memory to store the trajectories, we here also provide an online backward version of TBQ(): TBQ(). Based on the equivalence between forward view and backward view of the eligibility traces Sutton and Barto (1998), the online backward view version of TBQ() can be implemented as in Algorithm 2. The online backward view version TBQ() provides a more concise and efficient form and it is more efficient in executing the TBQ() algorithm.
5. Experiments
In this section, we explore the tradeoff in the control case w.r.t. several environments. We empirically find that, for and , there exists some degree of utilizing traces , which can improve the efficiency of trace utilization.
5.1. 19State Random Walk
The 19state random walk problem is a onedimensional MDP environment which is widely used in RL Sutton and Barto (2011)De Asis et al. (2017). There are two terminal states at the two ends of the environment, transition to the left terminal receives a reward 0 and to the right terminal receives 1. The agent at each state has two actions: left and right. We here apply the online forward version TBQ() by using an policy as behavior policy and a greedy policy as target policy. For each episode, the maximum step is bounded as 100. We then measure the meansquarederror (MSE) of the optimal Qvalue between the estimated values and the analytically computed values after 10,000 episodes of offline running. We test 3 different values: 0.1, 0.5, 1. The corresponding distance between target policy and behavior policy is 0.05, 0.25, 0.5, respectively. For each , we test different values from 0 to 1 with stepsize 0.1. Also, for each , we also try different values from 0 to 1 with stepsize 0.1. The learning stepsize is tuned to 0.3. All results are averaged across 10 independent runs with fixed random seed. We compare TBQ() with TB() and Naive Q(). For TBQ(), we also mark out the best performance of , with the results shown in Figure 1.
Figure 1(a) shows that is too small for the agent to explore the whole environment. The agent can seldom reach the left terminal. In addition, since the exploratory action is also rarely taken, the MSEs of TB() between different values vary a little. Naive Q() never cuts traces and enjoys the convergence when . When , the Naive Q() diverges. The MSEs of TBQ() vary a little when Naive Q() converged. When , we can still tune to reach a lower MSE. The best of TBQ() decreases as the increase of . When (Figure 1(b)), we observe results similar to when . When Naive Q() diverges, TBQ() can also benefit from learning from the full returns by adjusting a suitable . The MSE can also be reduced as well. When , the behavior policy becomes completely random. The performance between TB() and Naive Q() is nearly the same when . When we can also adjust a suitable to ensure the convergence of TBQ().
In this experiment, we observe that when , , we can adjust a suitable in order to learn from the full returns and avoid cutting traces too often as well. In practice, when is close to 0, can be set to 1 to make full use of the traces. When is close to 1, can be set to a small number near 0 to improve the efficiency of traces utilization.
5.2. 1010 Maze Environment
The Maze environment is a 2dimensional navigation task^{1}^{1}1We here use this version of gymMaze environment: https://github.com/MattChanTK/gymmaze.. The agent’s goal is to find the shortest path from start to the goal. For each state, the agent has 4 actions: go up, go down, turn left or turn right. If the path is blocked, the agent will stay at the current location. The reward is 1 when the agent reaches the goal, while at any intermediate state the agent gets reward 0.0001. Each episode is terminated if the agent reaches the goal, or the step count exceeds 2,000. To ensure adequate exploration and speed up the training process as well, we here adopt an policy as behavior policy and linearly decay the parameter from 1 to 0.1 by 0.02. In this experiment, we use the online backward version of TBQ(). The learning rate is tuned to 0.05. We here use 6 different factors of TBQ(): {0, 0.2, 0.4, 0.6, 0.8, 1}, and measure the average total steps of each episode. In addition, the results are averaged across 10 independent runs with fixed random seeds.
The result is illustrated in Figure 2. Since the shortest path of the maze is deterministic, TBQ() gradually accelerates the learning process when varies from 0 to 0.8. However, Naive Q() diverges and cannot find the shortest path. The convergence speed of TBQ() reaches fastest at . The result shows that, in practice, we can accelerate the learning process by adjusting a suitable parameter based on the TBQ() algorithm.
5.3. Tbq() with Function Approximator
We also evaluate TBQ(
) algorithm using neural networks as function approximator. With the help of deep Qneworks (DQN)
Mnih et al. (2015), the offline version with a function approximator can be easily implemented. We here adopt online forward view for updating the parameters in the neural network. Unlike traditional DQN, we replay 4 consecutive sequences of samples with length of 8 for each update. We here evaluate TBQ() on CartPole problem Barto et al. (1983), and adopt the OpenAI Gym as the evaluation platform ^{2}^{2}2http: gym.openai.com Brockman et al. (2016). In this setting, a pole is attached by an unactuated joint to a cart, which can move along the track. The agent’s goal is to prevent the pole from falling over with two actions controlling the cart: move left or right. Since the observation space is continuous, we adopt a twolayer neural network with 64 nodes in each layer to approximate the Qvalue for the state action pairs. We use policy as behavior policy and exponentially decay the parameter from 1 to 0.1 by 0.995 to ensure adequate exploration. In addition, the target network parameters are updated using soft replacement Lillicrap et al. (2016) according to the evaluation network parameter : .Parameter  Value 

Discount factor  0.99 
Initial exploration  1 
Final exploration  0.1 
Optimizer  AdamKingma and Ba (2014) 
Initial learning rate  0.001 
Replay memory size  20000 
Replay start episode  100 
1  
0.001 
In this setting, in the beginning of the learning process the distance between target policy and behavior policy reach the maximum. When fades to 0.1, the two policy then become close. Therefore, to ensure convergence we here adopt a dynamic linearly increase from 0.1 to 1 by stepsize 0.01. Other main learning parameters are listed in Table 1. the results are averaged across 5 independent runs with fixed random seeds. The result is showed in Figure 3. We also smooth the results with a rightcentred moving average of 50 successive episodes. With a dynamic suitable , TBQ() outperforms TB() and Naive Q() in the CartPole problem. The result indicates that in practice, we can improve the learning by adjusting a suitable parameter using TBQ() algorithm.
6. Discussion and Conclusion
In this paper, we propose a new offpolicy learning method called TBQ() to define the degree of utilizing the offpolicy traces. TBQ() unifies TB() and Naive Q(). Theoretical analysis shows the contraction property of TBQ() in both policy evaluation and control. In addition, its convergence is proved for control setting. We also provide two versions of TBQ() control algorithm: online forward version TBQ() and online backward version TBQ().
Comparing to TB(), the proposed algorithm improves the efficiency of trace utilization when target policy is greedy. Comparing to Naive Q(), our algorithm has relatively loose convergence requirement. Since the coefficient in our algorithm is less than 1, the variance of our algorithm is bounded Munos et al. (2016). Although we are not able to give further theoretical analysis between bootstrapping parameter and degree of cutting traces on convergence, we empirically show that the existing offpolicy learning algorithms with eligibility traces can be improved and accelerated by adjusting a suitable tracecutting degree parameter . The theoretical relationship between bootstrapping parameter and is remained for the future work.
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This work is partly supported by National Key Research and Development Plan under Grant No. 2016YFB1001203, Zhejiang Provincial Natural Science Foundation of China (LR15F020001).
References
 (1)
 Barto et al. (1983) Andrew G Barto, Richard S Sutton, and Charles W Anderson. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics 5 (1983), 834–846.
 Bertsekas and Tsitsiklis (1996) Dimitry P. Bertsekas and John N. Tsitsiklis. 1996. NeuroDynamic Programming. Athena Scientific.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. arXiv preprint arXiv:1606.01540 (2016).

De Asis et al. (2017)
Kristopher De Asis,
J Fernando HernandezGarcia, G Zacharias
Holland, and Richard S Sutton.
2017.
Multistep reinforcement learning: A unifying
algorithm.
AAAI Conference on Artificial Intelligence
(2017). 
Doina Precup (2000)
Satinder Singh Doina Precup, Richard
S. Sutton. 2000.
Eligibility traces for offpolicy policy
evaluation. In
Proceedings of the Seventeenth International Conference on Machine Learning, 2000
. Morgan Kaufmann, 759–766.  Harutyunyan et al. (2016) Anna Harutyunyan, Marc G Bellemare, Tom Stepleton, and Rémi Munos. 2016. Q() with OffPolicy Corrections. In International Conference on Algorithmic Learning Theory. Springer, 305–320.
 Hasselt (2011) H Hasselt. 2011. Insights in reinforcement learning: formal analysis and empirical evaluation of temporaldifference learning algorithms. Ph.D. Dissertation. Universiteit Utrecht.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 Leng et al. (2009) Jinsong Leng, Colin Fyfe, and Lakhmi C Jain. 2009. Experimental analysis on Sarsa () and Q () with different eligibility traces strategies. Journal of Intelligent & Fuzzy Systems 20, 1, 2 (2009), 73–82.
 Lillicrap et al. (2016) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR) (2016).
 Meng et al. (2018) Wenjia Meng, Qian Zheng, Long Yang, Pengfei Li, and Gang Pan. 2018. Qualitative Measurements of Policy Discrepancy for Returnbased Deep QNetwork. arXiv preprint arXiv:1806.06953 (2018).
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
 Munos et al. (2016) Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. 2016. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems. 1054–1062.
 Peng and Williams (1994) Jing Peng and Ronald J Williams. 1994. Incremental multistep Qlearning. In Machine Learning Proceedings 1994. Elsevier, 226–232.
 Precup (2000) Doina Precup. 2000. Temporal abstraction in reinforcement learning. Ph.D. Dissertation. University of Massachusetts Amherst.
 Singh and Dayan (1998) Satinder Singh and Peter Dayan. 1998. Analytical mean squared error curves for temporal difference learning. Machine Learning 32, 1 (1998), 5–40.
 Sutton et al. (2014) Rich Sutton, Ashique Rupam Mahmood, Doina Precup, and Hado Hasselt. 2014. A new Q (lambda) with interim forward view and Monte Carlo equivalence. In International Conference on Machine Learning. 568–576.
 Sutton (1988) Richard S Sutton. 1988. Learning to predict by the methods of temporal differences. Machine learning 3, 1 (1988), 9–44.
 Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. MIT press.
 Sutton and Barto (2011) Richard S Sutton and Andrew G Barto. 2011. Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
 Van Seijen et al. (2009) Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. 2009. A theoretical and empirical analysis of Expected Sarsa. In Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL’09. IEEE Symposium on. IEEE, 177–184.
 Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Qlearning. Machine learning 8, 34 (1992), 279–292.
 Watkins (1989) Christopher John Cornish Hellaby Watkins. 1989. Learning from delayed rewards. Ph.D. Dissertation. King’s College, Cambridge.
 Yang et al. (2018) Long Yang, Minhao Shi, Qian Zheng, Wenjia Meng, and Gang Pan. 2018. A Unified Approach for Multistep TemporalDifference Learning with Eligibility Traces in Reinforcement Learning. International Joint Conference on Artificial Intelligence (IJCAI) (2018).
Comments
There are no comments yet.