TBQ(σ): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

05/17/2019 ∙ by Longxiang Shi, et al. ∙ University of Technology Sydney Zhejiang University 0

Off-policy reinforcement learning with eligibility traces is challenging because of the discrepancy between target policy and behavior policy. One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup. However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems. The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traces and slow down the learning process. Alternatively, some non-probabilistic measurement methods such as General Q(λ) and Naive Q(λ) never cut traces, but face convergence problems in practice. To address the above issues, this paper introduces a new method named TBQ(σ), which effectively unifies the tree-backup algorithm and Naive Q(λ). By introducing a new parameter σ to illustrate the degree of utilizing traces, TBQ(σ) creates an effective integration of TB(λ) and Naive Q(λ) and continuous role shift between them. The contraction property of TB(σ) is theoretically analyzed for both policy evaluation and control settings. We also derive the online version of TBQ(σ) and give the convergence proof. We empirically show that, for ϵ∈(0,1] in ϵ-greedy policies, there exists some degree of utilizing traces for λ∈[0,1], which can improve the efficiency in trace utilization for off-policy reinforcement learning, to both accelerate the learning process and improve the performance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

As a basic mechanism in reinforcement learning (RL), eligibility traces Sutton (1988) unify and generalize temporal-difference (TD) and Monte Carlo methods Sutton and Barto (2011). As a temporary record of an event (e.g., taking an action or visiting a state) in RL, eligibility traces mark the memory parameters associated with the event as eligible for undergoing changes Sutton and Barto (1998). The eligible traces are then used to assign credit to the current TD-error which leads the learning of policies. With traces, credit is passed through multiple preceding states and therefore learning is often significantly faster Singh and Dayan (1998).

With the on-policy TD learning with traces (e.g., TD(), Sarsa()), the assignment of credit to previous states decays exponentially according to the parameter . If , the traces are set to zero immediately and the on-policy TD learning algorithm with traces is equal to one-step TD learning. If , the traces fade away slowly and no bootstrapping is made, and thus producing the Monte Carlo algorithm with online update Sutton et al. (2014). Moreover, the intermediate value of makes the learning algorithm to perform better than the method at either extreme.

In the off-policy case, when the samples generated from a behavior policy is used to learn a different target policy, the usual approach is to measure the difference of the two policies in a probabilistic way. For example, Per-Decision Importance Sampling Doina Precup (2000)

weights returns based on the mismatch between target and behavior probabilities of the related actions. Alternatively, Tree-backup (TB) algorithm 

Doina Precup (2000)

combines the value estimates for the actions along the traces according to their probabilities of target policy. More recently, Retrace(

Munos et al. (2016) combines Naive Q(

) with importance sampling, and offers a safe (whatever the behavior policy is) and efficient (can learn from full returns) way for off-policy reinforcement learning. However, existing off-policy learning methods based on state-action probability are inefficient when utilizing the traces for off-policy learning, especially when the target policy is deterministic, which is quite obvious in control problems. If the target policy is deterministic, the probability of target policy is zero when an exploratory action is taken. In this setting, importance sampling always involves a large variance since the importance ratio may be greater than 1 and is rarely used in practice. Retrace(

) and TB() is identical to Watkins’ Q(Watkins (1989) and the traces are cut when an exploratory action is taken. This may cause to lose the advantage of eligibility traces and slow down the learning process Sutton and Barto (1998). Peng’s Q(Peng and Williams (1994) tried to solve this problem, but fails to converge to the optimal value.

On the other hand, some existing methods do not depend on target policy probabilities and can learn from full returns without cutting traces under the greedy target policy. Unfortunately, some of them may face limitations in convergence. For instance, Naive Q(Sutton and Barto (1998) never cuts traces thus provides a way to use full returns when performing off-policy RL with eligibility traces, which can sometimes achieve a better performance over Watkins’ Q(Leng et al. (2009). A more recent work by Harutyunyan et al. (2016) shows that Naive Q() for control can converge to the optimal value under some conditions. An open question is: how about the intermediate condition between target policy probabilities-based and non-target policy probabilities-based methods?

To address the above question, in this paper we propose a TBQ() algorithm, which unifies TB() (cutting traces immediately) and Naive Q() (never cutting traces). By introducing a new parameter to illustrate the degree of utilizing traces, TBQ() creates a continuous integration and role shift between TB() and Naive Q(). If then TBQ() is converted to the Naive Q() that never cuts traces; and if then TBQ() is transformed to the Watkins’ Q(). We then theoretically analyze the contraction property of TB() for both policy evaluation and control settings. We also derive the online version of TBQ() and give the convergence proof. Compared to TB(), TBQ() is efficient in trace utilization with the greedy target policy. Compared to Naive Q(), TBQ() can achieve convergence by adjusting a suitable . We empirically show that, for in -greedy policies, there exists some degree of utilizing traces for , which can improve the efficiency in trace utilization, therefore accelerating the learning process and improving the performance as well.

2. Preliminaries and Problem Settings

Here, we introduce some basic concepts, our target problems, notations, and related work.

2.1. Preliminaries and Problem Settings

A reinforcement learning problem can be formulated as a Markovian Decision Process (MDP) , where is a finite state space, is the action space, is the discount factor and is the mapping of transition function for each state-action pair to a distribution over . A policy

is a probability distribution over the set


The state-action value is a mapping on to , which indicates the expected discounted future reward when taking action at state under policy :


where is the time of termination. For each policy , we define the operator  Harutyunyan et al. (2016):

For an arbitrary policy we use to describe the unique Q-function corresponding to :

The Bellman operator for a policy is defined as:


Obviously, has a unique fixed point :


The Bellman optimality operator introduces a maximization over a set of policies and is defined as:


Its unique fixed point is .

The Bellman equation can also be extended using the exponentially weighted sum of -step returns Sutton (1988):


In this -return version of Bellman equation, the fixed point of is also . By varying the parameter from 0 to 1, provides a continuous connection and role shift between one-step TD learning and Monte Carlo methods.

In this paper, we consider two types of RL problems, and mainly focus on action-value case under the off-policy setting. That is, in a policy evaluation problem, we wish to estimate of a fixed policy under the samples drawn from a different behavior policy ; in a control problem, we seek to approximate based on the iteration of Q-values. We specially focus on the learning scenario that the target policy is greedy, which is obvious in the control setting. Our main challenge is to improve the efficiency of trace utilization as well as ensure learning convergence during the off-policy learning process.

2.2. Related Work

Based on the usage of target policy probability when calculating the -return, existing works can be divided into 2 categories:

2.2.1. Target policy probability-based methods.

The -step methods face challenges when involving off-policy, which has triggered to produce many methods to solve those challenges. The most common approach is to measure the two policies in a probabilistic senseMeng et al. (2018). Based on the work in Munos et al. (2016), several off-policy return-based methods based on target policy probability: importance sampling (IS), tree-backup and Retrace() can be expressed in a unified operator as follows:


Importance sampling: . The IS methods correct the difference between target policy and behavior policy by their division of probabilities Sutton and Barto (2011). For example, Per-Decision Importance Sampling (PDIS) Doina Precup (2000) incorporates eligibility traces with importance sampling. Since the estimation value contains a cumulative production of importance rations () which may exceeds 1, IS methods suffer from large variance and are seldom used in practice. In addition, weighted importance sampling Precup (2000) can reduce the variance of IS, but leads to a biased estimation.

Tree-backup: . The TB() algorithm Doina Precup (2000) provides an alternative way for off-policy learning without IS. In control problems, if the target policy is greedy, then TB() produces Watkins’ Q(Watkins and Dayan (1992). In this case, TB() is not efficient as it cuts traces when encountered an exploratory action and is not able to learn from the full returns.

Retrace(): , was proposed in Munos et al. (2016). Comparing to IS methods, this method truncates the importance ration by 1 to reduce the variance in IS. It is proved to convergence under any behavior policy and can learn from full returns when the behavior and target policies are near. However, in the control case when the target policy is greedy, Retrace() is identical to TB() and is not efficient in utilizing traces.

2.2.2. Non-target policy probability-based methods.

In addition, there are also some methods that does not depend on target policy probability, and can make full use of the traces:

General Q(): General Q(Van Seijen et al. (2009)Hasselt (2011) generalizes the on-policy Sarsa() using the following update equation:

In control case, when target policy is greedy, General Q() is identical to Peng’s Q(Peng and Williams (1994). It does not cut traces so much as Watkins’ Q(). However, When learning is off-policy, General Q() lead to a biased estimation and does not converge to .

Q() with off policy corrections Harutyunyan et al. (2016): it is an off-policy correction method based on a Q-baseline. Their proposed operator is the same as if in (6). Their algorithms, named and for policy evaluation and control, respectively. If the distance between target policy and behavior policy is small, i.e., , converges to its fixed point . In control scenarios, is equal to Naive  Sutton and Barto (1998) and is guaranteed to converge to under . Besides, they also empirically show that in fact there exists some trade-off between and beyond the convergence guarantee, which can make the learning faster and better. In addition, is proposed in Yang et al. (2018) to combine Sarsa() and Q, and inherit the similar properties with Q().

In conclusion, existing off-policy learning methods based on target policy probability are inefficient when utilizing eligibility traces, especially when target policy is greedy. In this scenario, The traces are cut immediately when encountered an exploratory action and thus may lose the advantage of eligibility traces and slow down the learning process. In addition, existing non-target policy probability based methods can make full use of the traces, but may face limitations in convergence. In this paper, we try to solve this dilemma by create a hybridization of those two different methods.

3. Tbq(): Degree of Traces Utilization

In the RL literature, unifying different algorithmic ideas to leverage the pros and cons in each idea and to produce better algorithms has been a pragmatic approach De Asis et al. (2017). This also applies to several policy learning methods, e.g., TD() to unify TD-learning and Monte Carlo methods, Q(De Asis et al. (2017) to fuse multi-step tree-backup and Sarsa, and Q(,Yang et al. (2018) to integrate and Sarsa(). Such hybridization is useful for balancing the capabilities of different trace-cutting methods discussed above. Accordingly, in this paper, we introduce a new parameter into trace-cutting to enable the degree of utilizing traces. The proposed method, TBQ(), unifies TB() (cutting traces immediately) and Naive Q() (never cutting traces).

We first give the definition of operator that used for the update equation of TBQ():

Definition .

The proposed operator is a map on to ,



TBQ() linearly combines TB() and Naive Q() by using the degree parameter . When then TBQ() is converted to TB(), and TBQ() is transformed to Naive Q(). By exploratory adjusting the parameter from 0 to 1 we can produce a continuous integration and role shift between cutting the traces immediately and never cutting traces. We then analyze the contraction property of in policy evaluation. We here use to represent the supremum norm.

The proposed operator has a unique fixed point . If the behavior policy and target policy are near, i.e.,

, then .


Unfolding the operator:

Taking the supremum norm:

Per Lemma 1 in Harutyunyan et al. (2016) we have:

where is the distance between and :

Per Theorem 1 in Munos et al. (2016) we have:

Adding the above two items we have:

where .

Further, for , , we have

Theorem 3.2 indicates that, for any , if the distance between two policies are near with regard to , then converges to . Comparing to  Harutyunyan et al. (2016), our algorithm derives a wider convergence range w.r.t . We provide a hybridization of utilizing traces based on TB() and Naive Q(). In practice, the convergence condition can be satisfied by adjusting the parameter under different situations.

4. Tbq() for Control

In control problems, we want to estimate by iteratively applying policy evaluation and policy improvement processes, which is referred to generalized policy iteration (GPI) Sutton and Barto (1998). Denoting as the Q-value and the corresponding target policy in the iteration process under the arbitrary behavior policy at step , then can be retrieved by our operator by using the following steps:

  • Policy evaluation step:

  • Policy improvement step:

We here use the notion to represent , which is greedy with respect to . Based on GPI, the TBQ() algorithm for control problems is depicted in Algorithm 1 with an online forward view, i.e., TBQ(). Note that is the indicator function.

To analyze the convergence of Algorithm 1, we first consider off-line version of the TBQ() algorithm. The following lemma states that, if satisfies some condition with regard to , then the off-line version of TBQ() is guaranteed to converge.


Considering the sequence generated by the operator under a greedy target policy and an arbitrary behavior policy , we have:

where .

Specifically, if , then the sequence converges to exponentially fast.


Unfolding the operator:

based on Harutyunyan et al. (2016) and Munos et al. (2016), we have:

As a consequence, we deduce the result:

Lemma 4.1 states that, for any , if then the off-line control algorithm is guaranteed to converge. However, similar to  Harutyunyan et al. (2016), in practice, there exist some trade-offs between and under different values, which goes beyond the convergence guarantee. By introducing a new parameter , we can alleviate relationship through adjusting a suitable . The traces can also be utilized when an exploratory action is taken. In addition, comparing to Naive Q(), we derive a wider convergence range by tuning . Although we have not give a detail theoretical analyze of relationship under different , in the experiment part we will show that for any and in policies, there exist some degree of utilizing traces , which can accelerate the learning process and yield a better performance through utilizing the full returns as well.

  Input: discounting factor , degree of utilizing traces , bootstrapping parameter , and stepsize
  Initialization: arbitrary
  for Episode from to  do
     Sample a trajectory from
     for Sample from to  do
     end for
  end for
Algorithm 1 TBQ(): The online forward view version of TBQ() algorithm

4.1. Convergence Analysis of TBQ() Algorithm

We now consider the convergence proof of TBQ() described in Algorithm 1. First, we make some assumptions similar to Harutyunyan et al. (2016) Munos et al. (2016).

Assumption 1.

For bounded stepsize : ,


Assumption 2.

Minimum visit frequency: all pairs are visited infinitely often: .

Assumption 3.

Finite sample trajectories: , denotes the length of sample trajectories.

Under those assumptions, Algorithm 1 can converge to with probability 1 as stated below: Considering the sequence of Q-functions generated from Algorithm 1, where is the greedy policy with respect to , if , then under Assumptions 1-3, with probability 1.


For reading convenience, we first define some notations: Let denote the th iteration, denote the length of the trajectory, denote the th sample of current trajectory, then the accumulating trace Sutton and Barto (1998) can be written as:


We use to emphasize the online setting, then Equation (7) can be written as:


Since , based on Assumption 3, we have:


Therefore, the total update is bounded based on Equation (11). Further, we can rewrite the update Equation (9) as:

Based on Assumptions 1 and 2, the new stepsize satisfies Assumption (a) of Proposition 4.5 in Bertsekas and Tsitsiklis (1996). Lemma 4.1 states that the operator is a contraction, which satisfies Assumption (c) of Proposition 4.5 in Bertsekas and Tsitsiklis (1996). Based on Equation (7) and the bounded reward function, the variance noise term is bounded, thus Assumption (b) of Proposition 4.5 in Bertsekas and Tsitsiklis (1996) is satisfied. The noise term can also be shown to satisfy Assumption (d) of Proposition 4.5 in Bertsekas and Tsitsiklis (1996), based on Proposition 5.2 in Bertsekas and Tsitsiklis (1996). Finally, we are able to apply Proposition 4.5 Bertsekas and Tsitsiklis (1996) to conclude that the sequence converges to with probability 1. ∎

4.2. Online Backward Version of TBQ()

Since the online forward view algorithm described in Algorithm 1 needs extra memory to store the trajectories, we here also provide an online backward version of TBQ(): TBQ(). Based on the equivalence between forward view and backward view of the eligibility traces Sutton and Barto (1998), the online backward view version of TBQ() can be implemented as in Algorithm 2. The online backward view version TBQ() provides a more concise and efficient form and it is more efficient in executing the TBQ() algorithm.

  Input: discounting factor , degree of cutting traces , bootstrapping parameter and stepsize
  Initialization: arbitrary
  for  from to  do
        Take action , observe state and receive reward
        Choose from using -greedy policy based on
        for all  do
           if  then
           end if
        end for
     until  is terminal
  end for
Algorithm 2 TBQ(): On-line backward version of TBQ() algorithm

5. Experiments

In this section, we explore the trade-off in the control case w.r.t. several environments. We empirically find that, for and , there exists some degree of utilizing traces , which can improve the efficiency of trace utilization.

5.1. 19-State Random Walk

The 19-state random walk problem is a one-dimensional MDP environment which is widely used in RL Sutton and Barto (2011)De Asis et al. (2017). There are two terminal states at the two ends of the environment, transition to the left terminal receives a reward 0 and to the right terminal receives 1. The agent at each state has two actions: left and right. We here apply the online forward version TBQ() by using an policy as behavior policy and a greedy policy as target policy. For each episode, the maximum step is bounded as 100. We then measure the mean-squared-error (MSE) of the optimal Q-value between the estimated values and the analytically computed values after 10,000 episodes of offline running. We test 3 different values: 0.1, 0.5, 1. The corresponding distance between target policy and behavior policy is 0.05, 0.25, 0.5, respectively. For each , we test different values from 0 to 1 with stepsize 0.1. Also, for each , we also try different values from 0 to 1 with stepsize 0.1. The learning stepsize is tuned to 0.3. All results are averaged across 10 independent runs with fixed random seed. We compare TBQ() with TB() and Naive Q(). For TBQ(), we also mark out the best performance of , with the results shown in Figure 1.

Figure 1. relationship under different values

Figure 1(a) shows that is too small for the agent to explore the whole environment. The agent can seldom reach the left terminal. In addition, since the exploratory action is also rarely taken, the MSEs of TB() between different values vary a little. Naive Q() never cuts traces and enjoys the convergence when . When , the Naive Q() diverges. The MSEs of TBQ() vary a little when Naive Q() converged. When , we can still tune to reach a lower MSE. The best of TBQ() decreases as the increase of . When  (Figure 1(b)), we observe results similar to when . When Naive Q() diverges, TBQ() can also benefit from learning from the full returns by adjusting a suitable . The MSE can also be reduced as well. When , the behavior policy becomes completely random. The performance between TB() and Naive Q() is nearly the same when . When we can also adjust a suitable to ensure the convergence of TBQ().

In this experiment, we observe that when , , we can adjust a suitable in order to learn from the full returns and avoid cutting traces too often as well. In practice, when is close to 0, can be set to 1 to make full use of the traces. When is close to 1, can be set to a small number near 0 to improve the efficiency of traces utilization.

5.2. 1010 Maze Environment

The Maze environment is a 2-dimensional navigation task111We here use this version of gym-Maze environment: https://github.com/MattChanTK/gym-maze.. The agent’s goal is to find the shortest path from start to the goal. For each state, the agent has 4 actions: go up, go down, turn left or turn right. If the path is blocked, the agent will stay at the current location. The reward is 1 when the agent reaches the goal, while at any intermediate state the agent gets reward -0.0001. Each episode is terminated if the agent reaches the goal, or the step count exceeds 2,000. To ensure adequate exploration and speed up the training process as well, we here adopt an policy as behavior policy and linearly decay the parameter from 1 to 0.1 by 0.02. In this experiment, we use the on-line backward version of TBQ(). The learning rate is tuned to 0.05. We here use 6 different factors of TBQ(): {0, 0.2, 0.4, 0.6, 0.8, 1}, and measure the average total steps of each episode. In addition, the results are averaged across 10 independent runs with fixed random seeds.

Figure 2. Averaged total steps of the Maze environment. TBQ() gradually accelerates the learning process when varies from 0 to 0.8. However, Naive Q() diverges and cannot find the shortest path.

The result is illustrated in Figure 2. Since the shortest path of the maze is deterministic, TBQ() gradually accelerates the learning process when varies from 0 to 0.8. However, Naive Q() diverges and cannot find the shortest path. The convergence speed of TBQ() reaches fastest at . The result shows that, in practice, we can accelerate the learning process by adjusting a suitable parameter based on the TBQ() algorithm.

5.3. Tbq() with Function Approximator

We also evaluate TBQ(

) algorithm using neural networks as function approximator. With the help of deep Q-neworks (DQN) 

Mnih et al. (2015), the offline version with a function approximator can be easily implemented. We here adopt online forward view for updating the parameters in the neural network. Unlike traditional DQN, we replay 4 consecutive sequences of samples with length of 8 for each update. We here evaluate TBQ() on CartPole problem Barto et al. (1983), and adopt the OpenAI Gym as the evaluation platform 222http: gym.openai.com Brockman et al. (2016). In this setting, a pole is attached by an un-actuated joint to a cart, which can move along the track. The agent’s goal is to prevent the pole from falling over with two actions controlling the cart: move left or right. Since the observation space is continuous, we adopt a two-layer neural network with 64 nodes in each layer to approximate the Q-value for the state action pairs. We use policy as behavior policy and exponentially decay the parameter from 1 to 0.1 by 0.995 to ensure adequate exploration. In addition, the target network parameters are updated using soft replacement Lillicrap et al. (2016) according to the evaluation network parameter : .

Figure 3. TBQ() with function approximator in CartPole environment. The exploring parameter of - greedy policy decays from 1 to 0.1. To efficient utilize the traces under dynamic , is linearly decayed from 1 to 0.1 by step size 0.01. The result show that TBQ() outperforms both TB() and Naive Q().
Parameter Value
Discount factor 0.99
Initial exploration 1
Final exploration 0.1
Optimizer AdamKingma and Ba (2014)
Initial learning rate 0.001
Replay memory size 20000
Replay start episode 100
Table 1. Learning parameters for the neural network

In this setting, in the beginning of the learning process the distance between target policy and behavior policy reach the maximum. When fades to 0.1, the two policy then become close. Therefore, to ensure convergence we here adopt a dynamic linearly increase from 0.1 to 1 by stepsize 0.01. Other main learning parameters are listed in Table 1. the results are averaged across 5 independent runs with fixed random seeds. The result is showed in Figure 3. We also smooth the results with a right-centred moving average of 50 successive episodes. With a dynamic suitable , TBQ() outperforms TB() and Naive Q() in the CartPole problem. The result indicates that in practice, we can improve the learning by adjusting a suitable parameter using TBQ() algorithm.

6. Discussion and Conclusion

In this paper, we propose a new off-policy learning method called TBQ() to define the degree of utilizing the off-policy traces. TBQ() unifies TB() and Naive Q(). Theoretical analysis shows the contraction property of TBQ() in both policy evaluation and control. In addition, its convergence is proved for control setting. We also provide two versions of TBQ() control algorithm: online forward version TBQ() and online backward version TBQ().

Comparing to TB(), the proposed algorithm improves the efficiency of trace utilization when target policy is greedy. Comparing to Naive Q(), our algorithm has relatively loose convergence requirement. Since the coefficient in our algorithm is less than 1, the variance of our algorithm is bounded Munos et al. (2016). Although we are not able to give further theoretical analysis between bootstrapping parameter and degree of cutting traces on convergence, we empirically show that the existing off-policy learning algorithms with eligibility traces can be improved and accelerated by adjusting a suitable trace-cutting degree parameter . The theoretical relationship between bootstrapping parameter and is remained for the future work.

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This work is partly supported by National Key Research and Development Plan under Grant No. 2016YFB1001203, Zhejiang Provincial Natural Science Foundation of China (LR15F020001).


  • (1)
  • Barto et al. (1983) Andrew G Barto, Richard S Sutton, and Charles W Anderson. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics 5 (1983), 834–846.
  • Bertsekas and Tsitsiklis (1996) Dimitry P. Bertsekas and John N. Tsitsiklis. 1996. Neuro-Dynamic Programming. Athena Scientific.
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. arXiv preprint arXiv:1606.01540 (2016).
  • De Asis et al. (2017) Kristopher De Asis, J Fernando Hernandez-Garcia, G Zacharias Holland, and Richard S Sutton. 2017. Multi-step reinforcement learning: A unifying algorithm.

    AAAI Conference on Artificial Intelligence

  • Doina Precup (2000) Satinder Singh Doina Precup, Richard S. Sutton. 2000. Eligibility traces for off-policy policy evaluation. In

    Proceedings of the Seventeenth International Conference on Machine Learning, 2000

    . Morgan Kaufmann, 759–766.
  • Harutyunyan et al. (2016) Anna Harutyunyan, Marc G Bellemare, Tom Stepleton, and Rémi Munos. 2016. Q() with Off-Policy Corrections. In International Conference on Algorithmic Learning Theory. Springer, 305–320.
  • Hasselt (2011) H Hasselt. 2011. Insights in reinforcement learning: formal analysis and empirical evaluation of temporal-difference learning algorithms. Ph.D. Dissertation. Universiteit Utrecht.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Leng et al. (2009) Jinsong Leng, Colin Fyfe, and Lakhmi C Jain. 2009. Experimental analysis on Sarsa () and Q () with different eligibility traces strategies. Journal of Intelligent & Fuzzy Systems 20, 1, 2 (2009), 73–82.
  • Lillicrap et al. (2016) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. International Conference on Learning Representations  (ICLR) (2016).
  • Meng et al. (2018) Wenjia Meng, Qian Zheng, Long Yang, Pengfei Li, and Gang Pan. 2018. Qualitative Measurements of Policy Discrepancy for Return-based Deep Q-Network. arXiv preprint arXiv:1806.06953 (2018).
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
  • Munos et al. (2016) Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. 2016. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems. 1054–1062.
  • Peng and Williams (1994) Jing Peng and Ronald J Williams. 1994. Incremental multi-step Q-learning. In Machine Learning Proceedings 1994. Elsevier, 226–232.
  • Precup (2000) Doina Precup. 2000. Temporal abstraction in reinforcement learning. Ph.D. Dissertation. University of Massachusetts Amherst.
  • Singh and Dayan (1998) Satinder Singh and Peter Dayan. 1998. Analytical mean squared error curves for temporal difference learning. Machine Learning 32, 1 (1998), 5–40.
  • Sutton et al. (2014) Rich Sutton, Ashique Rupam Mahmood, Doina Precup, and Hado Hasselt. 2014. A new Q (lambda) with interim forward view and Monte Carlo equivalence. In International Conference on Machine Learning. 568–576.
  • Sutton (1988) Richard S Sutton. 1988. Learning to predict by the methods of temporal differences. Machine learning 3, 1 (1988), 9–44.
  • Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. MIT press.
  • Sutton and Barto (2011) Richard S Sutton and Andrew G Barto. 2011. Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
  • Van Seijen et al. (2009) Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. 2009. A theoretical and empirical analysis of Expected Sarsa. In Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL’09. IEEE Symposium on. IEEE, 177–184.
  • Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.
  • Watkins (1989) Christopher John Cornish Hellaby Watkins. 1989. Learning from delayed rewards. Ph.D. Dissertation. King’s College, Cambridge.
  • Yang et al. (2018) Long Yang, Minhao Shi, Qian Zheng, Wenjia Meng, and Gang Pan. 2018. A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning. International Joint Conference on Artificial Intelligence (IJCAI) (2018).