# Online Off-policy Prediction

This paper investigates the problem of online prediction learning, where learning proceeds continuously as the agent interacts with an environment. The predictions made by the agent are contingent on a particular way of behaving, represented as a value function. However, the behavior used to select actions and generate the behavior data might be different from the one used to define the predictions, and thus the samples are generated off-policy. The ability to learn behavior-contingent predictions online and off-policy has long been advocated as a key capability of predictive-knowledge learning systems but remained an open algorithmic challenge for decades. The issue lies with the temporal difference (TD) learning update at the heart of most prediction algorithms: combining bootstrapping, off-policy sampling and function approximation may cause the value estimate to diverge. A breakthrough came with the development of a new objective function that admitted stochastic gradient descent variants of TD. Since then, many sound online off-policy prediction algorithms have been developed, but there has been limited empirical work investigating the relative merits of all the variants. This paper aims to fill these empirical gaps and provide clarity on the key ideas behind each method. We summarize the large body of literature on off-policy learning, focusing on 1- methods that use computation linear in the number of features and are convergent under off-policy sampling, and 2- other methods which have proven useful with non-fixed, nonlinear function approximation. We provide an empirical study of off-policy prediction methods in two challenging microworlds. We report each method's parameter sensitivity, empirical convergence rate, and final performance, providing new insights that should enable practitioners to successfully extend these new methods to large-scale applications.[Abridged abstract]

• 12 publications
• 11 publications
• 58 publications
• 45 publications
• 31 publications
06/06/2018

### A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Temporal difference learning (TD) is a simple iterative algorithm used t...
02/28/2016

### Investigating practical linear temporal difference learning

Off-policy reinforcement learning has many applications including: learn...
08/07/2019

### Fast multi-agent temporal-difference learning via homotopy stochastic primal-dual optimization

We consider a distributed multi-agent policy evaluation problem in reinf...
06/02/2021

### An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task

Off-policy prediction – learning the value function for one policy from ...
04/15/2013

### Off-policy Learning with Eligibility Traces: A Survey

In the framework of Markov Decision Processes, off-policy learning, that...
07/01/2020

### Gradient Temporal-Difference Learning with Regularized Corrections

It is still common to use Q-learning and temporal difference (TD) learni...
02/22/2022

## 1 A short history of off-policy temporal difference learning

The story of off-policy learning begins with one of the best-known algorithms of reinforcement learning, called Q-learning, and the classic exploration-exploitation tradeoff. Off-policy learning poses an elegant solution to the exploration-exploitation tradeoff: the agent makes use of an independent exploration policy to select actions while learning the value function for the optimal policy. The exploration policy does not maximize reward, but instead selects actions in order to generate data that improves the optimal policy through learning. Ultimately, the full potential of Q-learning—and this ability to learn about one policy from a data generated by a totally different exploration—proved limited. Baird’s famous counter-example (1995) provided a clear illustration of how, under function approximation, the weights learned by Q-learning can become unstable.111The action-value star MDP can be found in the errata of Baird’s paper (1995). Baird’s counter-example highlights that divergence can occur when updating off-policy with function approximation and with bootstrapping (as in temporal difference (TD) learning); even when learning the value function of a fixed target policy.

The instability of TD methods is caused by how we correct the updates to the value function to account for the potential mismatch between the target and exploration policies. Off-policy training involves estimating the expected future rewards (the value function) that would be observed while selecting actions according to the target policy with training data (states, actions, and rewards) generated while selecting actions according to an exploration policy. One approach to account for the differences between the data produced by these two policies is based on using importance sampling corrections: scaling the update to the value function based on the agreement between the target and exploration policy at the current state. If the target and exploration policy would select the same action in a state, then they completely agree. Alternatively, if they never take the same action in a state they completely disagree. More generally, there can be degrees of agreement. We call this approach posterior corrections because the corrections account for the mismatch between policies ignoring the history of interaction up to the current time step—it does not matter what the exploration policy has done in the past.

Another approach, called prior corrections, uses the history of agreement between the exploration and target policy in the update. The likelihood that the trajectory could have occurred under the target policy is used to scale the update. The most extreme version of prior corrections uses the trajectory of experience from the beginning of time, corresponding to what has sometimes been referred to as the alternative life framework. Prior and posterior corrections can be combined to achieve stable Off-policy TD updates (Precup et al., 2000), though finite variance of the updates cannot be guaranteed (Precup et al., 2001). The perceived variance of these updates, as well as a preference for the excursions framework discussed below, led to a different direction years later for obtaining sound off-policy algorithms (Sutton et al., 2009).

Learning about many different policies in parallel has long been a primary motivation for stable off-policy learning, and this usage suggested that perhaps prior corrections are not essential. Several approaches require learning many value functions or policies in parallel, including approaches based on option models (Sutton, Precup & Singh, 1999), predictive representations of state (Littman, Sutton & Singh, 2002; Tanner and Sutton, 2005; Sutton et al., 2011), and auxiliary tasks (Jaderberg et al., 2016). In a parallel learning setting, it is natural to estimate the future reward achieved by following each target policy until termination from the states encountered during training—the value of taking excursions from the behavior policy. When value functions or policies estimated off-policy will be used, they will be used starting from states visited by the behavior policy. In such a setting, therefore, it is not necessarily desirable to obtain alternative life solutions.

The first major breakthrough came with the formalization of this excursion model as an objective function, which then enabled development of an online stochastic gradient descent algorithm. The resultant family of Gradient

-TD methods use posterior corrections via importance sampling, and are guaranteed to be stable under function approximation (Sutton et al., 2009). This new excursion objective has the same fixed point as TD, and thus Gradient-TD methods converge to the same solution in the cases for which TD converges. Prior attempts to create an objective function for off-policy learning, namely the mean-squared Bellman error due to Baird (1995), resulted in algorithms that converge to different and sometimes less desirable fixed points (see Sutton & Barto, 2018 for an in depth discussion of these issues). The Gradient-TD methods have extensions for incorporating eligibility traces (Maei & Sutton, 2010), non-linear function approximation such as with a neural network (Maei, 2011), and learning optimal policies (Maei & Sutton, 2010). Although guaranteed stable, the major critiques of these methods are (1) the additional complexity due to a second set of learned parameters, and (2) the variance due to importance sampling corrections.

The second major family of off-policy methods revisits the idea of using prior corrections. The idea is to incorporate prior corrections, starting only from the beginning of the excursion. In this way, the values of states that are more often visited under the target policy are emphasized, but the high variance of full prior corrections—to the beginning of the episode—is avoided. An incremental algorithm, called Emphatic TD(), was developed to estimates these emphasis weightings (Sutton, Mahmood & White, 2016), with a later extension to further improve variance of the emphasis weights (Hallak et al., 2015). These Emphatic-TD methods are guaranteed stable under both on-policy and off-policy sampling with linear function approximation (Sutton, Mahmood & White, 2016; Yu, 2015; Hallak et al., 2015).

Since the introduction of these methods, several refinements have been introduced, largely towards improving sample efficiency. These include (1) Hybrid-TD methods that behave like TD when sampling is on-policy, (2) Saddlepoint methods for facilitating application of improved stochastic approximation algorithms and (3) variance reduction methods for posterior corrections, using different eligibility trace parameters. The Hybrid TD methods can be derived with a simple substitution in the gradient of the excursion objective. The resultant algorithms perform conventional TD updates when data is generated on-policy (Hackman 2012; White & White, 2016), and are stable under off-policy sampling. Initial empirical studies suggested that TD achieves better sample efficiency than Gradient-TD methods when the data is sampled on-policy, though later evaluations found little difference (White & White, 2016).

Another potential improvement on Gradient-TD can be derived by reformulating the excursion objective into a saddlepoint problem, resulting in several new methods (Liu et al., 2015; Liu et al., 2016; Du et al., 2017; Touati et al., 2018). This saddlepoint formulation enables use of optimization strategies for saddlepoint problems, including finite sample analysis (Touati et al., 2018) and accelerations (Liu et al., 2015; Du et al., 2017). Though most are applicable to online updating, some acceleration strategies are restricted to offline batch updating (Du et al., 2017). As with the hybrid methods, comparative studies to date remain inconclusive about the advantages of these methods over their vanilla Gradient-TD counterparts (Mahadevan et al., 2014; White & White, 2016).

Finally, several algorithms have been proposed to mitigate variance from importance sampling ratios in the posterior corrections. High magnitude importance sampling corrections introduce variance and slow learning, dramatically reducing the benefits of off-policy learning. In parallel learning frameworks with many target policies, the likelihood of large importance sampling corrections increases as the number of target policies increases. In practice one might use small stepsizes, or avoid eligibility traces to mitigate the variance. The Retrace algorithm solves this issue by truncating the importance sampling ratio and a bias correction, thus avoiding large updates when the exploration and the target policy differ significantly. This approach can diverge with function approximation (Touati et al., 2018). Nevertheless, Retrace has been used in several deep-learning systems with non-linear function approximation (Munos et al., 2016; Wang, 2016). The Tree Backup algorithm (Precup, 2000) mitigates variance without importance sampling corrections by only using the probability of the selected action under the target policy. Both Retrace and Tree Backup can be viewed as adapting the eligibility trace to reduce variance. The related ABQ algorithm achieves stable off-policy updates without importance sampling corrections by varying the amount of bootstrapping in an action-dependent manner (Mahmood, Yu & Sutton, 2017). Empirical studies suggest Retrace based deep-learning systems can outperform systems based on Tree Backup and Q-learning. However, more targeted experiments are needed to understand the benefits of these adaptive bootstrapping methods over Gradient and Emphatic-TD methods.

### 1.1 Outlining the empirical study

Our theoretical understanding of off-policy temporal difference learning has evolved significantly, but our practical experience with these methods remains limited. Several variants of Gradient and Emphatic-TD have asymptotic performance guarantees (Mahadevan et al., 2014; Yu 2015, 2016, and 2017), and most of the methods discussed above (besides Tree Backup, V-trace) achieve the slightly weaker standard of convergence in expectation. In practice, stable off-policy methods are not commonly used. Q-learning and other potentially divergent Off-policy TD methods have been at the forefront of recent progress in scaling up reinforcement learning (Mnih et al., 2015; Lillicrap et al., 2015; Munos et al., 2016; Wang et al., 2016; Gruslys et al., 2018; Espeholt et al., 2018). To date, there have been no successful demonstrations of Gradient-TD methods in Atari, or any other large-scale problems. This is largely because these methods are not well understood empirically and many basic questions remain open. (1) How do methods from the two major families—Gradient and Emphatic-TD—compare in terms of asymptotic performance and speed of learning? (2) Does Emphatic-TD’s prior correction result in better asymptotic error, and does its single weight vector result in better sensitivity to its tunable parameters? (3) Does the hypothesis that Hybrid methods can learn faster than Gradient-TD hold across domains? (4) Do posterior correction methods exhibit significant variance compared with V-trace, Tree Backup, and ABQ? (5) What is the best way to incorporate posterior importance sampling corrections?

In this paper, we provide a comprehensive survey and empirical comparison of modern linear off-policy, policy evaluation methods. Prior studies have provided useful insights into some of these methods (Dann et al., 2014; Geist & Scherrer, 2014; White & White, 2016). Here we take the next step towards practical and stable online off-policy prediction.

In this paper, we restrict attention to policy evaluation methods and linear function approximation. There are several reasons for this choice. This setting is simpler and yet still includes key open problems. Focusing on policy evaluation allows us to lay aside a host of issues including maintaining sufficient exploration and chattering near optimal policies (see Bertsekas 2012, Chapter 6, for an overview). Another reason for focusing on policy evaluation is that many methods for policy optimization involve evaluation as an intermediate, repeated step; solving policy evaluation better can be expected to lead to better optimization methods. Although many of the recent large-scale demonstrations of reinforcement learning make use non-linear function approximation via artificial neural networks, the linear case requires further treatment. Several recently proposed off-policy methods can diverge, even with linear function approximation. The majority of the methods we consider here have not been extended to the non-linear case, and the extensions are not trivial. Most importantly, conducting empirical comparisons of neural network learning systems is challenging due to extreme parameter sensitivity, and sensitivity to initial conditions. The development of sound methodologies for empirical comparisons of neural network learning systems is still very much in its infancy, and beyond the scope of this paper (Henderson et al., 2017).

## 2 Problem Definition and Background

We consider the problem of learning the value function for a given policy under the Markov Decision Process (MDP) formalism. The agent interacts with the environment over a sequence of discrete time steps,

. On each time step the agent observes a partial summary of the state and selects an action . In response, the environment transitions to a new state , according to transition function , and emits a scalar reward . The agent selects actions according to a stochastic, stationary target policy .

We study the problem of policy evaluation: the computation or estimation of the expected discounted sum of future rewards for policy from every state. The return at time , denoted , is defined as the discounted sum of future rewards. The discount factor can be variable, dependent on state: , with . The return is defined as

 Gt =Rt+1+γt+1Rt+2+γt+1γt+2Rt+3+γt+1γt+2γt+3Rt+4+… =∞∑k=0(k∏i=1γt+i)Rt+k+1.

When is constant, we get the familiar return , where we overload here to indicate a scalar, constant discount. Otherwise, variable can discount per state, including encoding termination when it is set to zero. The value function maps each state to the expected return under policy starting from that state

 vπ(s)≐Eπ[Gt∣St=s]=Eπ[∞∑k=0γkRt+k+1∣St=s] , for all s∈S (1)

where the expectation operator reflects that the expectation over future states, actions, and rewards uses the distribution over actions given by , and the transition dynamics of the MDP.

In this paper, we are interested in problems where the value of each state cannot be stored in a table; instead the agent must approximate the value with a parameterized function. The approximate value function can have arbitrary form, as long as it is everywhere differentiable with respect to the parameters. An important special case is when the approximate value function is linear in the parameters and in features of the state. In particular, the current state is converted into feature vector by some fixed mapping . The value of the state can then be approximated with an inner product:

 ^v(st,w)≐w⊤xt≈v(st),for all st∈S,

where is a vector of weights/parameters which are modified by the learning process to better approximate . Henceforth, we refer to exclusively as the weights, or weight vector, and reserve the word parameter for variables like the discount-rate and stepsize parameters. Typically the number of components in is much less than the number of possible states (), and thus will generalize values across many states in .

We first describe how to learn this value function for the on-policy setting, where the behavior policy equals the target policy. Temporal difference learning (Sutton, 1988) is perhaps the best known and most successful approach for estimating directly from samples generated while interacting with the environment. Instead of waiting until the end of a trajectory to update the value of each state, the TD() algorithm adjusts its current estimate of the weights toward the difference between the discounted estimate of the value in the next state and the estimated value of the current state plus the reward along the way:

 δt≐δ(St,At,St+1)≐Rt+1+γw⊤txt+1−w⊤txt. (2)

We use the value function’s own estimate of future reward as a placeholder for the future rewards defining that are not available on time-step . In addition, the TD() algorithm also maintains an eligibility trace vector that stores a fading trace of recent feature activations. The components of are updated on each step proportional to the magnitude of the trace vector. This simple scheme allows update information to more quickly propagate in domains when the rewards are often zero, such as a maze with a reward of one upon entering the terminal state and zero otherwise.

The update equations for TD() are straightforward:

 wt+1← wt+αδtzt zt← γλzt−1+xt,

where is the scalar stepsize parameter that controls the speed of learning, and controls the length of the eligibility trace. If is one, then the above algorithm performs an incremental version of Monte-Carlo policy evaluation. On the other-hand, when is zero the TD() algorithm updates the value of each state using only the reward and the estimated value of the next state—often referred to as full one-step bootstrapping. In practice, intermediate values of between zero and one often perform best. The TD() algorithm has been shown to converge with probability one to the best linear approximation of the value function under quite general conditions.

These updates need to be modified for the off-policy case, where the agent selects actions according to a behavior policy that is different from the target policy. The value function for target policy is updated using experience generated from a behavior policy that is off, away, or distant from the target policy. For example, consider the most well-known off-policy algorithm, Q-learning. The target policy might be the one that maximizes future discounted reward, while the behavior is nearly identical to the target policy, but instead selects an exploratory action with some small probability. More generally, the target and behavior policies need not be so closely coupled. The target policy might be the shortest path to one or more goal states in a gridworld, and the behavior policy might select actions in each state uniform randomly. The main requirement linking these two policies is that the behavior policy covers the actions selected by the target policy in each state visited by , that is: for all states and actions in which .

An important difference between these two settings is in the stability and convergence of the algorithms. One of the most distinctive aspects of off-policy learning and function approximation is that it has been shown that Q-learning and TD(), appropriately modified for off-policy updates, and even Dynamic Programming can diverge (Sutton & Barto, 2018). In the next two sections, we will discuss different ways to adapt TD-style algorithms with linear function approximation to the off-policy setting. We will highlight convergence issues and issues with solution quality, and discuss different ways recent algorithms proposed to address these issues.

## 3 Off-policy Corrections

The key problem in off-policy learning is to estimate the value function for the target policy, conditioned on samples produced by actions selected according to the behavior policy. This is an instance of the problem of estimating an expected value under some target distribution from samples generated by some other behavior distribution. In statistics, we address this problem with importance sampling, and indeed most methods of off-policy reinforcement learning use such corrections.

We can either account for the differences between which actions the target policy would choose in each state, or account for which states are more likely to be visited under the target policy. More precisely, there are two distributions that we could consider correcting: the distribution over actions, given the state, and the distribution over states. When observing a transition generated by taking the action according to , we can consider correcting the update for that transition so that in expectation it is as if actions were taken according to . However, these updates would still be different than if we evaluated on-policy, because the frequency of visiting state under will be different than under . All methods correct for the distribution over actions (posterior corrections), given the state, but several methods correct for the distribution over states (prior corrections) in slightly different ways.

In this section, we first provide an intuitive explanation of the differences between methods that use only posterior correction and those that additionally incorporate prior corrections. We then discuss the optimization objective used by Off-policy TD methods, and highlight how the use of prior corrections corresponds to different weightings in this objective. This generic objective will then allow us to easily describe the differences between the algorithms in Section 4.

### 3.1 Posterior Corrections

The most common approach to developing sound Off-policy TD algorithms makes use of posterior corrections based on importance sampling. One of the simplest examples of this approach is Off-policy TD(). The procedure is easy to implement and requires constant computation per time step, given knowledge of both the target and behavior policies. On the transition from to via action , we compute the ratio between and :

 ρt≐ρ(At|St)≐π(At|St)b(At|St). (3)

These importance sampling corrections are then simply added to the eligibility trace update on each time step:

 wt+1← wt+αδtzρt zρt← ρt(γλzρt−1+xt), (4)

where is defined in Equation 2. This way of correcting the sample updates ensures that the approximate value function estimates the expected value of the return as if the actions were selected according to . Posterior correction methods use the target policy probabilities for the selected action to correct the update to the value of state using only the data from time step onward. Values of from time steps prior to have no impact on the correction. Combining importance sampling with eligibility trace updates, as in Off-policy TD(), is the most common realization of posterior corrections.

To help understand the implications of posterior corrections, consider the MDP depicted in Figure 1. Each episode starts in the leftmost state denoted ‘x’ and terminates on transition into the terminal state denoted with ‘T’, and each state is represented with a unique tabular state encoding: x, y. In each state there are two possible actions and the behavior policy chooses each action in each state with 0.5 probability. The target policy chooses action in all states. A posterior correction method like Off-policy TD(), will always update the value of a state if action is taken. For example if the agent experiences the transition , Off-policy TD() will update the value of state ; no matter the history of interaction before entering state .

Although the importance sampling corrections product in the eligibility trace update, Off-policy TD() does not use importance sampling corrections computed from prior time-steps to update the value of the current state. This is easy to see with an example. For simplicity we assume is a constant . Let’s examine the updates and trace contents for a trajectory where ’s action choices perfectly agree with :

 x→y→T.

After the transition from , Off-policy TD() will update the value estimate corresponding to :

 [^v1(x)^v1(y)]←[00]+αδ1zρ1=αδ1⎡⎣π(a1|x)b(a1|x)γλ0⎤⎦,

where denotes the estimated value of state on time step (after the first transition), and as usual and are initialized to zero. After the second transition, , the importance sampling corrections will product in the trace, and the value estimates corresponding to both and are updated:

 [^v2(x)^v2(y)]←[^v1(x)^v1(y)]+αδ2⎡⎢ ⎢⎣π(a1|y)b(a1|y)π(a1|x)b(a1|x)γ2λ2π(a1|y)b(a1|y)γλ⎤⎥ ⎥⎦.

The estimated value of state is only updated with importance sampling corrections computed from state transitions that occur after the visit to : using , but not .

Finally, consider another trajectory that deviates from the target policy’s choice on the second step of the trajectory:

 x→y→y→T.

On the first transition the value of is updated as expected, and no update occurs as a result of the second transition. On the third, transition the estimated value of state is not updated; which is easy to see from inspecting the eligibility trace on each time-step:

 zρ1=⎡⎣π(a1|x)b(a1|x)γλ0⎤⎦; zρ2=0; zρ3=⎡⎣0π(a1|y)b(a1|y)γλ⎤⎦.

The eligibility trace is set to zero on time step two, because the target policy never chooses action in state and thus . The value of state is never updated using importance sampling corrections computed on time steps prior to .

Many modern off-policy prediction methods use some form of posterior corrections including the Gradient-TD methods, Tree Backup(), V-trace(), and Emphatic TD(). In fact, all off-policy prediction methods with stability guarantees make use of posterior corrections via importance sampling. Only correcting the action distribution, however, does not necessarily provide stable updates, and Off-policy TD() is not guaranteed to converge (Baird, 1995). To obtain stable Off-policy TD() updates, we need to consider corrections to the state distribution as well; as we discuss next.

### 3.2 Prior Corrections

We can also consider correcting for the differences between the target and behavior policy by using the agreement between the two over a trajectory of experience. Prior correction methods keep track of the product of either or , and correct the update to the value of using the current value of the product. Therefore, the value of is only updated if the product is not zero, meaning that the behavior policy never selected an action for which was zero—the behavior never completely deviated from the target policy.

To appreciate the consequences of incorporating these prior corrections into the TD update consider a state-value variant of Precup et al’s (2001) Off-policy TD() algorithm:

 wt+1← wt+αδtzρt zρt← ρt(γλzt−1+t−1∏k=1ρkxt) (5)

where . We will refer to the above algorithm as Alternative-life TD(). The product in Equation 3.2 includes all the observed during the current episode. Note that experience from prior episodes does not impact the computation of the eligibility trace, as the trace is always reinitialized at the start of the episode.

Now consider the updates performed by Alternative-life TD() using different trajectories from our simple MDP (Figure 1). If the agent ever selects action , then none of the following transitions will result in further updates to the value function. For example, the trajectory will update corresponding to the first transition, but would never be updated due to the product in Equation 3.2. In contrast, the Off-policy TD() algorithm described in Equation 3.1 would update on the first transition, and also update on the last transition of the trajectory.

The Alternative-life TD() algorithm has been shown to converge under linear function approximation, but in practice exhibits unacceptable variance (Precup et al., 2001). The Emphatic TD() algorithm, on the other hand, provides an alternative form for the prior corrections, that is lower variance but still guarantees convergence. To more clearly explain why, next we will discuss how different prior corrections account for different weightings in optimizing the mean-squared Projected Bellman Error (MSPBE).

### 3.3 Objective functions for posterior and prior corrections

In this section, we describe how different prior corrections, or no prior corrections, correspond to optimizing similar objectives, but with different weightings over the state. This section introduces the notation required to explain all the algorithms, and clarifies convergence properties of algorithms, including which algorithms converge and to which fixed point.

We begin by considering a simplified setting, with , and a simplified variant of the MSPBE, called the NEU (norm of the expected TD update (Sutton, 2009))

 NEU(w)=∥∥∑s∈Sd(s)Eπ[δ(S,A,S′)x(S)∣S=s]∥∥22, (6)

where is a positive weighting on the states, and we explicitly write to emphasize that randomness in the TD-error is due to the underlying randomness in the transition . Equation 6 does not commit to a particular sampling strategy. If the data is sampled on-policy, then , where is the stationary distribution for which represents the state visitation frequency under behavior in the MDP. If the data is sampled off-policy, then the objective is instead weighted by the state visitation frequency under , i.e., . As discussed for ETD() in Section 4.5, other weightings are also possible; for now, we focus on or .

We first consider how to sample the NEU for a given a state. The behavior selects actions in each state , so the update needs to be corrected for the action selection probabilities of in state . Importance sampling is one way to correct these action probabilities from a given state

 Eπ[δ(St,At,St+1)x(St)∣St=s] =∑a∈Aπ(a|s)∑s′∈SP(s′|s,a)δ(s,a,s′)x(s) =∑a∈Ab(a|s)b(a|s)π(a|s)∑s′∈SP(s′|s,a)δ(s,a,s′)x(s) =∑a∈Ab(a|s)∑s′∈SP(s′|s,a)π(a|s)b(a|s)δ(s,a,s′)x(s) =Eb[ρ(At|St)δ(St,At,St+1)x(St)∣St=s]. (7)

Therefore, the update provides an unbiased sample of the desired expected update . All off-policy methods use these posterior corrections.

We can also adjust the state probabilities from to , using prior corrections. Alternative-life TD() uses such prior corrections to ask: what would the value be if the data had been generated according to instead of . In such a scenario, the state visitation would be according to , and so we need to correct both action probabilities in the updates as well as the distribution from which we update. Prior corrections adjust the likelihood of reaching a state. Consider the expectation using prior corrections, when starting in state and taking two steps following :

 Eb[ρ0ρ1Eπ[δ(St,At,St+1)x(St)∣St=S2]∣S0=s0] =Eb⎡⎣ρ0∑a1∈Ab(a1|S1)∑s2∈SP(s2|S1,a1)ρ(a1|S1)Eπ[δ(St,At,St+1)x(St)∣St=s2]∣S0=s0⎤⎦ =Eb⎡⎣ρ0∑a1∈Aπ(a1|S1)P(s1|S1,a1)Eπ[δ(St,At,St+1)x(St)∣St=s2]∣S0=s0⎤⎦ =Eb[ρ0Eπ[δ(St,At,St+1)x(St)∣St−1=S1]∣S0=s0] =∑a0∈Aπ(s0,a0)∑s1∈SP(s1|s0,a0)Eπ[δ(St,At,St+1)x(St)∣St−1=s1] =Eπ[δ(St,At,St+1)x(St)∣S0=s0].

More generally, we get

 Eb[ρ1…ρt−1Eπ[δ(St,At,St+1)x(St)∣St=s]|S0=s0] =Eπ[δ(St,At,St+1)x(St)∣S0=s0].

These corrections adjust the probabilities of the sequence from the beginning of the episode to make it as if policy had taken actions to get to state , from which we do the TD() update.

A natural question is which objective should be preferred: the alternative-life () or the excursions objective (). As with all choices for objectives, there is not an obvious answer. The alternative-life objective is difficult to optimize, because prior corrections can become very large or zero—causing data to be discarded—and is high variance. On the other hand, the fixed-point solution to the excursion objective can be arbitrarily poor compared with the best value function in the function approximation class if there is a significant mismatch between the behavior and target policy (Kolter, 2011). Better solution accuracy can be achieved using an excursion’s weighting that includes , but additionally reweights to make the states distribution closer to , as is done with Emphatic TD(). We postpone the discussion of this alternative weighting and its corresponding fixed point until after we have properly described Emphatic TD(), with the rest of the algorithms in the next section.

The above discussion focused on a simplified variant of the MSPBE with , but the intuition is the same for the MSPBE and . To simplify notation we introduce a conditional expectation operator:

 Ed[Y]=∑s∈Sd(s)Eπ[Y | S=s].

We can now define

 C ≐Ed[x(S)x(S)⊤] A ≐−Ed[(γ(S′)x(S′)−x(S))z(S)⊤] b ≐Ed[R(S,A,S′)z(S)⊤]

where the eligibility trace is defined recursively as . We can write the TD() fixed point residual as:

 Ed[δ(S,A,S′)z(S)]=−Aw+b (8)

so called because at the fixed point solution for on-policy TD(). The MSPBE can be defined simply, given the definition above:

 MSPBE(w)≐(−Aw+b)⊤C−1(−Aw+b). (9)

The only difference compared with the NEU is the weighted norm, weighted by , instead of simply . The extension to requires that posterior corrections also correct future actions from the state , resulting in a product of importance sampling ratios in the eligibility trace, as described in the previous section. The conclusions about the choice of state probabilities in defining the objective, however, remain consistent. In the next section, we discuss how different off-policy methods optimize the different variants of the MSPBE.

## 4 Algorithms

In this section, we describe the methods used in the empirical study that follows next. In particular, we discuss the optimization objective, and provide detailed update equations highlighting how prior or posterior corrections are used in each method. We begin with the Gradient-TD family of methods that minimize the excursion variant of the MSPBE. We then discuss modifications on GTD()—namely the Hybrid methods and the Saddlepoint methods. Then we discuss the second family of off-policy methods, the Emphatic methods. We conclude with a discussion of several methods that reduce variance of posterior corrections, using action-dependent bootstrapping. The algorithms, categorized according to weightings, are summarized in Table 1.

### 4.1 Gradient Temporal Difference Learning

Gradient-TD methods were the first to achieve stability with function approximation using gradient descent (Sutton et al., 2009). This breakthrough was achieved by creating an objective function, the MSPBE, and a strategy to sample the gradient of the MSPBE. The negative of the gradient of the MSPBE, with weighting , can be written:

 ∇MSPBE(w) =Edb[δ(S,A,S′)z(S)] (10) −Edb[γ(S′)x(S′)x(S)⊤]Edb[x(S)x(S)⊤]−1Edb[δ(S,A,S′)z(S)].

Sampling this gradient is not straightforward due to the product of expectations. To resolve this issue, a second weight vector, , can be used to estimate and avoid the need for two independent samples. The resultant method, called GTD(), can be thought of as approximate stochastic gradient descent on the MSPBE and is specified by the following updated equations:

 ht+1← ht+αh[δtzρt−(h⊤tx)xt+1] wt+1← wt+αδtzρt−αγt+1(1−λ)(h⊤tzρt)xt+1correction term (11)

The GTD() algorithm has several important details that merit further discussion. The most notable characteristic is the second weight vector that forms a quasi-stationary estimate of the last two terms in the gradient of the MSPBE. The corresponding two-timescale analysis highlights that the learning rate parameter should be larger than , where the weights change slower to enable to obtain such a quasi-stationary estimate (Sutton et al., 2009). In practice, the best values of and are problem dependent, and the practitioner must tune them independently to achieve good performance (White, 2015; White & White, 2016). Another important detail is that the first part of the update to corresponds to Off-policy TD(). When , the second term—the correction term—is removed, making GTD(1) = TD(1). Otherwise, for smaller , the correction term plays a bigger role.

The GTD() algorithm has been shown to be stable with linear function approximation. The GTD() with , also known as TDC, has been shown to converge in expectation with i.i.d sampling of states (Sutton et al., 2009). The convergence of Gradient-TD methods with was later shown in the Markov noise case with constant stepsize and stepsizes that approach zero in the limit (Yu, 2018).

The GTD2() algorithm is related to GTD(), and can be derived starting from the gradient of the excursion MSPBE in Equation 10. The gradient of the MSPBE given in Equation 10 is an algebraic rearrangement of:

 ∇MSPBE(w) = Edb[(x(S)−γ(S′)x(S′))z(S)⊤]Edb[x(S)x(S)⊤]−1Edb[δ(S,A,S′)z(S)].

As before, the last two terms can again be replaced by a secondary weight vector . The resultant expression

 Edb[(x(S)−γ(S′)x(S′))z(S)⊤]h,

can be sampled resulting in an algorithm that is similar to GTD(), but differs in its update to the primary weights:

 wt+1← wt+α(h⊤txt)xt−αγt+1(1−λ)(h⊤tzρt)xt+1. (12)

This update does not make use of the TD-error , except through the secondary weights . The GTD2() algorithm performs stochastic gradient descent on the MSPBE, unlike GTD(), which uses an approximate gradient, as we discuss further in Section 4.3 when describing the Saddlepoint methods.

### 4.2 Hybrid TD methods

The Hybrid TD methods were created to achieve the data efficiency of TD() when data is sampled on-policy, and the stability of Gradient-TD methods when the data is sampled off-policy. Early empirical experience with TD(0) and GTD(0) in on-policy problems suggested that TD(0) might be more sample efficient (Sutton et al., 2009). Later studies highlighted the need for additional empirical comparisons to fully characterize the relative strengths of GTD() compared with TD() (Dann et al., 2014; White & White, 2016).

Hybrid TD methods were first proposed by Maei (2011) and Hackman (2012) and were further developed to make use of eligibility traces by White and White (2016). The derivation of the method starts with the gradient of the excursion MSPBE. Recall from Equation (9) that the MSPBE can be written . The matrix is simply the weighting in the squared error for . In fact, because we know every solution to the MSPBE satisfies , the choice of asymptotically is not relevant, as long as it is positive definite. The gradient of the MSPBE, can therefore be modified to , for any positive definite , and should still converge to the same solution(s).

In order to achieve a hybrid learning rule, this substitution must result in an update that reduces to the TD() update when . This can be achieved by setting , which is the matrix for the behavior. Because this is estimated with on-policy samples—since we are following —we know is positive semi-definite (Sutton, 1989), and positive definite under certain assumptions on the features. Further, when , we have that , giving update . The TD() update is a stochastic sample of expected update , and so when HTD() uses a stochastic sample of when , it is in fact using the same update as TD().

The HTD() algorithm is:

 ht+1← ht+αh[δtzρt−(xt−γt+1xt+1)(h⊤tzt)] wt+1← wt+α[δtzρt−(xt−γt+1xt+1)(zρt−zt)⊤ht] (13)

HTD() has two eligibility trace vectors, with being a conventional accumulating eligibility trace for the behavior policy. If , then all the are 1 and , which causes the last term in the update to be zero and the overall update reduces to the TD() algorithm. The last term in the update applies a correction to the usual Off-policy TD().

Like GTD(), the HTD() algorithm is a posterior correction method that should converge to the minimum of the excursion variant of the MSPBE. No formal stochastic approximation results have been published, though the expected update is clearly convergent because

is positive semi-definite. This omission is likely due to the mixed empirical results achieved with Hybrid TD methods Markov chains and random MDPs (Hackman, 2012; White & White, 2016).

Optimization of the MSPBE can be reformulated as a saddle point problem, yielding another family of stable Off-policy TD methods based on gradient descent. These include the original Proximal-GTD methods (Liu et al., 2015; Liu et al., 2016) methods, stochastic variance reduction methods for policy evaluation (Du et al., 2017), and gradient formulations of Retrace and Tree Backup (Touati et al., 2018). The MSPBE can be rewritten using convex conjugates:

 MSPBE(w)=minh(b−Aw)⊤h−12∥h∥2C (14)

where the weighted norm .

The utility of this saddlepoint formulation is that it removes the product of expectations, with the explicit addition of an auxiliary variable. This avoids the double sampling problem, since for a given , it is straightforward to sample (see Equation (8)) with sample . It is similarly straightforward to sample the gradient of this objective for a given . Now this instead requires that this auxiliary variable be learned. The resulting algorithm is identical to GTD2(0) when using stochastic gradient descent for this saddle point problem. This result is somewhat surprising, because GTD2(0) was derived from the gradient of the MSPBE using a quasi-stationary estimate of a proportion of the gradient.

The saddle point formulation—because it is a clear convex-concave optimization problem—allows for many algorithmic variants. For example, stochastic gradient descent algorithm for this convex-concave problem can incorporate accelerations, such as mirror-prox—as used by Liu et al. (2015)—or variance reduction approaches—as used (Du et al., 2017). This contrasts the original derivation for GTD2(), which used a quasi-stationary estimate and was not obviously a standard gradient descent technique. One such accelerated algorithm is Proximal GTD2(), described by the following update equations:

 ht+12← ht+αh[δtzρt−(h⊤txt)xt] (15) wt+12← wt+α(h⊤txt)xt−αγt+1(1−λt+1)(h⊤tzρt)xt+1 (16) δt+12 \tiny def= Rt+1+γt+1w⊤t+12xt+1−w⊤t+12xt (17) ht+1← ht+αh[δt+12zρt−(h⊤t+12xt)xt] (18) wt+1← wt+α(h⊤t+12xt)xt−αγt+1(1−λt+1)(h⊤t+12zρt)xt+1 (19)

The double update to and , denoted by subscripts and , is produced by applying the Stochastic Mirror-Prox acceleration (Juditsky et al., 2011) to the gradient descent update derived from Equation 14. We will refer to this algorithm by the shorthand name PGTD2 in the figures.

The saddle point formulation cannot be applied to derive an accelerated version of GTD(). Recall that GTD(

) was obtained by reordering expectations in the gradient of the MSPBE, and then using quasi-stationary estimates of different expected values. This alternative formulation cannot obviously be written as a saddle point problem—though it has nonetheless been shown to be convergent. Nevertheless, a heuristic approximation of accelerated Proximal GTD(

) has been proposed (Liu et al., 2015), and its update equations are similar to that of Proximal GTD2() with difference in updating the weight vector :

 wt+12← wt+αδtzρt−αγt+1(1−λt+1)(h⊤tzρt)xt+1 (20) wt+1← wt+αδt+12zρt−αγt+1(1−λt+1)(h⊤t+12zρt)xt+1 (21)

We will refer to this algorithm by the shorthand name PGTD in the figures.

Both Proximal GTD() and Proximal GTD2() minimize the excursion variant of MSPBE, as they assume . The idea of the saddlepoint formulation, however, is more general and alternatives weightings could be considered, such as (shown in Table 1). The expectations in the MSPBE would simply change, and prior corrections would need to be incorporated to get an unbiased sample of weighted by .

The practical utility of these methods for online estimation is still not well understood. Several of the accelerations mentioned above, such as the use of stochastic variance reduction strategies (Du et al., 2017), assume a batch learning setting. The online algorithms, as mentioned, all use variants of GTD2(), which seems to perform more poorly than GTD() in practice (Touati et al., 2018). This saddle point formulation, however, does enable continued advances in online convex optimization to be ported to reinforcement learning. Additionally, this formulation allows analysis tools from optimization to be applied to the analysis of TD learning methods. For example, Touati et al. (2018) provided the first finite sample analysis for GTD2(), which is not possible with the original GTD2() derivation based on the quasi-stationary secondary weights.

### 4.4 Off-policy learning with action-dependent boostrapping

A common concern with using importance sampling ratios is the possibility for high variance, due to large ratios.222We would like to note that, to the best of our knowledge, variance issues due to importance sampling ratios have not been concretely demonstrated in the literature. This concern, therefore, is based on intuition and should be considered a hypothesis rather than a known phenomenon. Several methods have been introduced that control this variance, either by explicitly or implicitly avoiding the product of importance sampling ratios in the traces. The Tree Backup() algorithm, which we call TB(), was the first off-policy method that did not explicitly use importance sampling ratios (Precup et al., 2000). This method decays traces more, incurring more bias; newer algorithms such as V-trace() and ABQ() attempt to reduce variance but without decaying traces as much, and improve performance in practice. In this section, we describe the state-value prediction variants of TB(), V-trace(), and ABQ() that we investigate in our empirical study.

These three methods can all be seen as Off-policy TD() with generalized from a constant to a function of state and action. This unification was highlighted by Mahmood et al. (2017) when they introduced ABQ. This unification makes explanation of the algorithms straightforward: each method simply uses a different action-dependent trace function . All three methods were introduced for learning action-values; we present the natural state-value variants below.

We begin by providing the generic Off-policy TD algorithm with action-dependent traces. The key idea is to set such that is well-behaved. The Off-policy TD() algorithm for this generalized trace function can be written333This update explicitly uses in the update to . This contrasts the earlier Off-policy TD updates in Equation (3.1), which have in the trace. These two forms are actually equivalent, in that the update to is exactly the same. We show this equivalence in Appendix C. We use this other form here, to more clearly highlight the relationship between and .

 wt+1 =wt+αρtδtzt zt =γtρt−1λtzt−1+xt, (22)

Now we can specify different algorithms using this generic variant of Off-policy TD(), by specifying different implementations of the function. Like Off-policy TD(), these algorithms all perform only posterior corrections.

TB() is Off-policy TD() with , for some tuneable constant . Replacing with in the eligibility trace update in Equation 22 simplifies as follows:

 zt =γtπt−1bt−1bt−1λzt−1+xt =γtπt−1λzt−1+xt, (23)

and gives the state-value variant of TB().

A simplified variant of the V-trace() algorithm (Espeholt et al., 2018) can be derived with a similar substitution: , where and are both tuneable constants. The eligibility trace update becomes:

 zt =γtmin(¯cπt−1,1bt−1)λbt−1πt−1bt−1zt−1+xt =γtmin(¯cπt−1,1bt−1)λπt−1zt−1+xt =γtmin(¯cπt−1πt−1,πt−1bt−1)λzt−1+xt =γtmin(¯c,ρt−1)λzt−1+xt, (24)

The parameter is used to cap importance sampling ratios in the trace. Note that it is not possible to recover the full V-trace() algorithm in this way. The more general V-trace() algorithm uses an additional parameter, that caps the in the update to : . When is set to the largest possible importance sampling ratio, it does not affect in the update to and so we obtain the equivalence above. For smaller , however, V-trace() is no longer simply an instance of Off-policy TD(). In the experiments that follow, we investigate this simplified variant of V-trace() that does not cap and set as done in the original Retrace algorithm.

ABTD() for uses , with the following eligibility trace update:

 zt =γtνt−1bt−1bt−1λzt−1+xt =γtνt−1πt−1zt−1+xt. (25)

with the following scalar parameters to define

 νt \tiny def=ν(ψ(ζ),st,at)\tiny def=min(ψ(ζ),1max(b(at|st),π(at|st))) ψ(ζ) \tiny def=2ζψ0+max(0,2ζ−1)(ψmax−2ψ0) ψ0 \tiny def=1maxs,amax(b(a|s),π(a|s)) ψmax \tiny def=1mins,amax(b(a|s),π(a|s)).

The convergence properties of all three methods are similar to Off-policy TD(). They are not guaranteed to converge under off-policy sampling with weighting and function approximation. With the addition of gradient corrections similar to GTD(), these algorithms are convergent. For explicit theoretical results, see Mahmood et al. (2017) for ABQ with gradient correction and Touati et al. (2018) for convergent versions of Retrace and Tree Backup.

### 4.5 Emphatic-TD learning

Emphatic Temporal Difference learning, ETD(), provides an alternative strategy for obtaining stability under off-policy sampling without computing gradients of the MSPBE. The key idea is to incorporate some prior corrections so that the weighting results in a positive definite matrix . Given such an