# Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced emphatic temporal differences (ETD) algorithm SuttonMW15, which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases. We call this framework , where our introduced parameter β controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for . Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling β, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.

## Authors

• 9 publications
• 45 publications
• 71 publications
• 127 publications
• ### Emphatic TD Bellman Operator is a Contraction

Recently, SuttonMW15 introduced the emphatic temporal differences (ETD) ...
08/14/2015 ∙ by Assaf Hallak, et al. ∙ 0

• ### Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators

In temporal difference (TD) learning, off-policy sampling is known to be...
06/24/2021 ∙ by Zaiwei Chen, et al. ∙ 0

• ### Reanalysis of Variance Reduced Temporal Difference Learning

Temporal difference (TD) learning is a popular algorithm for policy eval...
01/07/2020 ∙ by Tengyu Xu, et al. ∙ 0

• ### Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

We address the problem of policy evaluation in discounted Markov decisio...
03/16/2020 ∙ by Koulik Khamaru, et al. ∙ 21

• ### A Convenient Generalization of Schlick's Bias and Gain Functions

We present a generalization of Schlick's bias and gain functions – simpl...
10/17/2020 ∙ by Jonathan T. Barron, et al. ∙ 0

• ### Maxmin Q-learning: Controlling the Estimation Bias of Q-learning

Q-learning suffers from overestimation bias, because it approximates the...
02/16/2020 ∙ by Qingfeng Lan, et al. ∙ 29

• ### Robust temporal difference learning for critical domains

We present a new Q-function operator for temporal difference (TD) learni...
01/23/2019 ∙ by Richard Klima, et al. ∙ 20

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In Reinforcement Learning (RL;

Sutton and Barto 1998), policy-evaluation refers to the problem of evaluating the value function – a mapping from states to their long-term discounted return under a given policy, using sampled observations of the system dynamics and reward. Policy-evaluation is important both for assessing the quality of a policy, but also as a sub-procedure for policy optimization.

For systems with large or continuous state-spaces, an exact computation of the value function is often impossible. Instead, an approximate value-function is sought using various function-approximation techniques (a.k.a. approximate dynamic-programming; Bertsekas 2012

). In this approach, the parameters of the value-function approximation are tuned using machine-learning inspired methods, often based on

temporal-differences (TD;Sutton and Barto 1998).

The source generating the sampled data divides policy evaluation into two cases. In the on-policy case, the samples are generated by the target-policy – the policy under evaluation; In the off-policy setting, a different behavior-policy generates the data. In the on-policy setting, TD methods are well understood, with classic convergence guarantees and approximation-error bounds, based on a contraction property of the projected Bellman operator underlying TD (Bertsekas and Tsitsiklis, 1996). These bounds guarantee that the asymptotic error, or  bias, of the algorithm is contained. For the off-policy case, however, standard TD methods no longer maintain this contraction property, the error bounds do not hold, and these methods might even diverge (Baird, 1995).

The standard error-bounds may be shown to hold for an

importance-sampling TD method (IS-TD), as proposed by Precup, Sutton, and Dasgupta (2001)

. However, this method is known to suffer from a high variance of its importance-sampling estimator, limiting its practicality.

Lately, Sutton, Mahmood, and White (2015) proposed the emphatic TD (ETD) algorithm: a modification of the TD idea, which converges off-policy (Yu, 2015), and has a reduced variance compared to IS-TD. This variance reduction is achieved by incorporating a certain decay factor over the importance-sampling ratio. However, to the best of our knowledge, there are no results that bound the bias of ETD. Thus, while ETD is assured to converge, it is not known how good its limit actually is.

In this paper, we propose the ETD(, ) framework – a modification of the ETD() algorithm, where the decay rate of the importance-sampling ratio, , is a free parameter, and is the same bootstrapping parameter employed in TD() and ETD(). By varying the decay rate, one can smoothly transition between the IS-TD algorithm, through ETD, to the standard TD algorithm.

We investigate the bias of ETD(, ), by studying the conditions under which its underlying projected Bellman operator is a contraction. We show that the original ETD possesses a contraction property, and present the first error bounds for ETD and ETD(, ). In addition, our error bound reveals that the decay rate parameter balances between the bias and variance of the learning procedure. In particular, we show that selecting a decay equal to the discount factor as in the original ETD may be suboptimal in terms of the mean-squared error.

The main contributions of this work are therefore a unification of several off-policy TD algorithms under the ETD(, ) framework, and a new error analysis that reveals the bias-variance trade-off between them.

#### Related Work:

In recent years, several different off-policy policy-evaluation algorithms have been studied, such as importance-sampling based least-squares TD (Yu, 2012), and gradient-based TD (Sutton et al., 2009; Liu et al., 2015). These algorithms are guaranteed to converge, however, their asymptotic error can be bounded only when the target and behavior policies are similar (Bertsekas and Yu, 2009), or when their induced transition matrices satisfy a certain matrix-inequality suggested by Kolter (2011), which limits the discrepancy between the target and behavior policies. When these conditions are not satisfied, the error may be arbitrarily large (Kolter, 2011). In contrast, the approximation-error bounds in this paper hold for general target and behavior policies.

## 2 Preliminaries

We consider an MDP , where is the state space, is the action space,

is the transition probability matrix,

is the reward function, and is the discount factor.

Given a target policy mapping states to a distribution over actions, our goal is to evaluate the value function:

 Vπ(s)≐Eπ[∞∑t=0R(st,at)∣∣ ∣∣s0=s].

Linear temporal difference methods (Sutton and Barto, 1998) approximate the value function by

 Vπ(s)≈θ⊤φ(s),

where are state features, and are weights, and use sampling to find a suitable . Let denote a behavior policy that generates the samples according to and . We denote by the ratio , and we assume, similarly to Sutton, Mahmood, and White (2015), that and are such that is well-defined111Namely, if then for all . for all .

Let denote the Bellman operator for policy , given by

 T(V)≐R+γPV,

where and

are the reward vector and transition matrix induced by policy

, and let denote a matrix whose columns are the feature vectors for all states. Let and denote the stationary distributions over states induced by the policies and , respectively. For some satisfying element-wise, we denote by a projection to the subspace spanned by with respect to the -weighted Euclidean-norm.

For , the ETD(, (Sutton, Mahmood, and White, 2015) algorithm seeks to find a good approximation of the value function by iteratively updating the weight vector :

 θt+1=θt+αFtρt(Rt+1+γθ⊤tφt+1−θ⊤tφt)φtFt=βρt−1Ft−1+1,F0=1, (1)

where is a decaying trace of the importance-sampling ratios, and controls the decay rate.

###### Remark 1.

The algorithm of Sutton, Mahmood, and White (2015) selects the decay rate equal to the discount factor, i.e., . Here, we provide more freedom in choosing the decay rate. As our analysis reveals, the decay rate controls a bias-variance trade-off of ETD, therefore this freedom is important. Moreover, we note that for , we obtain the standard TD in an off-policy setting Yu (2012), and when we obtain the full importance-sampling TD algorithm Precup, Sutton, and Dasgupta (2001).

###### Remark 2.

The ETD(, ) algorithm of Sutton, Mahmood, and White (2015) also includes a state-dependent emphasis weight , and a state-dependent discount factor . Here, we analyze the case of a uniform weight and constant discount factor for all states. While our analysis can be extended to their more general setting, the insights from the analysis remain the same, and for the purpose of clarity we chose to focus on this simpler setting.

An important term in our analysis is the emphatic weight vector , defined by

 f⊤=d⊤μ(I−βP)−1. (2)

It can be shown (Sutton, Mahmood, and White, 2015; Yu, 2015), that ETD(, ) converges to - a solution of the following projected fixed point equation:

 V=ΠfTV,V∈R|S|. (3)

For the fixed point equation (3), a contraction property of is important for guaranteeing both a unique solution, and a bias bound (Bertsekas and Tsitsiklis, 1996).

It is well known that is a -contraction with respect to the -weighted Euclidean norm (Bertsekas and Tsitsiklis, 1996), and by definition is a non-expansion in -norm, however, it is not immediate that the composed operator is a contraction in any norm. Indeed, for the TD(0) algorithm (Sutton and Barto 1998; corresponding to the case in our setting), a similar representation as a projected Bellman operator holds, but it may be shown that in the off-policy setting the algorithm might diverge (Baird, 1995). In the next section, we study the contraction properties of , and provide corresponding bias bounds.

## 3 Bias of ETD(0, β)

In this section we study the bias of the ETD(, ) algorithm. Let us first introduce the following measure of discrepancy between the target and behavior policies:

 κ≐minsdμ(s)f(s).
###### Lemma 1.

The measure obtains values ranging from (when there is a state visited by the target policy, but not the behavior policy), to (when the two policies are identical).

The technical proof is given in the supplementary material. The following theorem shows that for ETD(, ) with a suitable , the projected Bellman operator is indeed a contraction.

###### Theorem 1.

For , the projected Bellman operator is a -contraction with respect to the Euclidean -weighted norm, namely, :

 ∥∥ΠfTv1−ΠfTv2∥∥f≤√γ2β(1−κ)∥v1−v2∥f.
###### Proof.

Let . We have

 ∥v∥2f−β∥Pv∥2f=v⊤Fv−βv⊤P⊤FPv≥(a)v⊤Fv−βv⊤diag(f⊤P)v=v⊤[F−βdiag(f⊤P)]v=v⊤[diag(f⊤(I−βP))]v=(b)v⊤diag(dμ)v=∥v∥2dμ,

where (a) follows from Jensen inequality:

 v⊤P⊤FPv=∑sf(s)(∑s′P(s′|s)v(s′))2≤∑sf(s)∑s′P(s′|s)v2(s′)=∑s′v2(s′)∑sf(s)P(s′|s)=v⊤diag(f⊤P)v,

and (b) is by the definition of in (2).

Notice that for every :

 ∥v∥2dμ=∑sdμ(s)v2(s)≥∑sκf(s)v2(s)=κ∥v∥2f

Therefore:

 ∥v∥2f≥β∥Pv∥2f+∥v∥2dμ≥β∥Pv∥2f+κ∥v∥2f,⇒β∥Pv∥2f≤(1−κ)∥v∥2f

and:

 ∥Tv1−Tv2∥2f=∥γP(v1−v2)∥2f=γ2∥P(v1−v2)∥2f≤γ2β(1−κ)∥v1−v2∥2f.

Hence, is a -contraction. Since is a non-expansion in the -weighted norm (Bertsekas and Tsitsiklis, 1996), is a -contraction as well. ∎

Recall that for the original ETD algorithm (Sutton, Mahmood, and White, 2015), we have that , and the contraction modulus is , thus the contraction of always holds.

Also note that in the on-policy case, the behavior and target policies are equal, and according to Lemma 1 we have . In this case, the contraction modulus in Theorem 1 is , similar to the result for on-policy TD Bertsekas and Tsitsiklis (1996).

We remark that Kolter (2011) also used a measure of discrepancy between the behavior and the target policy to bound the TD-error. However, Kolter (2011) considered the standard TD algorithm, for which a contraction could be guaranteed only for a class of behavior policies that satisfy a certain matrix inequality criterion. Our results show that for ETD(, ) with a suitable , a contraction is guaranteed for general behavior policies. We now show in an example that our contraction modulus bounds are tight.

###### Example 1.

Consider an MDP with two states: Left and Right. In each state there are two identical actions leading to either Left or Right deterministically. The behavior policy will choose Right with probability , and the target policy will choose Left with probability , hence . Calculating the quantities of interest:

 P=(ε1−εε1−ε),dμ=(1−ε,ε)f=11−β(1+2εβ−ε−β,−2εβ+ε+β)⊤.

So for :

 ∥v∥2f=ε+β−2εβ1−β,∥Pv∥2f=(1−ε)21−β,

and for small we obtain that .

An immediate consequence of Theorem 1 is the following error bound, based on Lemma 6.9 of Bertsekas and Tsitsiklis (1996):

###### Corollary 1.

We have

 ∥∥Φ⊤θ∗−Vπ∥∥f≤1√1−γ2β(1−κ)∥∥ΠfVπ−Vπ∥∥f,∥∥Φ⊤θ∗−Vπ∥∥dμ≤1√γ(1−γ2β(1−κ))∥∥ΠfVπ−Vπ∥∥f.

Up to the weights in the norm, the error is the best approximation we can hope for, within the capability of the linear approximation architecture. Corollary 1 guarantees that we are not too far away from it.

Notice that the error uses a measure which is independent of the target policy; This could be useful in further analysis of a policy iteration algorithm, which iteratively improves the target policy using samples from a single behavior policy. Such an analysis may proceed similarly to that in Munos (2003) for the on-policy case.

### 3.1 Numerical Illustration

We illustrate the importance of the ETD(, ) bias bound in a numerical example. Consider the 2-state MDP example of Kolter (2011), with transition matrix (where is an all matrix), discount factor , and value function (with ). The features are , with . Clearly, in this example we have . The behavior policy is chosen such that .

In Figure 1 we plot the mean-squared error , where is either the fixed point of the standard TD equation , or the ETD(, ) fixed point of (3), with . We also show the optimal error achievable with these features. Note that, as observed by Kolter (2011), for certain behavior policies the bias of standard TD is infinite. This means that algorithms that converge to this fixed point, such as the GTD algorithm (Sutton et al., 2009), are hopeless in such cases. The ETD algorithm, on the other hand, has a bounded bias for all behavior policies.

## 4 The Bias-Variance Trade-Off of ETD(0, β)

From the results in Corollary 1, it is clear that increasing the decay rate decreases the bias bound. Indeed, for the case we obtain the importance sampling TD algorithm (Precup, Sutton, and Dasgupta, 2001), which is known to have a bias bound similar to on-policy TD. However, as recognized by Precup, Sutton, and Dasgupta (2001) and Sutton, Mahmood, and White (2015), the importance sampling ratio suffers from a high variance, which increases with . The quantity is important as it appears as a multiplicative factor in the definition of the ETD learning rule, so its amplitude directly impacts the stability of the algorithm. In fact, the asymptotic variance of may be infinite, as we show in the following example:

###### Example 2.

Consider the same MDP given in Example 1, only now the behavior policy chooses Left or Right with probability , and the target policy chooses always Right. For ETD(, ) with , we have that when then (since ). When , may take several values depending on how many steps, , was the last transition from Left to Right, i.e. . We can write this value as where:

 Fτ≐τ∑i=0(2β)i=(2β)τ+1−12β−1,

if . Let us assume that since interesting cases happen when is close to 1.

Let’s compute ’s average over time: Following the stationary distribution of the behavior policy, with probability . Now, conditioned on (which happens with probability ), we have with probability . Thus the average (over time) value of is

 EFt=12∞∑i=02−i−1Fi=∑iβi+1−12(2β−1)=12(1−β).

Thus amplifies the TD update by a factor of

in average. Unfortunately, the actual values of the (random variable)

does not concentrate around its expectation, and actually does not even have a finite variance. Indeed the average (over time) of is

 EF2t=14∞∑i=02−i(Fi)2=∑i2−i((2β)i+1−1)24(2β−1)2=∞,

as soon as .

So although ETD(, ) converges almost surely (as shown by Yu 2015), the variance of the estimate may be infinite, which suggests a prohibitively slow convergence rate.

In the following proposition we characterize the dependence of the variance of on .

###### Proposition 1.

Define the mismatch matrix such that and write

the largest magnitude of its eigenvalues. Then for any

the average variance of (conditioned on any state) is finite, and

 Eμ[Var[Ft|St=s]]≤β21−β⎛⎝2+(1+β)∥∥~Pμ,π∥∥∞1−β2∥∥~Pμ,π∥∥∞⎞⎠,

where is the -induced norm which is the maximum absolute row sum of the matrix.

###### Proof.

(Partial) Following the same derivation that Sutton, Mahmood, and White (2015) used to prove that , we have

 q(s)≐dμ(s)limt→∞E[F2t|St=s]=dμ(s)limt→∞E[(1+ρt−1βFt−1)2|St=s]=dμ(s)limt→∞E[1+2ρt−1βFt−1+ρ2t−1β2F2t−1|St=s].

For the first summand, we get . For the second summand, we get:

 2βdμ(s)limt→∞E[ρt−1Ft−1|St=s]=2β∑¯s[Pπ]¯ssf(¯s).

The third summand equals

Hence . Thus for any , all eigenvalues of the matrix have magnitude smaller than , and the vector has finite components. The rest of the proof is very technical and is given in Lemma 2 in the supplementary material.

Proposition 1 and Corollary 1 show that the decay rate acts as an implicit trade-off parameter between the bias and variance in ETD. For large , we have a low bias but suffer from a high variance (possibly infinite if ), and vice versa for small . Notice that for the on-policy case, thus for any the variance is finite.

Originally, ETD(, ) was introduced with , and from our perspective, it may be seen as a specific choice for the bias-variance trade-off. However, there is no intrinsic reason to choose , and other choices may be preferred in practice, depending on the nature of the problem. In the following numerical example, we investigate the bias-variance dependence on , and show that the optimal in term of mean-squared error may be quite different from .

### 4.1 Numerical Illustration

We revisit the 2-state MDP described in Section 3.1, with , and . For these parameter settings, the error of standard TD is ( was chosen to be close to a point of infinite bias for these parameters).

In Figure 2 we plot the mean-squared error , where was obtained by running ETD(, ) with a step size for iterations, and averaging the results over different runs.

First of all, note that for all , the error is smaller by two orders of magnitude than that of standard TD. Thus, algorithms that converge to the standard TD fixed point such as GTD Sutton et al. (2009) are significantly outperformed by ETD(, ) in this case. Second, note the dependence of the error on , demonstrating the bias-variance trade-off discussed above. Finally, note that the minimal error is obtained for , and is considerably smaller than that of the original ETD with .

## 5 Contraction Property for ETD(λ, β)

We now extend our results to incorporate eligibility traces, in the style of the ETD() algorithm (Sutton, Mahmood, and White, 2015), and show similar contraction properties and error bounds.

The ETD(, ) algorithm iteratively updates the weight vector according to

 θt+1:=θt+α(Rt+1+γθ⊤tφt+1−θ⊤tφt)etet=ρt(γλet−1+Mtφt),e−1=0Mt=λ+(1−λ)FtFt=βρt−1Ft−1+1,F0=1,

where is the eligibility trace (Sutton, Mahmood, and White, 2015). In this case, we define the emphatic weight vector by

 m⊤=d⊤μ(I−Pλ,β)−1, (4)

where for some denotes the following matrix:

 Pa,b=I−(I−baP)−1(I−bP).

The Bellman operator for general and is given by:

 T(λ)(V)=(I−γλP)−1R+Pλ,γV,V∈R|S|.

For we have , , and so we recover the definitions of ETD(, ).

Recall that our goal is to estimate the value function . Thus, we would like to know how well the ETD(, ) solution approximates . Mahmood et al. (2015) show that, under suitable step-size conditions, ETD converges to some that is a solution of the projected fixed-point equation:

 θ⊤Φ=ΠmT(λ)(θ⊤Φ).

In their analysis, however, Mahmood et al. (2015) did not show how well the solution approximates . Next, we establish that the projected Bellman operator is a contraction. This result will then allow us to bound the error .

###### Theorem 2.

is an -contraction with respect to the Euclidean -weighted norm where:

 β≥γ:ω=√γ2(1+λβ)2(1−λ)β(1+γλ)2(1−λβ),β≤γ:ω=√γ2(1−βλ)(1−λ)β(1−γλ)2. (5)
###### Proof.

(sketch) The proof is almost identical to the proof of Theorem 1, only now we cannot apply Jensen’s inequality directly, since the rows of do not sum to . However:

 Pλ,β1=(I−(I−βλP)−1(I−βP))1=ζ1,

where . Notice that each entry of is positive. Therefore will hold for Jensen’s inequality. Let , we have

 ∥v∥2m−1ζ∥∥Pλ,βv∥∥2m=v⊤Mv−ζv⊤Pλ,βζ⊤MPλ,βζv≥(a)v⊤Mv−βv⊤diag(m⊤Pλ,βζ)v=v⊤[M−diag(m⊤Pλ,β)]v=v⊤[diag(m⊤(I−Pλ,β))]v=(b)v⊤diag(dμ)v=∥v∥2dμ,

where (a) follows from the Jensen inequality and (b) from Equation (4). Therefore:

 ∥v∥2m≥1ζ∥∥Pλ,βv∥∥2m+∥v∥2dμ≥1ζ∥∥Pλ,βv∥∥2m,

and:

 ∥∥T(λ)v1−T(λ)v2∥∥2m=∥∥Pλ,γ(v1−v2)∥∥2m(Case A: β≥γ)≤∥∥∥γ(1+βλ)β(1+γλ)Pλ,β(v1−v2)∥∥∥2m≤γ2(1+λβ)2(1−λ)β(1+γλ)2(1−λβ)∥v1−v2∥2m,(Case B: β≤γ)≤∥∥∥γ(1−βλ)β(1−γλ)Pλ,β(v1−v2)∥∥∥2m≤γ2(1−βλ)(1−λ)β(1−γλ)2∥v1−v2∥2m.

The inequalities depending on the two cases originate from the fact that the two matrices are polynomials of the same matrix , and mathematical manipulation on the corresponding eigenvalues decomposition of . The details are given in Lemma 3 of the supplementary material.

Now, for a proper choice of , the operator is a contraction, and since is a non-expansion in the -weighted norm, is a contraction as well. ∎

In Figure 3 we illustrate the dependence of the contraction moduli bound on and . In particular, for , the contraction modulus diminishes to 0. Thus, for large enough , a contraction can always be guaranteed (this can also be shown mathematically from the contraction results of Theorem 2). We remark that a similar result for standard TD() was established by Yu 2012. However, as is well-known (Bertsekas, 2012), increasing also increases the variance of the algorithm, and we therefore obtain a bias-variance trade-off in as well as . Finally, note that for , the contraction modulus equals , and that for the result is the same as in Theorem 1.

## 6 Conclusion

In this work we unified several off-policy TD algorithms under the ETD(, ) framework, which flexibly manages the bias and variance of the algorithm by controlling the decay-rate of the importance-sampling ratio. From this perspective, we showed that several different methods proposed in the literature are special instances of this bias-variance selection.

Our main contribution is an error analysis of ETD(, ) that quantifies the bias-variance trade-off. In particular, we showed that the recently proposed ETD algorithm of Sutton, Mahmood, and White (2015) has bounded bias for general behavior and target policies, and that by controlling the decay-rate in the ETD(, ) algorithm, an improved performance may be obtained by reducing the variance of the algorithm while still maintaining a reasonable bias.

Possible future extensions of our work includes finite-time bounds for off-policy ETD(, ), an error propagation analysis of off-policy policy improvement, and solving the bias-variance trade-off adaptively from data.

## References

• Baird (1995) Baird, L. 1995. Residual algorithms: Reinforcement learning with function approximation. In ICML.
• Bertsekas and Tsitsiklis (1996) Bertsekas, D., and Tsitsiklis, J. 1996. Neuro-Dynamic Programming. Athena Scientific.
• Bertsekas and Yu (2009) Bertsekas, D., and Yu, H. 2009. Projected equation methods for approximate solution of large linear systems. Journal of Computational and Applied Mathematics 227(1):27–50.
• Bertsekas (2012) Bertsekas, D. 2012. Dynamic Programming and Optimal Control, Vol II. Athena Scientific, 4th edition.
• Kolter (2011) Kolter, J. Z. 2011. The fixed points of off-policy TD. In NIPS.
• Liu et al. (2015) Liu, B.; Liu, J.; Ghavamzadeh, M.; Mahadevan, S.; and Petrik, M. 2015. Finite-sample analysis of proximal gradient td algorithms. In UAI.
• Mahmood et al. (2015) Mahmood, A. R.; Yu, H.; White, M.; and Sutton, R. S. 2015. Emphatic Temporal-Difference Learning. arXiv:1507.01569.
• Munos (2003) Munos, R. 2003. Error bounds for approximate policy iteration. In ICML.
• Precup, Sutton, and Dasgupta (2001) Precup, D.; Sutton, R. S.; and Dasgupta, S. 2001. Off-policy temporal-difference learning with function approximation. In ICML.
• Sutton and Barto (1998) Sutton, R. S., and Barto, A. 1998. Reinforcement learning: An introduction. Cambridge Univ Press.
• Sutton et al. (2009) Sutton, R. S.; Maei, H. R.; Precup, D.; Bhatnagar, S.; Silver, D.; Szepesvári, C.; and Wiewiora, E. 2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In ICML.
• Sutton, Mahmood, and White (2015) Sutton, R. S.; Mahmood, A. R.; and White, M. 2015. An emphatic approach to the problem of off-policy temporal-difference learning. arXiv:1503.04269.
• Yu (2012) Yu, H. 2012. Least squares temporal difference methods: An analysis under general conditions. SIAM Journal on Control and Optimization 50(6):3310–3343.
• Yu (2015) Yu, H. 2015. On convergence of emphatic temporal-difference learning. In COLT.

## Appendix A Proof of Lemma 1

Notice that obtains non-negative values since . Now, if there is a state visited by the target policy, but not the behavior policy, this means that , and that there is some such that , and by definition , so we can get .

Next, we prove the upper bound on . Notice that , and that . Hence, if , then there must exist some such that so . Now, when , by definition and we obtain this upper bound.

## Appendix B Technical Part of Proposition 1

###### Lemma 2.

The following is true:

###### Proof.

Notice that:

 f⊤=d⊤μ(I−βPπ)−1≥(cw)d⊤μ+βd⊤μPπ,

so:

Where (a) comes from the inequality on , (b) also removes the negative summand , and swaps sum with norm (all coordinates are non-negative), (c) and (d) are from the sub-multiplicative property of induced norms (the norm originates from the transpose). ∎

## Appendix C Norm Inequality between Pλ,βπ and Pλ,γπ

If :

 (6)

and if :

 (7)
###### Proof.

Mark the orthonormal eigenvectors w.r.t.

, and corresponding eigenvalues of by respectively ( may be a complex number, this decomposition exists over almost surely). Notice that since are polynomials of they have the same eigenvectors, with the eigenvalues correspondingly. Hence, we can write the first norm as follows:

 (8)

And similarly for :

 ∥∥Pλ,βπv∥∥2m=∑j∣∣∣∣2∣∣lβj∣∣2∥∥uj∥∥2m. (9)

So if we can find a constant such that:

 ∀j:∣∣lγj∣∣2≤α2∣∣lβj∣∣2, (10)

then could swap . The expression we want to maximize is:

 ∣∣lγj∣∣2∣∣lβj∣∣2=γ2(1−βλtj)(1−βλt∗j)β2(1−γλtj)(1−γλt∗j)=γ2(1−βλtj−βλt∗j+β2λ2∣∣tj∣∣2)β2(1−γλtj−γλt∗j+γ2λ2∣∣tj∣∣2). (11)

Taking the derivative with respect to , shows that there are no extrema points inside the ball

(we know the eigenvalues are inside this ball since they belong to a stochastic matrix), which means we can look at the boundary of this ball

to find the maximum value. Since now we get dependence only on , the maximum must be on :

 maxt:∣∣tj∣∣≤1∣∣lγj∣∣2∣∣lβj∣∣2=γ2(1±βλ)2β2(1±γλ)2, (12)

where when the plus is larger and vice versa. ∎