# On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several gradient-based TD algorithms with linear function approximation. The algorithms we analyze include: (i) two basic forms of two-time-scale gradient-based TD algorithms, which we call GTD and which minimize the mean squared projected Bellman error using stochastic gradient-descent; (ii) their "robustified" biased variants; (iii) their mirror-descent versions which combine the mirror-descent idea with TD learning; and (iv) a single-time-scale version of GTD that solves minimax problems formulated for approximate policy evaluation. We derive convergence results for three types of stepsizes: constant stepsize, slowly diminishing stepsize, as well as the standard type of diminishing stepsize with a square-summable condition. For the first two types of stepsizes, we apply the weak convergence method from stochastic approximation theory to characterize the asymptotic behavior of the algorithms, and for the standard type of stepsize, we analyze the algorithmic behavior with respect to a stronger mode of convergence, almost sure convergence. Our convergence results are for the aforementioned TD algorithms with three general ways of setting their λ-parameters: (i) state-dependent λ; (ii) a recently proposed scheme of using history-dependent λ to keep the eligibility traces of the algorithms bounded while allowing for relatively large values of λ; and (iii) a composite scheme of setting the λ-parameters that combines the preceding two schemes and allows a broader class of generalized Bellman operators to be used for approximate policy evaluation with TD methods.

## Authors

• 4 publications
• ### Policy Evaluation with Variance Related Risk Criteria in Markov Decision Processes

In this paper we extend temporal difference policy evaluation algorithms...
01/01/2013 ∙ by Aviv Tamar, et al. ∙ 0

• ### Off-policy Learning with Eligibility Traces: A Survey

In the framework of Markov Decision Processes, off-policy learning, that...
04/15/2013 ∙ by Matthieu Geist, et al. ∙ 0

• ### Two Timescale Stochastic Approximation with Controlled Markov noise and Off-policy temporal difference learning

We present for the first time an asymptotic convergence analysis of two ...
03/31/2015 ∙ by Prasenjit Karmakar, et al. ∙ 0

• ### Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

We address the problem of policy evaluation in discounted Markov decisio...
03/16/2020 ∙ by Koulik Khamaru, et al. ∙ 21

• ### Improper Learning with Gradient-based Policy Optimization

We consider an improper reinforcement learning setting where the learner...
02/16/2021 ∙ by Mohammadi Zaki, et al. ∙ 0

• ### Estimate Sequences for Variance-Reduced Stochastic Composite Optimization

In this paper, we propose a unified view of gradient-based algorithms fo...
05/07/2019 ∙ by Andrei Kulunchakov, et al. ∙ 10

• ### Consistent On-Line Off-Policy Evaluation

The problem of on-line off-policy evaluation (OPE) has been actively stu...
02/23/2017 ∙ by Assaf Hallak, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes (MDPs) with finite spaces and discounted reward criteria. Off-policy TD learning extends on-policy model-free TD learning [27, 34] (see also the books [3, 30]) to cases where stationary policies of interest are evaluated using data collected without executing the policies. It is more flexible than on-policy learning and can be useful not only as a computational tool to solve MDPs, but also as an aid to building experience-based knowledge representations for autonomous agents in AI applications [29]. The specific class of algorithms that we consider in this technical report is the class of gradient-based off-policy TD algorithms with linear function approximation. Our purpose is to analyze several such algorithms proposed in the literature, and present a collection of convergence results for a broad range of choices of stepsizes and other important algorithmic parameters.

The algorithms that we will analyze include the following:

• Two two-time-scale gradient-based TD algorithms proposed and studied by Sutton et al. [31, 33] and Maei [10]. These two algorithms use stochastic gradient-descent to minimize the mean squared projected Bellman error, a convex quadratic objective function, for approximate policy evaluation, thereby overcoming the divergence issue in off-policy TD learning. They have been called GTD2, TDC, as well as GTD() in the early works just mentioned. Here we shall refer to them as GTDa and GTDb, respectively, and refer to both algorithms as GTD algorithms.

• A single-time-scale version of GTDa that solves minimax problems formulated for approximate policy evaluation. This algorithm was also considered in those early works on GTD just mentioned. However, that it is trying to solve a minimax problem equivalent to the projected-Bellman-error minimization was pointed out later by Liu et al. [9] (see also Mahadevan et al. [13]). The latter viewpoint facilitates convergence analysis of the algorithm, by placing it in a more general class of stochastic approximation algorithms for solving minimax problems.

• The mirror-descent versions of GTD and TD. Combining the mirror-descent idea of Nemirovsky and Yudin [18] with TD learning was proposed by Mahadevan and Liu [12] (see also [13]).

• “Robustified” biased variants of the preceding algorithms. These algorithms use a “robustification” procedure to mitigate the high-variance issue in off-policy learning, at the price of introducing biases in this procedure. They are similar to the biased variant algorithms considered by the author

[37] for the emphatic TD (ETD) algorithm proposed by Sutton et al. [32]. In the present context, as we will show, for two-time-scale GTD algorithms, the variant algorithms can be viewed as approximate gradient algorithms, and for the single-time-scale GTDa, its variant tries to solve minimax problems that approximate the ones GTDa tries to solve.

We will analyze primarily constrained algorithms which confine their iterates in bounded sets. Only for the single-time-scale GTDa algorithm, we will also analyze its unconstrained version under certain conditions.

We will present convergence results for three types of stepsizes: constant stepsize, slowly diminishing stepsize, as well as the standard type of diminishing stepsize with a square-summable condition. For the first two types of stepsizes, we apply the weak convergence method from stochastic approximation theory [8] to characterize the asymptotic behavior of the algorithms. For the third, standard type of stepsize, we analyze the algorithmic behavior with respect to a stronger mode of convergence, almost sure convergence, by using general results on stochastic approximation [4, 8].

Our convergence results are for the aforementioned TD algorithms with three general ways of setting the -parameters in TD learning:

• state-dependent [28, 30];

• a case of history-dependent as proposed recently by Yu et al. [40], which can keep the eligibility traces in the off-policy TD algorithms bounded while allowing for relatively large values of ;

• a composite scheme of setting the -parameters [35, 40], which combines the preceding two schemes and allows a broader class of generalized Bellman operators to be used for approximate policy evaluation with TD methods.

To our knowledge, for off-policy gradient-based TD algorithms with linear function approximation, there are few prior convergence results that address the case of general nonzero -parameters. Although such algorithms with constant or state-dependent have been proposed and investigated (see e.g., [10, 11, 13]), the analyses given earlier [9, 10, 13, 31, 33] have only proved convergence for the case where and the data consist of i.i.d. state transitions. But the assumption on i.i.d. data is unrealistic for reinforcement learning even in the case . Moreover, these analyses cannot be extended to the case of positive , where the algorithms need to use non-i.i.d. off-policy data in order to gather information about the multistep Bellman operator with respect to which the mean squared projected Bellman error is defined. To our knowledge, the only prior convergence result that applies to off-policy data is given by Karmakar and Bhatnagar [6]. It is a convergence result for the two-time-scale TDC algorithm (GTDb as we call it) with the standard type of diminishing stepsize, as an application of the theoretical results developed in their work [6] for two-time-scale differential inclusions. The result is for , although its arguments can be applied in the case where is small enough so that the eligibility traces produced in the algorithm are bounded.

Our results on gradient-based TD algorithms differ from that of [6]

not only in the range of algorithms and parameter settings they cover, but also in the proof approaches by which they are derived. Specifically, we combine the ordinary-differential-equation (ODE) based proof methods in stochastic approximation theory

[8] with special properties of the eligibility traces and the ergodicity of the joint state and eligibility trace process under various settings of the -parameters mentioned above. Those properties were derived in the author’s early works [35, 37] for state-dependent and in the recent work [40] for the special case of history-dependent mentioned above. The properties of the joint state and eligibility trace process are not considered in [6] since it treats the case where the eligibility traces are simply functions of states, but these properties are important for convergence analysis of those TD algorithms that use general nonzero -parameters. The ODE-based line of analysis we use is less general than the differential inclusion-based method studied in [6], however. As future work, it can be worthwhile to use the latter approach to handle even more flexible ways of choosing history-dependent than the one we consider in this work.

Another difference between our work and [6] is that we analyze primarily constrained algorithms, as mentioned earlier. The result given in [6]

is for the unconstrained two-time-scale TDC under the assumption that the iterates are almost surely bounded. With constraints, we do not need such assumptions, and instead we could simply require that the constraint sets are large enough so that the algorithms can estimate the gradients correctly. The presence of constraint sets helps us avoid some theoretical difficulties in convergence analysis. However, extra work is also needed to ensure that the constraint sets do not interfere with the algorithms to prevent them from achieving the original goals that they are designed for. We take care of such issues in our analyses, especially for mirror-descent GTD/TD algorithms and for the single- and two-time-scale GTDa algorithms, which are not as straightforward as the two-time-scale GTDb algorithm.

This technical report is organized as follows. Section 2 is about the preliminaries: We first describe the off-policy policy evaluation problem and the two two-time-scale GTD algorithms. We then explain the role of the -parameters in TD learning, and discuss the properties of the eligibility traces and the properties of the joint state and eligibility trace process, in order to prepare the stage for convergence analyses. In Sections 3-4, we present convergence results for slowly diminishing stepsize and constant stepsize, which are derived with the weak convergence method. Section 3 is for the two two-time-scale GTD algorithms and their biased variants. Section 4 is for the mirror-descent GTD/TD algorithms, as well as a single-time-scale GTDa algorithm and its biased variant, both of which solve minimax problems for policy evaluation. In this section we also use the minimax problem formulation to strengthen the results of Section 3 for the two-time-scale GTDa algorithm and its biased variant. In Section 5, we consider standard stepsize conditions and present almost sure convergence results for both the two-time-scale and single-time-scale algorithms, including a result for the unconstrained single-time-scale GTDa for certain choices of the -parameters. We then conclude the paper in Section 6 with a brief discussion about these results and open questions. For quick access to our convergence results in Sections 3-5, the convergence theorems will be listed at the beginning of each of those sections.

## 2 Preliminaries

In this section we first introduce the off-policy policy evaluation problem, and describe two basic forms of the gradient-based TD algorithms which were proposed and studied in [10, 11, 31, 33]. We then explain generalized Bellman operators, how they relate to the -parameters of the TD algorithms and to the objectives of these algorithms. We also specify two ways of choosing these -parameters, for the algorithms that we will analyze in the paper. These background materials will be given in Section 2.1. In Section 2.2, the second half of this section, we present materials to prepare the stage for analyzing the gradient-based TD algorithms as stochastic approximation algorithms in the rest of this paper. These materials are about the state-trace process, the random process that underlies and drives the TD algorithms, and its important properties for convergence analysis (including, among others, ergodicity and uniform integrability properties).

### 2.1 Problem Setup and Two Basic Forms of Algorithms

The gradient-based TD algorithms we consider belong to the class of model-free, temporal-differences based learning algorithms for evaluating a stationary policy in a Markov decision process (MDP). We shall consider MDP with finite spaces. For the purpose of this paper, however, we do not need the full MDP framework.111For references on MDP, see the excellent textbook [22].

It is adequate to consider two Markov chains on a finite state space

.222The states in these Markov chains need not correspond to the states of the MDP; they can correspond to state-action pairs. Depending on whether one evaluates the value function for states or state-actions pairs in the MDP, the Markov chains here correspond to slightly different processes in the MDP. (For the details of these correspondences, see [35, Examples 2.1, 2.2].) But the analysis is the same, so for notational simplicity, we have chosen not to introduce action variables in the paper. The first Markov chain has transition matrix , and the second . Whatever physical mechanisms that induce the two chains shall be denoted by and , and referred to as the target policy and behavior policy, respectively. The second Markov chain we can observe; however, what we want is to evaluate the system performance with respect to (w.r.t.) the first Markov chain that we do not observe—the “off-policy” learning case.

The performance of the target policy is defined w.r.t. a discounted total reward criterion as follows. A one-stage reward function specifies the expected reward at each state . Each state is also associated with a state-dependent discount factor . The expected discounted total rewards for each initial state is defined by

 vπ(s):=Eπs[rπ(S0)+∑∞n=1γ(S1)γ(S2)⋯γ(Sn)⋅rπ(Sn)]. (2.1)

Here the notation indicates that the expectation is taken w.r.t. the Markov chain induced by and with the initial state . The function in (2.1) is called the value function of , and it is well-defined under Condition 2.1(i) given below, which we shall assume throughout the paper.

Denote by the diagonal matrix with the discount factors as its diagonal entries.

###### Condition 2.1 (Conditions on the target and behavior policies).
• is such that the inverse exists, and

• is such that for all , , and moreover, is irreducible.

The second part of this condition is for the behavior policy that generates the observed Markov chain. It will be needed when we describe the off-policy learning algorithms.

By standard MDP theory (see e.g., [22]), under Condition 2.1(i),

satisfies uniquely the linear equation (expressed in matrix/vector notation with

, viewed as -dimensional vectors):

 vπ=rπ+PΓvπ(i.e., \ vπ=(I−PΓ)−1rπ). (2.2)

It is known as the Bellman equation (or dynamic programming equation) for . Besides this equation, also satisfies a broad family of generalized Bellman equations, which have as their unique solution and like (2.2), express as the sum of two rewards terms, with the first term representing the expected rewards received prior to a certain (randomized stopping) time and the second term those received afterwards. (We shall discuss these equations further in the subsequent Section 2.1.3.)

TD algorithms compute by solving such a Bellman equation for . Which Bellman equation to solve is determined by certain parameters, which we call the -parameters, used by the algorithms in their iterative computation of eligibility traces (which are iterates that carry information about the past states). We shall give more details about this correspondence between and the generalized Bellman equations in Section 2.1.3, after we describe two basic forms of the GTD algorithms. For now, we will focus on the overall structure of the computation problem tackled by the gradient-based TD algorithms. Think of as one generalized Bellman equation that an algorithm chooses to solve. The operator is an affine operator on and similar to (2.2), can be expressed as

 T(λ)v=r(λ)π+P(λ)v,∀v∈R|S|, (2.3)

for a vector and a substochastic matrix . We shall refer to as a generalized Bellman operator for . The TD algorithms we consider try to find an approximate solution to the linear equation

 v=T(λ)v,

by solving an optimization problem on a lower dimensional space using linear function approximation. Let us describe first the approximation architecture and then the formulation of the optimization problem.

Let be a given function that maps each state to a -dimensional feature vector (it will be taken for granted that is non-trivial; i.e., for at least one state ). Write , and denote the subspace spanned by these component functions by . To approximate , the TD algorithms look for some function that satisfies the generalized Bellman equation approximately: . The functions in the approximation subspace are parameterized as , , for parameters . (We treat and as column vectors; the symbol stands for transpose.) We do not require the functions to be linearly independent; for this reason, another subspace will be useful later. This is the subspace in spanned by the feature vectors, ; below we shall write it as for short. In matrix notation, is the column space of the matrix that has the feature vectors as its rows; i.e.,

 Φ=⎡⎢ ⎢ ⎢⎣⋮ϕ(s)⊤⋮⎤⎥ ⎥ ⎥⎦orΦ⊤=[ccc⋯ϕ(s)⋯].

Any approximate value function can be written as for some —note that is uniquely determined by if .

Let us now describe the optimization problem that the gradient-based TD algorithms try to solve in order to find an approximation of in .

#### 2.1.1 The objective function

To find a function with , the two original gradient-based TD algorithms, GTDa and GTDb, to be described shortly, try to minimize an objective function of the form

 J(θ)=12∥Πξ(T(λ)vθ−vθ)∥2ξ,where  vθ=Φθ,θ∈Rd. (2.4)

Here denotes projection onto the approximation subspace w.r.t. the weighted Euclidean norm given by , for a positive -dimensional vector with components . The objective measures the magnitude of the “Bellman error” on the subspace . If the projected Bellman equation has a unique solution, the relation between this solution and can be characterized using the oblique projection viewpoint and related approximation error bounds (see the early work [26, 38] and a summary in more general terms given in the recent work [40, Appendix B]). In this paper, since our focus is on convergence properties of the algorithms and since the problem always has an optimal solution, we do not require the projected Bellman equation to have a unique solution or any solution. (When it has no solution or multiple solutions, the quality of the approximate value function from the minimization of could be a concern, though.)

In this paper, we shall take

to be the invariant probability distribution of the Markov chain with transition matrix

induced by the behavior policy (i.e., ); such a distribution exists and is unique under Condition 2.1(ii). This choice of is mostly for notational simplicity: our analyses extend to cases where does not coincide with the invariant distribution of , but the algorithms in those cases have additional weighting terms and are notationally more cumbersome.

Later we will also discuss objective functions of the form , where is some smooth convex function that serves as a regularizer. It will be seen that to handle this additional term, little extra effort is needed in the convergence analysis. So, for notational simplicity, we will take to be the objective function in the first half of the paper, and discuss the regularized objective function after we have presented the main convergence proof arguments. One can also consider a mixed objective function by combing the projected Bellman errors for multiple Bellman operators, for instance, with defined like but for two different . The convergence analysis of the gradient-based TD algorithms for such mixed objectives is essentially the same as that for , so we will focus on the latter for simplicity.

Let us now work out two expressions of , which are used respectively by the two GTD algorithms, before we describe these algorithms. Let denote the inner product on the Euclidean space ; i.e., . (The notation will be used for the usual inner product in Euclidean spaces.) For , let denote the -th column of , and the -th component of . Since , the partial derivative of w.r.t. each is

 ∇θiJ(θ)=⟨Πξ(T(λ)vθ−vθ),Πξ(P(λ)−I)Φi⟩ξ.

(Recall that is the substochastic matrix in the affine operator ; cf. (2.3).) Observe two facts. First, for any ,

 ⟨Πξv,Πξv′⟩ξ=⟨v,Πξv′⟩ξ=⟨Πξv,v′⟩ξ.

Second, for any , there is a unique with , and therefore, given , there is a unique solution to the linear equation (in ),

 Φx=Πξ(T(λ)vθ−vθ),x∈\rm span{ϕ(S)}, (2.5)

which is also the unique solution to the equivalent linear equation333To see (2.6) is equivalent to (2.5), note that for any , if and only if w.r.t. , is perpendicular to the approximation subspace , which is true if and only if for all , (since is the column space of ). The latter system of linear equations is the same as the first equation in (2.6) when we set .

 Φ⊤ΞΦx=Φ⊤Ξ(T(λ)vθ−vθ),x∈\rm span{ϕ(S)}, (2.6)

where denotes the diagonal matrix with the components of on its diagonal.

Using the above facts, we can write

 ∇θiJ(θ)=⟨Φxθ,(P(λ)−I)Φi⟩ξ=x⊤θ⋅Φ⊤Ξ(P(λ)−I)Φi, (2.7)

which gives the expression of the gradient as

 ∇J(θ)=(Φ⊤Ξ(P(λ)−I)Φ)⊤xθ. (2.8)

Alternatively, we can write

 ∇θiJ(θ) =−⟨T(λ)vθ−vθ,Φi⟩ξ+⟨Φxθ,P(λ)Φi⟩ξ (2.9)

(where in the second equality we used ). This gives another expression of the gradient:

 (2.10)

In principle one can derive other gradient expressions and formulate corresponding gradient-based algorithms; we shall, however, focus on the expressions (2.8) and (2.10) only.

#### 2.1.2 GTDa and GTDb

We now describe two basic forms of GTD algorithms. As mentioned earlier, they can only observe the Markov chain induced by the behavior policy , instead of the target policy . Upon each state transition , they receive a random reward that is a function of the transition, , plus a zero-mean finite-variance noise term whose distribution is determined by the state transition. The reward function relates to the target policy’s one-stage reward as for all . The rewards and the states are all that the algorithms can observe.

Define for . They are the importance sampling ratios that can be used to compensate for the differences in the dynamics of the two Markov chains. We assume that the algorithms know these ratios (this is the case for standard value function or state-action value function estimation, as well as for the simulation context where both and are known). To simplify notation, for , we write

 ρn=ρ(Sn,Sn+1),γn=γ(Sn).

For any given approximate value function on , we write for the (scalar) temporal-difference term calculated based on the observed random transition and reward :

 δn(v)=ρn(Rn+1+γn+1v(Sn+1)−v(Sn)). (2.11)

The conditional expectation of given the history measures the difference between the two sides of the Bellman equation (2.2) for the state when in (2.2) is replaced by .

Using the states , a sequence of eligibility trace vectors is calculated iteratively by both GTD algorithms according to this formula: given an initial , for ,

 en=λnγnρn−1en−1+ϕ(Sn). (2.12)

Here , are the -parameters we referred to earlier. They are important parameters in TD learning. Not only do they affect the behavior of the algorithms, but they also determine the Bellman operator appearing in the objective function . In the next subsection we shall describe the choices of these parameters that we consider in this paper, and explain what are the associated Bellman operators .

The eligibility traces (or traces, for short) are combined with temporal-differences terms by the algorithms to generate a sequence of iterates , starting from some initial . In particular, let as before. The first algorithm, GTDa, calculates the sequence iteratively according to

 θn+1 =θn+αnρn(ϕ(Sn)−γn+1ϕ(Sn+1))⋅e⊤nxn, (2.13) xn+1 =xn+βn(enδn(vθn)−ϕ(Sn)ϕ(Sn)⊤xn). (2.14)

The second algorithm, GTDb, has the same formula for , but calculates according to

 θn+1=θn+αn(enδn(vθn)−ρn(1−λn+1)γn+1ϕ(Sn+1)⋅e⊤nxn). (2.15)

In the above and are stepsizes, with . (We will consider a broad range of stepsizes and we defer the precise stepsize conditions till later sections where we analyze the algorithms.) Although it can be hard, for readers unfamiliar with TD algorithms, to see how the preceding formulae relate to the gradient , the GTD algorithms do correspond to applying gradient-descent to minimize for the two gradient expressions (2.8) and (2.10), respectively. The idea of the two algorithms, roughly speaking, is to let the -iterates evolve at a slow time-scale and the -iterates at a fast time-scale. As varies slowly, the fast-evolving -iterates aim to track the solution of (2.6) for , the “current” -iterate. The information about is then used to perform stochastic gradient-descent in the -space to minimize .

To gain more intuition and insights about the algorithms, we suggest the reader consult the original derivations given in e.g., [10, Chap. 7].444GTDa and GTDb here are called GTD2 and TDC, respectively, in [10], for the case . For the case of nonzero , GTDb here is called GTD() for value function estimation and GQ() for state-action value function estimation in [10]. More precisely, the GQ() algorithm does not coincide exactly with GTDb for estimating state-action values; it differs from the latter in a term with a conditional mean of zero, which does not make any difference in convergence analysis, however. We need to also point out, however, that these derivations have issues. For example, they involve various expectations such as that are taken for granted to be independent of the time . But before we know the properties of the process , it is not clear w.r.t. which probability distribution one can define so that it is independent of . Indeed, even with constant for all and a stationary state process , it does not immediately follow just from these that has to have a stationary distribution, let alone a unique one.

We shall discuss the properties of the state-trace process in Section 2.2. As can be seen from (2.12), this process depends on the -parameters in the algorithms. So let us first explain the relation between the -parameters and the generalized Bellman operators , since this will at least let us complete the definition of the objective function . We will then focus the discussion on the process and its many properties that will be needed—actually, for some choices of the ’s that we consider, also depends on the property of the process . As to the connection between the above algorithms and the gradient , it will be seen first in Prop. 2.1, Section 2.2.2, after we explain and the ergodicity property of .

#### 2.1.3 Choices of λ-parameters and associated Bellman operators

We will consider in this paper three ways of setting the -parameters in (2.12) for the trace iterates . We discuss the first two in this subsection (the third one builds upon them and will be discussed in Section 3.4.) These two choices are state-dependent and a case of history-dependent with special properties:

• State-dependent [28, 30], where for a given function .

• History-dependent as introduced in [40], where we choose based on the previous trace directly, in order to make bounded. In particular, we introduce additional memory states , to summarize the history of past states up to time . We let evolve in a Markovian way and choose based on the current and the previous trace as follows:

 yn=f(yn−1,Sn),λn=λ(yn,en−1) (2.16)

where and are some given functions, whose properties will be given shortly.

Although state-dependent is a special case of history-dependent (e.g., take in (ii)), generality is not our purpose here. The primary purpose of choosing based on (ii), as explained in [40], is to exploit the flexibility of history-dependent to bound the traces easily, while allowing for a large range of values. The latter is important because the choice of affects the choice of the generalized Bellman operator appearing in the objective function , and in turn, this choice of affects approximation error. Bounding the traces is also important, as it facilitates convergence of the algorithms. Thus, instead of the most general history-dependent , we shall focus on the special case (ii) under additional conditions studied in [40]. These conditions concern the memory states and the function , and they will be needed in the next subsection to ensure certain desired properties of the state-trace process:

###### Condition 2.2 (Evolution of memory states in (2.16)).

The memory states take values in a finite space , and under the behavior policy , the Markov chain on has a single recurrent class.

###### Condition 2.3 (Condition for λ(⋅) in (2.16)).

The function in (2.16) satisfies the following. For some norm on and for each memory state :

• For any , .

• For some constant , for all and all possible state transitions that can lead to the memory state .

Several existing off-policy algorithms, Tree-Backup [21], Retrace [17] and ABQ [14], choose state or state transition-dependent accordingly to keep the trace iterates bounded (in fact, these works motivated the history-dependent described above). Such choices of satisfy the above conditions, since one can simply let be a state or state transition and let be a function of only. One disadvantage of these choices, however, is that they are too conservative and often result in small values of . A few examples of memory states and function that satisfy the above conditions are given in [40, Section 2.2].

In the rest of this subsection, let us explain, at a level of detail adequate for the purpose of this paper, what are the generalized Bellman operators for the target policy associated with the preceding two ways of setting . As mentioned above, corresponding to different choices of are different Bellman operators in the objective function . With different , the solutions of the minimization problem are also different and can have different approximation biases (see e.g., [40, Appendix B]). For the purpose of this paper, however, the details of do not matter, because our focus is on the stochastic approximation aspects of the algorithms and what we care about is whether the average dynamics of the algorithms can be characterized by mean ODEs that are related to the minimization of for the associated . The two cases of choosing mentioned above share many properties in common. Once their common properties are made clear, as will be done here and in the next subsection, the two cases can be treated together in most of our convergence analysis of the algorithms. For this reason, regarding the Bellman operators , we shall recount only the facts that we will need in this paper (for a detailed study, see the paper [40]).

As mentioned earlier, different choices of induce different Bellman operators for the target policy. They are members of a broad family of generalized Bellman operators associated with randomized stopping times [40, Section 3.1], which are all contractive operators that have as their unique fixed point (see [40, Theorem 3.1 and Appendix A]). Such an operator takes the general form of

 (Tτv)(s)=Eπ[Rτ+γτ1v(Sτ)∣S0=s],s∈S, ∀v∈R|S|, (2.17)

where denotes expectation over the randomized stopping time and the states generated according to the target policy, is the total discounted rewards received prior to the time of stopping, is the state at time , and is a shorthand for the product of discount factors . These generalized Bellman equations and operators are a consequence of the strong Markov property of Markov chains [19, Theorem 3.3]. We refer the reader to [40, Section 3.1] for a fuller account of the framework and the mathematical notions and derivations involved.

For state-dependent , corresponds to a randomized stopping time for the Markov chain under the target policy, where is such that

 τ≥1,P(τ=n∣τ>n−1,S0,…,Sn)=1−λ(Sn)   for n≥1.

(I.e., the probability of stopping at time given that the system has not stopped yet is .) The associated operator can be expressed in several equivalent ways. Besides the general form (2.17) above, we can write as555This follows from (2.17) by taking conditional expectation over .

 (T(λ)v)(s)=Eπ[∞∑n=0λn1⋅γn1rπ(Sn)+∞∑n=1λn−11(1−λn)⋅γn1v(Sn)∣∣S0=s],s∈S, ∀v∈R|S|, (2.18)

where we used the shorthand notation with . We can also write explicitly in terms of and the model parameters as666This follows from (2.18) by a direct calculation using the definition .

 T(λ)v=(I−PΓΛ)−1rπ+(I−PΓΛ)−1PΓ(I−Λ)v, (2.19)

where denotes the diagonal matrix with diagonal entries . (Thus, the substochastic matrix in (2.3) has the explicit expression .)

In the case of history-dependent , it is shown in [40, Section 3.2] under Conditions 2.1-2.3 that is also a generalized Bellman operator (for the target policy) corresponding to a certain randomized stopping time . But this random time now depends on the behavior policy in a much more complex way than in the case of state-dependent . Among others, it depends on the dynamics of the traces under the behavior policy. As such, we generally cannot write explicitly in terms of the model parameters , and the function . We express in other ways, in order to relate it to the algorithms that employ history-dependent . In particular, an expression of similar to (2.18) will be useful in our subsequent analysis:

 (T(λ)v)(s)=Eπζ[∞∑n=0λn1⋅γn1rπ(Sn)+∞∑n=1λn−11(1−λn)⋅γn1v(Sn)∣∣S0=s],s∈S, ∀v∈R|S|. (2.20)

In the above denotes expectation with respect to the probability measure of the following process:

• The states are generated under the target policy .

• For , the memory state , the parameter , and the trace are calculated according to (2.16) and (2.12), respectively.777The randomized stopping time is generated according to the following rule: and for , (i.e., the probability of stopping at time given that the system has not stopped yet is , which is similar to the case of state-dependent ). The random time does not appear in the expression (2.20) of , because (2.20) is an equivalent form of (2.17) after taking conditional expectation over .

• The initial state, memory state and trace are distributed according to , the unique invariant probability measure of the process under the behavior policy .

The existence and uniqueness of the invariant probability measure just mentioned is ensured under Condition 2.1 on the two policies and Conditions 2.2-2.3 on the memory states and the function . Further details will be explained in the next subsection (see [40, Section 3.2] for the derivations of the preceding results).

Another expression that will be useful later is an expression of the “Bellman error” in terms of temporal-differences terms (this expression follows from (2.19) by rearranging terms):

 (T(λ)v−v)(s)=Eπζ[∞∑n=0λn1⋅γn1(rπ(Sn)+γn+1v(Sn+1)−v(Sn))∣∣S0=s],s∈S, ∀v∈R|S|. (2.21)

The same expression (ignoring the subscript of ) also holds for the case of state-dependent .

### 2.2 Properties of State-Trace Process

The purpose of this subsection is to prepare the stage for analyzing the asymptotic behavior of the gradient-based TD algorithms using stochastic approximation theory. We collect here important properties of the state-trace process that our subsequent analysis will rely on. Specifically, in Section 2.2.2, we first discuss ergodicity properties of the state-trace process. We then consider several functions on the state-trace space that appear in the TD algorithms, and we derive their expectations w.r.t. the stationary state-trace process, which can be related to expressions of the gradient . In Section 2.2.3, we include more properties of the trace iterates, which will be needed later in analyzing the average dynamics of the algorithms and proving their convergence.

Most of the results in this subsection were proved earlier by the author [35, 37, 40]. There are some small differences in the setup of the problem considered in those earlier works; these differences are nonessential and do not affect the conclusions obtained. For clarity, however, we will give additional details to bridge the gap, when this can be done quickly without repeating long proofs.

#### 2.2.1 Some notations and definitions

Let us first introduce some notations and definitions that we will need below and throughout the paper. In most of our analysis, we will treat the two cases of together. For brevity, let us collect the conditions given earlier for each case of in a single assumption.

###### Assumption 2.1.

Condition 2.1 holds. In the case of history-dependent given in (2.16), Conditions 2.2-2.3 also hold.

By the state-trace process, we mean for the case of state-dependent , and (including the memory states ) for the case of history-dependent , generated under the behavior policy . The state-trace process is a Markov chain with the weak Feller property—this means that, with or (depending on the case of ), is a continuous function of for any bounded continuous function [16, Prop. 6.1.1]. (Using the definitions of the traces and memory states, one can verify that this is the case.) Weak Feller Markov chains have nice ergodicity properties [15], which helped us in obtaining some of the ergodicity properties of the state-trace process that will be discussed shortly.

Let denote the indicator function. For each initial condition of or , define random probability measures , on the state-trace space by

 μx,n(D)=1n+1∑ni=01((Si,ei)∈D)or   μx,n(D)=1n+1∑ni=01((Si,yi,ei)∈D)

for all Borel subsets of the state-trace space.888We take the topology on the state-trace space to be the product topology, with the discrete topology on the space of states/memory states and with the usual topology on , the trace space. The Borel sigma-algebra on the state-trace space is generated by this topology. We refer to them as the occupation probability measures of the state-trace process. Their convergence to the unique invariant probability measure of the state-trace process is crucial for our convergence analysis of the TD algorithms. Here the sense of convergence for these probability measures is weak convergence, and it is defined as follows: if and are probability measures on the state-trace space and as , for all bounded continuous functions , then the sequence is said to converge weakly to .

Let in the case of state-dependent , and in the case of history-dependent . Denote the space of in each case by the same notation . (This is to prepare for handling temporal-differences terms, which involve state transitions.) Among (vector-valued) functions on , a set of them will be important in our analysis: these are functions that are Lipschitz continuous in the trace variable, uniformly w.r.t. the other components of . We have or (depending on the case of ), and since the state space and the memory state space are finite, such a function is just one that is Lipschitz continuous in for each or . So in what follows, when referring to such a function , we will simply say is Lipschitz continuous in the trace variable .

Regarding other notations and terminologies, the abbreviation “a.s.” stands for “almost surely,” and “-a.s.” for “almost surely w.r.t. the probability measure ,” where the subscript indicates that the process under consideration starts from the initial condition . We shall use to denote the sup-norm and

the standard Euclidean norm. For a sequence of random variables

, we say converges in mean to a random variable , if as .

#### 2.2.2 Ergodicity properties

The ergodicity results given in Theorem 2.1 below are proved essentially in [35, Theorems 3.1, 3.3] for the case of state-dependent , and in [40, Theorem 3.2] for the case of history-dependent we consider.999The paper [35] analyzed the state-trace process in the case of constant . Its proof arguments and conclusions, however, extend to the case of state-dependent , under Condition 2.1. In fact, this extension was incorporated in the convergence analysis of the more complex ETD algorithm [36], and that is why we do not repeat here the proof arguments for this extension. The paper [40] analyzed the state-trace process in the case of history-dependent under Assumption 2.1. The first part of the theorem concerns the existence and uniqueness of an invariant probability measure of the state-trace process, and the convergence of the occupation probability measures. Without these ergodicity properties, the behavior of the TD algorithms will be quite different, indeed much more complex, and will have to be analyzed using more advanced stochastic approximation theory for differential inclusions (which are beyond the scope of the present paper).

The second part of the theorem will be used, among others, to characterize the average dynamics of the algorithms. We need it in the case of state-dependent . In that case, the functions appearing in the algorithms can be unbounded, but they have the Lipschitz continuity property required in the theorem. Note that for bounded continuous functions , by the weak convergence of occupation probability measures given in Theorem 2.1(i), the conclusions in the part (ii) automatically hold. (However, even in the case of history-dependent , we will still need the function to be Lipschitz continuous in the trace variable, in order to show that it satisfies a certain “averaging condition” that is stronger than the convergence-in-mean ensured by Theorem 2.1 and is needed in the subsequent convergence analysis. See Prop. 2.3(i) and Remark 2.1 in Section 2.2.3.)

###### Theorem 2.1 (Ergodicity of the state-trace process).

Under Assumption 2.1, the following hold:

• The state-trace process is a weak Feller Markov chain and has a unique invariant probability measure . For each initial condition of the process, the occupation probability measures converge weakly to , -a.s.

• Let denote expectation w.r.t. the stationary state-trace process with initial distribution . Then for any vector-valued function that is Lipschitz continuous in the trace variable . Furthermore, for such function , given each initial condition of , as , converges to in mean and almost surely.

Next, w.r.t. the stationary state-trace process with initial distribution , we derive expressions of the expectation for several functions involved in the GTD algorithms. These expressions are related to the expressions (2.7)-(2.10) of the gradient and will appear in the mean ODEs associated with the algorithms. To state the results concisely, let

 ¯δ(s,s′,v)=ρ(s)(r(s,s′)+γ(s′)v(s′)−v(s)),¯δ0(v)=¯δ(S0,S1,v).

(Recall is the mean reward associated with the transition ; cf. Section 2.1.2. The above are temporal-difference terms without noises in the rewards.) Recall that is the substochastic matrix in the generalized Bellman operator (cf. (2.3)), is the -th component of the function , and the -th column of the matrix .

###### Proposition 2.1.

Under Assumption 2.1, we have

 Eζ[ϕ(S0)ϕ(S0)⊤] =Φ⊤ΞΦ, (2.22) Eζ[e0¯δ0(v)] =Φ⊤Ξ(T(λ)v−v),∀v∈R|S|, (2.23) Eζ[e0⋅ρ0(ϕi(S0)−γ1ϕi(S1))] =Φ⊤Ξ(I−P(λ))Φi,1≤i≤d, (2.24) Eζ[e0⋅ρ0(1−λ1)γ1ϕi(S1)] =Φ⊤ΞP(λ)Φi,1≤i≤d. (2.25)

To prove this proposition, it is convenient to extend the stationary state-trace process whose time is indexed by , to a double-ended stationary state-trace process or with . Let denote the probability measure of the latter process. We shall keep using to denote expectation with respect to . The following lemma gives an expression of in this stationary process. It will facilitate our calculation of for various functions in the proposition.

Regarding notation in the lemma and in what follows, for , let , , , and in addition, adopt the convention that if . Let denote the -dimensional vector of all ’s.

###### Lemma 2.1 (An expression for stationary traces).

Let Assumption 2.1 hold. Then -almost surely, is well-defined and finite, and

 e0=ϕ(S0)+∑∞n=1λ01−nγ01−nρ−1−nϕ(S−n). (2.26)
###### Proof.

For the case of history-dependent we consider, this is proved in [40, Lemma 3.1]. We give the proof for the case of state-dependent . The beginning part of the proof is same as that of [40, Lemma 3.1] and similar to that of [35, Lemma 4.2] for the case of constant . Under Condition 2.1(i), and therefore,

 Eζ[∑∞n=1γ01−nρ−1−n]=∑∞n=1Eζ[γ01−nρ−1−n]=∑∞n=1ξ⊤(PΓ)n1<∞

where the first equality follows from the monotone convergence theorem, and the second equality follows from a direct calculation together with the fact that the marginal of on coincides with , the unique invariant probability measure of under Condition 2.1(ii). Since for all , the above relation implies that

 Eζ[∑∞n=1λ01−nγ01−nρ−1−n∥ϕ(S−n)∥]≤maxs∈S∥ϕ(s)∥⋅Eζ[∑∞n=1γ01−nρ−1−n]<∞. (2.27)

It then follows from a theorem on integration [25, Theorem 1.38, p. 28-29] that -almost surely, the infinite series converges to a finite limit.

We now prove (2.26). Since under Condition 2.1(i),

converges to the zero matrix as

, it follows from an argument very similar to the above that

 Eζ[∥∥∑∞n=mλ01−nγ01−nρ−1−nϕ(S−n)∥∥]→0,%as m→∞. (2.28)

Unfolding the iteration (2.12) for backwards in time, we have that for all ,

 e0=ϕ(S0)+∑m−1n=1λ01−nγ01−nρ−1−nϕ(S−n)+λ01−mγ0