1 Introduction
In reinforcement learning, a common strategy to learn an optimal policy is to iteratively estimate the value function for the current decision making policy—called
policy evaluation—and then update the policy using the estimated values. The overall efficiency of this policy iteration scheme is directly influenced by the efficiency of the policy evaluation step. Temporal difference learning methods perform policy evaluation: they estimate the value function directly from the sequence of states, actions, and rewards produced by an agent interacting with an unknown environment.The family of temporal difference methods span a spectrum from computationallyfrugal, linear, stochastic approximation methods to data efficient but quadratic least squares TD methods. Stochastic approximation methods, such as temporal difference (TD) learning [Sutton1988] and gradient TD methods [Maei2011]
perform approximate gradient descent on the mean squared projected Bellman error (MSPBE). These methods require linear (in the number of features) computation per time step and linear memory. These linear TDbased algorithms are well suited to problems with high dimensional feature vectors —compared to available resources— and domains where agent interaction occurs at a high rate
[Szepesvari2010]. When the amount of data is limited or difficult to acquire, the feature vectors are small, or data efficiency is of primary concern, quadratic least squares TD (LSTD) methods may be preferred. These methods directly compute the value function that minimizes the MSPBE, and thus LSTD computes the same value function to which linear TD methods converge. Of course, there are many domains for which neither light weight linear TD methods, nor data efficient least squares methods may be a good match.Significant effort has focused on reducing the computation and storage costs of least squares TD methods in order to span the gap between TD and LSTD. The iLSTD method [Geramifard and Bowling2006] achieves subquadratic computation per time step, but still requires memory that is quadratic in the size of the features. The tLSTD method [Gehring et al.2016]
uses an incremental singular value decomposition (SVD) to achieve both subquadratic computation and storage. The basic idea is that in many domains the update matrix in LSTD can be replaced with a low rank approximation. In practice tLSTD achieves runtimes much closer to TD compared to iLSTD, while achieving better data efficiency. A related idea is to use random projections to reduce computation and storage of LSTD
[Ghavamzadeh et al.2010]. In all these approaches, a scalar parameter (descent dimensions, rank, and number of projections), controls the balance between computation cost and quality of solution.In this paper we explore a new approach called Accelerated gradient TD (ATD), that performs quasisecondorder gradient descent on the MSPBE. Our aim is to develop a family of algorithms that can interpolate between linear TD methods and LSTD, without incurring bias. ATD, when combined with a lowrank approximation, converges in expectation to the TD fixedpoint, with convergence rate dependent on the choice of rank. Unlike previous subquadratic methods, consistency is guaranteed even when the rank is chosen to be one. We demonstrate the performance of ATD versus many linear and subquadratic methods in three domains, indicating that ATD (1) can match the data efficiency of LSTD, with significantly less computation and storage, (2) is unbiased, unlike many of the alternative subquadratic methods, (3) significantly reduces parameter sensitivity for the stepsize, versus linear TD methods, and (4) is significantly less sensitive to the choice of rank parameter than tLSTD, enabling a smaller rank to be chosen and so providing a more efficient incremental algorithm. Overall, the results suggest that ATD may be the first practical subquadratic complexity TD method suitable for fully incremental policy evaluation.
2 Background and Problem Formulation
In this paper we focus on the problem of policy evaluation
, or that of learning a value function given a fixed policy. We model the interaction between an agent and its environment as a Markov decision process
, where denotes the set of states, denotes the set of actions, and encodes the onestep state transition dynamics. On each discrete time step , the agent selects an action according to its behavior policy, , with and the environment responds by transitioning into a new state according to , and emits a scalar reward .The objective under policy evaluation is to estimate the value function, , as the expected return from each state under some target policy :
where denotes the expectation, defined over the future states encountered while selecting actions according to . The return, denoted by is the discounted sum of future rewards given actions are selected according to :
(1)  
where is a scalar that depends on and discounts the contribution of future rewards exponentially with time. The generalization to transitionbased discounting enables the unification of episodic and continuing tasks [White2016] and so we adopt it here. In the standard continuing case, for some constant and for a standard episodic setting, until the end of an episode, at which point , ending the infinite sum in the return. In the most common onpolicy evaluation setting , otherwise and policy evaluation problem is said to be offpolicy.
In domains where the number of states is too large or the state is continuous, it is not feasible to learn the value of each state separately and we must generalize values between states using function approximation. In the case of linear function approximation the state is represented by fixed length feature vectors , where and the approximation to the value function is formed as a linear combination of a learned weight vector, , and : . The goal of policy evaluation is to learn from samples generated while following .
The objective we pursue towards this goal is to minimize the meansquared projected Bellman error (MSPBE):
(2) 
where is a weighting function,
with for TDerror . The vector is called the eligibility trace
where is called the tracedecay parameter and is the stationary distribution induced by following . The importance sampling ratio reweights samples generated by to give an expectation over
This reweighting enables to be learned from samples generated by (under offpolicy sampling).
The most wellstudied weighting occurs when (i.e., . In the onpolicy setting, with , for all and the that minimizes the MSPBE is the same as the found by the onpolicy temporal difference learning algorithm called TD(). More recently, a new emphatic weighting was introduced with the emphatic TD (ETD) algorithm, which we denote . This weighting includes longterm information about (see [Sutton et al.2016, Pg. 16]),
Importantly, the matrix induced by the emphatic weighting is positive semidefinite [Yu2015, Sutton et al.2016], which we will later use to ensure convergence of our algorithm under both on and offpolicy sampling. The used by TD() is not necessarily positive semidefinite, and so TD() can diverge when (offpolicy).
Two common strategies to obtain the minimum of this objective are stochastic temporal difference techniques, such as TD() [Sutton1988], or directly approximating the linear system and solving for the weights, such as in LSTD() [Boyan1999]. The first class constitute linear complexity methods, both in computation and storage, including the family of gradient TD methods [Maei2011], true online TD methods [van Seijen and Sutton2014, van Hasselt et al.2014] and several others (see [Dann et al.2014, White and White2016] for a more complete summary). On the other extreme, with quadratic computation and storage, one can approximate and incrementally and solve the system . Given a batch of samples , one can estimate
and then compute solution such that . Leastsquares TD methods are typically implemented incrementally using the ShermanMorrison formula, requiring storage and computation per step.
Our goal is to develop algorithms that interpolate between these two extremes, which we discuss in the next section.
3 Algorithm derivation
To derive the new algorithm, we first take the gradient of the MSPBE (in 2) to get
(3) 
Consider a second order update by computing the Hessian: . For simplicity of notation, let and . For invertible , the secondorder update is
In fact, for our quadratic loss, the optimal descent direction is with , in the sense that . Computing the Hessian and updating
requires quadratic computation, and in practice quasiNewton approaches are used that approximate the Hessian. Additionally, there have been recent insights that using approximate Hessians for stochastic gradient descent can in fact speed convergence
[Schraudolph et al.2007, Bordes et al.2009, Mokhtari and Ribeiro2014]. These methods maintain an approximation to the Hessian, and sample the gradient. This Hessian approximation provides curvature information that can significantly speed convergence, as well as reduce parameter sensitivity to the stepsize.Our objective is to improve on the sample efficiency of linear TD methods, while avoiding both quadratic computation and asymptotic bias. First, we need an approximation to that provides useful curvature information, but that is also subquadratic in storage and computation. Second, we need to ensure that the approximation, , does not lead to a biased solution .
We propose to achieve this by approximating only and sampling using as an unbiased sample. The proposed accelerated temporal difference learning update—which we call ATD()—is
with expected update
(4) 
with regularization . If is a poor approximation of , or discards key information—as we will do with a low rank approximation— then updating using only will result in a biased solution, as is the case for tLSTD [Gehring et al.2016, Theorem 1]. Instead, sampling , as we show in Theorem 1, yields an unbiased solution, even with a poor approximation . The regularization is key to ensure this consistency, by providing a full rank preconditioner .
Given the general form of ATD(), the next question is how to approximate . Two natural choices are a diagonal approximation and a lowrank approximation. Storing and using a diagonal approximation would only require linear time and space. For a lowrank approximation , of rank , represented with truncated singular value decomposition , the storage requirement is and the required matrixvector multiplications are only because for any vector , , is a sequence of matrixvector multiplications. Exploratory experiments revealed that the lowrank approximation approach significantly outperformed the diagonal approximation. In general, however, many other approximations to could be used, which is an important direction for ATD.
We opt for an incremental SVD, that previously proved effective for incremental estimation in reinforcement learning [Gehring et al.2016]. The total computational complexity of the algorithm is for the fully incremental update to and for minibatch updates of samples. Notice that when , the algorithm reduces exactly to TD), where is the stepsize. On the other extreme, where , ATD is equivalent to an iterative form of LSTD(). See the appendix for a further discussion, and implementation details.
4 Convergence of ATD()
As with previous convergence results for temporal difference learning algorithms, the first key step is to prove that the expected update converges to the TD fixed point. Unlike previous proofs of convergence in expectation, we do not require the true to be full rank. This generalization is important, because as shown previously, is often lowrank, even if features are linearly independent [Bertsekas2007, Gehring et al.2016]. Further, ATD should be more effective if is lowrank, and so requiring a fullrank would limit the typical usecases for ATD.
To get across the main idea, we first prove convergence of ATD with weightings that give positive semidefinite ; a more general proof for other weightings is in the appendix.
Assumption 1.
is diagonalizable, that is, there exists invertible
with normalized columns (eigenvectors) and diagonal
, , such that . Assume the ordering .Assumption 2.
and .
Finally, we introduce an assumption that is only used to characterize the convergence rate. This condition has been previously used [Hansen1990, Gehring et al.2016] to enforce a level of smoothness on the system.
Assumption 3.
The linear system defined by and satisfy the discrete Picard condition: for some , for all .
Theorem 1.
Under Assumptions 1 and 2, for any , let be the rank approximation of , where with for and zero otherwise. If or , the expected updating rule in (4) converges to the fixedpoint .
Further, if Assumption 3 is satisfied, the convergence rate is
Proof: We use a general result about stationary iterative methods which is applicable to the case where is not full rank. Theorem 1.1 [Shi et al.2011] states that given a singular and consistent linear system where is in the range of , the stationary iteration with for
(5) 
converges to the solution if and only if the following three conditions are satisfied.

[leftmargin=*]

Condition II: .

Condition III: nullspace nullspace.
We verify these conditions to prove the result. First, because we use the projected Bellman error, is in the range of and the system is consistent: there exists s.t. .
To rewrite our updating rule (4) to be expressible in terms of (5), let , giving
(6) 
where is a diagonal matrix with the indices set to 1, and the rest zero.
Proof for condition I. Using (6), .
To bound the maximum absolute value in the diagonal matrix ,
we consider eigenvalue in , and address two cases.
Because is positive semidefinite for the assumed [Sutton et al.2016],
for all .
Case 1: .
Case 2: .
which is true for
for any .
Proof for condition II. does not change the number of positive eigenvalues, so the rank is unchanged.
Proof for condition III. To show the nullspaces of and are equal,
it is sufficient to prove if and only if .
,
is invertible because and .
For any , we get , and
so .
For any , , and so .
Convergence rate.
Assume . On each step, we update with
. This can be verified inductively, where
For , because ,
and because ,
where because has normalized columns.
For , we have that the magnitude of the values in are
For , we get .
Under the discrete Picard condition, and so the denominator cancels, giving the desired result.
This theorem gives insight into the utility of ATD for speeding convergence, as well as the effect of . Consider TD(), which has positive definite in onpolicy learning [Sutton1988, Theorem 2]. The above theorem guarantees ATD convergences to the TD fixedpoint, for any . For , the expected ATD update is exactly the expected TD update. Now, we can compare the convergence rate of TD and ATD, using the above convergence rate.
Take for instance the setting for ATD, which is common for secondorder methods and let . The rate of convergence reduces to the maximum of and . In early learning, the convergence rate for TD is dominated by , because is largest relative to for small . ATD, on the other hand, for a larger , can pick a smaller and so has a much smaller value for , i.e., , and is small because is small for . As gets smaller, becomes larger, slowing convergence. For lowrank domains, however, could be quite small and the preconditioner could still improve the convergence rate in early learning—potentially significantly outperforming TD.
ATD is a quasisecond order method, meaning sensitivity to parameters should be reduced and thus it should be simpler to set the parameters. The convergence rate provides intuition that, for reasonably chosen , the regularizer should be small—smaller than a typical stepsize for TD. Additionally, because ATD is a stochastic update, not the expected update, we make use of typical conventions from stochastic gradient descent to set our parameters. We set , as in previous stochastic secondorder methods [Schraudolph et al.2007], where we choose and set to a small fixed value. Our choice for represents a small final stepsize, as well as matching the convergence rate intuition.
On the bias of subquadratic methods.
The ATD() update was derived to ensure convergence to the minimum of the MSPBE, either for the onpolicy or offpolicy setting. Our algorithm summarizes past information, in
, to improve the convergence rate, without requiring quadratic computation and storage. Prior work aspired to the same goal, however, the resultant algorithms are biased. The iLSTD algorithm can be shown to converge for a specific class of feature selection mechanisms
[Geramifard et al.2007, Theorem 2]; this class, however, does not include the greedy mechanism that is used in iLSTD algorithm to select a descent direction. The random projections variant of LSTD [Ghavamzadeh et al.2010] can significantly reduce the computational complexity compared with conventional LSTD, with projections down to size , but the reduction comes at a cost of an increase in the approximation error [Ghavamzadeh et al.2010]. Fast LSTD [Prashanth et al.2013] does randomized TD updates on a batch of data; this algorithm could be run incrementally with O() by using minibatches of size . Though it has a nice theoretical characterization, this algorithm is restricted to . Finally, the most related algorithm is tLSTD, which also uses a lowrank approximation to .In ATD is used very differently, from how is used in tLSTD. The tLSTD algorithm uses a similar approximation as ATD, but tLSTD uses it to compute a closed form solution , and thus is biased [Gehring et al.2016, Theorem 1]. In fact, the bias grows with decreasing , proportionally to the magnitude of the th largest singular value of . In ATD, the choice of is decoupled from the fixed point, and so can be set to balance learning speed and computation with no fear of asymptotic bias.
5 Empirical Results
All the following experiments investigate the onpolicy setting, and thus we make use of the standard version of ATD for simplicity. Future work will explore offpolicy domains with the emphatic update. The results presented in this section were generated over 756 thousand individual experiments run on three different domains. Due to space constraints detailed descriptions of each domain, error calculation, and all other parameter settings are discussed in detail in the appendix. We included a wide variety of baselines in our experiments, additional related baselines excluded from our study are also discussed in the appendix.
Our first batch of experiments were conducted on Boyan’s chain—a domain known to elicit the strong advantages of LSTD() over TD(). In Boyan’s chain the agent’s objective is to estimate the value function based on a lowdimensional, dense representation of the underlying state (perfect representation of the value function is possible). The ambition of this experiment was to investigate the performance of ATD in a domain where the preconditioner matrix is full rank; no rank truncation is applied. We compared five linearcomplexity methods (TD(0), TD(), true online TD(), ETD(), true online ETD()), against LSTD() and ATD, reporting the percentage error relative to the true value function over the first 1000 steps, averaged over 200 independent runs. We swept a large range of stepsize parameters, trace decay rates, and regularization parameters, and tested both fixed and decaying stepsize schedules. Figure 1 summarizes the results.
Both LSTD() and ATD achieve lower error compared to all the linear baselines—even thought each linear method was tuned using 864 combinations of stepsizes and . In terms of sensitivity, the choice of stepsize for TD(0) and ETD exhibit large effect on performance (indicated by sharp valleys), whereas trueonline TD() is the least sensitive to learning rate. LSTD() using the ShermanMorrison update (used in many prior empirical studies) is sensitive to the regularization parameter; the parameter free nature of LSTD may be slightly overstated in the literature.^{1}^{1}1We are not the first to observe this. sutton2016reinforcement (sutton2016reinforcement) note that plays a role similar to the stepsize for LSTD.
Our second batch of experiments investigated characteristics of ATD in a classic benchmark domain with a sparse highdimensional feature representation where perfect approximation of the value function is not possible—Mountain car with tile coding. The policy to be evaluated stochastically takes the action in the direction of the sign of the velocity, with performance measured by computing a truncated Monte Carlo estimate of the return from states sampled from the stationary distribution (detailed in the appendix). We used a fine grain tile coding of the the 2D state, resulting in a 1024 dimensional feature representation with exactly 10 units active on every time step. We tested TD(0), true online TD() true online ETD(), and subquadratic methods, including iLSTD, tLSTD, random projection LSTD, and fast LSTD [Prashanth et al.2013]. As before a wide range of parameters () were swept over a large set. Performance was averaged over 100 independent runs. A fixed stepsize schedule was used for the linear TD baselines, because that achieved the best performance. The results are summarized in figure 2.
LSTD and ATD exhibit faster initial learning compared to all other methods. This is particularly impressive since is less than 5% of the size of . Both fast LSTD and projected LSTD perform considerably worse than the linear TDmethods, while iLSTD exhibits high parameter sensitivity. tLSTD has no tunable parameter besides , but performs poorly due to the high stochasticity in the policy—additional experiments with randomness in action selection of 0% and 10% yielded better performance for tLSTD, but never equal to ATD. The true online linear methods perform very well compared to ATD, but this required sweeping hundreds of combinations of and , whereas ATD exhibited little sensitivity to it’s regularization parameter (see Figure 2 RHS); ATD achieved excellent performance with the same parameter setting as we used in Boyan’s chain.^{2}^{2}2For the remaining experiments in the paper, we excluded the TD methods without true online traces because they perform worse than their true online counterparts in all our experiments. This result matches the results in the literature [van Seijen et al.2016].
We ran an additional experiment in Mountain car to more clearly exhibit the benefit of ATD over existing methods. We used the same setting as above, except that 100 additional features were added to the feature vector, with 50 of them randomly set to one and the rest zero. This noisy feature vector is meant to emulate a situation such as a robot that has a sensor that becomes unreliable, generating noisy data, but the remaining sensors are still useful for the task at hand. The results are summarized in Figure 4. Naturally all methods are adversely effected by this change, however ATD’s low rank approximation enables the agent to ignore the unreliable feature information and learn efficiently. tLSTD, as suggested by our previous experiments does not seem to cope well with the increase in stochasticity.
Our final experiment compares the performance of several subquadratic complexity policy evaluation methods in an industrial energy allocation simulator with much larger feature dimension (see Figure 4). As before we report percentage error computed from Monte Carlo rollouts, averaging performance over 50 independent runs and selecting and testing parameters from an extensive set (detailed in the appendix). The policy was optimized ahead of time and fixed, and the feature vectors were produced via tile coding, resulting in an 8192 dimensional feature vector with 800 units active on each step. Although the feature dimension here is still relatively small, a quadratic method like LSTD nonetheless would require over 67 million operations per time step, and thus methods that can exploit low rank approximations are of particular interest. The results indicate that both ATD and tLSTD achieve the fastest learning, as expected. The instrinsic rank in this domain appears to be small compared to the feature dimension—which is exploited by ATD and tLSTD with —while the performance of tLSTD indicates that the domain exhibits little stochasticity. The appendix contains additional results for this domain—in the small rank setting ATD significantly outperforms tLSTD.
6 Conclusion and future work
In this paper, we introduced a new family of TD learning algorithms that take a fundamentally different approach from previous incremental TD algorithms. The key idea is to use a preconditioner on the temporal difference update, similar to a quasiNewton stochastic gradient descent update. We prove that the expected update is consistent, and empirically demonstrated improved learning speed and parameter insensitivity, even with significant approximations in the preconditioner.
This paper only begins to scratch the surface of potential preconditioners for ATD. There remains many avenues to explore the utility of other preconditioners, such as diagonal approximations, eigenvalues estimates, other matrix factorizations and approximations to that are amenable to inversion. The family of ATD algorithms provides a promising avenue for more effectively using results for stochastic gradient descent to improve sample complexity, with feasible computational complexity.
References
 [Bertsekas2007] Bertsekas, D. 2007. Dynamic Programming and Optimal Control. Athena Scientific Press.

[Bordes et al.2009]
Bordes, A.; Bottou, L.; and Gallinari, P.
2009.
SGDQN: Careful quasiNewton stochastic gradient descent.
Journal of Machine Learning Research
.  [Boyan1999] Boyan, J. A. 1999. Leastsquares temporal difference learning. International Conference on Machine Learning.

[Dabney and Thomas2014]
Dabney, W., and Thomas, P. S.
2014.
Natural Temporal Difference Learning.
In
AAAI Conference on Artificial Intelligence
.  [Dann et al.2014] Dann, C.; Neumann, G.; and Peters, J. 2014. Policy evaluation with temporal differences: a survey and comparison. The Journal of Machine Learning Research.
 [Gehring et al.2016] Gehring, C.; Pan, Y.; and White, M. 2016. Incremental Truncated LSTD. In International Joint Conference on Artificial Intelligence.
 [Geramifard and Bowling2006] Geramifard, A., and Bowling, M. 2006. Incremental leastsquares temporal difference learning. In AAAI Conference on Artificial Intelligence.
 [Geramifard et al.2007] Geramifard, A.; Bowling, M.; and Zinkevich, M. 2007. iLSTD: Eligibility traces and convergence analysis. In Advances in Neural Information Processing Systems.
 [Ghavamzadeh et al.2010] Ghavamzadeh, M.; Lazaric, A.; Maillard, O. A.; and Munos, R. 2010. LSTD with random projections. In Advances in Neural Information Processing Systems.
 [Givchi and Palhang2014] Givchi, A., and Palhang, M. 2014. Quasi newton temporal difference learning. In Asian Conference on Machine Learning.
 [Hansen1990] Hansen, P. C. 1990. The discrete picard condition for discrete illposed problems. BIT Numerical Mathematics.
 [Maei2011] Maei, H. 2011. Gradient TemporalDifference Learning Algorithms. Ph.D. Dissertation, University of Alberta.
 [Mahadevan et al.2014] Mahadevan, S.; Liu, B.; Thomas, P. S.; Dabney, W.; Giguere, S.; Jacek, N.; Gemp, I.; and 0002, J. L. 2014. Proximal reinforcement learning: A new theory of sequential decision making in primaldual spaces. CoRR abs/1405.6757.
 [Meyer et al.2014] Meyer, D.; Degenne, R.; Omrane, A.; and Shen, H. 2014. Accelerated gradient temporal difference learning algorithms. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.
 [Mokhtari and Ribeiro2014] Mokhtari, A., and Ribeiro, A. 2014. RES: Regularized stochastic BFGS algorithm. IEEE Transactions on Signal Processing.
 [Prashanth et al.2013] Prashanth, L. A.; Korda, N.; and Munos, R. 2013. Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. ECML PKDD.
 [Salas and Powell2013] Salas, D. F., and Powell, W. B. 2013. Benchmarking a Scalable Approximate Dynamic Programming Algorithm for Stochastic Control of Multidimensional Energy Storage Problems. Dept Oper Res Financial Eng.
 [Schraudolph et al.2007] Schraudolph, N.; Yu, J.; and Günter, S. 2007. A stochastic quasiNewton method for online convex optimization. In International Conference on Artificial Intelligence and Statistics.
 [Shi et al.2011] Shi, X.; Wei, Y.; and Zhang, W. 2011. Convergence of general nonstationary iterative methods for solving singular linear equations. SIAM Journal on Matrix Analysis and Applications.
 [Sutton and Barto1998] Sutton, R., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT press.
 [Sutton and Barto2016] Sutton, R., and Barto, A. G. 2016. Reinforcement Learning: An Introduction 2nd Edition. MIT press.
 [Sutton et al.2016] Sutton, R. S.; Mahmood, A. R.; and White, M. 2016. An emphatic approach to the problem of offpolicy temporaldifference learning. The Journal of Machine Learning Research.
 [Sutton1988] Sutton, R. 1988. Learning to predict by the methods of temporal differences. Machine Learning.
 [Szepesvari2010] Szepesvari, C. 2010. Algorithms for Reinforcement Learning. Morgan & Claypool Publishers.
 [van Hasselt et al.2014] van Hasselt, H.; Mahmood, A. R.; and Sutton, R. 2014. Offpolicy TD () with a true online equivalence. In Conference on Uncertainty in Artificial Intelligence.
 [van Seijen and Sutton2014] van Seijen, H., and Sutton, R. 2014. True online TD(lambda). In International Conference on Machine Learning.
 [van Seijen et al.2016] van Seijen, H.; Mahmood, R. A.; Pilarski, P. M.; Machado, M. C.; and Sutton, R. S. 2016. True Online TemporalDifference Learning. In Journal of Machine Learning Research.
 [Wang and Bertsekas2013] Wang, M., and Bertsekas, D. P. 2013. On the convergence of simulationbased iterative methods for solving singular linear systems. Stochastic Systems.
 [White and White2016] White, A. M., and White, M. 2016. Investigating practical, linear temporal difference learning. In International Conference on Autonomous Agents and Multiagent Systems.
 [White2016] White, M. 2016. Unifying task specification in reinforcement learning. arXiv.org.
 [Yu2015] Yu, H. 2015. On convergence of emphatic temporaldifference learning. In Annual Conference on Learning Theory.
Appendix A Convergence proof
For the more general setting, where can also equal , we redefine the rank approximation. We say the rank approximation to is composed of eigenvalues if for diagonal , for , and zero otherwise.
Theorem 2.
Under Assumptions 1 and 2, let be the rank approximation composed of eigenvalues . If or contains all the negative eigenvalues in , then the expected updating rule in (4) converges to the fixedpoint .
Proof: We use a general result about stationary iterative methods [Shi et al.2011], which is applicable to the case where is not full rank. Theorem 1.1 [Shi et al.2011] states that given a singular and consistent linear system where is in the range of , the stationary iteration with for
(5) 
converges to the solution if and only if the following three conditions are satisfied.

[leftmargin=*]

Condition I: the eigenvalues of are equal to 1 or have absolute value strictly less than 1.

Condition II: .

Condition III: the null space .
We verify these conditions to prove the result. First, because we are using the projected Bellman error, we know that is in the range of and the system is consistent: there exists s.t. .
To rewrite our updating rule (4) to be expressible in terms of (5), let , giving
(6) 
where is a diagonal matrix with the indices set to 1, and the rest zero.
Proof for condition I. Using (6), .
To bound the maximum absolute value in the diagonal matrix ,
we consider eigenvalue in , and address three cases.
Case 1: , :
Case 2: , : if .
Case 3: . For this case, , by assumption, as contains the indices for all negative eigenvalues of . So if .
All three cases are satisfied by the assumed and .
Therefore, the absolute value of the eigenvalues of are all less than 1 and so the first condition holds.
Proof for condition II. does not change the number of positive eigenvalues, so the rank is unchanged.
Proof for condition III. To show that the nullspaces of and are equal, it is sufficient to prove if and only if . Because , we know that is invertible as long as . Because , this is clearly true for and also true for because is strictly less than . For any , we get , and so . For any , we get , and so , completing the proof.
With , the update is a gradient descent update on the MSPBE, and so will converge even under offpolicy sampling. As , the gradient is only approximate and theoretical results about (stochastic) gradient descent no longer obviously apply. For this reason, we use the iterative update analysis above to understand convergence properties. Iterative updates for the full expected update, with preconditioners, have been studied in reinforcement learning (c.f. [Wang and Bertsekas2013]); however, they typically analyzed different preconditioners, as they had no requirements for reducing computation below quadratic computation. For example, they consider a regularized preconditioner , which is not compatible with an incremental singular value decomposition and, to the best of our knowledge, current iterative eigenvalue decompositions require symmetric matrices.
The theorem is agnostic to what components of are approximated by the rank matrix . In general, a natural choice, particularly in onpolicy learning or more generally with a positive definite , is to select the largest magnitude eigenvalues of , which contain the most significant information about the system and so are likely to give the most useful curvature information. However, could also potentially be chosen to obtain convergence for offpolicy learning with , where is not necessarily positive semidefinite. This theorem indicates that if the rank approximation contains the negative eigenvalues of , even if it does not contain the remaining information in , then we obtain convergence under offpolicy sampling. We can of course use the emphatic weighting more easily for offpolicy learning, but if the weighting is desired rather than , then carefully selecting for ATD enables that choice.
Appendix B Algorithmic details
In this section, we outline the implemented ATD() algorithm. The key choices are how to update the approximation to , and how to update the eligibility trace to obtain different variants of TD. We include both the conventional and emphatic trace updates in Algorithms 2 and 3 respectively. The lowrank update to uses an incremental singular value decomposition (SVD). This update to is the same one used for tLSTD, and so we refer the reader to [Gehring et al.2016, Algorithm 3]. The general idea is to incorporate the rank one update into the current SVD of . In addition, to maintain a normalized , we multiply by :
Multiplying by a constant corresponds to multiplying the singular values. We also find that multiplying each component of the rank one update by is more effective than multiplying only one of them by .
Comments
There are no comments yet.