Background
Let
be a Markov Decision Process (MDP) where
is a countable set of states, is a finite set of actions, maps each stateaction pairto the probability of transitioning to
in a single timestep, is andimensional vector mapping each state
to a scalar reward, and is the discount factor. We assume we are given a function that maps each state to a dimensional vector, and we denote by a dimensional matrix with one column for each state .Let be a stochastic policy and denote by the probability that the policy executes action from state . Given a policy , we can define the value function
(1)  
(2) 
where . Note that is a matrix where the
row is the probability distribution over next states, given that the agent is in state
.Given that , we have that
(3) 
where is a matrix, which implies that is a matrix and is a dimensional vector. Given trajectories with length ^{1}^{1}1For clarity, we assume all trajectories have the same fixed length. The algorithms presented can easily be extended to handle variable length trajectories., this suggests the LSTD() algorithm [Bradtke and Barto, 1996, Boyan, 2002], which estimates
(4)  
(5) 
where , , and . After estimating and , LSTD solves for the parameters
(6) 
We will drop the subscript when it is clear from context. The computational complexity of LSTD() is , where the term is due to solving for the inverse of and the term is the cost associated with building the matrix. We can further reduce the total computational complexity to by using Recursive LSTD() [Xu et al., 2002], which we will refer to as RLSTD(). Instead of computing and solving for its inverse, RLSTD() recursively updates an estimate of using the ShermanMorrison formula [Sherman and Morrison, 1949].
Definition 1.
If
is an invertable matrix and
are column vectors such that , then the ShermanMorrison formula is given by(7) 
RLSTD() updates according to the following rule:
(8) 
where , is the identity matrix, and SM is the ShermanMorrison formula given by (7) with , , and .
In the remainder of this paper, we focus on LSTD rather than RLSTD for (a) clarity, (b) because RLSTD has an additional initial variance parameter, and (c) because LSTD gives exact least squares solutions (while RLSTD’s solution is approximate). Note, however, that similar approaches and analysis can be applied to RLSTD.
Adapting the Timescale of LSTD
The parameter effectively controls the timescale at which updates are performed. This induces a biasvariance tradeoff, because longer timescales ( close to 1) tend to result in high variance, while shorter timescales ( close to 0) introduce bias. In this paper, the solution to this tradeoff is the value of that produces the parameters that minimize Mean Squared Value Error (MSVE)
(9) 
where is a distribution over states.
If is a finite set, a natural choice is to perform LeaveOneOut (LOO) CrossValidation (CV) to select the
that minimizes the MSE. Unlike the typical supervised learning setting, individual sampled transitions are correlated. So the LOOCV errors are potentially biased. However, since trajectories are independent, we propose the use of LeaveOneTrajectoryOut (LOTO) CV.
Let , then a naïve implementation would perform LOTOCV for each parameter in . This would mean running LSTD times for each parameter value in . Thus the total time to run LOTOCV for all parameter values is . The naïve implementation is slowed down significantly by the need to solve LSTD times for each parameter value. We first decrease the computational cost associated with LOTOCV for a single parameter value. Then we consider methods that reduce the cost associated with solving LSTD for different values of rather than solving for each value separately.
Efficient LeaveOneTrajectoryOut CV
Fix a single value . We denote by
(10)  
(11)  
(12) 
where is the LSTD() solution computed without the trajectory.
The LOTOCV error for the trajectory is defined by
(13) 
which is the Mean Squared Value Error (MSVE). Notice that the LOTOCV error only depends on through the computed parameters . This is an important property because it allows us to compare this error for different choices of . Once the parameters are known, the LOTOCV error for the trajectory can be computed in time.
Since , it is sufficient to derive and . Notice that and . After deriving and via (4) and (5), respectively, we can derive straightforwardly in time. However, deriving must be done more carefully. We first derive and then update this matrix recursively using the ShermanMorrison formula to remove each transition sample from the trajectory.
We update recursively for all transition samples from the trajectory via
where and . Now applying the ShermanMorrison formula, we can obtain
(14) 
which gives . Since the ShermanMorison formula can be applied in time, erasing the effect of all samples for the trajectory can be done in time.
Using this approach the cost of LSTD() + LOTOCV is , which is on the same order as running LSTD() alone. So computing the additional LOTOCV errors is practically free.
Solving LSTD for Timescale Parameter Values
Let be a matrix where the columns are the state observation vectors at timesteps during the trajectory (with the last state removed) and be a matrix where the columns are the next state observation vectors at timesteps during the trajectory (with the first state observation removed). We define and , which are both matricies.
(15) 
where and .
By applying the ShermanMorrison formula recursively with and , we can obtain in time and then obtain in time. Thus, the total running time for LSTD with timescale parameter values is .
Proposed Algorithm: ALLSTD
We combine the approaches from the previous two subsections to define Adaptive LeastSquares Temporal Difference learning (ALLSTD). The pseudocode is given in Algorithm 3. ALLSTD takes as input a set of trajectories and a finite set of values in . is the set of candidate values.
Theorem 1.
(Agnostic Consistency) Let , be a dataset of trajectories generated by following the policy in an MDP with initial state distribution . If , then as ,
(16) 
where is the proposed algorithm ALLSTD which maps from a dataset and to a vector in and .
Experiments
We compared the MSVE and computation time of ALLSTD against a naïve implementation of LOTOCV (which we refer to as NaïveCV+LSTD) in three domains: random walk, 2048, and Mountain Car. As a baseline, we compared these algorithms to LSTD and RLSTD with the best and worst fixed choices of in hindsight, which we denote LSTD(best), LSTD(worst), RLSTD(best), and RLSTD(worst). In all of our experiments, we generated 80 independent trials. In the random walk and 2048 domains we set the discount factor to . For the mountain car domain we set the discount factor to .
()  ()  () 
()  ()  () 
()  ()  () 
Domain: Random Walk
The random walk domain [Sutton and Barto, 1998] is a chain of five states organized from left to right. The agent always starts in the middle state and can move left or right at each state except for the leftmost and rightmost states, which are absorbing. The agent receives a reward of 0 unless it enters the rightmost state where it receives a reward of and the episode terminates.
Policy: The policy used to generate the trajectories was the uniform random policy over two actions: left and right. Features: Each state was encoded using a 1hot representation in a 5dimensional feature vector. Thus, the value function was exactly representable.
Figure 1 shows the root MSVE as a function of and # trajectories. Notice that results in lower error, but the difference between and decreases as the # trajectories grows.
Figure 1 compares root MSVE in the random walk domain. Notice that ALLSTD and NaïveALLSTD achieve roughly the same error level as LSTD(best) and RLSTD(best). While this domain has a narrow gap between the performance of LSTD(best) and LSTD(worst), ALLSTD and NaïveALLSTD achieve performance comparable to LSTD(best).
Figure 1 compares the average execution time of each algorithm in seconds (on a log scale). LSTD and RLSTD are simply the time required to compute LSTD and RLSTD for different values, respectively. They are shown for reference and do not actually make a decision about which value to use. ALLSTD is significantly faster than NaïveALLSTD and takes roughly the same computational time as solving LSTD for different values.
Domain: 2048
2048 is a game played on a board of tiles. Tiles are either empty or assigned a positive number. Tiles with larger numbers can be acquired by merging tiles with the same number. The immediate reward is the sum of merged tile numbers.
Policy: The policy used to generate trajectories was a uniform random policy over the four actions: up, down, left, and right. Features: Each state was represented by a 16dimensional vector where the value was taken as the value of the corresponding tile and 0 was used as the value for empty tiles. This linear space was not rich enough to capture the true value function.
Figure 2 shows the root MSVE as a function of and # trajectories. Similar to the random walk domain, results in lower error.
Figure 2 compares the root MSVE in 2048. Again ALLSTD and NaïveALLSTD achieve roughly the same error level as LSTD(best) and RLSTD(best) and perform significantly better than LSTD(worst) and RLSTD(worst) for a small number of trajectories.
Figure 2 compares the average execution time of each algorithm in seconds (on a log scale). Again, ALLSTD is significantly faster than NaïveALLSTD.
Domain: Mountain Car
The mountain car domain [Sutton and Barto, 1998] requires moving a car back and forth to build up enough speed to drive to the top of a hill. There are three actions: forward, neutral, and reverse.
Policy: The policy generate the data sampled one of the three actions with uniform probability 25% of the time. On the remaining 75% of the time the forward action was selected if
(17) 
and the reverse action was selected otherwise, where represents the location of the car and represents the velocity of the car. Features: The feature space was a 2dimensional vector where the first element was the location of the car and the second element was the velocity of the car. Thus, the linear space was not rich enough to capture the true value function.
Figure 3 shows the root MSVE as a function of and # trajectories. Unlike the previous two domains, achieves the smallest error even with a small # trajectories. This difference is likely because of the poor feature representation used, which favors the MonteCarlo return [Bertsekas and Tsitsiklis, 1996, Tagorti and Scherrer, 2015].
Figure 3 compares the root MSVE in the mountain car domain. Because of the poor feature representation, the difference between LSTD(best) and LSTD(worst) is large. ALLSTD and NaïveALLSTD again achieve roughly the same performance as LSTD(best) and RLSTD(best).
Figure 3 shows that average execution time for ALLSTD is significantly shorter than for NaïveALLSTD.
Related Work
We have introduced ALLSTD to efficiently select to minimize root MSVE in a datadriven way. The most similar approach to ALLSTD is found in the work of Downey2010, which introduces a Bayesian model averaging approach to update . However, this approach is not comparable to ALLSTD because it is not clear how it can be extended to domains where function approximation is required to estimate the value.
Konidaris2011 and Thomas2015 introduce returns and returns, respectively, which offer alternative weightings of the step returns. However, these approaches were designed to estimate the value of a single point rather than a value function. Furthermore, they assume that the bias introduced by step returns is . Thomas2016 introduce the MAGIC algorithm that attempts to account for the bias of the step returns, but this algorithm is still only designed to estimate the value of a point. ALLSTD is designed to estimate a value function in a datadriven way to minimize root MSVE.
White2016 introduce the greedy algorithm for adapting perstate based on estimating the bias and variance. However, an approximation of the bias and variance is needed for each state to apply greedy. Approximating these values accurately is equivalent to solving our original policy evaluation problem, and the approach suggested in the work of White2016 introduces several additional parameters. ALLSTD, on the other hand, is a parameter free algorithm. Furthermore, none of these previous approaches suggest using LOTOCV to tune or show how LOTOCV can be efficiently implemented under the LSTD family of algorithms.
Discussion
While we have focused on onpolicy evaluation, the biasvariance tradeoff controlled by is even more extreme in offpolicy evaluation problems. Thus, an interesting area of future work would be applying ALLSTD to offpolicy evaluation [White and White, 2016, Thomas and Brunskill, 2016]. It may be possible to identify good values of without evaluating all trajectories. A banditlike algorithm could be applied to determine how many trajectories to use to evaluate different values of . It is also interesting to note that our efficient crossvalidation trick could be used to tune other parameters, such as a parameter controlling regularization.
In this paper, we have focused on selecting a single global value, however, it may be possible to further reduce estimation error by learning values that are specialized to different regions of the state space [Downey and Sanner, 2010, White and White, 2016]. Adapting to different regions of the statespace is challenging because increases the search space, but it identifying good values of could improve prediction accuracy in regions of the state space with high variance or little data.
References
 [Bertsekas and Tsitsiklis, 1996] Bertsekas, D. and Tsitsiklis, J. (1996). NeuroDynamic Programming. Athena Scientific.
 [Boyan, 2002] Boyan, J. A. (2002). Technical update: Leastsquares temporal difference learning. Machine Learning, 49(23):233–246.
 [Bradtke and Barto, 1996] Bradtke, S. J. and Barto, A. G. (1996). Linear leastsquares algorithms for temporal difference learning. Machine Learning, 22(13):33–57.
 [Downey and Sanner, 2010] Downey, C. and Sanner, S. (2010). Temporal difference bayesian model averaging: A bayesian perspective on adapting lambda. In Proceedings of the 27th International Conference on Machine Learning.
 [Konidaris et al., 2011] Konidaris, G., Niekum, S., and Thomas, P. (2011). : Reevaluating complex backups in temporal difference learning. In Advances in Neural Information Processing Systems 24, pages 2402–2410.
 [Lagoudakis and Parr, 2003] Lagoudakis, M. G. and Parr, R. (2003). Leastsquares policy iteration. Journal of Machine Learning Research, 4(Dec):1107–1149.
 [Mnih et al., 2015] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533.
 [Shani and Gunawardana, 2011] Shani, G. and Gunawardana, A. (2011). Evaluating recommendation systems. In Recommender Systems Handbook, pages 257–297. Springer.
 [Sherman and Morrison, 1949] Sherman, J. and Morrison, W. J. (1949). Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Annals of Mathematical Statistics, 20(Jan):317.
 [Sugiyama et al., 2007] Sugiyama, M., Krauledat, M., and MÃžller, K.R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005.
 [Sutton and Barto, 1998] Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
 [Tagorti and Scherrer, 2015] Tagorti, M. and Scherrer, B. (2015). On the Rate of Convergence and Error Bounds for LSTD(). In Proceedings of the 32nd International Conference on Machine Learning, pages 1521–1529.
 [Thomas and Brunskill, 2016] Thomas, P. and Brunskill, E. (2016). Dataefficient offpolicy policy evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, pages 2139–2148.
 [Thomas et al., 2015] Thomas, P., Niekum, S., Theocharous, G., and Konidaris, G. (2015). Policy Evaluation using the Return. In Advances in Neural Information Processing Systems 29.
 [White and White, 2016] White, M. and White, A. (2016). A greedy approach to adapting the trace parameter for temporal difference learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 557–565.

[Xu et al., 2002]
Xu, X., He, H.g., and Hu, D. (2002).
Efficient reinforcement learning using recursive leastsquares
methods.
Journal of Artificial Intelligence Research
, 16(1):259–292.
Appendix A Agnostic Consistency of ALLSTD
Theorem 2.
(Agnostic Consistency) Let , be a dataset of trajectories generated by following the policy in an MDP with initial state distribution . If , then as ,
(18) 
where is the proposed algorithm ALLSTD which maps from a dataset and to a vector in and .
Theorem 2 says that in the limit ALLSTD converges to the best hypothesis.
Proof.
ALLSTD essentially executes LOTOCV+LSTD() for parameter values and compares the scores returned. Then it returns the LSTD() solution for the value with the lowest score. So it is sufficient to show that

the scores converge to the expected MSVE for each , and

that LSTD() converges to .
The first follows by an application of the law of large numbers. Since each trajectory is independent, the MSVE for each
will converge almost surelyto the expected MSVE. The second follows from the fact that LSTD() is equivalent to linear regression against the MonteCarlo returns
[Boyan, 2002]. Notice that the distribution is simply an average over the distributions encountered at each timestep.∎
Comments
There are no comments yet.