Policy evaluation, commonly referred to as value function approximation, is an important and central part in many reinforcement learning (RL) algorithms 
, whose task is to estimate value functions for a fixed policy in a discounted Markov Decision Process (MDP) environment. The value function of each state specifies the accumulated reward an agent would receive in the future by following the fixed policy from that state. Value functions have been widely investigated in RL applications, and it can provide insightful and important information for the agent to obtain an optimal policy, such as important board configurations in Go
, failure probabilities of large telecommunication networks, taxi-out times at large airports  and so on.
Despite the value functions can be approximated by different ways, the simplest form, linear approximations, are still widely adopted and studied due to their good generalization abilities, relatively efficient computation and solid theoretical guarantees[27, 7, 13, 16]. Temporal Difference (TD) learning is a common approach to this policy evaluation with linear function approximation problem. These typical TD algorithms can be divided into two categories: gradient based methods (e.g., GTD() ) and least-square (LS) based methods (e.g., LSTD()). A good survey on these algorithms can be found in [17, 6, 12, 7, 13].
As the development of information technologies, high-dimensional data is widely seen in RL applications[26, 30, 23] , which brings serious challenges to design scalable and computationally efficient algorithms for the linear value function approximation problem. To address this practical issue, several approaches have been developed for efficient and effective value function approximation.  and  adopted or regularization techniques to control the complexity of the large function space and designed several and regularized RL algorithms. 
studied this problem by using low-rank approximation via an incremental singular value function decomposition and proposed t-LSTD().  derived ATD() by combining the low-rank approximation and quasi-Newton gradient descent ideas.
Recently,  and  investigated sketching (projecting) methods to reduce the dimensionality in order to make it feasible to employ Least-Squares Temporal Difference (briefly, LSTD) algorithms. Specifically,  proposed an approach named LSTD-RP, which is based on random projections. They showed that LSTD-RP can benefit from the random projection strategy. The eligibility traces have already been proven to be important parameters to control the quality of approximation during the policy evaluation process, but  did not take them into consideration.  empirically investigated the effective use of sketching methods including random projections, count sketch, combined sketch and hadamard sketch for value function approximation, but they did not provide any conclusion on finite sample analysis. However, finite sample analysis is important for these algorithms since it clearly demonstrates the effects of the number of samples, dimensionality of the function space and the other related parameters.
In this paper, we focus on exploring the utility of random projections and eligibility traces on LSTD algorithms to tackle the computation efficiency and quality of approximations challenges in the high-dimensional feature spaces setting. We also provide finite sample analysis to evaluate its performance. To the best of our knowledge, this is the first work that performs formal finite sample analysis of LSTD with random projections and eligibility traces. Our contributions can be summarized from the following two aspects:
Algorithm: By introducing random projections and eligibility traces, we propose a refined algorithm named LSTD with Random Projections and Eligibility Traces (denoted as LSTD()-RP for short), where is the trace parameter of -return when considering eligibility traces. LSTD()-RP algorithm consists of two steps: first, generate a low-dimensional linear feature space through random projections from the original high-dimensional feature space; then, apply LSTD() to this generated low-dimensional feature space.
Theoretical Analysis: We perform theoretical analysis to evaluate the performance of LSTD()-RP and provide its finite sample performance bounds, including the estimation error bound, approximation error bound and total error bound. The analysis of the prior works LSTD-RP and LSTD() cannot directly apply to our setting, since (i) The analysis of LSTD-RP is based on a model of regression with Markov design, but it does not hold when we incorporate eligibility traces; (ii) Due to utilizing random projections, the analysis of LSTD(
) cannot be directly used, especially the approximation error analysis. To tackle these challenges, we first prove the linear independence property can be preserved by random projections, which is important for our analysis. Second, we decompose the total error into two parts: estimation error and approximation error. Then we make analysis on any fixed random projection space, and bridge these error bounds between the fixed random projection space and any arbitrary random projection space by leveraging the norm and inner-product preservation properties of random projections, the relationship between the smallest eigenvalues of the Gram matrices in the original and randomly projected spaces and the Chernoff-Hoeffding inequality for stationary-mixing sequence. What’s more, our theoretical results show that
Compared to LSTD-RP, the parameter of eligibility traces illustrates a trade-off between the estimation error and approximation error for LSTD()-RP. We could tune to select an optimal which could balance these two errors and obtain the smallest total error bound. Furthermore, for fixed sample , optimal dimension of randomly projected space in LSTD()-RP is much smaller than that of LSTD-RP.
Compared to LSTD(), in addition to the computational gains which are the result of random projections, the estimation error of LSTD()-RP is much smaller at the price of a controlled increase of the approximation error. LSTD()-RP may have a better performance than LSTD(), whenever the additional term in the approximation error is smaller than the gain achieved in the estimation error.
These results demonstrate that LSTD()-RP can benefit from eligibility traces and random projections strategies in computation efficiency and approximation quality, and can be superior to LSTD-RP and LSTD() algorithms.
In this section, first we introduce some notations and preliminaries. Then we make a brief review of LSTD() and LSTD-RP algorithms.
Now we introduce some notations for the following paper. Let denote the size of a set and denote the
norm for vectors. Letbe a measurable space. Denote the set of probability measure over , and denote the set of measurable functions defined on and bounded by as . For a measure , the -weighted norm of a measurable function is defined as . The operator norm for matrix is defined as .
2.1 Value Functions
Reinforcement learning (RL) is an approach to find optimal policies in sequential decision-making problems, in which the RL agent interacts with a stochastic environment formalized by a discounted Markov Decision Process (MDP) . An MDP is described as a tuple , where state space is finite 111For simplicity, we assume the state space is finite. However, the results in this paper can be generalized into other more general state spaces., action space is finite, is the transition probability from state to the next state when taking action , is the reward function, which is uniformly bound by , and is the discount factor. A deterministic policy222Without loss of generality, here we only consider the deterministic policy. The extension to stochastic policy setting is straight-forward. is a mapping from state space to action space, which is an action selection policy. Given the policy , the MDP
can be reduced to a Markov chain, with transition probability and reward .
In this paper, we are interested in policy evaluation, which can be used to find optimal policies or select actions. It involves computing the state-value function of a given policy which assigns to each state a measure of long-term performance following the given policy. Mathematically, given a policy , for any state , the value function of state is defined as follows:
where denotes the expectation over random samples which are generated by following policy . Let denote a vector constructed by stacking the values of on top of each other. Then, we can see that is the unique fixed point of the Bellman operator :
where is the expected reward vector under policy . Equation (1) is called Bellman Equation, which is the basis of temporal difference learning approaches. In the reminder of this paper, we omit the policy superscripts for ease of reference in unambiguous cases, since we are interested in on-policy learning in this work.
When the size of state space is very large or even infinite, one may consider to approximate the state-value function by a linear function approximation, which is widely used in RL [27, 7]. We define a linear function space , which is spanned by the basis functions 333, i.e., , where is the feature vector. We assume for some finite positive constant . For any function , let . Furthermore, we generate a -dimensional () random space from through random projections , where
be a random matrix whose each element is drawn independently and identically distributed (i.i.d.) from Gaussion distribution444
It is also can be some sub-Gaussian distributions. Without loss of generality, here we only consider Gaussian distribution for simplicity.. For any , denote the randomly projected feature vector , where . Thus, . Define of dimension and of dimension to be the original and randomly projected feature matrix respectively.
Least-Squares Temporal Difference (LSTD) is a traditional and important approach for policy evaluation in RL, which was first introduced by , and later was extended to include the eligibility traces by [3, 4] referred to as LSTD().
The essence of LSTD() is to estimate the fixed point of the projected multi-step Bellman equation, that is,
where is the steady-state probabilities of the Markov chain induced by policy , denotes the diagonal matrix with diagonal elements being , is the orthogonal projection operator into the linear function space , and is a multi-step Bellman operator parameterized by , and it is defined as follows:
When , we have , and it becomes LSTD.
Given one sampled trajectory generated by the Markov chain under policy , the LSTD() algorithm returns with , where
where is called the eligibility trace, and is the trace parameter for the -return.
Compared to gradient based temporal difference (TD) learning algorithms, LSTD() has data sample efficiency and parameter insensitivity advantages, but it is less computationally efficient. LSTD() requires computation per time step or still requires by using the Sherman-Morrison formula to make incremental update. This expensive computation cost makes LSTD() impractical for the high-dimensional feature spaces scenarios in RL. Recently, Least-Squares TD with Random Projections algorithm (briefly denoted as LSTD-RP) was proposed to deal with the high-dimensional data setting .
The basic idea of LSTD-RP is to learn the value function of a given policy from a low-dimensional linear space which is generated through random projections from a high-dimensional space . Their theoretical results show that the total computation complexity of LSTD-RP is , which is dramatically less than the computation cost in the high dimensional space (i.e., ). In addition to these practical computational gains,  demonstrate that LSTD-RP can provide an efficient and effective approximation for value functions, since LSTD-RP reduces the estimation error at the price of the increase in the approximation error which is controlled.
However, LSTD-RP does not take the eligibility traces into consideration, which are important parameters in RL. First, the use of these traces can significantly speed up learning by controlling the trade off between bias and variance[1, 25]. Second, the parameter of these traces is also known to control the quality of approximation . In the remainder of this paper, we present a generalization of LSTD-RP to deal with the scenario (i.e., LSTD()-RP (see Section 3)). What’s more, we also give its theoretical guarantee in Section 4.
In this section, we first consider the Bellman equation with random projections (see Equation (4)), and explore the existence and uniqueness properties of its solution, which is the goal of our newly proposed algorithm to estimate. Then we present the LSTD with Random Projections and Eligibility Traces algorithm (briefly denoted as LSTD()-RP) as shown in Algorithm 1, and discuss its computational cost.
3.1 Bellman Equation with Random Projections
The feature matrix has full column rank; that is, the original high-dimensional feature vectors are linearly independent.
From the following lemma, we can get that the linear independence property can be preserved by random projections. Due to the space restrictions, we leave its detailed proof into Appendix B.
Let Assumption 1 hold. Then the randomly projected low-dimensional feature vectors are linearly independent a.e.. Accordingly, is invertible a.e..555Notice that here the randomness is w.r.t. the random projection rather than the random sample. In the following paper, without loss of generality, we can assume are linearly independent and is invertible.
Let denote the orthogonal projection onto the randomly projected low-dimensional feature space with respect to the -weighted norm. According to Lemma 1, we obtain the projection has the following closed form
Then the projected multi-step Bellman equation with random projections becomes
Note that when , we have .
According to the Banach fixed point theorem, in order to guarantee the existence and uniqueness of the fixed point of Bellman equation with random projections (see Equation (4)), we only need to demonstrate the contraction property of operator . By simple derivations, we can demonstrate that the contraction property of holds as shown in the following Lemma 2, and we leave its detailed proof into Appendix C.
Let Assumption 1 hold. Then the projection operator is non-expansive w.r.t. -weighted quadratic norm, and the operator is a (-)contraction.
Denote the unique solution of the Bellman equation with random projections (see Equation (4)) as . In this work, we focus exclusively on the linear function approximation problem. Therefore, there exists such that
Furthermore, by Lemma 1, we can prove that is invertible. Thus, is well defined.
3.2 Lstd()-RP Algorithm
Now we present our proposed algorithm LSTD()-RP in Algorithm 1, which aims to estimate the solution of Bellman equation with random projections (see Equation (6)) by using one sample trajectory generated by the Markov chain . Then we discuss its computational advantage compared to LSTD() and LSTD-RP.
LSTD()-RP algorithm is a generalization of LSTD-RP. It uses eligibility traces to handle the case. Line 8 updates the eligibility traces , and lines 9-12 incrementally update and as described in Equation (8), which have some differences from that in LSTD-RP algorithm due to eligibility traces. If the parameter is set to zero, then the LSTD()-RP algorithm becomes the original LSTD-RP algorithm. What’s more, if the random projection matrix
is identity matrix, then LSTD()-RP becomes LSTD().
From Algorithm 1, we obtain that the LSTD()-RP algorithm returns
with 666We will see that exists with high probability for a sufficiently large sample size in Theorem 3. where
Here is referred to as randomly projected eligibility trace.
The difference between LSTD()-RP algorithm and the prior LSTD-RP algorithm lies in the fact that LSTD()-RP incorporates the eligibility traces. From Algorithm 1, we know that the computational cost of eligibility traces is . Based on the analysis of the computational complexity of LSTD-RP algorithm, we obtain that the total computational complexity of LSTD()-RP is . This reveals that the computation cost of LSTD()-RP algorithm is much less than that of LSTD() algorithm, which is .
To evaluate the performance of LSTD()-RP algorithm, we consider the gap between the value function learned by LSTD()-RP algorithm and the true value function , i.e., . We refer to this gap as the total error of the LSTD()-RP algorithm. According to the triangle inequality, we can decompose the total error into two parts: estimation error and approximation error . We will illustrate how to derive meaningful upper bounds for these three errors of LSTD()-RP in the following section.
4 Theoretical Analysis
In this section, we conduct theoretical analysis for LSTD()-RP. First, we examine the sample size needed to ensure the uniqueness of the sample-based LSTD()-RP solution, that is, we explore sufficient conditions to guarantee the invertibility of with high probability, which can be used in the analysis of estimation error bound. Second, we make finite sample analysis of LSTD()-RP including discussing how to derive meaningful upper bounds for the estimation error , the approximation error and the total error .
To perform such finite sample analysis, we also need to make a common assumption on the Markov chain process that has some -mixing properties as shown in Assumption 2 [19, 29]. Under this assumption, we can make full use of the concentration inequality for -mixing sequences during the process of finite sample analysis.
is a stationary exponential -mixing sequence, that is, there exist some constant parameters , and such that .
4.1 Uniqueness of the Sample-Based Solution
In this subsection, we explore how sufficiently large the number of observations needed to guarantee the invertibility of with high probability as shown in Theorem 3, which indicates the uniqueness of sample-based LSTD()-RP solution. Due to the space limitations, we leave the detailed proof into Appendix D.
From Theorem 3, we can draw the following conclusions:
The number of observations needed to guarantee the uniqueness of the sample-based LSTD()-RP solution is of order , and it is much smaller than that of LSTD(), which is of order (Theorem 1 in ).
In our analysis, setting , we can see that our result has some differences from LSTD-RP (Lemma 3 in ), since we consider the invertibility of the matrix , while they consider the empirical Gram matrix .
4.2 Estimation Error Bound
In this subsection, we upper bound the estimation error of LSTD()-RP as shown in Theorem 4. For its proof, first, bound the estimation error on one fixed randomly projected space . Then, by utilizing properties of random projections, the relationship between the smallest eigenvalues of the Gram matrices in and and the conditional expectation properties, bridge the error bounds between the fixed space and any arbitrary random projection space. Due to space limitations, we leave its detailed proof into Appendix E.
Let Assumptions 1 and 2 hold, and let . For any , and when and , with probability (the randomness w.r.t. the random sample and the random projection), for all the estimation error is upper bounded as follows:
with , where is the smallest eigenvalue of the Gram matrix , , , , , and are defined as in Theorem 3.
4.3 Approximation Error Bound
Now we upper bound the approximation error of LSTD()-RP which is shown in Theorem 5. As to its proof, we first analyze the approximation error on any fixed random projected space . Then, we make a bridge of approximation error bound between the fixed random projection space and any arbitrary random projection space by leveraging the definition of projection and the inner-product preservation property of random projections and the Chernoff-Hoeffding inequality for stationary -mixing sequence. Due to space limitations, we leave detailed proof into Appendix F.
From Theorem 5, we know that by setting , the right hand of Equation (11) becomes while for LSTD-RP (Theorem 2 in ) it is Notice that they are just different from the coefficients. Furthermore, due to eligibility traces which can control the quality of approximation, we could tune to make approximation error of LSTD()-RP smaller than that of LSTD-RP, since the coefficient in Equation (11) is , while it is in LSTD-RP.
Remark 2: The coefficient in the approximation can be improved by .
4.4 Total Error Bound
Let Assumptions 1 and 2 hold. Let . For any and when and , with probability (the randomness w.r.t. the random sample and the random projection) at least , for all , the total error can be upper bounded by:
By setting , the total error bound of LSTD()-RP is consistent with that of LSTD-RP except for some differences in coefficients. These differences lie in the analysis of LSTD-RP based on a model of regression with Markov design.
Although our results consistent with LSTD-RP when setting except for some coefficients, our results have some advantages over LSTD-RP and LSTD(). Now we have some discussions. From Theorem 4, Theorem 5 and Corollary 6, we can obtain that
Compared to LSTD(), the estimation error of LSTD()-RP is of order , which is much smaller than that of LSTD() (i.e., (Theorem 1 in )), since random projections can make the complexity of the projected space is smaller than that of the original high-dimensional space . Furthermore, the approximation error of LSTD()-RP increases by at most , which decreases w.r.t . This shows that in addition to the computational gains, the estimation error of LSTD()-RP is much smaller at the cost of a increase of the approximation error which can be fortunately controlled. Therefore, LSTD()-RP may have a better performance than LSTD(), whenever the additional term in the approximation error is smaller than the gain achieved in the estimation error.
Compared to LSTD-RP, illustrates a trade-off between the estimation error and approximation error for LSTD()-RP, since eligibility traces can control the trade off between the approximation bias and variance during the learning process. When increases, the estimation error would increase, while the approximation error would decrease. Thus, we could select an optimal to balance these two errors and obtain the smallest total error.
Compared to LSTD-RP, we can select an optimal to obtain the smallest total error, and make a balance between the estimation error and the approximation error of LSTD()-RP, which is much smaller than that of LSTD-RP () due to the effect of eligibility traces.
These conclusions demonstrate that random projections and eligibility traces can improve the approximation quality and computation efficiency. Therefore, LSTD()-RP can provide an efficient and effective approximation for value functions and can be superior to LSTD-RP and LSTD().
Remark 4: Our analysis can be simply generalized to the emphatic LSTD algorithm (ELSTD) with random projections and eligibility traces.
5 Conclusion and Future Work
In this paper, we propose a new algorithm LSTD()-RP, which leverages random projection techniques and takes eligibility traces into consideration to tackle the computation efficiency and quality of approximations challenges in the high-dimensional feature space scenario. We also make theoretical analysis for LSTD()-RP.
For the future work, there are still many important and interesting directions: (1) the convergence analysis of the off-policy learning with random projections is worth studying; (2) the comparison of LSTD()-RP to and regularized approaches asks for further investigation. (3) the role of in the error bounds is in need of discussion.
This work is partially supported by the National Key Research and Development Program of China (No. 2017YFC0803704 and No. 2016QY03D0501), the National Natural Science Foundation of China (Grant No. 61772525, Grant No. 61772524, Grant No. 61702517 and Grant No. 61402480) and the Beijing Natural Science Foundation (Grant No. 4182067 and Grant No. 4172063).
Appendix A Preparations
Now we present some useful facts (Fact 7-9), which are important for the following theoretical analysis processes. Specifically, Fact 7 and 8 show the norm and inner-product preservation properties of random projections respectively, and Fact 9 states the relationship between the smallest eigenvalues of the Gram matrices in spaces and .
 Let of i.i.d. elements drawn from . Then for any vector , the random (w.r.t. the choice of the matrix ) variable concentrates around its expectation . Mathematically, for any we have
 Let and be vectors of . Let of i.i.d. elements drawn from . For any , , for , we have, with probability at least , for all ,
 Let . and with dimensions and are defined in section 2 with . Let and be the Gram matrices for spaces and (i.e., , .), and and be their corresponding smallest eigenvalues. Then, with probability (w.r.t. the random projection), we have
The following fact gives a measure of the difference between the distribution of blocks where the blocks are independent in one case and dependent in the other case. The distribution within each block is assumed to be the same in both cases.
(, Corollary 2.7) Let and suppose that is a measurable function on a product probability space with bound , where for all . Let be a probability measure on the product space with marginal measures on , and let be the marginal measure of on , Let where and . Then,
In addition, we present the key Fact 11 for our analysis, which shows the concentration inequality holds for the infinitely-long-trace -mixing process.
(, Lemma 2) Let Assumptions 1 and 2 hold and let . Define the matrix , such that
where , for all , , and . Then for any , with probability at least , we have
(, Theorem 1) The LSTD() approximation error satisfies
Appendix B Proof of Lemma 1
Under Assumption 1, since , to prove that are linearly independent a.e., that is,
we only need to show that
Now decompose the random projection matrix into two blocks as where and . Since each element of are continuous variable, by mathematical induction, we can show that the determinant of matrix a.e., which implies that Therefore, we have
Furthermore, we have
Therefore, is invertible (a.e.). ∎
Appendix C Proof of Lemma 2
Using the Pythagorean theorem, for any measurable function , we have
Hence, the operator is not an expansion w.r.t. -weighted quadratic norm.
Furthermore, from , we know that the multi-step Bellman operator is a -contraction, i.e.,
Therefore, is also a -contraction. ∎
Appendix D Proof of Theorem 3
For simplicity, denote , and let be the spectral radius of the matrix .
Under Assumption 1, we know that is invertible by Lemma 1. Consequently, is invertible if and only if is invertible. According to the relationship between the spectral radius of one matrix and its norm, we can obtain that if , then it implies that is invertible.
From the definition and properties of the matrix norm, we have
Therefore, in order to derive the sufficient conditions that is invertible, we just only to need to find the sufficient conditions such that