1 Problem description
We consider discretetime infinite horizon discounted Markov Decision Process (MDP), with states
, admissible actions , discount factor , reward function, and transition probability density
. Solving the MDP means finding a policy that maximizes the accumulated discounted rewards with states and actions distribution induced by applying that policy:(1) 
Policy evaluation is a key component of policy iteration and policy gradient approaches to (1), with the objective of finding the Qfunction^{1}^{1}1Throughout this work, we consider Qfunction, instead of state value function, in order to cover the modelfree setting. associated with a stationary deterministic policy , which is the fixed point of the Bellman operator or the solution of the Bellman equation:
(2) 
When the state space is large or infinite, solving (2) exactly becomes intractable. LeastSquares Temporal Difference (LSTD) is a widely used simulationbased algorithm for approximately solving the projected Bellman equation with linear representation of value function [1]. In order to apply LSTD to policy iteration, Lagoudakis and Parr [2] proposed a Qfunction extension, and showed that the resulting policy iteration algorithm can be successfully applied to control problems.
In this work, we study LSTDbased approximate policy evaluation methods that benefit from manifold regularized learning, with the intuition that the Qfunction is smooth on the manifold where states lie and not necessary in the ambient Euclidean space. Such manifold structure naturally arises in many robotics tasks due to constraints in the state space. For example, contact, e.g., between foot and the ground or a hand and an object [3]
, restricts feasible states to lie along a manifold or a union of manifolds. Other examples include the cases when state variables belong to Special Euclidean group SE(3) to encode 3D poses, and obstacle avoidance settings where geodesic paths between state vectors are naturally better motivated than geometrically infeasible straightline paths in the ambient space.
2 Background
2.1 Manifold regularized learning
Manifold regularization has previously been studied in semisupervised learning
[4]. This datadependent regularization exploits the geometry of the input space, thus achieving better generalization error. Given a labeled dataset , Laplacian Regularized LeastSquares method (LapRLS) [4], finds a function in a reproducing kernel Hilbert space that fits the data with proper complexity for generalization:(3) 
where data matrices and with and , is the norm in and is a penalty that reflects the intrinsic structure of (see §2.2 in [4] for choices of ), and scalars , are regularization parameters^{2}^{2}2
denotes the marginal probability distribution of inputs, and
.. A natural choice of is , where is the support of which is assumed to be a compact manifold [4] and is the gradient along . When is unknown as in most learning settings,can be estimated empirically, and the optimization problem (
3) becomes(4) 
where and matrix is a graph Laplacian^{3}^{3}3 denotes the set of symmetric positive semidefinite matrices. (different ways of constructing graph Laplacian from data can be found in [5]). Note that problem (4) is still convex, since graph Laplacian
is positive semidefinite, and the multiplicity of eigenvalue
is the number of connected components in the graph. In fact, a broad family of graph regularizers can be used in this context [6, 7]. This includes the iterated graph Laplacian which is theoretically better than the standard Laplacian [7], or the diffusion kernel on graphs.Based on Representer Theorem [8], the solution of (4) has the form , where , and is the kernel associated with . After substituting the form of in (4), and solving the resulting leastsquares problem, we can derive
(5) 
where matrix is the gram matrix, and matrix
is an identity matrix of matching size.
2.2 Kernelized LSTD with regularization
Farahmand et al. [9] introduce an regularization extension to kernelized LSTD [10], termed Regularized LSTD (REGLSTD), featuring better control of the complexity of the function approximator through regularization, and mitigating the burden of selecting basis functions through kernel machinery. REGLSTD is formulated by adding regularization terms to both steps in a nested minimization problem that is equivalent to the original LSTD [11]:
(6)  
(7) 
where informally speaking, in (6) is the projection of (2) on , and is enforced to be close to that projection in (7). The kernel
can be selected as a tensor product kernel
, where (note that the multiplication of two kernels is a kernel [12]). Furthermore, can be the Kronecker delta function, if the admissible action set is finite.As in LSTD, the expectations in (6) and (7) are then approximated with finite samples , leading to
(8)  
(9) 
where is defined as in (3) with , similarly consists of with actions generated by the policy to evaluate, and reward vector . Invoking Representer theorem [8], Farahmand et al. [9] showed that
(10) 
where weight vectors and , and matrix is and stacked. Substituting (10) in (6) and (7), we obtain the formula for :
(11) 
where , , , , , and .
3 Our approach
Our approach combines the benefits of both manifold regularized learning (§2.1) that incorporates geometric structure, and kernelized LSTD with regularization (§2.2
) that offers better control over function complexity and eases feature selection.
More concretely, besides the regularization term in (9) that controls the complexity of in the ambient space, we augment the objective function with a manifold regularization term that enforces to be smooth on the manifold that supports . In particular, if the manifold regularization term is chosen as in §2.1, large penalty is imposed if varying too fast along the manifold. With the empirically estimated manifold regularization term, optimization problem (9) becomes (cf., (4))
(12) 
which admits the optimal weight vector (cf., (11)):
(13) 
4 Experimental results
We present experimental results on two standard RL benchmarks: tworoom navigation and cartpole balancing. REGLSTD with manifold regularization (MRLSTD) (§3) is compared against REGLSTD without manifold regularization (§2.2) and LSTD with three commonly used basis function construction mechanisms: polynomial [13, 2]
, radial basis functions (RBF)
[2, 14], and Laplacian eigenmap [15, 16], in terms of the quality of the policy produced by LeastSquares Policy Iteration (LSPI) [2]. The kernel used in the experiments iswith hyperparameter
. We use combinatorial Laplacian with adjacency matrix computed from neighborhood with weights [5].4.1 Tworoom navigation
The tworoom navigation problem is a classic benchmark for RL algorithms that cope with either discrete [16, 17, 15] or continuous state space [14, 18]. In the vanilla tworoom problem, the state space is a discrete grid world, and admissible actions are stepping in one of the four cardinal directions, i.e., up, right, down, and left. The dynamics is stochastic: each action succeeds with probability when movement is not blocked by obstacle or border, otherwise leaves the agent in the same location. The goal is to navigate to the diagonally opposite corner in the other room, with a singlecell doorway connecting the two rooms. Reward is at the goal location otherwise , and the discount factor is set to . Data are collected beforehand by uniformly sampling states and actions, and used throughout LSPI iterations. Seen from Table 1, Laplacian eigenmap which exploits intrinsic state geometry outperforms parametric basis functions: polynomial and RBF. The MRLSTD method we propose achieves the best performance.
# of samples  polynomial  RBF  eigenmap  REGLSTD  MRLSTD 

250  7.00  16.31  6.44  6.69  4.40 
500  5.46  11.7  3.9  2.43  0.56 
1,000  4.55  2.4  1.56  0.24  0.02 
4.2 Cartpole Balancing
The cartpole balancing task is to balance a pole upright by applying force to the cart to which it’s attached^{4}^{4}4The cartpole environment in OpenAI Gym package is used in our implementation. [19]. The agent constantly receives reward until trial ends with reward when it’s away from the upright posture. On the contrary to the tworoom navigation task, the state space is continuous, consisting of angle and angular velocity of the pole. Admissible actions are finite: pushing the cart to left or right. The discount factor . Data are collected from random episodes, i.e., starting from a perturbed upright posture and applying uniformly sampled actions. Results are reported in Table 2, which shows that REGLSTD achieves significantly better performance than parametric basis functions, and performance is even improved further with manifold regularization.
# of samples  polynomial  RBF  REGLSTD  MRLSTD 

250  121.10  116.59  186.82  195.78 
500  150.50  112.22  181.44  195.91 
1,000  154.38  130.74  188.82  198.12 
5 Related work
The closest work to this paper was recently introduced by Li et al. [20]
, which utilized manifold regularization by learning state representation through unsupervised learning, and then adopting the learned representation in policy iteration. In contrast to this work, we naturally blend manifold regularization with policy evaluation with possibly provable performance guarantee (left for future work). There is also work on constructing basis functions directly from geometry,
e.g., Laplacian methods [15, 16], and geodesic Gaussian kernels [14]. Furthermore, different regularization mechanisms to LSTD have been proposed, including regularization for feature selection [21], and nested and penalization to avoid overfitting [22].6 Conclusion and future work
We propose manifold regularization for a kernelized LSTD approach in order to exploit the intrinsic geometry of the state space for better sample efficiency and Qfunction approximation, and demonstrate superior performance on two standard RL benchmarks. Future work directions include 1) accelerating by structured random matrices for kernel machinery [23, 24], and graph sketching for graph regularizer construction to scale up to large datasets and rich observations, e.g., images, 2) providing theoretical justification, and combining manifold regularization with deep neural nets [25, 26] and other policy evaluation, e.g., [27, 10] and policy iteration algorithms, 3) learning with a datadependent kernel that capturing the geometry (equivalent to the manifold regularized solution [28]) that makes it easier to derive new algorithms, and 4) extension to continuous action spaces by constructing kernels such that policy improvement (optimize over actions) is tractable [12].
References
 Bradtke and Barto [1996] S. J. Bradtke and A. G. Barto. Linear leastsquares algorithms for temporal difference learning. In Recent Advances in Reinforcement Learning, pages 33–57. Springer, 1996.

Lagoudakis and Parr [2003]
M. G. Lagoudakis and R. Parr.
Leastsquares policy iteration.
Journal of machine learning research
, 4(Dec):1107–1149, 2003.  Posa et al. [2016] M. Posa, S. Kuindersma, and R. Tedrake. Optimization and stabilization of trajectories for constrained dynamical systems. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 1366–1373. IEEE, 2016.
 Belkin et al. [2006] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399–2434, 2006.
 Belkin and Niyogi [2003] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
 Smola and Kondor [2003] A. J. Smola and R. Kondor. Kernels and regularization on graphs. In COLT, volume 2777, pages 144–158. Springer, 2003.

Zhou and Belkin [2011]
X. Zhou and M. Belkin.
Semisupervised learning by higher order regularization.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pages 892–900, 2011.  Schölkopf et al. [2001] B. Schölkopf, R. Herbrich, and A. Smola. A generalized representer theorem. In Computational learning theory, pages 416–426. Springer, 2001.
 Farahmand et al. [2009] A. M. Farahmand, M. Ghavamzadeh, S. Mannor, and C. Szepesvári. Regularized policy iteration. In Advances in Neural Information Processing Systems, pages 441–448, 2009.

Xu et al. [2007]
X. Xu, D. Hu, and X. Lu.
Kernelbased least squares policy iteration for reinforcement
learning.
IEEE Transactions on Neural Networks
, 18(4):973–992, 2007.  Antos et al. [2008] A. Antos, C. Szepesvári, and R. Munos. Learning nearoptimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
 Genton [2001] M. G. Genton. Classes of kernels for machine learning: a statistics perspective. Journal of machine learning research, 2(Dec):299–312, 2001.
 Schweitzer and Seidmann [1985] P. J. Schweitzer and A. Seidmann. Generalized polynomial approximations in markovian decision processes. Journal of mathematical analysis and applications, 110(2):568–582, 1985.
 Sugiyama et al. [2008] M. Sugiyama, H. Hachiya, C. Towell, and S. Vijayakumar. Geodesic gaussian kernels for value function approximation. Autonomous Robots, 25(3):287–304, 2008.
 Mahadevan [2012] S. Mahadevan. Representation policy iteration. arXiv preprint arXiv:1207.1408, 2012.
 Petrik [2007] M. Petrik. An analysis of laplacian methods for value function approximation in mdps. In IJCAI, pages 2574–2579, 2007.
 Parr et al. [2008] R. Parr, L. Li, G. Taylor, C. PainterWakefield, and M. L. Littman. An analysis of linear models, linear valuefunction approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 752–759. ACM, 2008.
 Taylor and Parr [2009] G. Taylor and R. Parr. Kernelized value function approximation for reinforcement learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1017–1024. ACM, 2009.
 Barto et al. [1983] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, (5):834–846, 1983.
 Li et al. [2017] H. Li, D. Liu, and D. Wang. Manifold regularized reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 2017.
 Kolter and Ng [2009] J. Z. Kolter and A. Y. Ng. Regularization and feature selection in leastsquares temporal difference learning. In Proceedings of the 26th annual international conference on machine learning, pages 521–528. ACM, 2009.
 Hoffman et al. [2012] M. Hoffman, A. Lazaric, M. Ghavamzadeh, and R. Munos. Regularized least squares temporal difference learning with nested l2 and l1 penalization. Recent Advances in Reinforcement Learning, 7188:102–114, 2012.
 Felix et al. [2016] X. Y. Felix, A. T. Suresh, K. M. Choromanski, D. N. HoltmannRice, and S. Kumar. Orthogonal random features. In Advances in Neural Information Processing Systems, pages 1975–1983, 2016.
 Choromanski et al. [2017] K. Choromanski, M. Rowland, and A. Weller. The unreasonable effectiveness of random orthogonal embeddings. arXiv preprint arXiv:1703.00864, 2017.
 Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Silver et al. [2014] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 387–395, 2014.
 Dai et al. [2016] B. Dai, N. He, Y. Pan, B. Boots, and L. Song. Learning from conditional distributions via dual kernel embeddings. arXiv preprint arXiv:1607.04579, 2016.
 Sindhwani et al. [2005] V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: from transductive to semisupervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 824–831. ACM, 2005.