I Introduction
Decisionmaking of an autonomous mobile robot moving in unstructured environments typically requires the robot to account for uncertain action (motion) outcomes, and at the same time, maximize the longterm return. The Markov Decision Process (MDP) is an extremely useful framework for formulating such decisiontheoretic planning problems [7]. Since the robot is moving in a continuous space, directly employing the standard form of MDP needs a discretized representation of the robot state and action. For example, in practice the discretized robot states are associated with spatial tessellation [37], and a gridmap like representation has been widely used for robot planning problems where each grid is regarded as a discrete state; similarly, actions are simplified as transitions to traversable grids which are usually the very few number of adjacent grids in vicinity.
However, the discretization can be problematic. Specifically, if the discretization is low in resolution (i.e., large but few number of grids), the decision policy becomes a very rough approximation of the simplified (discretized) version of the original problem; on the other hand, if the discretization is high in resolution, the result might be approximated well, but this will induce prohibitive computational cost and prevent the realtime decisionmaking. Finally, the characteristics of state space might be complex and it is inappropriate to conduct latticelike tessellation which is likely to result in suboptimal solutions. See Fig. 1 for an illustration.
Another critical issue lies in MDP’s transition model which describes the probabilistic transitions from a state to others. However, obtaining an accurate stochastic transition model for robot motion transition is unrealistic even without considering spatiotemporal variability. This is another important factor that significantly limits the applicability of MDP in many realworld problems. Reinforcement learning
[19]does not rely on a transition model specification, but requires cumbersome training trials to learn the value function and the policy, which can be viewed as another strong assumption in many robotic missions. Thus, it is desirable that the demanding assumption of a known transition model can be relaxed. Fortunately, the characteristics of transition probabilistic distribution (e.g., mean, variance, or quantiles) for most robotic planning and control systems can be obtained from historical data or offline tests
[37]. If we only assume such “partial knowledge”–mean and variance–of the transition model, we must redesign the modeling and solving mechanisms which will be presented in this work.To address the above problems, we propose a kernel Taylorbased approximation approach. Our contributions can be summarized as follows:

First, to relax the requirement of fully known transition functions, we apply the secondorder Taylor expansion to the value function [8, 9]
. The Bellmantype policy evaluation equation and Bellman optimality equation are then approximated by a partial differential equation (PDE) which only relies on the first and second moments of transition probability distribution.

Second, to improve the generalizability of the value function, we use kernel functions which can represent a large number of function families for better value approximation. This approximation can conveniently characterize the underlying value functions with a finite set of discrete supporting states.

Finally, we develop an efficient policy iteration algorithm by integrating the kernel value function representation and the Taylorbased approximation to Bellman optimality equation. The policy evaluation step can be represented as a linear system of equations with characterizing values at the finite supporting states, and the only information needed is the first and second moments of the transition function. This alleviates the need for heavily searching in continuous/large state space and the need for carefully modeling/engineering the transition probability.
Ii Related Work
Our work primarily focuses on value function approximation in large/continuous state space using only minimum prior knowledge of the transition function. A major challenge for solving the continuousstate MDP involves the search in a largescale (usually infinite) state space. Popular methods in robotics that avoid intractability of computing the value function over the continuous space are by tessellating the continuous state space into grids [30, 29, 12, 1, 3]. However, this naive approach does not scale well and may give inferior performance when the problem size increases, known as the curse of dimensionaliy [4]. A more advanced discretization technique that alleviates this problem is by adaptive discretization [15, 23, 38, 27, 22].
Alternative methods tackle this challenge by representing and also approximating the value function by a set of basis functions or some parametric functions [5, 35, 28]. The parameters can be optimized by minimizing the Bellman residual [2]. However, these methods are not applicable in complicated problems because defining features to approximate the value function linearly is nontrivial. This weakness may be resolved through kernel methods [17]. Because the weights in linear combination of basis functions can be presented by their product (through socalled duel form of least squares [33]), the weights can be written in terms of kernel functions and value functions at supporting states. Once value functions at supporting states are obtained, the approximation to value function at any state is also determined. The approach to approximating the value function by kernel functions is referred as the direct kernelbased method
in this paper. In addition, it is generally hard to find a suitable nonlinear function (such as neural networks) to approximate the value function
[13]. There is a vast literature on kernelized value function approximations in reinforcement learning [10, 20, 36, 39], but few studies in robotic planning problems leveraged this approach. A recent application of work [10] on marine robots can be found in [24].Unfortunately, all these schemes rely on either fully known transitions in MDP or the selection of basis functions, which are difficult to obtain in practice. Therefore, the challenge becomes how to design a principled methodology without explicitly relying on basis functions and without full knowledge of transitions in MDP, which will be addressed in this work.
Iii Preliminary Material
Iiia Markov Decision Processes
We formulate the robot decisiontheoretic planning problem as an infinite horizon discounted Markov Decision Process (MDP) with continuous states and finite actions. An infinite horizon discounted MDP is defined by a 5tuple , where is the dimensional continuous state space and is a finite set of actions. can be thought of as the robot workspace in our study. A robot transits from a state to the next by taking an action in a stochastic environment and obtains a reward
. Such transition is governed by a conditional probability distribution
which is also termed as transition model (or transition function); the reward , a mapping from a pair of state and action to a scalar value, specifies the shortterm objective that the robot receives by taking action at state . The final element in is a discount factor which will be used in the expression of value function.We consider the class of deterministic policies , which defines a mapping from a state to an action. The expected discounted cumulative reward for any policy starting at any state is expressed as
(1) 
We can rewrite the above equation recursively as follows
(2) 
where is called the Bellman operator, and . The function in Eq. (2) is usually called the state value function of the policy . Solving an MDP is to find the optimal policy with the optimal value function which satisfies the Bellman optimality equation
(3) 
IiiB Approximate Policy Iteration via Value Function Representation
To solve an MDP, value iteration and policy iteration are the most prevalent approaches. It has been shown that the value iteration and policy iteration can achieve similar stateoftheart performances in terms of solution quality and running time [6, 35]. Our work will be built upon policy iteration and here we provide a summary of the important value function approximation process used in the policy iteration [31, 14].
Policy iteration requires initialization of the policy (can be random), based on which a system of linear equations can be established where each equation is exactly the value function (Eq. (2)). When the states in the MDP are finite, the solution to this linear system yields incumbent values for all states [32]. This step is called policy evaluation. The second step is to improve the current policy by greedily improving local actions based on the incumbent values obtained. This step is called policy improvement. Through iterating these two steps, we can find the optimal policy and a unique solution to the value function that satisfies Eq. (1) for every state.
If, however, the states are continuous or the number of states is infinite, it is difficult to evaluate the value function at every state. One must resort to approximate solutions. Suppose that the value function can be represented by a weighted linear combination of known functions where only weights are to be determined, then a natural way to go is leveraging the Bellmantype equation, i.e., Eq. (2), to compute the weights. Specifically, given an arbitrary policy, the representation of value function can be evaluated at a finite number of states, leading to a linear system of equations whose solutions can be viewed as weights [21]. This obtained representation of value function can be used to improve the current policy. The remaining procedure is then similar to the standard policy iteration method. The final obtained value function representation serves as an approximated optimal value function for the whole continuous state space, and the corresponding policy can be obtained accordingly.
Formally, let the value function approximation under policy be
(4) 
where . The set is the basis functions in literature [31]. A finite number of supporting states , can be selected, which are minimized via the squared Bellman error over , defined by The solution for may have a closed form in terms of the basis functions, transition probabilities, and rewards [21]. By policy iteration, the final solution for can be obtained.
Note that may be generalized to any parametric nonlinear functions such as neural networks, and that the selection of supporting states needs to take account of the characteristics of the underlying value functions (in our robotics decisiontheoretic planning scenarios, it relates to the landscape geometry of the terrain).
Iv Kernel TaylorBased Approximate Policy Iteration
Our objective is to design a principled kernelbased policy iteration approach by leveraging kernel methods to solve the continuousstate MDP. In contrast to most decisiontheoretic planning frameworks which assume fully known MDP transition probabilities [7, 32], we propose a method that eliminates such a strong premise which oftentimes is extremely difficult to engineer in practice. To overcome this challenge, first we apply the secondorder Taylor expansion of the kernelized value function (Section IVA). The Bellman optimality equation is then approximated by a partial differential equation which only relies on the first and second moments of transition probabilities (Section IVB). Combining the kernel representation of value function, this approach efficiently tackles the continuous or largescale state space search with minimum prerequisite knowledge of state transition model (Sections IVC and IVD). Finally, the experiments show that our proposed approach is very powerful and flexible, and reveal great advantages over several baseline approaches (Section V).
Iva Taylored Approximate Policy Evaluation Equation
To design an efficient approach for solving MDP based decisiontheoretic planning problems, we essentially have two elements to deal with: the value function and the Bellman optimality equation. If we directly apply kernel methods to approximate the value function (referred to as the direct kernelbased method), we can avoid explicitly specifying basis functions as mentioned in Section IIIB. But it still requires fully known MDP transition probabilities, and it needs the exact Bellman optimality equations to develop the policy iteration method.
In contrast to the direct kernelbased approach, we consider an approximation to the Bellmantype equation by using only first and second moments of transition functions. This will allow us to obtain a nice property that a complete and accurate transition model is not necessary; instead, only the important statistics such as mean and variance (or covariance) will be sufficient. To better describe the basic idea, we keep our discussions on a surfacelike terrain and use that surface as the decisiontheoretic planning workspace, i.e., , though our approaches apply to the state space of any dimensions.
Formally, suppose that the value function for any given policy has continuous first and second order derivatives. We subtract both handsides by from Eq. (2) and then take Taylor expansions of value function around up to second order [8]:
(5) 
where and
are the first moment (i.e., mean, a 2dimensional vector) and the second moment (i.e., covariance, a 2by2 matrix) of transition functions, respectively, with the following form
(6a)  
(6b) 
for ; the operator and the notation in the last equation indicate an inner product. To be clear, we present the expression for the following operator
Since Eq. (5) approximates calculation of Eq. (2) in the policy evaluation stage, the solution to Eq. (5) thus provides the value function approximation under current policy . Eq. (5) also implies that we only need to use the mean and convariance instead of the original transition model to approximate the value function.
IvB Approximate Bellman Optimality Equation via PDE
We need to analyze the necessary boundary conditions to Eq. (5) which is a partial differential equation (PDE), and develop an approximation methodology to the Bellman optimality equation which is the foundation for efficient MDP solution. To achieve these, first, the directional derivative of the value function with respect to the unit vector normal at the boundary states must be zero. (Note, the value function should not have values on obstacles or outside the state space .) Second, in order to ensure a unique solution, we constrain the value function at the goal state to a fixed value.
Let us denote the boundary of entire continuous planning region/workspace by and the goal state by . Suppose the value function at is . Section IVA implies that the Bellman optimality equation Eq. (3) can be approximated by the following PDE:
(7)  
with boundary conditions
(8a)  
(8b) 
where denotes the unit vector normal to pointing outward. The condition (8a) is a type of homogeneous Neumann condition, and condition (8b) can be thought of as a Dirichlet condition in literature [11]. This elegantly approximates the classic Bellman optimality equation by a convenient PDE representation. In the next section, we will leverage the kernelized representation of the value function to avoid difficulties of directly solving PDE. The kernel method will help transform the problem to a linear system of equations with unknown values at the finite supporting states.
IvC Kernel TaylorBased Approximate Policy Evaluation
With aforementioned formulations, another critical research question is whether the value function can be represented by some special functions that are able to approximate large function families in a convenient way. We tackle this question by using a kernel method to represent the value function. Thanks to Eq. (7) which allows us to extend with kernelized policy evaluation for Taylored value function approximation.
Specifically, let be a generic kernel function [17]. For a set of selected finite supporting states , let be the Gram matrix with , and . Given a policy , assume the value functions at are . Then, for any state , the kernelized value function has the following form
(9) 
where is a regularization factor. When
, it links to the kernel ordinary least squares estimation of
in Eq. (4); when , it refers to the ridgetype regularized kernel least squares estimation [33]. Furthermore, Eq. (9) implies that as long as the values are available, the value function for any state can be immediately obtained. Now our objective is to get through Eq. (5) and boundary conditions Eq. (8).It is worth mentioning that our approach of utilizing kernel methods is to approximate the function. This usage should be distinguished from that in the machine learning literature where kernel methods are used to learn patterns from data.
Plugging the kernelized value function representation into Eq. (5), we end up with the following linear system:
(10) 
where
is an identity matrix,
is a vector with element , and is a matrix whose elements are:(11) 
Note that indicates the derivatives with respect to , i.e., . In Appendix, we provide a concrete example using Gaussian kernels which lead to closedform expressions and are widely utilized in practice.
IvD Kernel TaylorBased Approximate Policy Iteration
With the above new model, our next step is to design an implementable algorithm that can solve the continuousstate MDP efficiently. We extend the classic policy iteration mechanism which iterates between the policy evaluation step and the policy improvement step until convergence to find the optimal policy as well as its corresponding optimal value function.
Because our kernelized value function representation depends on the finite supporting states instead of the whole state space, we only need to improve the policy on . Therefore, the policy improvement step in the th iteration is to produce a new policy according to
(12) 
where , and are the current policy and the updated policy, respectively. Note that and depend on through the transition function in Eq. (6). Compared with the approximated Bellman optimality equation (Eq. (7)), Eq. (12) drops the term . This is because does not explicitly depend on action . The value function of the updated policy satisfies [6]. If the equality holds, the iteration converges.
The final kernel Taylorbased policy iteration algorithm is pseudocoded in Alg. 1. It first initializes the actions at the finite supporting states and then iterates between policy evaluation and policy improvement. Since the supporting states as well as the kernel parameters do not change, the regularized kernel matrix and its inverse are computed only once at the beginning of the algorithm. This greatly reduces the computational burden caused by matrix inversion. Furthermore, due to the finiteness of the supporting states, the entire algorithm views the policy as a table and only updates the actions at the supporting states using Eq. (12). The algorithm stops and returns the supporting state values when the actions are stabilized. We can then use these statevalues to get the final kernel value function that approximates the optimal solution. The corresponding policy for every continuous state can then be easily obtained from this kernel value function [34].
Intuitively, this proposed framework is flexible and powerful due to the following reasons: rather than tackling the difficulties in solving the PDE which approximates the original Bellmantype equation, we use the kernel representation to convert the problem to a system of linear equations with characterizing values at the finite discrete supporting states. From this viewpoint, our proposed method nicely balances the tradeoff between searching in finite states and that in infinite states. In other words, our approach leverages the kernel methods and Bellman optimal conditions under the practical assumptions.
V Experiments
To validate our method, we consider two mobile robot decisiontheoretic planning tasks. The first one is a goaloriented planning problem in a simple environment with obstacleoccupied and obstaclefree spaces. This will help us to evaluate basic algorithmic properties. In the second task, we demonstrate that our method can be applied to a more realistic as well as more challenging navigation scenario on Mars surface [25], where the robot needs to take the elevation of the terrain surface into account (i.e., “obstacles” are implicit). In both tasks, we assume that the estimates of the first two moments of the transition probability are obtained from the past field experiments. To be concrete, in both experiments we use the Gaussian kernel given in Appendix A.
The performance matrix obtained by the hyperparameter search using (a)
; (b) ; (c) ; and (d) evenlyspaced supporting states. Rows and columns represent different Gaussian kernel lengthscale and regularization parameters, respectively. The numbers in the heatmap represent the average return of the final policy obtained using the corresponding hyperparameter combination. The colorbar is shown on the right side of each table.Va Plane Navigation
VA1 Setup
Our first experiment is a 2D plane navigation problem, where the obstacles and a goal area are represented in a environment, as shown in Fig. 2. The state space for this task is a 2dimensional euclidean space, i.e., and . The action space is a finite set of points . Each point is an action generated on a circle centered at the current state with a radius . In this experiment, we set the number of actions and the action radius as . An action point can be viewed as the “carrotdangling” waypoint for the robot to follow, which serves as the input to the lowlevel motion controller. For the reward function, we set the reward of arriving at the goal and obstacle states to be and , respectively. Since the reward now depends on the next state, we use Montecarlo sampling to estimate the expectation of . The discount factor for the reward is set to . We set the obstacle areas and the goal as absorbing states, i.e., the robot can not transit to any other states if they are in these states. To satisfy the boundary condition mentioned in Section IVB, we allow the robot to receive rewards at the goal state, but it cannot receive any reward if its current state is within an obstacle. Thus, the goal state value is .
VA2 Performance measure
Since the ultimate goal of planning is to find the optimal policy, our performance measure is based on the quality of the policy. A policy is better if it achieves a higher expected cumulative reward starting from every state. Because it is impossible to evaluate over the infinite number of states, we numerically evaluate the quality of a policy using the average return criterion [18]. In detail, we first uniformly sample states to ensure a thorough performance evaluation. Then, for each sampled state, we execute the policy to generate multiple trajectories, where each trajectory ends when it arrives at a terminal state (goal or obstacle) or reaches an allowable maximal number of steps. This procedure gives us an expected performance of the policy at any state by averaging the discounted sum of rewards over all the trajectories starting from it. Now, we can calculate the average return criterion by averaging over the performance of sampled states. A higher value of the average return implies that, on average, the policy gives better performance over the entire state space.
VA3 Results
To evaluate the effect of supporting states, we place evenlyspaced supporting states (in a lattice pattern) with different spacing resolution. Besides the number of states, the kernel lengthscale (see Appendix A) and the regularization parameter are the other two hyperparameters governing the performance of our algorithm. We present the gridbased hyperparameter search results using four different configurations of supporting states shown in Fig. 3. The lengthscale and regularization parameters are searched over the same values, . By entrywise comparison among the four matrices in Fig. 3, we can observe that increasing the number of states leads to improving performance in general. However, we can find that the best performed policy is given by the supporting states configuration (Fig. LABEL:sub@fig:taylorpi100) which is not the scenario with the best spacing resolution. This indicates that a larger number of states can also result in a deteriorating solution, and the performance of the algorithm is a matter of how the supporting states are placed (distributed), instead of the number (resolution) of state discretization. Furthermore, we can gain some insights on how to select the hyperparameters based on the number of supporting states. Lowperforming entries (highlighted with red) occur more often on the left side of the performance matrix when the number of supporting states increases. It implies that with more supporting states, the algorithm requires a stronger regularization (i.e., greater described in Section IVC). On the other hand, highperforming policies (indicated by yellow) appear more on the bottom of the performance matrix when a greater number of supporting states present, which means that a smaller length scale is generally required given a larger quantity of supporting states.
We further compare our kernelized value function representation against other three variants of the value function approximations. Specifically, the first one is the direct kernelbased approximation method using a Gaussian kernel. This method is similar to the one in [20], but with a fully known transition function. The second one uses neural networks (NNs) as the value function approximator. We setup the NN configuration similar to a recent work [16]. In detail, we use a shallow twolayer network with 100 hidden units in each layer. Its parameters are optimized through minimizing the squared Bellman error via gradient descent. Since there is no closedform solution to compute the expected next state value when using NN as a function approximator, we use MonteCarlo sampling to estimate the expected value at the next states . The third method is the gridbased approximation method. It first transforms the continuous MDP to a discrete version, where each state is regarded as a grid. Then, it uses the vanilla policy iteration to solve the discretized MDP.
The comparison among above methods aims at investigating two important questions:

How does the kernelized value function representation compare to other representations (NN and gridbased) in terms of the final policy performance?

Compared to our method, the direct kernelbased method not only requires the fully known transition function, but also restricts the transition to be a Gaussian distribution. Can our method with only mean and variance obtain similar performance as the direct kernelbased method?
To answer these questions, we choose the transition function as a Gaussian distribution. Its mean is the selected next waypoint. We set the standard deviation of the transition function to be
on both axes during the experiment. The transition probability models the accuracy of the lowlevel motion controller: more accurate controller leads to smaller uncertainty. To perform fair comparisons, we use the same supporting states and apply hyperparameter search to all the methods.The results for four methods are shown in Fig. 4 in terms of the average return. The first question is answered by the fact that the kernelbased methods (kernel Taylorbased and direct kernelbased PI) consistently outperforms the other two methods. Moreover, our method has the performance as good as the direct kernelbased method which however requires the prerequisite full distribution information of the transition. This indicates that our method can be applied to broader applications that do not have full knowledge of transition functions. In contrast to the gridbased PI, the kernelbased algorithms and NN can achieve moderate performance even with a small number of supporting states. It indicates that the continuous representation of the value function is crucial when supporting states are sparse. However, increasing the number of states does not improve the performance of the NN.
In Fig. 6, we compare the computational time and the number of iterations to convergence. The computational time of our method is less than the gridbased method as revealed in Fig. LABEL:sub@fig:computationtime. We notice that there is a negligible computational time difference between our method and the direct method. As a parametric method, NN has the least computational time, and only linearly increases, but it does not converge as indicated by Fig. LABEL:sub@fig:convergenceiter.
The function values and the final policies are visualized in Fig. 2. All the methods except for the NN obtain reasonable approximations to the optimal value function. Compared to our method, the values generated by the gridbased method are discrete “color blocks”, thus the obtained policy is nonsmooth. The direct kernelbased method obtains a slightly more “aggressive” (dangerous) policy in contrast to our method.
VB Martian Terrain Navigation
In this experiment, we consider the autonomous navigation task on the surface of Mars with a rover. We obtain the Mars terrain data from High Resolution Imaging Science Experiment (HiRISE) [26]. Since there is no explicitly presented “obstacle”, the robot only receives the reward when it reaches the goal. If the rover attempts to move on a steep slope, it may be damaged and trapped within the same state with probability proportional to the slope angle. Otherwise, its next state is distributed around the desired waypoint specified by the current action. This indicates that the underlying transition function should be the mixture of these two situations. It is reasonable to assume that the means of the two cases are given by the current state and the next waypoint, respectively. We can similarly have an estimate of the variances. The mean and the variance of the transition function can then be computed using the law of total expectation and total variance, respectively.
Due to the complex and unstructured terrestrial features, evenlyspaced supporting state points may fail to best characterize the underlying value function. Also, to keep the computational time at a reasonable amount while maintaining a good performance, we leverage the importance sampling technique to sample the supporting states that concentrate around the dangerous regions where there are steep slopes. This is obtained by first drawing a large number of states uniformly covering the whole workspace. For each sampled state, we then assign a weight proportional to its slope angle. Finally, we resample supporting states based on the weights. To guarantee the goal state to have a value, we always place one supporting state at the center of the goal area.
Fig. LABEL:sub@fig:evensupportstate and Fig. LABEL:sub@fig:weightedsupportstate compare the two methods for supporting state selections. The supporting states given by the importance samplingbased method are dense around the slopes. These supporting states better characterize the potentially highcost and dangerous areas than the evenlyspaced selection scheme. We selected four starting locations from where the rover needs to plan paths to arrive at a goal location. For each starting location, we conducted multiple trials following the produced optimal policies. The trajectories generated with the importance sampling states in Fig. LABEL:sub@fig:weighted3dpolicy attempt to approach the goal (green star) with minimum distances, and at the same time, avoid highelevation terrains. In contrast, the trajectories obtained using the evenlyspacing states in Fig. LABEL:sub@fig:even3dpolicy approach the goal in a more aggressive manner which can be risky in terms of safety. It indicates that a good selection of supporting states can better capture the state value function and thus produce finer solutions. This superior performance can also be reflected in Fig. LABEL:sub@fig:barchart. The policy obtained by the uniformly sampled states shows similar performance to the one generated by the evenlyspacing states, both of which yield smaller average return than the importancesampled case. A topdown view of the policy is shown in Fig. LABEL:sub@fig:terrainpolicymapweighted2d where the background colormap denotes the elevation of terrain.
(a) The topdown view of the Mars terrain surface as well as the policy generated by our method with the importance sampling selection. The colormap indicates the height (in meters) of the terrain. (b) The comparison of average return among three supporting state selection methods using the same number of states. Red, green, and blue bars indicate the performance of importance sampling selection, evenlyspaced selection, and uniform distribution sampling selection, respectively.
Vi Conclusion
This paper presents an efficient policy iteration algorithm to solve the continuousstate Markov Decision Process by integrating the kernel value function representation and the Taylorbased approximation to Bellman optimality equation. Our algorithm alleviates the need for heavily searching in continuous state space and the need for precisely modeling the state transition functions. We have thoroughly evaluated the proposed method through simulations in both simplified and realistic planning scenarios. The experiments comparing with other baseline approaches show that our proposed framework is powerful and flexible, and the performance statistics reveal superior efficiency and accuracy of our algorithms.
a Gaussian Kernel for kernel TaylorBased Approximate Policy Evaluation
Because Gaussian kernels are widely used in the studies of kernel methods, this section presents the necessary derivations to aid the application of Gaussian kernels to our proposed kernel TaylorBased approximate methods.
Gaussian kernel functions on states and have the form , where is a constant and is a covariance matrix. Note that is referred to as the lengthscale parameter in our paper. Due to limited space, we only provide formula below for the first and second derivatives of the Gaussian kernel functions. These formula are necessary when Gaussian kernels are employed (for example, Eq.(11)). In fact, we have
(13) 
and
(14) 
where denotes the trace of the matrix.
References
 [1] (2013) Windenergy based path planning for unmanned aerial vehicles using markov decision processes. In 2013 IEEE International Conference on Robotics and Automation (ICRA), pp. 784–789. Cited by: §II.
 [2] (2008) Learning nearoptimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning 71 (1), pp. 89–129. Cited by: §II.
 [3] (2013) Optimal path planning of a targetfollowing fixedwing uav using sequential decision processes. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2955–2962. Cited by: §II.
 [4] (2015) Adaptive control processes: a guided tour. Vol. 2045, Princeton university press. Cited by: §II.
 [5] (1996) Neurodynamic programming. Vol. 5, Athena Scientific Belmont, MA. Cited by: §II.
 [6] (1995) Dynamic programming and optimal control. Vol. 1, Athena scientific Belmont, MA. Cited by: §IIIB, §IVD.

[7]
(1999)
Decisiontheoretic planning: structural assumptions and computational leverage.
Journal of Artificial Intelligence Research
11, pp. 1–94. Cited by: §I, §IV.  [8] (2018) On the taylor expansion of value functions. arXiv preprint arXiv:1804.05011. Cited by: 1st item, §IVA.
 [9] (2017) HamiltonJacobiBellman Equation. In Optimal and Learning Control for Autonomous Robots, Cited by: 1st item.
 [10] (2003) Bayes meets bellman: the gaussian process approach to temporal difference learning. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 154–161. Cited by: §II.
 [11] (2010) Partial Differential Equations: Second Edition (Graduate Series in Mathematics). American Mathematical Society. Cited by: §IVB.
 [12] (2015) Sense and collision avoidance of unmanned aerial vehicles using markov decision process and flatness approach. In 2015 IEEE International Conference on Information and Automation, pp. 714–719. Cited by: §II.
 [13] (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. Cited by: §II.
 [14] (1999) Approximate solutions to markov decision processes. Technical report CarnegieMellon University School of Computer Science. Cited by: §IIIB.

[15]
(2015)
Efficient highdimensional stochastic optimal motion control using tensortrain decomposition.
. In Robotics: Science and Systems, Cited by: §II.  [16] (2015) Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pp. 2944–2952. Cited by: §VA3.
 [17] (2008) Kernel methods in machine learning. The annals of statistics, pp. 1171–1220. Cited by: §II, §IVC.
 [18] (2017) Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133. Cited by: §VA2.
 [19] (2013) Reinforcement learning in robotics: a survey. The International Journal of Robotics Research 32 (11), pp. 1238–1274. Cited by: §I.
 [20] (2004) Gaussian processes in reinforcement learning. In Advances in Neural Information Processing Systems, pp. 751–758. Cited by: §II, §VA3.
 [21] (2003) Leastsquares policy iteration. Journal of machine learning research 4 (Dec), pp. 1107–1149. Cited by: §IIIB, §IIIB.
 [22] (2006) Planning algorithms. Cambridge university press. Cited by: §II.
 [23] (2018) A solution to timevarying markov decision processes. IEEE Robotics and Automation Letters 3 (3), pp. 1631–1638. Cited by: §II.
 [24] (2018) Sparse gaussian process temporal difference learning for marine robot navigation. arXiv preprint arXiv:1810.01217. Cited by: §II.
 [25] (2003) Mars rover autonomous navigation. Autonomous Robots 14 (23), pp. 199–208. Cited by: §V.
 [26] (2007) Mars reconnaissance orbiter’s high resolution imaging science experiment (hirise). Journal of Geophysical Research: Planets 112 (E5). Cited by: §VB.
 [27] (2002) Variable resolution discretization in optimal control. Machine learning 49 (23), pp. 291–323. Cited by: §II.
 [28] (2008) Finitetime bounds for fitted value iteration. Journal of Machine Learning Research 9 (May), pp. 815–857. Cited by: §II.
 [29] (2016) Anytime pathplanning: timevarying wind field+ moving obstacles. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2575–2582. Cited by: §II.
 [30] (2013) Riskaware path planning for autonomous underwater vehicles using predictive ocean models. Journal of Field Robotics 30 (5), pp. 741–762. Cited by: §II.
 [31] (2016) Perspectives of approximate dynamic programming. Annals of Operations Research 241 (12), pp. 319–356. Cited by: §IIIB, §IIIB.
 [32] (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §IIIB, §IV.
 [33] (2004) Kernel methods for pattern analysis. Cambridge university press. Cited by: §II, §IVC.
 [34] (2004) Handbook of learning and approximate dynamic programming. Vol. 2, John Wiley & Sons. Cited by: §IVD.
 [35] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §II, §IIIB.
 [36] (2009) Kernelized value function approximation for reinforcement learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1017–1024. Cited by: §II.
 [37] (2000) Probabilistic robotics. Vol. 1, MIT press Cambridge. Cited by: §I, §I.
 [38] (201906) Reachable space characterization of markov decision processes with time variability. In Proceedings of Robotics: Science and Systems, FreiburgimBreisgau, Germany. External Links: Document Cited by: §II.
 [39] (2007) Kernelbased least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks 18 (4), pp. 973–992. Cited by: §II.