State-Continuity Approximation of Markov Decision Processes via Finite Element Analysis for Autonomous System Planning

03/03/2019 ∙ by Junhong Xu, et al. ∙ Indiana University 0

Motion planning under uncertainty for an autonomous system can be formulated as a Markov Decision Process. In this paper, we propose a solution to this decision theoretic planning problem using a continuous approximation of the underlying discrete value function and leveraging finite element methods. This approach allows us to obtain an accurate and continuous form of value function even with a small number of states from a very low resolution of state space. We achieve this by taking advantage of the second order Taylor expansion to approximate the value function, where the value function is modeled as a boundary-conditioned partial differential equation which can be naturally solved using a finite element method. We have validated our approach via extensive simulations, and the evaluations reveal that our solution provides continuous value functions, leading to better path results in terms of path smoothness, travel distance and time costs, even with a smaller state space.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many autonomous vehicles that operate in flow fields, e.g. aerial and marine vehicles, can be easily disturbed by environmental disturbances. For example, autonomous marine vehicles might experience ocean currents as in Fig. 1. The uncertain motion behavior in an externally disturbed environment can be modeled using the decision-theoretic planning framework where the substrate is the Markov Decision Process (MDP) [1]

. However, the uncertainties in many scenarios are mixed due to environmental and system dynamics. To obtain a highly accurate solution, it is desirable to increase the number of states and to impose complex probability structures for modeling the MDP. Despite many successful achievements 

[2, 3, 4], existing work typically do not synthetically integrate environmental uncertainty into a computationally efficient planning model that possesses structural insights.

In this work, we treat the state space of MDP as a continuous form, but tackle it from a very different perspective from the majority of continuous MDP frameworks. The essence of this work lies in that, the state space is mapped from a discrete version to a continuous format, where the “mapping" mechanism is achieved via modeling the discrete-state value function through a partial differential equation (PDE) which is then solved by a finite element method.

In greater detail, we model the MDP as an infinite horizon problem and approximate the value function of the MDP by its Taylor expansion [5, 6] and show that it satisfies a diffusion-type PDE. We then apply a Finite Element Method (FEM) based on (a subset of) states for MDP to construct a solution naturally extending on the entire spatial space. Combining these procedures, we propose an approximate policy iteration algorithm to obtain the final policy and a continuous value function. Our framework in principle allows us to compute the policy on any subset of the entire planning domain (region). Because our proposed methods have deep mathematical roots in the diffusion approximation and the theory of FEM, the final results have theoretical guarantees. Finally, we validated our method in the scenario of navigating a marine vehicle in the ocean. Our simulation results show that the proposed approach produces results superior to the MDP solution on the discrete states but requires smaller computational efforts.

Fig. 1: Oceans currents can cause significant disturbances for autonomous marine vehicles. The swirl patterned currents near South Africa are Agulhas Rings which are usually strong. (Source: NASA.)

Ii Related Work

Path planning under the presence of the state uncertainty can be viewed as a decision-theoretic planning problem which is usually modeled as MDPs [7]. Most existing methods that solve MDPs in the literature use tabular methods, i.e., converting the continuous state space into a set of discrete indices [1]. To model the wind field disturbance, the MDP is used to model the task of minimum energy based planning [8]. Similar work [9, 10, 11]

addresses tasks of minimum time and energy planning under uncertain ocean current estimations as an MDP problem.

In addition to formulating motion uncertainty via MDPs, the environment uncertainty should also be integrated in planning. Environmental model uncertainty has been combined in stochastic planning algorithms such as Rapidly Exploring Random Trees (RRT) [12]. A prior knowledge of the motion model can be leveraged in linear quadratic Gaussian controller to account for other sources of uncertainty [13].

To extend the discrete-time actions to be continuous, the Semi-Markov Decision Processes (SMDP) [14, 15] have been designed based on the temporal abstraction concept so that actions can be performed in a more flexible manner. However, without relying on function approximation, the basic form of state space is still assumed to be discrete and tabular. Though the framework of continuous stochastic control can provide continuous value function and continuous actions, it requires Brownian motion driven dynamics [16]. Nevertheless, the Hamilton-Jacobi-Bellman equation for infinite horizon stochastic control shares a similar diffusion-type of equations with our approaches.

Research on the continuous representation of value functions has been conducted in literature. Fitted value iteration (FVI) is one of popular sampling methods that approximate continuous value functions [17, 18]. It requires batches of samples to approximate the true value function via regression methods. However, the behaviour of the algorithm is not well understood, and without carefully choosing a function approximator, the algorithm may diverge or become very slow on convergence [19, 20]. In contrast to the sampling methods, we attempt to approximate the value function using the second-order Taylor expansion and calculate the resulting partial differential equation via a finite element method [21].

Iii Problem Description and Formulation

This section describes the ocean environment modeling with disturbance and vehicle motion under uncertainty, followed by the decision-theoretic planning via Markov Decision Processes.

Iii-a Ocean Environment Model

We model the ocean environment as a 2-D surface (xy-plane). A vector field is used to represent the environmental disturbance caused by ocean currents at each location of the 2-D ocean surface. It is given by

, where   is the location, is the time, and are estimations of ocean current speeds along and axes given a location vector and a time . Note that and can take various forms. For example, regression models can be fitted if historical data of ocean currents is available; the ocean flow forecasts can be used [22]; or we can leverage a ocean circulation model studied by oceanographers [23].

The resulting and are not accurate enough to reflect the true dynamics of the ocean disturbance, since they are associated with estimation uncertainties [24]

. For effective and accurate planning, these uncertainties need to be considered by taking account of some environmental noises. For example, we may assume that the noise follows a Gaussian distribution:

(1)

where denotes the velocity vector after introducing uncertainty and is the uncertainty vector at location and time . To reduce the complexities on modeling and computing, we adopt existing approximation methods [9, 10, 11] and assume the uncertainty along two dimensions and are independent:

(2)

where denotes Gaussian distribution, and and

are the noise variance for each component, respectively. These two components are independent with respect to both space and time.

Iii-B Vehicle Motion Model

We formulate the motion model of the autonomous underwater vehicle (AUV) that operates in the 2-D plane environment. The net velocity of the vehicle with respect to the world frame (the actual motion of the vehicle) is given by , where is the speed of the vehicle relative to ocean currents and is the heading angle of the vehicle relative to the world frame. We assume that AUV has a maximum speed constrained by its capacity. For example, a marine glider can move at speeds comparable to those of typical ocean currents. The AUV motion is described by

(3)

where and are the and components of the net velocity vector, respectively; and are the control inputs for acceleration (fuel) and steering angular speed, respectively,

Iii-C Infinite Horizon Markov Decision Process

We consider planning of minimum time cost under the influence of uncertain ocean disturbances. This problem is usually formulated as an infinite time-horizon Markov Decision Process (MDP).

We represent the infinite discrete time horizon MDP as a 4-tuple . The discrete spatial state space and the action space are finite. A state denotes a spatial point on the plane in our problem setting. We denote the entire autonomous vehicle planning region (or domain) by . Thus, we have . The action depends on the state; that is, for each state , we have a feasible action set . The entire set of feasible state-action tuple is . There is a probability transition law on for all . Thus, specifies the probability of transitioning to the state given the current state with the taken action constrained by system dynamics. The final element is a real-valued reward function that depends on state and action.

We consider the class of deterministic Markov policies [15], denoted by , i.e., the mapping depends on the current state and the current time, and for a given state. For a given initial state and starting time , the expected discounted total reward is represented as:

(4)

where is the discount factor that discounts the reward at a geometric decaying rate. The aim is to find a policy to maximize the total reward from the starting time at the initial state , i.e.,

(5)

Accordingly, the optimal value is denoted by .

In practice the continuous 2-D ocean surface is discretized into a number of grids, where the center of a grid represents a state. The actions are defined by the vehicle moving at the maximum speed towards a desired heading direction in the fixed world frame  , where each element is a pair of . Because these actions only describe the desired heading angle and moving speed, they can not be used directly to control the vehicle. Low-level controllers that has integrated vehicle dynamics are used. The control inputs are and in the preceding section.

The transition probability is modeled by a bivariate Gaussian distribution

(6)

where represents the unnormalized transition probability, , and denotes the time of executing the action . We have assumed that within a short time window the ocean disturbances are constant and only depend on the state . In addition, the noise scaling parameters, and , are the same at all locations. Transition probabilities are only assigned to the states that are "connected" with the current state (Two states are connected only if a robot from a state is able to transit to the other state directly without passing any other states.). The transition probability can then be normalized by .

Iv Methodology

The solution to the infinite horizon MDP has values on the discrete spatial states (i.e. points) only. Although discrete MDP reduces computational efforts of evaluating infinite number of states on the entire space, it is still a large-scale instance for certain real applications. Moreover, it is desirable to have a solution on the entire continuous spatial space. In what follows, we employ methods in literature (e.g., [5, 6]) to approximate the value function of the discrete MDP by its Taylor expansion, and show that it satisfies a diffusion-type PDE. Given the situation that the parameters corresponding to this diffusion-type PDE have values on discrete states only, we then apply a Finite Element Method (FEM) to construct a solution naturally extending on the entire spatial space. We also show that under certain situations FEM allows to obtain solution from fewer discrete states. Finally we propose a policy iteration algorithm based on the diffusion-type PDE and FEM to obtain the final policy and continuous value function. Because our proposed methods have deep mathematical roots in the diffusion approximation and the theory of FEM, the final results have theoretical guarantees.

Iv-a Diffusion-Type Approximate Value Function

Under certain conditions [15, 25], it is well-known that the optimal solution that satisfies the following recursive relationship

(7)

where . This is the Bellman equation, which serves as a basis to solve the problem with Eq. (5) and Eq. (4

). The popular algorithms to obtain the solutions include policy and value iteration algorithms as well as linear program based methods

[15].

Following the procedures in [5], we subtract both hand-sides by and then take Taylor expansions of value function around up to second order (assuming such expansions exist):

(8)

where

; the operator and the notation in the last equation indicate an inner product. We note that similar approximations can also be found in literature of stochastic control [16]. We omit the rigorous derivations here, and refer readers to the aforementioned literature.

Under the optimal policy, the Eq. (8) is actually a diffusion type partial differential equation. The above procedures indicate that the solution to control problem Eq. (8) approximates the value function of the original problem at discrete states in . In order to obtain the solution to this diffusion type PDE, we need to impose reasonable boundary conditions. Since the path planning should be done within a region, the value function should not have values outside this region. This means that the directional derivative of the value function with respect to the unit vector normal at the boundary of the planning region must be zero. In addition, we constrain the value function at the goal state to be zero to ensure that there is a unique solution.

The above analysis inspires us to perform a policy iteration algorithm to obtain the policy [5]:

  • In the policy evaluation stage, we solve for a diffusion type PDE with imposed boundary conditions to obtain ;

  • In the policy improvement stage, we search for a policy that maximizes the values of the right hand side of Eq. (8) with obtained from the previous policy evaluation stage.

Iv-B Partial Differential Equation Representation

Now let us look at the PDE involved in the policy evaluation stage. Let us further denote the entire continuous spatial planning region (called domain) by , its boundary by , and the goal state by . We assume that the discrete spatial space . We also use to denote the unit vector normal to pointing outward. Under the policy , we aim to solve for the following diffusion type PDE:

(9)

with boundary conditions

(10a)
(10b)

where and indicate that and are obtained under the policy . The condition (10a) is a type of homogeneous Neumann condition, and (10b) can be thought of as a Dirichlet condition in literature [26]. In practice, we assume that the solutions to the above diffusion type PDE (9) with boundary conditions exist. The relevant conditions for a well-posed PDE can be found in [26].

We emphasize that Eq. (9) is obtained on discrete spatial state in , i.e., the parameters and and take values in only. This is because the Taylor expansions (8) are only performed on the discrete values . We need to construct a continuous solution with desired theoretical guarantees on the entire domain . In the next section, we will introduce a finite element method to perfectly achieve our objectives in our setting.

We make a final comment in this section. The problem here appears relevant to the numerical methods for differential equation in an opposite sense: in numerical methods, we have a well defined differential equation from the beginning, then discretize the domain to obtain the parameter values on each discrete point, and finally compute the solution. In our problem, we obtain parameters on each discrete point first, assuming a solution exists, and finally construct a continuous solution on the entire space.

Iv-C A Finite Element Method

(a) Grids of States
(b) Meshes
(c) Meshes on a subset of states
Fig. 5: Examples of triangle meshes on and on a subset of

The finite element method is a name for a general technique for constructing numerical solutions to differential equations with boundary value problems. The method consists of dividing the domain into a finite number of simple subdomains, the finite element meshes, and then using variational form (also called weak form) of the original differential equation to construct the solution through "finite element functions" over meshes [27]. There is an extremely rich collection of approaches and rigorous theories about the finite element methods. We cannot elaborate details in this section. However, our purpose here is merely to leverage a standard method to obtain the numerical solution of the preceding PDE and to demonstrate later that solution can even be obtained based on a subset of with high precision.

We will sketch the main ideas involved in finite element methods (Galerkin methods in particular) in a heuristic way, and refer the readers to literature

[28, 27] for a full account of the theory. The variational or weak form of the problem amounts to multiplying the both sides of Eq. (9) by a function (such function called test function) with suitable properties, and to using integration-by-parts and boundary condition Eq. (10b) we have

(11)

Here the test function on as well as on . It can be shown the solution to this variational form is also the solution to the original form.

In a standard finite element approach, the next procedure partitions the domain into suitable elements. Typically triangle element meshes will be applied because they are general enough to handle all kinds of domain geometry. In our problem setting, there is an additional requirement: such partition must be based on the discrete state points . For example, Fig. (a)a shows an grids of states, where a dark dot corresponds to a state point, i.e., the set of these dots being for the infinite horizon MDP. Fig. (b)b shows the triangle meshes used for finite element methods, where each vertex of triangle meshes corresponds to a state point.

Finally, we aim to represent the test function and the solution by a linear finite combination of so-called basis functions, i.e.,

. This study uses the Lagrange interpolation polynomials, each of which is constructed based on vertices of the triangle element. Because coefficients

should be arbitrary, this along with condition (10a) leads to a coupled system of linear algebraic equation of type . Each entry of matrix corresponds to basis function involved integrals on the right-hand side of Eq. (11). When we choose number of one degree of Lagrange interpolation polynomials, the is of size. corresponds to the basis function involved integrals on the right-hand side of Eq. (11). Solving this linear system gives us the estimates of , i.e., the approximate solution .

We make two notes here. First, we may use a subset of to perform the finite element methods. This is because the finite element methods approximate the solution with high precision. Fig. (c)c provides an example of using subset of states. The red dots are selected states and red triangle elements are used in finite element methods. We will demonstrate in numerical examples in the later section. Second, different types of applications and equations may require different designs of meshes. Due to the limited space, we will discuss how to choose meshes in the future work.

Iv-D Approximate Policy Iteration Algorithm

We summarize our approximate policy iterative algorithm in Algorithm 1.

In the policy evaluation step, we may perform finite element methods on or a subset if the computational cost is a concern, as discussed in the preceding section, It is worthwhile noting that the obtained value function is continuous on the entire planning region, i.e., the domain . Therefore, it is possible to obtain policy on any subset of where . It also implies that we can obtain a continuous policy on the entire domain in principle.

1.0

Algorithm 1 Policy Iteration with State-Continuity Approximation and Finite Element Methods
1:Transition function , reward function , discount , spatial states , and goal state .
2:Policy and continuous value function .
3:Initialize , and set .
4:repeat
5:      Step 1. Finite Element Based Policy evaluation.
6:      Calculate and on a subset ;  
7:      Obtain continuous value function by solving for the variational form (11) by Finite Element Methods described in Section IV-C;
8:     Step 2. Policy improvement.
9:      Update the policy on (or any subset of domain including ) using the value function according to the following equation:
10:      Set .
11:until  does not change for all

V Experiments

(a) Classic Policy Iteration
(b) Approximate policy iteration
(c) Goal-oriented planer
(d) Classic Policy Iteration
(e) Approximate policy iteration
(f) Goal-oriented planer
(g) Classic Policy Iteration
(h) Approximate policy iteration
(i) Goal-oriented planer
Fig. 16: Trajectory comparisons of trials under gyre disturbances. The top three figures demonstrate the paths under weak gyre disturbances, where

. Standard deviations of the disturbance velocity vector are set to

. The middle row shows the paths under stronger and more uncertain disturbances, where and . The last row demonstrates the paths with random obstacles (e.g., oil platforms or islets) in the environment.

In this section, we present experimental results for planning with an objective of minimizing time cost to a designated goal position under uncertain ocean disturbances.

We compare the proposed approach with two other baseline methods: the classic MDP policy iteration, and a goal-oriented planner which maintains the maximum vehicle speed and always keeps the vehicle heading angle towards the goal direction. The latter method has been widely used in navigating underwater vehicles due to its “effective but lightweight" properties [29]. The evaluation is based on two criteria: (i) the time cost of the trajectory, and (ii) the approximation error of the value function.

We first consider a wind-driven gyre model which is commonly used in analyzing flow patterns in large scale recirculation regions in the ocean [30]. Velocity components are given by and respectively, where denotes the strength of the current and determines the size of the gyres. The dimension of the ocean surface is set to . The surface is discretized into cells containing states. We set the maximum of vehicle speed as a fixed value in all the experiments. To match the large spatial dimension, the action execution time (resolution) is set to for MDP policies.

Because we are interested in minimizing the travel time, we impose a cost for each action execution. The reward function is designed as follows

(12)

V-a Trajectories and Time Costs

(a)
(b)
Fig. 19: The time costs and trajectory lengths of the three methods under different disturbance strengths. These results are averaged over 10 trials

We present the performance evaluations in terms of trajectories and time costs. The entire state space is used for evaluating all methods. A detailed comparison of using a subset of the state space with approximate policy iteration is demonstrated in Sect. V-B. We set which generates four gyre areas. To simulate the uncertainty of estimations during each experimental trial, we sample Gaussian noises from Eq. (2) and add them to the calculated velocity components of the gyre model.

We looked into the trajectories of different methods under weak and strong disturbances in Fig. 16. We run each method for 10 times. The colored curves represent accumulated trajectories of multiple trials. For the weak disturbance, parameter is set to and the uncertainty parameter is set to . The resulting maximum ocean current velocity is , which is smaller than the vehicle maximum velocity . In this case, the vehicle has the capability of going forward against ocean currents. In contrast, the strong ocean current gives the maximum velocity of which could be larger than the vehicle maximum velocity. By comparing the first row and the bottom two rows of Fig. 16, we can easily observe that under weak ocean disturbances, the three methods produce trajectories that well “converge" and look similar. This is because the weak disturbance results in smaller vehicle motion uncertainty. The goal-oriented planer has a better path (Fig. (c)c) in terms of the travel distance (path length) and time costs under the weak disturbance. On the other hand, MDP policies are able to leverage the fact that moving in the same direction with the ocean current allows the vehicle to obtain a faster net speed and higher transition probability in the current direction. The phenomena can be observed From Fig. 16(d) to (j). Specifically, with strong disturbances, instead of taking a more directed path towards the goal position, MDP policies take a longer path but with a fast speed by following the direction of the ocean currents. In contrast, the goal-oriented planer is not able to reach the goal under the strong disturbance. Planning under the appearance of obstacles is shown in the last row of Fig. 16. It is worth noting that MDP policies have three circular trajectories around the top-left gyre. We interpret the occasional circular behaviors that the noisy motion drives the vehicle to the north direction towards the obstacle. The MDP policies take the path that follow the direction of the circular ocean current to avoid possible collisions.

We further quantitatively evaluated the averaged time costs and trajectory lengths of the three methods under different disturbance strengths with fixed , as illustrated in Fig. 19. The overall performance for the proposed approximate policy iteration has a comparable performance to the classic policy iteration. However, our method has the advantage of outputting a continuous value function. The goal-oriented planer yields a slightly better result when no disturbance is presented (). However, as long as noticeable disturbance is presented, its performance degrades dramatically. In addition, it halted before arriving at the goal because it reached the given time budget ( with ).

(a)
(b)
(c)
(d)
(e)
(f)
Fig. 26: Value function calculated from approximate policy iteration on the right and exact solution on the left. The environment parameters and settings are the same as the ones in Fig. 16.
PI (400 states) (400 states) (200 states)
Time cost
Traj. len.
Time cost
Traj. len.
Time costs
Traj. len.
Time costs
Traj. len.
Time costs
Traj. len.
TABLE I: Time and trajectory costs averaged over 10 trials with different ocean current strengths , where PI stands for classic policy iteration. The best performing statistics are highlighted with a bold font.

V-B Performance with Different Resolution Parameters

We also investigated the performance of the approximate policy iteration where only a subset of the state space is used to evaluate the value function. The environment parameters are the same with the ones used in the previous section.

Table I provides the statistics of comparing the policy iteration and two different resolution parameters (), where means the approximate policy evaluation is conducted with only size of the MDP state space. The results reveal that, in most cases, approximating value function using half of the state space excels.

The continuous value function computed by the finite element method has the advantage of computational time saving. With an MDP of states, each policy iteration requires to solve a system of linear equations where the time complexity is roughly . In contrast, the finite element method with resolution parameter requires a time complexity of only .

V-C Value Function Approximation Error

Fig. 27: MSE between the approximated value function and the value function computed by the classic policy iteration. Even with increasing state space, the gap between approximation and the exact solutions is still small.

We also present analysis on the approximation error of the value function. From Fig. 26, it can be seen that the approximate value function is almost identical to the value function computed by the classic policy iteration. In contrast to the value function computed by classic policy iteration which is discrete in nature, the value function computed by the approximation method is continuous and can be evaluated at every designated location. We can see that the last row of Fig. 26, which is the environment with random obstacles, has the largest discrepancy between the exact and approximate value functions. The reason is because adding multiple obstacles causes the true value function to be non-smooth as shown in Fig. (e)e. The non-smoothness is caused by the definition of the reward function in Eq. (12). Thus, the resulting PDE solution results in a less accurate approximation. To overcome this problem, the boundary value for obstacle areas should be carefully considered. We leave this task for our future work.

Fig. 27 shows the mean squared error (MSE) between the approximate value function and the value function computed from the exact policy iteration. The value of the exact solution is around two to three orders of magnitude of the error. As illustrated by the blue line, when using a coarser resolution, i.e. , we still achieve a relative small error.

V-D Evaluation with Ocean Data

(a)
(b)
(c)
Fig. 31: Trajectories from the three methods using ocean data. (a) Classic policy iteration; (b) Approximate policy iteration; (c) Goal-oriented planner.

In addition to the gyre model, Regional Ocean Model System (ROMS) [31] data is also used to evaluate our approach. The dataset provides ocean currents data in the Southern California Bight region. This allows us to use Gaussian Process Regression (GPR) to interpolate ocean currents at every location on the 2-D ocean surface. We use one time snapshot to predict the speed of ocean currents. The environment is discretized into and each grid has a dimension of . We use when approximating the value function. Fig. 31 shows the trajectories of the three methods. Note that the maximum speed of the ocean currents from this dataset is around , which is similar to the vehicle maximum linear speed. We can see that trajectories of approximate policy iteration are smoother (shorter distance and time costs) compared with those from the classic policy iteration approach.

Vi Conclusions

In this paper, we propose a solution to solving autonomous vehicle planning problem using a continuous approximate value function for the infinite horizon MDP and using finite element methods as key tools. Our method leads to an accurate and continuous form of value function even with a subset from a very low resolution of state space. We achieve this by leveraging the diffusion-type approximation for the value function, where the value function satisfies a partial differential equation which can be naturally solved using finite element methods. Extensive simulations and evaluations have demonstrated advantages of our methods for providing continuous value functions, and for better path results in terms of path smoothness, travel distance and time costs, even with a smaller state space.

References

  • [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.    MIT press, 2018.
  • [2] J. Van Den Berg, S. Patil, and R. Alterovitz, “Motion planning under uncertainty using iterative local optimization in belief space,” The International Journal of Robotics Research, vol. 31, no. 11, pp. 1263–1278, 2012.
  • [3] Y. Luo, H. Bai, D. Hsu, and W. S. Lee, “Importance sampling for online planning under uncertainty,” The International Journal of Robotics Research, p. 0278364918780322, 2016.
  • [4] R. He, E. Brunskill, and N. Roy, “Puma: Planning under uncertainty with macro-actions.” in AAAI, 2010.
  • [5] A. Braverman, I. Gurvich, and J. Huang, “On the Taylor expansion of value functions,” arXiv preprint arXiv:1804.05011, 2018.
  • [6] J. Buchli, F. Farshidian, A. Winkler, T. Sandy, and M. Giftthaler, “Optimal and learning control for autonomous robots,” arXiv preprint arXiv:1708.09342, 2017.
  • [7] C. Boutilier, T. Dean, and S. Hanks, “Decision-theoretic planning: Structural assumptions and computational leverage,”

    Journal of Artificial Intelligence Research

    , vol. 11, pp. 1–94, 1999.
  • [8] W. H. Al-Sabban, L. F. Gonzalez, and R. N. Smith, “Wind-energy based path planning for unmanned aerial vehicles using markov decision processes,” in 2013 IEEE International Conference on Robotics and Automation.    IEEE, 2013, pp. 784–789.
  • [9] D. Kularatne, H. Hajieghrary, and M. A. Hsieh, “Optimal path planning in time-varying flows with forecasting uncertainties,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).    IEEE, 2018, pp. 1–8.
  • [10] V. T. Huynh, M. Dunbabin, and R. N. Smith, “Predictive motion planning for auvs subject to strong time-varying currents and forecasting uncertainties,” in 2015 IEEE international conference on robotics and automation (ICRA).    IEEE, 2015, pp. 1144–1151.
  • [11] L. Liu and G. S. Sukhatme, “A solution to time-varying markov decision processes,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1631–1638, 2018.
  • [12] G. Kewlani, G. Ishigami, and K. Iagnemma, “Stochastic mobility-based path planning in uncertain environments,” in 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.    IEEE, 2009, pp. 1183–1189.
  • [13] J. Van Den Berg, P. Abbeel, and K. Goldberg, “Lqg-mp: Optimized path planning for robots with motion uncertainty and imperfect state information,” The International Journal of Robotics Research, vol. 30, no. 7, pp. 895–913, 2011.
  • [14] R. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial Intelligence, vol. 112, pp. 181–211, 1999.
  • [15] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming.    John Wiley & Sons, 2014.
  • [16] R. F. Stengel, Optimal control and estimation.    Courier Corporation, 1994.
  • [17] C. Szepesvári and R. Munos, “Finite time bounds for sampling based fitted value iteration,” in

    Proceedings of the 22nd international conference on Machine learning

    .    ACM, 2005, pp. 880–887.
  • [18] A. Antos, C. Szepesvári, and R. Munos, “Fitted q-iteration in continuous action-space mdps,” in Advances in neural information processing systems, 2008, pp. 9–16.
  • [19] D. J. Lizotte, “Convergent fitted value iteration with linear function approximation,” in Advances in Neural Information Processing Systems, 2011, pp. 2537–2545.
  • [20] L. Baird, “Residual algorithms: Reinforcement learning with function approximation,” in Machine Learning Proceedings 1995.    Elsevier, 1995, pp. 30–37.
  • [21] I. Babuška, “Error-bounds for finite element method,” Numerische Mathematik, vol. 16, no. 4, pp. 322–333, 1971.
  • [22] A. F. Shchepetkin and J. C. McWilliams, “The regional oceanic modeling system (ROMS): a split-explicit, free-surface, topography-following-coordinate oceanic model,” Ocean Modelling, vol. 9, no. 4, pp. 347–404, 2005.
  • [23] A. F. Blumberg and G. L. Mellor, “A description of a three-dimensional coastal ocean circulation model,” Three-dimensional coastal ocean models, vol. 4, pp. 1–16, 1987.
  • [24] R. N. Smith, J. Kelly, K. Nazarzadeh, and G. S. Sukhatme, “An investigation on the accuracy of regional ocean models through field trials,” in 2013 IEEE International Conference on Robotics and Automation.    IEEE, 2013, pp. 3436–3442.
  • [25] D. P. Bertsekas, Dynamic programming and optimal control, vol. 1, no. 2.
  • [26] L. C. Evans, Partial Differential Equations: Second Edition (Graduate Series in Mathematics).    American Mathematical Society, 2010.
  • [27] J. T. Oden and J. N. Reddy, An introduction to the mathematical theory of finite elements.    Courier Corporation, 2012.
  • [28] T. J. Hughes, The finite element method: linear static and dynamic finite element analysis.    Courier Corporation, 2012.
  • [29] K.-C. Ma, L. Liu, H. K. Heidarsson, and G. S. Sukhatme, “Data-driven learning and planning for environmental sampling,” Journal of Field Robotics, vol. 35, no. 5, pp. 643–661, 2018.
  • [30] D. Kularatne, S. Bhattacharya, and M. A. Hsieh, “Time and energy optimal path planning in general flows.” in Robotics: Science and Systems, 2016.
  • [31] A. F. Shchepetkin and J. C. McWilliams, “The regional oceanic modeling system (roms): a split-explicit, free-surface, topography-following-coordinate oceanic model,” Ocean modelling, vol. 9, no. 4, pp. 347–404, 2005.