Optimal Path Planning of Autonomous Marine Vehicles in Stochastic Dynamic Ocean Flows using a GPU-Accelerated Algorithm

by   Rohit Chowdhury, et al.

Autonomous marine vehicles play an essential role in many ocean science and engineering applications. Planning time and energy optimal paths for these vehicles to navigate in stochastic dynamic ocean environments is essential to reduce operational costs. In some missions, they must also harvest solar, wind, or wave energy (modeled as a stochastic scalar field) and move in optimal paths that minimize net energy consumption. Markov Decision Processes (MDPs) provide a natural framework for sequential decision-making for robotic agents in such environments. However, building a realistic model and solving the modeled MDP becomes computationally expensive in large-scale real-time applications, warranting the need for parallel algorithms and efficient implementation. In the present work, we introduce an efficient end-to-end GPU-accelerated algorithm that (i) builds the MDP model (computing transition probabilities and expected one-step rewards); and (ii) solves the MDP to compute an optimal policy. We develop methodical and algorithmic solutions to overcome the limited global memory of GPUs by (i) using a dynamic reduced-order representation of the ocean flows, (ii) leveraging the sparse nature of the state transition probability matrix, (iii) introducing a neighbouring sub-grid concept and (iv) proving that it is sufficient to use only the stochastic scalar field's mean to compute the expected one-step rewards for missions involving energy harvesting from the environment; thereby saving memory and reducing the computational effort. We demonstrate the algorithm on a simulated stochastic dynamic environment and highlight that it builds the MDP model and computes the optimal policy 600-1000x faster than conventional CPU implementations, making it suitable for real-time use.



There are no comments yet.


page 1

page 11

page 12

page 13


Memoryless Exact Solutions for Deterministic MDPs with Sparse Rewards

We propose an algorithm for deterministic continuous Markov Decision Pro...

Efficiently Solving MDPs with Stochastic Mirror Descent

We present a unified framework based on primal-dual stochastic mirror de...

On the Complexity of Value Iteration

Value iteration is a fundamental algorithm for solving Markov Decision P...

Robust Entropy-regularized Markov Decision Processes

Stochastic and soft optimal policies resulting from entropy-regularized ...

Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation

This paper proposes a formal approach to learning and planning for agent...

Model Reduction Techniques for Computing Approximately Optimal Solutions for Markov Decision Processes

We present a method for solving implicit (factored) Markov decision proc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The advancement of ocean science and engineering relies heavily on accessing challenging ocean regions using autonomous or unmanned underwater vehicles. These agents perform complex and long-duration missions [6], including scientific data collection, defense and security, ocean conservation, and ship-hull damage inspection. A pressing need for such tasks is to operate the autonomous agents optimally such that operational costs are minimized and utility is maximized. An optimal plan (or policy) is one that minimizes a cost function, typically, the expected mission time or energy consumption, while ensuring the safety of the agent. Additionally, the net energy consumption must be minimized for certain vehicles that continuously recharge their batteries by collecting solar, wind or wave energy during a mission. A key challenge in planning optimal paths of these agents is that they get advected by the strong, dynamic and uncertain ocean currents. As such, accurate modeling of the environment and an efficient decision-making framework under uncertainty for the autonomous agent is crucial for successful planning. The planner must also be implemented using fast and scalable algorithms so that it may be utilized in real-time.

In the present paper, we consider an optimal high-level path planning problem for an autonomous marine agent that has to travel from a given start point to arrive at a target by moving through a stochastic and dynamic ocean environment. We are interested in three different single-objective missions, viz., minimizing the expected mission time, expected energy consumption without environmental harvesting and expected net energy consumption with environmental harvesting. The environment consists of (i) a stochastic, dynamic velocity field representing the ocean currents that advect the autonomous agent, (ii) a stochastic, dynamic scalar field representing a quantity that must be collected by the agent (e.g., solar radiation or wind energy), and (iii) dynamic obstacles that must be avoided. Here, we rely on a reliable prediction model for the environment with quantified uncertainty, which is a reasonably good assumption for underwater applications [31, 58, 46, 56]. Further, we focus on developing a computationally efficient algorithm to compute an optimal high-level policy in an offline setting (open-loop control). Our purpose is to use this algorithm in the future to develop an on-board routing scheme where the optimal policy is periodically updated through in-mission environment observations (closed-loop control).

In the above setting, the path planning problem is posed as a finite horizon, undiscounted, total cost Markov Decision Process (MDP) over a discretized spatio-temporal state space (Sec. II). Here, the uncertain velocity field directly affects the agent’s motion, the uncertain scalar field affects the reward structure that incentivizes the agent’s behavior, and the agent must avoid any obstacle in the mission. In the MDP framework, the objective is to compute an optimal policy that maximizes the expected sum of rewards. A dynamic programming algorithm may be used to provide an exact solution to the optimization problem. However, for large problems sizes, solving an MDP through dynamic programming becomes computationally expensive [8]. For a finite horizon MDP, the computational cost of the value iteration algorithm (a dynamic programming algorithm) is , where , , are the time horizon, action space size and state space size, respectively. Moreover, the computational cost of building the transition probability from ensembles of the stochastic environment forecast is , where is a constant and is the number of discrete environmental flow ensembles. This cost limits the application of such exact methods for real-time use with large scale problems. However, if fast, scalable algorithms to build MDPs and solve them are available, it opens up possibilities of their use in high-level path planning and also for online re-planning when combined with a data assimilation scheme. To overcome the limitation posed by computational time requirement, we propose an end-to-end Graphical Processing Unit (GPU)-accelerated parallel algorithm to build an MDP and compute an optimal policy in a fraction of time required for serial algorithms on a CPU. The GPU’s parallel processing capability is ideal for this task as several computations in building the MDP transition probabilities and the dynamic programming solution algorithm are embarrassingly parallel.

The novel contributions of this paper are: (1) Development and implementation of a computationally efficient end-to-end GPU-based algorithm to solve the optimal path planning problem using an MDP framework. We highlight the algorithmic challenges that arise in the GPU implementation for building the MDP model and how these challenges are overcome with new optimized Compute Unified Device Architecture (CUDA) kernels; (2) Introduction of an additional stochastic scalar field that the agent tries to collect as it moves around during mission. This field affects the reward structure of the autonomous agent, thereby introducing an additional layer of uncertainty. In doing so, we have developed an interdisciplinary solution that combines the rigorous MDP framework for planning under uncertainty with a stochastic environment modelling technique and high performance computing.

The organization of the present paper is as follows. First, we briefly discuss the previous work in the field. Next, we describe the theory of Markov Decision Processes (Sec. II) and stochastic environmental modeling (Sec. II-C). The new GPU accelerated algorithm and implementation details are in Sec. III. In Sec. IV we showcase the new algorithm by applying it to plan paths for different single-objective missions in a stochastic dynamic flow situation, clearly illustrating the computational efficiency of our algorithm. Finally, we conclude in Sec. V with directions for future research.

I-a Previous Progress in Optimal Path Planning

Most methods for optimal path planning of autonomous marine vehicles in stochastic dynamic environments are Monte Carlo versions of traditional deterministic planners (e.g., [41, 61]) and surface response methods (e.g., [26]

). Unfortunately, they suffer from the curse of dimensionality and are computationally inefficient in serial implementation.

[32, 31] has a detailed review of deterministic planners such as dynamic Dijkstra [36, 19], A [22], potential field functions [5], Delayed D [17], Fast Marching Methods [40], nonlinear programming [2]

, and mixed integer linear programming

[64]. [51, 50] use waypoint-selection algorithms and [66, 67]

use evolutionary algorithms like particle swarm optimization for path planning and utilize regular predictions from the regional ocean model system (ROMS). Recently, we have developed a suite of level-set based planners for deterministic environments

[34, 53, 30] and stochastic environments [54, 55]. However, the stochastic versions give a distribution of paths and risk-optimal selection is a computationally expensive process. The use of Markov Decision Processes (MDPs) for planning in stochastic environments [1, 62, 29] has been proposed as they provide a rigorous mathematical framework for decision making under uncertainty. However, they become prohibitively expensive for large problems and use naive environment modelling techniques. For example, [29] adds independent Gaussian noise to flow velocity components at each state to a mean forecast taken from ROMS, which causes the uncertainty to be spatially and temporally uncorrelated. Partially Observable Markov Decision Processes (POMDPs) have also been used for a variety of simultaneous localization and planning applications of ground or aerial vehicles where the focus is local closed-loop control [14, 3]

. Building on top of the MDP framework, reinforcement learning (RL) algorithms

[27, 68, 18, 65, 63, 48, 10] have also been used for path planning. GPU-accelerated RL libraries have also been developed recently [4],[35]

. However, RL algorithms provide approximate solutions to MDPs, which can be validated through exact dynamic programming solutions if the environment model is available and the solution is computationally feasible. Other machine learning schemes such as Gaussian process regression

[13, 37] have also been proposed.

The main drawback of the MDP-based planners is the impractical computational time requirement of sequential implementation to solve large-scale problems. For example, [29] reports that their MDP solution takes about 13 hours for a medium-scale autonomous underwater vehicles planning problem on a modern CPU. To accelerate the computation of optimal policies for large-scale problems, ideas of High-Performance Computing (HPC) have been proposed to solve the MDPs. For general problem statements, attempts have been made to parallelize the value iteration (VI) algorithm. [25] provides a proof of concept where the value iteration algorithm was mapped to a GPU platform using CUDA. Value iteration was also implemented using OpenCL for path finding problems in [9]. [23] uses OpenMP and MPI to implement a very large-scale MDP to generate traffic control tables. [39] uses CUDA to implement value iteration on a GPU optimized for inventory control problems. [42] fills state transition matrices and solves the MDP using a GPU in the context of crowd simulations, that lack a stochastic dynamic environment forecast. [43] implemented value iteration using CUDA, optimized for problems with sparse transition matrices. GPUs have also been used for other planning models such as semi-infinite nonlinear programming [12]. In the present paper, we develop an end-to-end GPU-based optimal path planning system that uses (i) realistic forecasts of the ocean environment, (ii) a newly developed GPU-implemented MDP model builder, and (iii) an adaptation of the GPU-implemented value iteration algorithm [43] for solving the MDP. We also provide our software package for community use [11].

Ii Markov Decision Process Framework

Markov Decision Processes (MDPs) provide a natural framework for sequential decision-making to compute an optimal control law (policy) in stochastic environments. An MDP is defined as a tuple , where is the set of states, is the set of actions, denotes the state transition probabilities, denotes the one-step rewards, and is the discount factor. A policy is defined as a map from the set of states

to a probability distribution over the set of actions

. The state transition probabilities and and one-step rewards are and , where,

are random variables that denote the state, action and reward awarded at time

, and . We define the value function that quantifies the utility of following a policy from state as the expected sum of one-step rewards written in a recursive form in terms of the value function of possible successor states as


An optimal policy is one that does better or equally as good as all the other policies , i.e., . The optimal value function corresponding to an optimal policy can be expressed as a recursive relation (using Bellman conditions)


that can be solved by dynamic programming algorithms. We use a finite horizon, undiscounted, total cost MDP for our path planning problem. Table. I and Table. II list all the notation used in the present paper.

Symbol Description
Set of states for MDP.
Set of actions for MDP.
State transition probabilities for MDP.
One-step rewards for MDP.
Discount factor for MDP.
A policy for the MDP.
An optimal policy for the MDP.
Value function for the policy .
Optimal value function.
update of the optimal value function in the value
iteration algorithm.
2D spatial coordinate representing agent’s position.
Starting position of the agent.
Target position of the agent.
Position of the agent at timestep .
Temporal coordinate.
Sample of the stochastic velocity field.
Sample of the stochastic scalar field.
Stochastic dynamic velocity field.
Mean velocity field.
DO mode of the velocity field.
DO coefficient of the velocity field.
Stochastic dynamic scalar field.
Mean stochastic dynamic scalar field.
Mask containing dynamic obstacle data.
Speed of the agent with respect to currents.
Maximum speed of the agent with respect to currents.
Agent’s kinematic model.
Number of possible headings for the agent.
Number of possible speeds for the agent.
Number of realizations of the stochastic velocity field.
Number of realizations of the stochastic scalar field.
Number of dominant modes of the stochastic velocity field.
Dimension of the spatio-temporal grid along the -axis.
Dimension of the spatio-temporal grid along the -axis.
Dimension of the spatio-temporal grid along the -axis.
Number of cells in the spatial grid at a given time.
Number of cells in a neighbouring sub-grid.
Number of cells in the spatio-temporal grid.
Random variable denoting the state of the agent at time .
Random variable denoting the action taken by the agent
at time .
An element of the set .
An element of the set .
Probability that the agent transitions to a state after
performing action in state .
One-step reward given to the agent for reaching state
upon performing action in state
Expected one-step reward given to the agent upon
performing action in state .
A constant positive reward given to the agent for reaching
the terminal state.
A constant negative reward given to the agent for going
outside the spatio-temporal domain or landing in an
Energy function of the agent.
Constant coefficient for scalar field terms.
Constant coefficient in the energy function.
TABLE I: Mathematical Notation and Symbols Used in the Present Paper

Ii-a MDP Formulation of Path Planning

In a spatio-temporal domain , let ( for 2-D and 3-D space) denote the spatial index and the temporal index (Fig. 1A). Let the domain have a stochastic, dynamic flow where is a sample of the random velocity field with an associated probability distribution function. Let the domain also have a stochastic dynamic scalar field (like solar, wind or wave energy field), where is a sample of the random scalar field. Let be the instantaneous speed of an autonomous agent that starts from at time and travels to in a manner that minimizes the expected cost.

Fig. 1: Schematic of the path planning problem: (A) In the continuous state-space: The autonomous agent must travel from to in a domain under the influence of a stochastic dynamic flow field and a stochastic dynamic scalar field (e.g., solar radiation). The agent takes an action according to an optimal deterministic policy and is simultaneously advected by an instantaneous velocity

. The effective instantaneous path taken by the agent is along the vector sum

. Meanwhile, it may also collect energy as it maneuvers through the scalar energy field . We seek the optimal policy that minimizes a given cost function based on the mission objective;
(B) In the discrete state-space. The discretized spatio-temporal grid corresponding to two successive time-steps are shown in the left panel. The middle and right panels show the spatial grid at and . At state (middle panel), the agent experiences a velocity , takes an action and ends up in in the next time-step (right panel). In doing so, the agent may also collect energy based on the radiations and at and , respectively. The agent receives a one-step reward, ), upon making the transition.

Similar to the setup in [10] and [29], we discretize the domain to a spatio-temporal grid world that constitutes the state space (Fig 1(B)). A state represents a square region in the spatio-temporal grid, which is indexed by a spatio-temporal coordinate . The square is centered at with sides of length . The grid is temporally discretized at intervals of . The random velocity at a discrete state is . At each state , the autonomous agent can take an action (heading and speed) from the action space . We consider discrete headings between and , and discrete speeds between and , where is the maximum possible relative speed of the agent. Hence the action space consists of discrete actions. Our problem is to find the optimum deterministic policy that the agent must follow to minimize the expected total cost of the mission.

The state transition probabilities denotes the probability that the agent lands in the state , given that it performs an action in state . This probability is induced from the uncertainty in the velocity field, through the vehicle kinematics. The spatial motion is given by , and the temporal update is . The successor state is computed as , where the function maps the spatial and temporal coordinates to the appropriate discrete state in the domain. When the uncertainty in the environment is not in a simple closed-form (e.g., Gaussian), we often have to use a Monte-Carlo based approach [62], i.e., enumerating multiple samples of the uncertain field and counting how many times a successor state is visited under the influence of action performed in a given state .

The expected one-step rewards that an agent receives upon reaching a state after taking an action in state depends on the mission objective. For the three different mission objectives, we consider they are defined as follows. (i) Minimize expected travel time: The agent is penalized a constant value for each time-step leading to the one-step reward


(ii) Minimize expected energy consumption: The agent’s energy consumption is modeled as a function of its speed leading to the one-step reward


where is the agent’s speed and is the energy function. Typically, a quadratic relationship of the form is utilized, where is a constant [53]; and (iii) Minimize expected net energy consumption: The agent is equipped with a solar panel (or any other energy harvesting mechanism) that collects energy from a spatio-temporal stochastic dynamic energy field while moving around in the domain. Hence, the expected one-step reward is defined as


where is a constant. The first term inside the expectation is the energy consumed for taking the action . We assume that the energy harvested by the agent is directly proportional to the magnitude of the solar radiation field. The second term is a first-order integral approximation of the energy collected by the agent from as it transitions from the state to . The sum of the two terms is the agent’s one-step net-energy consumption. The expectation in Eq. (5

) is taken over the joint distribution

. Moreover, irrespective of the mission objective, the agent is given a positive reward of if it reaches the target terminal cell to provide it an incentive to reach the target location. In the absence of this incentive, the agent may never reach the target location. Similarly, it is given a relatively large negative reward if it lands in a state marked as an obstacle, goes out of the defined spatial domain, or requires time beyond the finite planning horizon. Note that here we emphasize a 2D in space problem setup; however, our framework and implementation apply to 3D in space as well. Also, we use a kinematic framework for planning, without considering the dynamics of the flow around the UAV [34, 1, 62, 29, 10].

Ii-B Dynamic Programming Algorithm

Since our planning problem is a finite horizon, total cost, undiscounted MDP, there exists an optimal deterministic policy [7] that can be computed using dynamic programming. The key idea of dynamic programming is to use the value functions to organise and structure the search for good policies. Value iteration can be viewed as turning the Bellman optimality Eq. (2) into the update rule


The sequence can be shown to converge to the optimal for any arbitrary initial value , under the same conditions that guarantee the existence of [8]. Once the optimal value function is obtained, the optimal policy can directly be computed as


We refer the reader to [8] for the complete algorithm.

Ii-C Stochastic Dynamic Environment Modeling

A crucial input for computing the state transition probabilities of the MDP path planning problem is the stochastic dynamic environmental flow field in its decomposed form , where is the so-called DO mean, are the DO modes, and are the DO coefficients. Here, Einstein’s notation (repeated indices indicate summation) is used. In the present paper, we employ the computationally efficient Dynamically Orthogonal field equations [45], in which the DO mean, modes and coefficients are solved by integrating explicit equations for them [54, 10]. Alternatively, any modeling system that provides the flow field’s distribution in terms of the above decomposition is sufficient to compute the transition probability matrix and complete the planning. For example, a deterministic ocean modeling system (e.g., [21, 46, 58]

) may be employed to forecast an ensemble from which the dynamically orthogonal reduced-order representation can be obtained by performing a singular value decomposition

[33]. Further, any other uncertainty quantification and propagation method such as the Polynomial Chaos Expansion [20] can also be used. Notably, the DO equations provide a significant computational advantage (two to four orders of magnitude speedup) over the ensemble approach. Similarly, the stochastic dynamic scalar field representing the energy field to be collected by the vehicle must be obtained using environmental models such as atmospheric [49], wind or wave [60] models.

Iii GPU Accelerated Planning

There are two phases in the algorithm to compute an optimal policy of the MDP: (1) Building the model, i.e., computing the expected one-step rewards , and the state transition probabilities ; (2) Solving the MDP, i.e., obtaining an optimal policy using methods such as dynamic or linear programming. Each of these phases requires practically infeasible computational time in serial implementation. The cost of the first phase using a naive Monte-Carlo approach is as we must compute the one-step transition data for each state, for each action, and for each realization of the uncertain velocity and scalar fields. Here, is the number of realizations of the stochastic scalar field . The cost of the second phase using value iteration is . Therefore, when we have a large state-, action-, or realization-space, the compute time may become impractical. To solve large-scale problems, without having to resort to approximate methods such as Reinforcement Learning [59], we propose a GPU-based algorithm to achieve significant computational speedups. Since our algorithm provides a dynamic programming solution, it can be used to validate solutions obtained through reinforcement learning methods. Next, we discuss each phase separately, wherein we identify the useful properties that make them suitable for parallelization and how they can be implemented on a GPU. We also discuss how to overcome the physical device limitations of a GPU through algorithmic intervention.

Iii-a MDP Model Building on GPUs

To compute the transition probabilities and the expected one-step rewards for the MDP, we must calculate one-step transitions for each state, each action, and each realization by evaluating and . A key observation in the process is that these computations are independent of each other across states, actions and realizations, i.e., and for any tuple can be computed independent of any other tuple.

Hence, the key idea is that one can spawn thousands of independent threads that perform all of these computations for every tuple independently and in parallel. Additionally, since there is no need for any data communication between these parallelly spawned threads, the application is “embarrassingly parallel”. Such applications are ideal for being implemented on a GPU. In our work, we use NVIDIA GPUs, and the algorithm is written using CUDA.

Ideally, with a sufficient number of cores, all the computation can potentially happen simultaneously in parallel. However, practically, we would like our algorithm to run efficiently on modest GPUs installed on the autonomous system itself and not consume excessive energy (to operate the GPU). Our ultimate goal is to update the MDP model through in-mission measurements of the surrounding environment and update the optimal policy regularly with a modest on-board GPU. Towards this goal, in the present paper, we focus only on the end-to-end GPU algorithm for solving the offline path-planning problem. Critically, such modest GPUs may have resource limitations such as limited memory. For instance, if we decided to store the velocity field of the entire spatio-temporal grid of size (say ) with (say 1000) realizations as floating points in GPU memory, it would require 8 GB of global memory! Clearly, this is not acceptable. Hence, we propose the following three solutions to deal with such limitations.

First, we use a reduced-order velocity field that is stored in the GPU’s global memory. The realizations of the stochastic, dynamic velocity field are decomposed to a mean-field along with corresponding modes and coefficients. For instance, let’s assume that the first

(say 10) singular values of the velocity’s covariance matrix capture a specified threshold of the variability (say 99%), then for the aforementioned spatio-temporal grid and realizations, each component of the velocity field would be broken down to a mean-field matrix of size

, mode matrices of size each, and a coefficient matrix of size (). This requires only 96 MB of memory compared to the 8 GB required otherwise, which is roughly a two orders of magnitude reduction.

Second, we use just the mean scalar field for computing the expected one-step rewards when the problem involves the agent manoeuvring to collect energy from the stochastic scalar field. To do so, we show that enumerated samples of the field are not required to compute the expected one-step rewards, and just the mean-field is sufficient. The use of only the mean scalar field helps save memory and saves computational expense by a factor of . The proof is as follows. The optimal value function can be expressed in terms of an expectation for all as


Let us focus on the second and third energy terms on the RHS of Eq. (12). In the second term, the only random variable inside the expectation is . Note that is not a random variable in this equation since we are calculating the expected one-step reward for performing an action in a given state , which is known. Therefore, the expectation in the second term is over the probability distribution or simply , since is independent of the action . The expectation simply results in the mean value of at state (Eq. 13). However, in the third term, both and are random variables. Hence this expectation is over the joint distribution and can be expressed as an iterated conditional expectation (Eq. 14). The outer expectation is over the distribution , which are the state transition probabilities. The inner expectation is over the distribution , which simplifies to since is independent of and when evaluating at (Eq. 15). The inner expectation then simply results in mean value of at (Eq. (16)).


From Eq. (13) and Eq. (16) we can see that the second and third terms in the RHS of Eq. (12) depend only on and , which are the means of the stochastic scalar field at and , respectively. This holds true for any and . Hence, while we may have a model characterising the uncertainty in the scalar field and it is possible to take samples of the same, it suffices to store only the mean scalar field, . Finally, substituting Eq. 13 and Eq. 16 in Eq. 12, we get


Third, we use the concept of a secondary grid, which we refer to as the neighbouring sub-grid, whose size is much smaller than the spatial grid at a given time instance. For a given pair, the neighbouring sub-grid is used to store the possible values of for different realizations in a memory-efficient manner. The algorithm and the details of how the neighbouring sub-grid achieves memory efficiency are discussed next.

Iii-B CUDA Algorithm for Building the MDP Model

Symbol Description
An array that keeps a count of the number of times a
successor state is reached upon taking an action
for each initial state , across realizations.
An array that stores the sum of one-step rewards
across realizations for each initial state ,
for a given action at a given time step.
A COO format state transition matrix for transitions
from state at time to the states at
the successive time , for the given action .
Reward vector that stores the expected one-step rewards
for performing the action in each state at time .
Number of non-zero elements in .
An array that stores the number of possible successor
states for for each state at time .
A COO format state transition matrix for all states
for the given action .
Reward vector that stores the expected one-step rewards
for all states for performing the given action .
A COO format state transition matrix made by appending
s for all actions .
Reward vector containing expected one-step rewards
made by appending s for all actions .
Thread index assigned to each thread in a kernel.
A one-dimensional index to denote discrete cell
in a spatial grid at any given time.
A one-dimensional index to denote discrete cell
in the spatio-temporal grid world.
A one-dimensional index to denote the realization of
the stochastic velocity field .
Mean scalar field at state .
One-step reward.
An index to access array.
An index to access array.
TABLE II: Mathematical Notation and Symbols Used in Algorithms in the Present Paper

Algorithm 1 is the newly developed CUDA algorithm for building the MDP transition probability matrix and the expected one-step reward.

1:  MEMCPY to GPU;
2:  Compute dimension for neighbouring sub-grid;
3:  Allocate memory for of size ;
4:  Allocate memory for of size ;
5:  for (do
6:     for   do
7:        , ;
8:        ;
9:        ;
10:        ;
11:        ;
12:        ;
13:        , ;
14:     end for
15:  end for
16:  for   do
17:     ;
18:  end for
Algorithm 1 Building the MDP Model

Algorithm 1 takes the following input,

  1. Environment data: Environment data comprises all environment information such as the reduced-order stochastic velocity field in the form of a mean velocity field, , the dominant modes, , and corresponding coefficients, ; the mean of the stochastic scalar field, ; and the dynamic obstacle information in the form of a mask, .

  2. : A parameter that defines the problem type and objective function. The objective could be minimizing expected travel time, expected energy consumption or expected net energy consumption.

  3. : An array that stores other problem parameters such as , , , which are used to define the action space, and and , which are used to define the reward structure.

  4. : An array that stores grid parameters such as length scales, time scales, and sizes of the spatio-temporal grid, , and .

and gives and as the output, where is an appended transition matrix and is the appended expected rewards vector.

The algorithm works as follows. First, the input data, which consists of environmental information, the grid, and problem parameters are transferred from the CPU to the GPU using a MEMCPY operation so that it can be readily accessed by threads during kernel launches. Next, we define the size of the neighbouring sub-grid. Conceptually, the neighbouring sub-grid is a smaller, rectangular 2D grid defined for each state at a given time-step in the domain. The key idea behind the grid is that an agent starting at a state cannot spatially transition to any other state that lies outside the neighbouring sub-grid of state , since it can only transition to physically nearby states. Fig. 2(C) shows neighbouring sub-grids for two different initial states. The size of the sub-grid ( is kept constant across all states and depends on the component-wise maximum velocities across all states with an added buffer. The cells inside the neighbouring sub-grid map to elements in array as shown in Fig. 2(C) and Fig. 2(D). The details of this mapping and how the sub-grid aids efficient computation are discussed in III-C1.

After that, we allocate memory for the and arrays in the GPU. The size of array is , where is the number of cells (or states) at a given time step. The array keeps a count of the number of times a given successor state is reached upon taking a given action for each initial state , across realizations. On the other hand, the size of the array is . It stores the sum of one-step rewards across realizations for each initial state , for a given action at a given time step. These arrays are filled in by successive kernel calls described in Sec. III-C.

Next in the algorithm, we have a series of CUDA kernels responsible for computing the MDP model. These kernels are nested within two for-loops - the outer loop ranges across time-steps, and the inner loop ranges across all actions in the action space. Hypothetically, if the GPU had unlimited memory and compute cores, all the computation could potentially be done in parallel without iterating over time-steps and actions. However, modest GPUs have limited memory and compute cores, making it difficult, if not impossible, to not only store the data but also be able to perform all computations parallelly. Since all computations cannot happen in parallel, splitting the computation through loops does not compromise maximum achievable parallelism. Effectively, we will be using the GPU to its full potential while splitting computations across time-steps and actions. As such, each kernel performs computations over states at a given time-step and a given action. In what follows, we discuss each kernel’s working and implementation.

Fig. 2: Methodology for building the MDP model using a GPU: (A) A 3D grid showing various realizations of the stochastic velocity field at a given time-step. (B) A 3D grid of CUDA threads, mimicking the structure of the 3D grid in (A). (C) Example showing participation of thread blocks during the execution of the kernel for two different spatial coordinates at a given time-step for a given action, i.e., for two different pairs. The figure also shows two neighbouring sub-grids highlighted in red and orange. (D) The information in neighbouring sub-grids is written to array. The kernel computes the number of non-zero elements, followed by the kernel, which converts the information to a sparse COO-format transition probability matrix. On the other hand, the kernel computes the expected reward vector from the array.

Iii-C CUDA Kernels

Iii-C1 K_transition

1:  ;
2:  ;
3:  if (then
4:     ;
5:     );
6:     if ( is not terminal) then
7:        if ( is not obstacle) then
8:           );
9:           );
10:           );
11:           if (obstacle between and then
12:              ;
13:           end if
14:        end if
15:     else
16:        ;
17:     end if
18:     ;
19:     );
20:     );
21:     );
22:  end if
Algorithm 2 K_TRANSITION

This is the most important kernel in the algorithm and consumes the majority of the total compute time. The kernel’s purpose is to compute one-step transitions for each state at a given time-step and action across all realizations of the velocity field. The computed data is then written to the appropriate indices of the and arrays. Algorithm 2 shows the pseudo-code of the kernel, which is executed parallelly by all the threads. A visual representation of the stochastic velocity field at a given time-step is shown in Fig. 2(A) in the form of a 3D grid. Each plane in the grid represents the spatial domain marked with grey arrows, highlighting the velocity field corresponding to a realization at a given time-step. The field’s realizations are stacked along the -direction. The kernel is launched using a 3D grid of thread blocks (Fig. 2(B)) that mimics the structure of the 3D grid of velocity field realizations in Fig. 2(A). The and directions of the kernel’s grid correspond to the and directions of the spatial domain, and the -direction corresponds to the realizations of the velocity field. All thread blocks (cuboidal units of the kernel’s grid) that share the same - coordinate are stacked along the -direction and map to the corresponding - coordinate on the spatial domain. For example, one such stack outlined in red corresponds to the spatial cell in the left corner. Also, each thread block consists of a single queue of threads oriented along the -direction such that each thread in the block maps to a realization of the velocity field. Fig. 2(C) shows two stacks of blocks that correspond to a given state, with each block consisting of multiple threads along the -direction, corresponding to the velocity field realizations (Fig. 2(A)). As a result of this mimicking structure, each thread uniquely maps to a particular state and realization at the given time-step, and maintains one-dimensional indices corresponding to (i) the discrete cell in the spatio-temporal grid world denoted by , (ii) the discrete cell in the spatial grid at time , denoted by , and (iii) realization of the velocity field, denoted by (Line 2 of Algorithm 2). Therefore, each thread can independently compute the velocity field’s and scalar field’s numerical values for its corresponding state and realization from the mean, modes, coefficients and mean scalar field, already present in the GPU’s global memory (Lines 2 and 2). Similarly, each thread can also independently access the obstacles data from the mask matrix . Once a thread has computed the environment information for its state and realization, it independently computes , i.e., the successor state’s spatial coordinate and temporal coordinate , through the function in Line 2. Next, since each thread knows its values, it reads the value of the mean field at and then computes its corresponding one-step reward using the function (Lines 2, 2), wherein the reward structure is defined according to Eq. 3-5. Computing the one-step reward also involves checking for the presence of any obstacles between and and whether is a terminal state by evaluating the if-conditions in lines 6, 7, and 11.

Based on our definition of the neighbouring sub-grid (Sec. III-B), the successor state must be a part of the sub-grid itself. Fig. 2(C) shows an example of neighbouring sub-grids for two different initial states, highlighted in red and orange colours. Each cell in the neighbouring sub-grid maps to a contiguous set of elements in the array, as shown in Fig. 2(C) and Fig. 2(D). The array is of size since it maps from sized neighbouring sub-grids for each of the possible initial states. Each thread computes the sub-grid index corresponding to and the corresponding index (Line 2) to access the array. Thereafter, it performs an operation (Line 2), incrementing the count at the index of the array. Fig. 2(C) provides a visual representation of the above process through a representative stack of three thread blocks (six threads in total) highlighted in orange. It shows how threads with the same sub-grid index participate in an operation named to access and write to the appropriate element in the array. In this way, the array encodes the number of times an agent starting at an initial state and performing a given action transitions to a successor state for all possible initial and successor states at times and , respectively.

Crucially, we note that the neighbouring sub-grid concept limits the size of the array by using the fact that transitions can practically only happen to nearby cells. The naive alternative would be to allow parallel access to array by making it of the size . The advantage of the latter is that there is no need to compute an additional index for the sub-grid; however, most elements of the alternative array would never be accessed and would hold just the initial value (zero), which is wasteful utilization of memory. Further, the neighbouring sub-grid concept also saves unnecessary computations and will be discussed in Sec. III-C4.

Lastly, the array must be filled with the sum of one-step rewards across realization. Therefore, all threads associated with a given spatial cell (but different realizations) participate in another operation (Line 2), adding their one-step rewards to the corresponding index of the array. A visual representation of this computation is shown in Fig. 2(C), where threads along the stack of blocks access and write to appropriate elements of the array through the operation named .

Iii-C2 K_compute_mean

As per Eq. (12

), we must compute the expected one-step rewards. In our approach, we estimate this expectation by taking a sample mean of

over all realizations. The array already contains the sum of all over all realizations. Hence, the expectation can simply be computed by dividing each element in by . The kernel performs this division operation through a 1D grid of thread-blocks, where each thread-block is also one dimensional. Each thread is assigned uniquely to an element in and simply divides its corresponding element by , thus resulting in , which constitutes the expected one-step rewards for performing the action in each state at time . Fig. 2(C) and Fig. 2(D) show the kernel using array and modifying it to the resulting array.

Iii-C3 K_count

We must now compute the transition probability matrix for transitions from state at time to the states at the successive time , for a given action . As is sparse, we store it in the sparse matrix COO (coordinate) format that needs only memory (in contrast to for a full matrix), where is the number of non-zero elements in the sparse matrix. However, we must first compute to allocate the appropriate amount of memory for . We must also separately store the number of non-zero elements in each neighbouring sub-grid of each state at time in an array called the array, which is used for accessing the appropriate index of in the next kernel (Sec. III-C4). The kernel works as follows. It is launched using a 1D grid of thread-blocks, where each thread-block containing threads. Hence, the total number of threads is , one for each element of the array. For example, threads from a given thread-block would be responsible for accessing their respective red highlighted elements in the array, as shown in Fig. 2(D). Each thread checks if its corresponding element of the array holds a non-zero value. If so, it participates in an atomicAdd() operation, incrementing the count maintained at the appropriate index of the array. Then is directly computed using the Thrust [24] library’s parallel implementation of the function.

Iii-C4 K_reduce_to_coo

The and arrays together encode all the information required to populate in the COO format, which is a matrix of size . The first row of stores all the row indices ( values), the second row stores the column indices ( values), and the third row stores the non-zero element value (the transition probability ) . Fig. 2(D) shows the transition probability matrix in the COO format, with three -sized rows for , and , respectively.

The kernel’s purpose is to reduce the encoded state transition information in the array to the transition matrix . The kernel is launched with threads organised into a 1D grid of thread-blocks, where each thread-block is also one dimensional. Hence, each thread is responsible for computation involving its own initial state . Each thread first computes its one-dimensional state index and another index to access the appropriate element of . Then it writes the computed value to successive elements starting from the index . Here is the number of possible successor states , for the state-action pair , and obtained by accessing array. Then, to fill the second and third rows of , each thread traverses through the neighbouring sub-grid of its corresponding initial state in the array to check for non-zero values. When a thread encounters a non-zero value, it decodes the one-dimensional index of the successor state, and writes the value to the appropriate index of ’s second row. It also maintains a local count of the number of times each occurs. Finally, the transition probability is computed by dividing the count of each by the total number realizations and written at the same index of ’s third row.

We briefly mentioned in Sec. III-C1 that the neighbouring sub-grid concept also saves unnecessary computation apart from saving memory. The reason is that in our approach, a thread has to traverse through only elements, whereas, in the naive alternative with no neighbouring sub-grid, a thread would have to traverse through elements, most of which would be . Since the size of the neighbouring sub-grid is much smaller than that of the spatial grid (), the cost of searching is significantly cheaper in our approach compared to the naive alternative.

In Algorithm 1, the above sequence of kernels is executed within two for-loops. After the last kernel is executed, we have both and stored in the GPU’s global memory. At the end of the inner action loop, and are moved to the CPU memory through a MEMCPY operation to maintain sufficient memory in the GPU to store similar data structures that will be populated in the next iteration. In addition, s and s of all actions are appended across time-steps in the CPU asynchronously with respect to the GPU. Hence, at the end of the two-level nested for-loops, we have the complete transition matrix and expected one-step rewards for all the states in the spatio-temporal grid for each action . Finally, we append all and across actions along rows and obtain and . This final assembly is also done in the CPU and is essentially a pre-processing step for providing a suitable input data structure to the Sparse Value Iteration solver.

Iii-D Sparse Value Iteration on GPUs

The optimal value function and policy for the above MDP model (, ) is computed by a dynamic programming algorithm called Sparse Value Iteration [43] that exploits properties of sparse transition matrices. We modify [43]’s GPU implementation algorithm (code available at [44]) to seamlessly integrate it with our algorithm for phase-1. The modifications include the use of different data structures such as the sparse COO format for storing matrices, change in functions responsible for reading input data and multiple variable names.

Overall, our novel model-building algorithm and sparse value iteration algorithm make an end-to-end GPU software package that takes the stochastic environment’s data as input and provides an optimal policy for the MDP path planning problem as output, orders of magnitudes faster than reported by conventional CPU methods. In the following section, we demonstrate applications using our end-to-end software.

Iv Applications

We demonstrate three different single objective missions (Sec. II) in a stochastic double-gyre ocean flow simulation. This flow field is obtained by solving the dynamically orthogonal quasi-geostrophic equations [45] and has been extensively used for method development and testing in several path planning [55] and uncertainty quantification [16] applications. The equations and description of the physical process associated with the flow field are in [54]. Briefly, the flow field corresponds to that in a square basin of size 1000 km 1000 km, where a wind from west to east sets up a double-gyre flow and the uncertainty arises from its initial conditions. The discrete spatial domain is a square grid of size , with non-dimensional units. We also use 200 discrete time intervals with non-dimensional units, thereby making our spatio-temporal grid of size . The non-dimensionalization is done such that one non-dimensional unit of space is 5 km and time is 0.072 days. The environment is common across all three applications. Fig. 3 shows the mean (row A), the first four DO modes (row B), and the joint-PDF of the first four DO coefficients (row C) at four discrete times (in four columns), highlighting the dynamic and non-Gaussian nature of the stochastic flow field. Fig. 4 shows the complete environment for a particular realization of the background velocity field (Fig. 4A) and the scalar energy field (Fig. 4B). The stochastic scalar field is assumed to be that of solar radiation incident upon the ocean surface, mimicking a cloud cover moving westwards. This field plays a role only for the net-energy consumption minimization goal. The dynamic obstacles are assumed to be restricted regions in the domain that changes deterministically with time. For the present applications, the dynamic obstacle is square-shaped with a side of 20 units in the grid. The obstacles enter the domain from the west and move eastward with time. It is also assumed that these restricted regions do not interact with or affect the flow in any way. In case the movement of obstacles is uncertain and/or the scales of the flow are such that the obstacles affect the flow, these assumptions can be relaxed and environmental flow fields solved with the obstacles as done in [54]. We consider that the agent can take one of 32 actions ( and ). The number of velocity field realizations used here is . The size of the neighbouring sub-grid is set . In dimensional units, the maximum magnitude of the flow field is around 2 m/s and the speed of the agent is 80 cm/s. The planned paths are valid for any AUV that flies at the specified speed. Appropriate operational parameters can be chosen to fix this AUV speed [47, 15, 31]. As such, the flow is relatively strong, and a typical autonomous marine agent modeled here gets strongly advected by the currents, making it difficult for the agent to move against the flow in most parts of the domain. For the test cases demonstrated here, the agent starts at the position (represented by a circular marker), and the target is the position (represented by a star-shaped marker).

Fig. 3: Double-gyre stochastic dynamic velocity field. (A) The mean velocity field shown as streamlines overlaid on a background colored with the magnitude. (B) Four dominant modes shown as streamlines overlaid on a background colored with the magnitude. (C) Pairplots showing the pairwise joint pdf of the DO coefficients corresponding to the first four modes. All fields and coefficients shown at four non-dimensional time units. Individual realizations of the velocity fields can be reconstructed from this prediction using the DO relation , where is a sample from the joint PDF of the coefficients.
Fig. 4: Evolution of the environment with time. (A) Four time snapshots of a particular realization of the stochastic double gyre flow shown as streamlines overlaid on a colorplot of the magnitude. (B) Four time snapshots of the scalar radiation field shown as a colorplot overlaid with the streamlines of the velocity realization. The dynamic obstacle is shown in grey filled patches.

Iv-a Time-Optimal Planning with Dynamic Obstacles

First, we consider a mission with the objective of minimizing the agent’s expected travel time from the start to target in the environment described above. We use the one-step rewards defined in Eq. 3 for time-optimal planning. Our novel end-to-end GPU-implemented algorithm is used to compute the time-optimal policy. This policy is then applied to the ensemble of flow fields, and the computed trajectories are shown in Fig. 5. The four discrete time instances shown illustrate how the different trajectories (each corresponding to a different background flow field) evolve with time. To minimize the travel time, the agent takes the high-speed action throughout the mission while intelligently utilizing the double-gyre flow. The distribution of trajectories splits into two bands, successfully avoiding the dynamic obstacles as seen at t=48 and 60 (Fig. 5). The computed expected travel time is 61.25 non-dimensional units (4.41 days). The total computational time required for building the MDP model and solving via the Value Iteration algorithm is of the order of a few minutes. In contrast, the MDP-based CPU implementation for a smaller spatio-temporal grid, without a realistic uncertain environmental flow forecast, reported in [29] was around 13 hours. Further, the overall paths are similar to the time-optimal trajectories computed from the level set equations in [54]. However, the key difference is that using the MDP approach, we get an optimal policy and not simply the distribution of optimal waypoint or heading sequence obtained by solving the stochastic level set equations.

Fig. 5: Paths by following the Time-Optimal Policy: Distribution of paths obtained by following a time-optimal policy in different realizations of the stochastic environment. The background is the mean fields magnitude with streamlines overlaid to illustrate the nature of flow field encountered. The trajectories are colored with arrival time.

Iv-B Minimal Energy Expenditure with Time Constraints

Second, we consider a mission with the objective of minimizing the agent’s expected energy consumption for travelling from the start to target in the environment described above. An additional constraint is that it must complete its mission within 200 time-steps, which is imposed by limiting the state space to 200 temporal units. Further, the energy consumption model for the agent is assumed to depend only on its speed (Sec. IV-B) and the one-step rewards defined in Eq. 4 is used here. Since we are only optimizing for energy consumption and not energy collection, the scalar field’s presence does not affect the optimal policy here. The evolution of the trajectories (at four discrete instances) obtained by following the optimal policy computed by our end-to-end GPU-implemented algorithm is shown in Fig. 6. In contrast with the behaviour observed in Sec. IV-A, the agent here typically takes the slow-speed action throughout the mission to reduce its energy consumption. Due to its low-speed actions, a relatively high background flow, and the presence of two dynamic obstacles, the distribution of trajectories splits into three bands. Two of these bands go around the dynamic obstacle, avoiding it from either side, similar to what we observed in Fig. 5. However, the third band (coloured yellow) has to swerve around, making a loop before moving towards the target, since otherwise, it would possibly bump into the dynamic obstacle. The agent’s expected energy consumption is 108.7 non-dimensional units. Our framework is flexible to allow using different speeds for the agent based on its capabilities. To do so, we simply have to change the number of discrete speeds considered.

Fig. 6: Paths by following the Energy-Optimal Policy: Same as Fig. 5 but for an agent following the energy-optimal policy. Trajectories are colored with the energy consumption.

Iv-C Net-Energy-Optimal Planning with Dynamic Intake

Third, we consider a mission with the objective of minimizing the agent’s expected net energy consumption for travelling from the start to target in the environment described above. The net energy consumption is the difference between the energy consumed by the agent’s motors for manoeuvring and the energy collected from solar radiation through its solar panels. In this case, the scalar radiation field plays a significant role in the optimization problem, and the one-step rewards defined in Eq. 5 is used in this case.

Fig. 7 shows the trajectories obtained by following the optimal policy for net energy consumption in different realizations of the flow field. The computed optimal policy induces a tendency in the agent to manoeuvre in the yellow sunny region and collect solar energy as long as possible while ensuring that it completes the mission within the given time constraint (Fig. 7). The vehicle starts by moving to the extreme left of the domain to collect solar energy in the yellow sunny region for all realizations of the background environment. The distribution of trajectories initially splits into two bands, as seen at t=48 (Fig. 7). One band (purple colors) continues to stay in the left extreme of the domain. The other band (yellow colors) moves back towards the right and then loops around to collect energy in the sunny region as much as possible. At around , the diminishing sunny region and the relatively strong background flow make it infeasible for the vehicle to keep collecting solar energy on the left side of the domain. After that, the vehicle starts moving towards the target location. On the way, the vehicle in the realizations corresponding to the purple band encounters the dynamic obstacle. In response, the band further splits into two, moving past above and below it. Finally, all trajectories move towards the target location and converge there by . The average net energy consumption is -190 non-dimensional units, indicating that more energy is harvested than expended for the particular non-dimensional constants used here.

Fig. 7: Paths by following the Net-Energy-Optimal Policy: Same as Fig. 5 but for an agent following the net-energy-optimal policy. Trajectories are colored with the net energy consumption.

Crucially, all these computations were performed on a modest GPU in a fraction of the time required by conventional MDP model builders and solvers using a CPU. A detailed computational run-time study is presented next.

Iv-D Run-Time Analysis with Different GPUs

The above applications were run for various problem sizes on three different GPUs, namely RTX 2080, RTX 8000 and a V100. The execution times were compared with the sequential version of the algorithm run on an Intel Core i7-8700 3.2 0GHz CPU with 32 GB RAM. The GPU runs were individually optimized for launching kernels at the optimal threads per block. These were found to be 64, 64 and 32 for the RTX 2080, RTX 8000 and V100, respectively. Our model-building algorithm’s execution times and speedups were noted for a varying number of states, number of actions, and number of realizations of the stochastic velocity f