I Introduction
The advancement of ocean science and engineering relies heavily on accessing challenging ocean regions using autonomous or unmanned underwater vehicles. These agents perform complex and longduration missions [6], including scientific data collection, defense and security, ocean conservation, and shiphull damage inspection. A pressing need for such tasks is to operate the autonomous agents optimally such that operational costs are minimized and utility is maximized. An optimal plan (or policy) is one that minimizes a cost function, typically, the expected mission time or energy consumption, while ensuring the safety of the agent. Additionally, the net energy consumption must be minimized for certain vehicles that continuously recharge their batteries by collecting solar, wind or wave energy during a mission. A key challenge in planning optimal paths of these agents is that they get advected by the strong, dynamic and uncertain ocean currents. As such, accurate modeling of the environment and an efficient decisionmaking framework under uncertainty for the autonomous agent is crucial for successful planning. The planner must also be implemented using fast and scalable algorithms so that it may be utilized in realtime.
In the present paper, we consider an optimal highlevel path planning problem for an autonomous marine agent that has to travel from a given start point to arrive at a target by moving through a stochastic and dynamic ocean environment. We are interested in three different singleobjective missions, viz., minimizing the expected mission time, expected energy consumption without environmental harvesting and expected net energy consumption with environmental harvesting. The environment consists of (i) a stochastic, dynamic velocity field representing the ocean currents that advect the autonomous agent, (ii) a stochastic, dynamic scalar field representing a quantity that must be collected by the agent (e.g., solar radiation or wind energy), and (iii) dynamic obstacles that must be avoided. Here, we rely on a reliable prediction model for the environment with quantified uncertainty, which is a reasonably good assumption for underwater applications [31, 58, 46, 56]. Further, we focus on developing a computationally efficient algorithm to compute an optimal highlevel policy in an offline setting (openloop control). Our purpose is to use this algorithm in the future to develop an onboard routing scheme where the optimal policy is periodically updated through inmission environment observations (closedloop control).
In the above setting, the path planning problem is posed as a finite horizon, undiscounted, total cost Markov Decision Process (MDP) over a discretized spatiotemporal state space (Sec. II). Here, the uncertain velocity field directly affects the agent’s motion, the uncertain scalar field affects the reward structure that incentivizes the agent’s behavior, and the agent must avoid any obstacle in the mission. In the MDP framework, the objective is to compute an optimal policy that maximizes the expected sum of rewards. A dynamic programming algorithm may be used to provide an exact solution to the optimization problem. However, for large problems sizes, solving an MDP through dynamic programming becomes computationally expensive [8]. For a finite horizon MDP, the computational cost of the value iteration algorithm (a dynamic programming algorithm) is , where , , are the time horizon, action space size and state space size, respectively. Moreover, the computational cost of building the transition probability from ensembles of the stochastic environment forecast is , where is a constant and is the number of discrete environmental flow ensembles. This cost limits the application of such exact methods for realtime use with large scale problems. However, if fast, scalable algorithms to build MDPs and solve them are available, it opens up possibilities of their use in highlevel path planning and also for online replanning when combined with a data assimilation scheme. To overcome the limitation posed by computational time requirement, we propose an endtoend Graphical Processing Unit (GPU)accelerated parallel algorithm to build an MDP and compute an optimal policy in a fraction of time required for serial algorithms on a CPU. The GPU’s parallel processing capability is ideal for this task as several computations in building the MDP transition probabilities and the dynamic programming solution algorithm are embarrassingly parallel.
The novel contributions of this paper are: (1) Development and implementation of a computationally efficient endtoend GPUbased algorithm to solve the optimal path planning problem using an MDP framework. We highlight the algorithmic challenges that arise in the GPU implementation for building the MDP model and how these challenges are overcome with new optimized Compute Unified Device Architecture (CUDA) kernels; (2) Introduction of an additional stochastic scalar field that the agent tries to collect as it moves around during mission. This field affects the reward structure of the autonomous agent, thereby introducing an additional layer of uncertainty. In doing so, we have developed an interdisciplinary solution that combines the rigorous MDP framework for planning under uncertainty with a stochastic environment modelling technique and high performance computing.
The organization of the present paper is as follows. First, we briefly discuss the previous work in the field. Next, we describe the theory of Markov Decision Processes (Sec. II) and stochastic environmental modeling (Sec. IIC). The new GPU accelerated algorithm and implementation details are in Sec. III. In Sec. IV we showcase the new algorithm by applying it to plan paths for different singleobjective missions in a stochastic dynamic flow situation, clearly illustrating the computational efficiency of our algorithm. Finally, we conclude in Sec. V with directions for future research.
Ia Previous Progress in Optimal Path Planning
Most methods for optimal path planning of autonomous marine vehicles in stochastic dynamic environments are Monte Carlo versions of traditional deterministic planners (e.g., [41, 61]) and surface response methods (e.g., [26]
). Unfortunately, they suffer from the curse of dimensionality and are computationally inefficient in serial implementation.
[32, 31] has a detailed review of deterministic planners such as dynamic Dijkstra [36, 19], A [22], potential field functions [5], Delayed D [17], Fast Marching Methods [40], nonlinear programming [2], and mixed integer linear programming
[64]. [51, 50] use waypointselection algorithms and [66, 67]use evolutionary algorithms like particle swarm optimization for path planning and utilize regular predictions from the regional ocean model system (ROMS). Recently, we have developed a suite of levelset based planners for deterministic environments
[34, 53, 30] and stochastic environments [54, 55]. However, the stochastic versions give a distribution of paths and riskoptimal selection is a computationally expensive process. The use of Markov Decision Processes (MDPs) for planning in stochastic environments [1, 62, 29] has been proposed as they provide a rigorous mathematical framework for decision making under uncertainty. However, they become prohibitively expensive for large problems and use naive environment modelling techniques. For example, [29] adds independent Gaussian noise to flow velocity components at each state to a mean forecast taken from ROMS, which causes the uncertainty to be spatially and temporally uncorrelated. Partially Observable Markov Decision Processes (POMDPs) have also been used for a variety of simultaneous localization and planning applications of ground or aerial vehicles where the focus is local closedloop control [14, 3]. Building on top of the MDP framework, reinforcement learning (RL) algorithms
[27, 68, 18, 65, 63, 48, 10] have also been used for path planning. GPUaccelerated RL libraries have also been developed recently [4],[35]. However, RL algorithms provide approximate solutions to MDPs, which can be validated through exact dynamic programming solutions if the environment model is available and the solution is computationally feasible. Other machine learning schemes such as Gaussian process regression
[13, 37] have also been proposed.The main drawback of the MDPbased planners is the impractical computational time requirement of sequential implementation to solve largescale problems. For example, [29] reports that their MDP solution takes about 13 hours for a mediumscale autonomous underwater vehicles planning problem on a modern CPU. To accelerate the computation of optimal policies for largescale problems, ideas of HighPerformance Computing (HPC) have been proposed to solve the MDPs. For general problem statements, attempts have been made to parallelize the value iteration (VI) algorithm. [25] provides a proof of concept where the value iteration algorithm was mapped to a GPU platform using CUDA. Value iteration was also implemented using OpenCL for path finding problems in [9]. [23] uses OpenMP and MPI to implement a very largescale MDP to generate traffic control tables. [39] uses CUDA to implement value iteration on a GPU optimized for inventory control problems. [42] fills state transition matrices and solves the MDP using a GPU in the context of crowd simulations, that lack a stochastic dynamic environment forecast. [43] implemented value iteration using CUDA, optimized for problems with sparse transition matrices. GPUs have also been used for other planning models such as semiinfinite nonlinear programming [12]. In the present paper, we develop an endtoend GPUbased optimal path planning system that uses (i) realistic forecasts of the ocean environment, (ii) a newly developed GPUimplemented MDP model builder, and (iii) an adaptation of the GPUimplemented value iteration algorithm [43] for solving the MDP. We also provide our software package for community use [11].
Ii Markov Decision Process Framework
Markov Decision Processes (MDPs) provide a natural framework for sequential decisionmaking to compute an optimal control law (policy) in stochastic environments. An MDP is defined as a tuple , where is the set of states, is the set of actions, denotes the state transition probabilities, denotes the onestep rewards, and is the discount factor. A policy is defined as a map from the set of states
to a probability distribution over the set of actions
. The state transition probabilities and and onestep rewards are and , where,are random variables that denote the state, action and reward awarded at time
, and . We define the value function that quantifies the utility of following a policy from state as the expected sum of onestep rewards written in a recursive form in terms of the value function of possible successor states as(1) 
An optimal policy is one that does better or equally as good as all the other policies , i.e., . The optimal value function corresponding to an optimal policy can be expressed as a recursive relation (using Bellman conditions)
(2) 
that can be solved by dynamic programming algorithms. We use a finite horizon, undiscounted, total cost MDP for our path planning problem. Table. I and Table. II list all the notation used in the present paper.
Symbol  Description 

Set of states for MDP.  
Set of actions for MDP.  
State transition probabilities for MDP.  
Onestep rewards for MDP.  
Discount factor for MDP.  
A policy for the MDP.  
An optimal policy for the MDP.  
Value function for the policy .  
Optimal value function.  
update of the optimal value function in the value  
iteration algorithm.  
2D spatial coordinate representing agent’s position.  
Starting position of the agent.  
Target position of the agent.  
Position of the agent at timestep .  
Temporal coordinate.  
Sample of the stochastic velocity field.  
Sample of the stochastic scalar field.  
Stochastic dynamic velocity field.  
Mean velocity field.  
DO mode of the velocity field.  
DO coefficient of the velocity field.  
Stochastic dynamic scalar field.  
Mean stochastic dynamic scalar field.  
Mask containing dynamic obstacle data.  
Speed of the agent with respect to currents.  
Maximum speed of the agent with respect to currents.  
Agent’s kinematic model.  
Number of possible headings for the agent.  
Number of possible speeds for the agent.  
Number of realizations of the stochastic velocity field.  
Number of realizations of the stochastic scalar field.  
Number of dominant modes of the stochastic velocity field.  
Dimension of the spatiotemporal grid along the axis.  
Dimension of the spatiotemporal grid along the axis.  
Dimension of the spatiotemporal grid along the axis.  
Number of cells in the spatial grid at a given time.  
Number of cells in a neighbouring subgrid.  
Number of cells in the spatiotemporal grid.  
Random variable denoting the state of the agent at time .  
Random variable denoting the action taken by the agent  
at time .  
An element of the set .  
An element of the set .  
Probability that the agent transitions to a state after  
performing action in state .  
Onestep reward given to the agent for reaching state  
upon performing action in state  
Expected onestep reward given to the agent upon  
performing action in state .  
A constant positive reward given to the agent for reaching  
the terminal state.  
A constant negative reward given to the agent for going  
outside the spatiotemporal domain or landing in an  
obstacle.  
Energy function of the agent.  
Constant coefficient for scalar field terms.  
Constant coefficient in the energy function. 
Iia MDP Formulation of Path Planning
In a spatiotemporal domain , let ( for 2D and 3D space) denote the spatial index and the temporal index (Fig. 1A). Let the domain have a stochastic, dynamic flow where is a sample of the random velocity field with an associated probability distribution function. Let the domain also have a stochastic dynamic scalar field (like solar, wind or wave energy field), where is a sample of the random scalar field. Let be the instantaneous speed of an autonomous agent that starts from at time and travels to in a manner that minimizes the expected cost.
Similar to the setup in [10] and [29], we discretize the domain to a spatiotemporal grid world that constitutes the state space (Fig 1(B)). A state represents a square region in the spatiotemporal grid, which is indexed by a spatiotemporal coordinate . The square is centered at with sides of length . The grid is temporally discretized at intervals of . The random velocity at a discrete state is . At each state , the autonomous agent can take an action (heading and speed) from the action space . We consider discrete headings between and , and discrete speeds between and , where is the maximum possible relative speed of the agent. Hence the action space consists of discrete actions. Our problem is to find the optimum deterministic policy that the agent must follow to minimize the expected total cost of the mission.
The state transition probabilities denotes the probability that the agent lands in the state , given that it performs an action in state . This probability is induced from the uncertainty in the velocity field, through the vehicle kinematics. The spatial motion is given by , and the temporal update is . The successor state is computed as , where the function maps the spatial and temporal coordinates to the appropriate discrete state in the domain. When the uncertainty in the environment is not in a simple closedform (e.g., Gaussian), we often have to use a MonteCarlo based approach [62], i.e., enumerating multiple samples of the uncertain field and counting how many times a successor state is visited under the influence of action performed in a given state .
The expected onestep rewards that an agent receives upon reaching a state after taking an action in state depends on the mission objective. For the three different mission objectives, we consider they are defined as follows. (i) Minimize expected travel time: The agent is penalized a constant value for each timestep leading to the onestep reward
(3) 
(ii) Minimize expected energy consumption: The agent’s energy consumption is modeled as a function of its speed leading to the onestep reward
(4) 
where is the agent’s speed and is the energy function. Typically, a quadratic relationship of the form is utilized, where is a constant [53]; and (iii) Minimize expected net energy consumption: The agent is equipped with a solar panel (or any other energy harvesting mechanism) that collects energy from a spatiotemporal stochastic dynamic energy field while moving around in the domain. Hence, the expected onestep reward is defined as
(5) 
where is a constant. The first term inside the expectation is the energy consumed for taking the action . We assume that the energy harvested by the agent is directly proportional to the magnitude of the solar radiation field. The second term is a firstorder integral approximation of the energy collected by the agent from as it transitions from the state to . The sum of the two terms is the agent’s onestep netenergy consumption. The expectation in Eq. (5
) is taken over the joint distribution
. Moreover, irrespective of the mission objective, the agent is given a positive reward of if it reaches the target terminal cell to provide it an incentive to reach the target location. In the absence of this incentive, the agent may never reach the target location. Similarly, it is given a relatively large negative reward if it lands in a state marked as an obstacle, goes out of the defined spatial domain, or requires time beyond the finite planning horizon. Note that here we emphasize a 2D in space problem setup; however, our framework and implementation apply to 3D in space as well. Also, we use a kinematic framework for planning, without considering the dynamics of the flow around the UAV [34, 1, 62, 29, 10].IiB Dynamic Programming Algorithm
Since our planning problem is a finite horizon, total cost, undiscounted MDP, there exists an optimal deterministic policy [7] that can be computed using dynamic programming. The key idea of dynamic programming is to use the value functions to organise and structure the search for good policies. Value iteration can be viewed as turning the Bellman optimality Eq. (2) into the update rule
(6) 
The sequence can be shown to converge to the optimal for any arbitrary initial value , under the same conditions that guarantee the existence of [8]. Once the optimal value function is obtained, the optimal policy can directly be computed as
(7) 
We refer the reader to [8] for the complete algorithm.
IiC Stochastic Dynamic Environment Modeling
A crucial input for computing the state transition probabilities of the MDP path planning problem is the stochastic dynamic environmental flow field in its decomposed form , where is the socalled DO mean, are the DO modes, and are the DO coefficients. Here, Einstein’s notation (repeated indices indicate summation) is used. In the present paper, we employ the computationally efficient Dynamically Orthogonal field equations [45], in which the DO mean, modes and coefficients are solved by integrating explicit equations for them [54, 10]. Alternatively, any modeling system that provides the flow field’s distribution in terms of the above decomposition is sufficient to compute the transition probability matrix and complete the planning. For example, a deterministic ocean modeling system (e.g., [21, 46, 58]
) may be employed to forecast an ensemble from which the dynamically orthogonal reducedorder representation can be obtained by performing a singular value decomposition
[33]. Further, any other uncertainty quantification and propagation method such as the Polynomial Chaos Expansion [20] can also be used. Notably, the DO equations provide a significant computational advantage (two to four orders of magnitude speedup) over the ensemble approach. Similarly, the stochastic dynamic scalar field representing the energy field to be collected by the vehicle must be obtained using environmental models such as atmospheric [49], wind or wave [60] models.Iii GPU Accelerated Planning
There are two phases in the algorithm to compute an optimal policy of the MDP: (1) Building the model, i.e., computing the expected onestep rewards , and the state transition probabilities ; (2) Solving the MDP, i.e., obtaining an optimal policy using methods such as dynamic or linear programming. Each of these phases requires practically infeasible computational time in serial implementation. The cost of the first phase using a naive MonteCarlo approach is as we must compute the onestep transition data for each state, for each action, and for each realization of the uncertain velocity and scalar fields. Here, is the number of realizations of the stochastic scalar field . The cost of the second phase using value iteration is . Therefore, when we have a large state, action, or realizationspace, the compute time may become impractical. To solve largescale problems, without having to resort to approximate methods such as Reinforcement Learning [59], we propose a GPUbased algorithm to achieve significant computational speedups. Since our algorithm provides a dynamic programming solution, it can be used to validate solutions obtained through reinforcement learning methods. Next, we discuss each phase separately, wherein we identify the useful properties that make them suitable for parallelization and how they can be implemented on a GPU. We also discuss how to overcome the physical device limitations of a GPU through algorithmic intervention.
Iiia MDP Model Building on GPUs
To compute the transition probabilities and the expected onestep rewards for the MDP, we must calculate onestep transitions for each state, each action, and each realization by evaluating and . A key observation in the process is that these computations are independent of each other across states, actions and realizations, i.e., and for any tuple can be computed independent of any other tuple.
Hence, the key idea is that one can spawn thousands of independent threads that perform all of these computations for every tuple independently and in parallel. Additionally, since there is no need for any data communication between these parallelly spawned threads, the application is “embarrassingly parallel”. Such applications are ideal for being implemented on a GPU. In our work, we use NVIDIA GPUs, and the algorithm is written using CUDA.
Ideally, with a sufficient number of cores, all the computation can potentially happen simultaneously in parallel. However, practically, we would like our algorithm to run efficiently on modest GPUs installed on the autonomous system itself and not consume excessive energy (to operate the GPU). Our ultimate goal is to update the MDP model through inmission measurements of the surrounding environment and update the optimal policy regularly with a modest onboard GPU. Towards this goal, in the present paper, we focus only on the endtoend GPU algorithm for solving the offline pathplanning problem. Critically, such modest GPUs may have resource limitations such as limited memory. For instance, if we decided to store the velocity field of the entire spatiotemporal grid of size (say ) with (say 1000) realizations as floating points in GPU memory, it would require 8 GB of global memory! Clearly, this is not acceptable. Hence, we propose the following three solutions to deal with such limitations.
First, we use a reducedorder velocity field that is stored in the GPU’s global memory. The realizations of the stochastic, dynamic velocity field are decomposed to a meanfield along with corresponding modes and coefficients. For instance, let’s assume that the first
(say 10) singular values of the velocity’s covariance matrix capture a specified threshold of the variability (say 99%), then for the aforementioned spatiotemporal grid and realizations, each component of the velocity field would be broken down to a meanfield matrix of size
, mode matrices of size each, and a coefficient matrix of size (). This requires only 96 MB of memory compared to the 8 GB required otherwise, which is roughly a two orders of magnitude reduction.Second, we use just the mean scalar field for computing the expected onestep rewards when the problem involves the agent manoeuvring to collect energy from the stochastic scalar field. To do so, we show that enumerated samples of the field are not required to compute the expected onestep rewards, and just the meanfield is sufficient. The use of only the mean scalar field helps save memory and saves computational expense by a factor of . The proof is as follows. The optimal value function can be expressed in terms of an expectation for all as
(8)  
(9)  
(10)  
(11)  
(12) 
Let us focus on the second and third energy terms on the RHS of Eq. (12). In the second term, the only random variable inside the expectation is . Note that is not a random variable in this equation since we are calculating the expected onestep reward for performing an action in a given state , which is known. Therefore, the expectation in the second term is over the probability distribution or simply , since is independent of the action . The expectation simply results in the mean value of at state (Eq. 13). However, in the third term, both and are random variables. Hence this expectation is over the joint distribution and can be expressed as an iterated conditional expectation (Eq. 14). The outer expectation is over the distribution , which are the state transition probabilities. The inner expectation is over the distribution , which simplifies to since is independent of and when evaluating at (Eq. 15). The inner expectation then simply results in mean value of at (Eq. (16)).
(13)  
(14)  
(15)  
(16) 
From Eq. (13) and Eq. (16) we can see that the second and third terms in the RHS of Eq. (12) depend only on and , which are the means of the stochastic scalar field at and , respectively. This holds true for any and . Hence, while we may have a model characterising the uncertainty in the scalar field and it is possible to take samples of the same, it suffices to store only the mean scalar field, . Finally, substituting Eq. 13 and Eq. 16 in Eq. 12, we get
(17)  
(18) 
Third, we use the concept of a secondary grid, which we refer to as the neighbouring subgrid, whose size is much smaller than the spatial grid at a given time instance. For a given pair, the neighbouring subgrid is used to store the possible values of for different realizations in a memoryefficient manner. The algorithm and the details of how the neighbouring subgrid achieves memory efficiency are discussed next.
IiiB CUDA Algorithm for Building the MDP Model
Symbol  Description 

An array that keeps a count of the number of times a  
successor state is reached upon taking an action  
for each initial state , across realizations.  
An array that stores the sum of onestep rewards  
across realizations for each initial state ,  
for a given action at a given time step.  
A COO format state transition matrix for transitions  
from state at time to the states at  
the successive time , for the given action .  
Reward vector that stores the expected onestep rewards  
for performing the action in each state at time .  
Number of nonzero elements in .  
An array that stores the number of possible successor  
states for for each state at time .  
A COO format state transition matrix for all states  
for the given action .  
Reward vector that stores the expected onestep rewards  
for all states for performing the given action .  
A COO format state transition matrix made by appending  
s for all actions .  
Reward vector containing expected onestep rewards  
made by appending s for all actions .  
Thread index assigned to each thread in a kernel.  
A onedimensional index to denote discrete cell  
in a spatial grid at any given time.  
A onedimensional index to denote discrete cell  
in the spatiotemporal grid world.  
A onedimensional index to denote the realization of  
the stochastic velocity field .  
Mean scalar field at state .  
Onestep reward.  
An index to access array.  
An index to access array. 
Algorithm 1 is the newly developed CUDA algorithm for building the MDP transition probability matrix and the expected onestep reward.
Algorithm 1 takes the following input,

Environment data: Environment data comprises all environment information such as the reducedorder stochastic velocity field in the form of a mean velocity field, , the dominant modes, , and corresponding coefficients, ; the mean of the stochastic scalar field, ; and the dynamic obstacle information in the form of a mask, .

: A parameter that defines the problem type and objective function. The objective could be minimizing expected travel time, expected energy consumption or expected net energy consumption.

: An array that stores other problem parameters such as , , , which are used to define the action space, and and , which are used to define the reward structure.

: An array that stores grid parameters such as length scales, time scales, and sizes of the spatiotemporal grid, , and .
and gives and as the output, where is an appended transition matrix and is the appended expected rewards vector.
The algorithm works as follows. First, the input data, which consists of environmental information, the grid, and problem parameters are transferred from the CPU to the GPU using a MEMCPY operation so that it can be readily accessed by threads during kernel launches. Next, we define the size of the neighbouring subgrid. Conceptually, the neighbouring subgrid is a smaller, rectangular 2D grid defined for each state at a given timestep in the domain. The key idea behind the grid is that an agent starting at a state cannot spatially transition to any other state that lies outside the neighbouring subgrid of state , since it can only transition to physically nearby states. Fig. 2(C) shows neighbouring subgrids for two different initial states. The size of the subgrid ( is kept constant across all states and depends on the componentwise maximum velocities across all states with an added buffer. The cells inside the neighbouring subgrid map to elements in array as shown in Fig. 2(C) and Fig. 2(D). The details of this mapping and how the subgrid aids efficient computation are discussed in IIIC1.
After that, we allocate memory for the and arrays in the GPU. The size of array is , where is the number of cells (or states) at a given time step. The array keeps a count of the number of times a given successor state is reached upon taking a given action for each initial state , across realizations. On the other hand, the size of the array is . It stores the sum of onestep rewards across realizations for each initial state , for a given action at a given time step. These arrays are filled in by successive kernel calls described in Sec. IIIC.
Next in the algorithm, we have a series of CUDA kernels responsible for computing the MDP model. These kernels are nested within two forloops  the outer loop ranges across timesteps, and the inner loop ranges across all actions in the action space. Hypothetically, if the GPU had unlimited memory and compute cores, all the computation could potentially be done in parallel without iterating over timesteps and actions. However, modest GPUs have limited memory and compute cores, making it difficult, if not impossible, to not only store the data but also be able to perform all computations parallelly. Since all computations cannot happen in parallel, splitting the computation through loops does not compromise maximum achievable parallelism. Effectively, we will be using the GPU to its full potential while splitting computations across timesteps and actions. As such, each kernel performs computations over states at a given timestep and a given action. In what follows, we discuss each kernel’s working and implementation.
IiiC CUDA Kernels
IiiC1 K_transition
This is the most important kernel in the algorithm and consumes the majority of the total compute time. The kernel’s purpose is to compute onestep transitions for each state at a given timestep and action across all realizations of the velocity field. The computed data is then written to the appropriate indices of the and arrays. Algorithm 2 shows the pseudocode of the kernel, which is executed parallelly by all the threads. A visual representation of the stochastic velocity field at a given timestep is shown in Fig. 2(A) in the form of a 3D grid. Each plane in the grid represents the spatial domain marked with grey arrows, highlighting the velocity field corresponding to a realization at a given timestep. The field’s realizations are stacked along the direction. The kernel is launched using a 3D grid of thread blocks (Fig. 2(B)) that mimics the structure of the 3D grid of velocity field realizations in Fig. 2(A). The and directions of the kernel’s grid correspond to the and directions of the spatial domain, and the direction corresponds to the realizations of the velocity field. All thread blocks (cuboidal units of the kernel’s grid) that share the same  coordinate are stacked along the direction and map to the corresponding  coordinate on the spatial domain. For example, one such stack outlined in red corresponds to the spatial cell in the left corner. Also, each thread block consists of a single queue of threads oriented along the direction such that each thread in the block maps to a realization of the velocity field. Fig. 2(C) shows two stacks of blocks that correspond to a given state, with each block consisting of multiple threads along the direction, corresponding to the velocity field realizations (Fig. 2(A)). As a result of this mimicking structure, each thread uniquely maps to a particular state and realization at the given timestep, and maintains onedimensional indices corresponding to (i) the discrete cell in the spatiotemporal grid world denoted by , (ii) the discrete cell in the spatial grid at time , denoted by , and (iii) realization of the velocity field, denoted by (Line 2 of Algorithm 2). Therefore, each thread can independently compute the velocity field’s and scalar field’s numerical values for its corresponding state and realization from the mean, modes, coefficients and mean scalar field, already present in the GPU’s global memory (Lines 2 and 2). Similarly, each thread can also independently access the obstacles data from the mask matrix . Once a thread has computed the environment information for its state and realization, it independently computes , i.e., the successor state’s spatial coordinate and temporal coordinate , through the function in Line 2. Next, since each thread knows its values, it reads the value of the mean field at and then computes its corresponding onestep reward using the function (Lines 2, 2), wherein the reward structure is defined according to Eq. 35. Computing the onestep reward also involves checking for the presence of any obstacles between and and whether is a terminal state by evaluating the ifconditions in lines 6, 7, and 11.
Based on our definition of the neighbouring subgrid (Sec. IIIB), the successor state must be a part of the subgrid itself. Fig. 2(C) shows an example of neighbouring subgrids for two different initial states, highlighted in red and orange colours. Each cell in the neighbouring subgrid maps to a contiguous set of elements in the array, as shown in Fig. 2(C) and Fig. 2(D). The array is of size since it maps from sized neighbouring subgrids for each of the possible initial states. Each thread computes the subgrid index corresponding to and the corresponding index (Line 2) to access the array. Thereafter, it performs an operation (Line 2), incrementing the count at the index of the array. Fig. 2(C) provides a visual representation of the above process through a representative stack of three thread blocks (six threads in total) highlighted in orange. It shows how threads with the same subgrid index participate in an operation named to access and write to the appropriate element in the array. In this way, the array encodes the number of times an agent starting at an initial state and performing a given action transitions to a successor state for all possible initial and successor states at times and , respectively.
Crucially, we note that the neighbouring subgrid concept limits the size of the array by using the fact that transitions can practically only happen to nearby cells. The naive alternative would be to allow parallel access to array by making it of the size . The advantage of the latter is that there is no need to compute an additional index for the subgrid; however, most elements of the alternative array would never be accessed and would hold just the initial value (zero), which is wasteful utilization of memory. Further, the neighbouring subgrid concept also saves unnecessary computations and will be discussed in Sec. IIIC4.
Lastly, the array must be filled with the sum of onestep rewards across realization. Therefore, all threads associated with a given spatial cell (but different realizations) participate in another operation (Line 2), adding their onestep rewards to the corresponding index of the array. A visual representation of this computation is shown in Fig. 2(C), where threads along the stack of blocks access and write to appropriate elements of the array through the operation named .
IiiC2 K_compute_mean
As per Eq. (12
), we must compute the expected onestep rewards. In our approach, we estimate this expectation by taking a sample mean of
over all realizations. The array already contains the sum of all over all realizations. Hence, the expectation can simply be computed by dividing each element in by . The kernel performs this division operation through a 1D grid of threadblocks, where each threadblock is also one dimensional. Each thread is assigned uniquely to an element in and simply divides its corresponding element by , thus resulting in , which constitutes the expected onestep rewards for performing the action in each state at time . Fig. 2(C) and Fig. 2(D) show the kernel using array and modifying it to the resulting array.IiiC3 K_count
We must now compute the transition probability matrix for transitions from state at time to the states at the successive time , for a given action . As is sparse, we store it in the sparse matrix COO (coordinate) format that needs only memory (in contrast to for a full matrix), where is the number of nonzero elements in the sparse matrix. However, we must first compute to allocate the appropriate amount of memory for . We must also separately store the number of nonzero elements in each neighbouring subgrid of each state at time in an array called the array, which is used for accessing the appropriate index of in the next kernel (Sec. IIIC4). The kernel works as follows. It is launched using a 1D grid of threadblocks, where each threadblock containing threads. Hence, the total number of threads is , one for each element of the array. For example, threads from a given threadblock would be responsible for accessing their respective red highlighted elements in the array, as shown in Fig. 2(D). Each thread checks if its corresponding element of the array holds a nonzero value. If so, it participates in an atomicAdd() operation, incrementing the count maintained at the appropriate index of the array. Then is directly computed using the Thrust [24] library’s parallel implementation of the function.
IiiC4 K_reduce_to_coo
The and arrays together encode all the information required to populate in the COO format, which is a matrix of size . The first row of stores all the row indices ( values), the second row stores the column indices ( values), and the third row stores the nonzero element value (the transition probability ) . Fig. 2(D) shows the transition probability matrix in the COO format, with three sized rows for , and , respectively.
The kernel’s purpose is to reduce the encoded state transition information in the array to the transition matrix . The kernel is launched with threads organised into a 1D grid of threadblocks, where each threadblock is also one dimensional. Hence, each thread is responsible for computation involving its own initial state . Each thread first computes its onedimensional state index and another index to access the appropriate element of . Then it writes the computed value to successive elements starting from the index . Here is the number of possible successor states , for the stateaction pair , and obtained by accessing array. Then, to fill the second and third rows of , each thread traverses through the neighbouring subgrid of its corresponding initial state in the array to check for nonzero values. When a thread encounters a nonzero value, it decodes the onedimensional index of the successor state, and writes the value to the appropriate index of ’s second row. It also maintains a local count of the number of times each occurs. Finally, the transition probability is computed by dividing the count of each by the total number realizations and written at the same index of ’s third row.
We briefly mentioned in Sec. IIIC1 that the neighbouring subgrid concept also saves unnecessary computation apart from saving memory. The reason is that in our approach, a thread has to traverse through only elements, whereas, in the naive alternative with no neighbouring subgrid, a thread would have to traverse through elements, most of which would be . Since the size of the neighbouring subgrid is much smaller than that of the spatial grid (), the cost of searching is significantly cheaper in our approach compared to the naive alternative.
In Algorithm 1, the above sequence of kernels is executed within two forloops. After the last kernel is executed, we have both and stored in the GPU’s global memory. At the end of the inner action loop, and are moved to the CPU memory through a MEMCPY operation to maintain sufficient memory in the GPU to store similar data structures that will be populated in the next iteration. In addition, s and s of all actions are appended across timesteps in the CPU asynchronously with respect to the GPU. Hence, at the end of the twolevel nested forloops, we have the complete transition matrix and expected onestep rewards for all the states in the spatiotemporal grid for each action . Finally, we append all and across actions along rows and obtain and . This final assembly is also done in the CPU and is essentially a preprocessing step for providing a suitable input data structure to the Sparse Value Iteration solver.
IiiD Sparse Value Iteration on GPUs
The optimal value function and policy for the above MDP model (, ) is computed by a dynamic programming algorithm called Sparse Value Iteration [43] that exploits properties of sparse transition matrices. We modify [43]’s GPU implementation algorithm (code available at [44]) to seamlessly integrate it with our algorithm for phase1. The modifications include the use of different data structures such as the sparse COO format for storing matrices, change in functions responsible for reading input data and multiple variable names.
Overall, our novel modelbuilding algorithm and sparse value iteration algorithm make an endtoend GPU software package that takes the stochastic environment’s data as input and provides an optimal policy for the MDP path planning problem as output, orders of magnitudes faster than reported by conventional CPU methods. In the following section, we demonstrate applications using our endtoend software.
Iv Applications
We demonstrate three different single objective missions (Sec. II) in a stochastic doublegyre ocean flow simulation. This flow field is obtained by solving the dynamically orthogonal quasigeostrophic equations [45] and has been extensively used for method development and testing in several path planning [55] and uncertainty quantification [16] applications. The equations and description of the physical process associated with the flow field are in [54]. Briefly, the flow field corresponds to that in a square basin of size 1000 km 1000 km, where a wind from west to east sets up a doublegyre flow and the uncertainty arises from its initial conditions. The discrete spatial domain is a square grid of size , with nondimensional units. We also use 200 discrete time intervals with nondimensional units, thereby making our spatiotemporal grid of size . The nondimensionalization is done such that one nondimensional unit of space is 5 km and time is 0.072 days. The environment is common across all three applications. Fig. 3 shows the mean (row A), the first four DO modes (row B), and the jointPDF of the first four DO coefficients (row C) at four discrete times (in four columns), highlighting the dynamic and nonGaussian nature of the stochastic flow field. Fig. 4 shows the complete environment for a particular realization of the background velocity field (Fig. 4A) and the scalar energy field (Fig. 4B). The stochastic scalar field is assumed to be that of solar radiation incident upon the ocean surface, mimicking a cloud cover moving westwards. This field plays a role only for the netenergy consumption minimization goal. The dynamic obstacles are assumed to be restricted regions in the domain that changes deterministically with time. For the present applications, the dynamic obstacle is squareshaped with a side of 20 units in the grid. The obstacles enter the domain from the west and move eastward with time. It is also assumed that these restricted regions do not interact with or affect the flow in any way. In case the movement of obstacles is uncertain and/or the scales of the flow are such that the obstacles affect the flow, these assumptions can be relaxed and environmental flow fields solved with the obstacles as done in [54]. We consider that the agent can take one of 32 actions ( and ). The number of velocity field realizations used here is . The size of the neighbouring subgrid is set . In dimensional units, the maximum magnitude of the flow field is around 2 m/s and the speed of the agent is 80 cm/s. The planned paths are valid for any AUV that flies at the specified speed. Appropriate operational parameters can be chosen to fix this AUV speed [47, 15, 31]. As such, the flow is relatively strong, and a typical autonomous marine agent modeled here gets strongly advected by the currents, making it difficult for the agent to move against the flow in most parts of the domain. For the test cases demonstrated here, the agent starts at the position (represented by a circular marker), and the target is the position (represented by a starshaped marker).
Iva TimeOptimal Planning with Dynamic Obstacles
First, we consider a mission with the objective of minimizing the agent’s expected travel time from the start to target in the environment described above. We use the onestep rewards defined in Eq. 3 for timeoptimal planning. Our novel endtoend GPUimplemented algorithm is used to compute the timeoptimal policy. This policy is then applied to the ensemble of flow fields, and the computed trajectories are shown in Fig. 5. The four discrete time instances shown illustrate how the different trajectories (each corresponding to a different background flow field) evolve with time. To minimize the travel time, the agent takes the highspeed action throughout the mission while intelligently utilizing the doublegyre flow. The distribution of trajectories splits into two bands, successfully avoiding the dynamic obstacles as seen at t=48 and 60 (Fig. 5). The computed expected travel time is 61.25 nondimensional units (4.41 days). The total computational time required for building the MDP model and solving via the Value Iteration algorithm is of the order of a few minutes. In contrast, the MDPbased CPU implementation for a smaller spatiotemporal grid, without a realistic uncertain environmental flow forecast, reported in [29] was around 13 hours. Further, the overall paths are similar to the timeoptimal trajectories computed from the level set equations in [54]. However, the key difference is that using the MDP approach, we get an optimal policy and not simply the distribution of optimal waypoint or heading sequence obtained by solving the stochastic level set equations.
IvB Minimal Energy Expenditure with Time Constraints
Second, we consider a mission with the objective of minimizing the agent’s expected energy consumption for travelling from the start to target in the environment described above. An additional constraint is that it must complete its mission within 200 timesteps, which is imposed by limiting the state space to 200 temporal units. Further, the energy consumption model for the agent is assumed to depend only on its speed (Sec. IVB) and the onestep rewards defined in Eq. 4 is used here. Since we are only optimizing for energy consumption and not energy collection, the scalar field’s presence does not affect the optimal policy here. The evolution of the trajectories (at four discrete instances) obtained by following the optimal policy computed by our endtoend GPUimplemented algorithm is shown in Fig. 6. In contrast with the behaviour observed in Sec. IVA, the agent here typically takes the slowspeed action throughout the mission to reduce its energy consumption. Due to its lowspeed actions, a relatively high background flow, and the presence of two dynamic obstacles, the distribution of trajectories splits into three bands. Two of these bands go around the dynamic obstacle, avoiding it from either side, similar to what we observed in Fig. 5. However, the third band (coloured yellow) has to swerve around, making a loop before moving towards the target, since otherwise, it would possibly bump into the dynamic obstacle. The agent’s expected energy consumption is 108.7 nondimensional units. Our framework is flexible to allow using different speeds for the agent based on its capabilities. To do so, we simply have to change the number of discrete speeds considered.
IvC NetEnergyOptimal Planning with Dynamic Intake
Third, we consider a mission with the objective of minimizing the agent’s expected net energy consumption for travelling from the start to target in the environment described above. The net energy consumption is the difference between the energy consumed by the agent’s motors for manoeuvring and the energy collected from solar radiation through its solar panels. In this case, the scalar radiation field plays a significant role in the optimization problem, and the onestep rewards defined in Eq. 5 is used in this case.
Fig. 7 shows the trajectories obtained by following the optimal policy for net energy consumption in different realizations of the flow field. The computed optimal policy induces a tendency in the agent to manoeuvre in the yellow sunny region and collect solar energy as long as possible while ensuring that it completes the mission within the given time constraint (Fig. 7). The vehicle starts by moving to the extreme left of the domain to collect solar energy in the yellow sunny region for all realizations of the background environment. The distribution of trajectories initially splits into two bands, as seen at t=48 (Fig. 7). One band (purple colors) continues to stay in the left extreme of the domain. The other band (yellow colors) moves back towards the right and then loops around to collect energy in the sunny region as much as possible. At around , the diminishing sunny region and the relatively strong background flow make it infeasible for the vehicle to keep collecting solar energy on the left side of the domain. After that, the vehicle starts moving towards the target location. On the way, the vehicle in the realizations corresponding to the purple band encounters the dynamic obstacle. In response, the band further splits into two, moving past above and below it. Finally, all trajectories move towards the target location and converge there by . The average net energy consumption is 190 nondimensional units, indicating that more energy is harvested than expended for the particular nondimensional constants used here.
Crucially, all these computations were performed on a modest GPU in a fraction of the time required by conventional MDP model builders and solvers using a CPU. A detailed computational runtime study is presented next.
IvD RunTime Analysis with Different GPUs
The above applications were run for various problem sizes on three different GPUs, namely RTX 2080, RTX 8000 and a V100. The execution times were compared with the sequential version of the algorithm run on an Intel Core i78700 3.2 0GHz CPU with 32 GB RAM. The GPU runs were individually optimized for launching kernels at the optimal threads per block. These were found to be 64, 64 and 32 for the RTX 2080, RTX 8000 and V100, respectively. Our modelbuilding algorithm’s execution times and speedups were noted for a varying number of states, number of actions, and number of realizations of the stochastic velocity f
Comments
There are no comments yet.