This work focuses on improving the efficiency of sampling-based motion planning for vehicular systems that exhibit significant dynamics using machine learning. Planning motions for such systems can be challenging when there is no access to a local planner and the basic primitive to explore the state space is forward propagation of controls. In this context, tree sampling-based kinodynamic planners have been developed[lavalle2001randomized], some of which achieve asymptotic optimality (AO) by propagating randomly sampled controls during each iteration [li2016asymptotically, hauser2016asymptotically, LB-DIRT, kleinbort2020refined]. While random controls are desirable for analysis purposes, in practice they result in low quality (i.e., high cost) trajectories.
Learning candidate controls for edge propagation from previous planning experience is a promising avenue for improving the practical efficiency of sampling-based kinodynamic planners. Prior work on the integration of machine learning for planning has primarily focused on geometric motion planning problems that don’t deal with dynamics [Qureshi2021MotionPN, jurgenson2019harnessing]
, or requires access to a two-point boundary value problem (BVP) solver to construct a supervised learning dataset[Li2018NeuralNA]. The latter may not be easily available for some vehicular models, especially for higher-fidelity physics-based models.
Recent work has shown that deep reinforcement learning (DRL) can be used to learn a local controller (or a policy) that steers the robot between two collision-free states. The learned policy can be then used as an edge connection strategy for a sampling-based motion planner, such as a Probabilistic Roadmap (PRM) [Faust2018PRMRLLR] or a Rapidly-exploring Random Tree (RRT) [chiang2019rl]. This approach, however, requires a large amount of training data and manual specification of reward parameters.
A representation of the planning environment, either in the form of binary occupancy grids, or cost maps that represent the cost of navigating across the environment, has been used as input to neural network models, which can then bias the planner towards shortest paths[ichter2018learnedsampling, Kumar2019LEGOLE, wang2020neural, ichter2020criticalroadmaps]. Although it is possible to use these learned sampling strategies as input to a learned local planner, some of these samples may correspond to dynamically infeasible states or inevitable collision states (ICS). Recent work has proposed learning a sampling strategy for local goals that can be reached while obeying kinodynamic constraints and avoiding collisions [johnson2020mpnet], but assumes a steering function that can move the robot from one state to another.
This work proposes a learning process for a controller and integrates it with a framework for sampling-based kinodynamic planning to improve performance. The controller is trained offline using DRL in an empty environment to generate high quality controls towards desirable local goals. During planning, a node expansion procedure generates local goals biasing towards the global goal that are passed to the learned controller, which outputs the best control. The benefit of this two-step approach is that the learning process has to only deal with the system’s dynamics and needs to be trained once per system. In this way it does not require a separate training process for each new environment that a robot is being deployed. The local goal selection and the planning framework are responsible for dealing with different problem instances and selecting a direction of progress towards the global goal. For the first expansion out of a node of the sampling-based tree, an informed approach is used where the local goal is selected so as to bias expansion towards the ultimate planning goal. Consecutive expansions follow a more exploratory strategy and select randomly local goals in the vicinity of the tree node.
The evaluation uses a first-order and second-order differential drive system as well as a physically simulated Segway robot across different environments. The testing environments are different from the obstacle-free environment used for training the controller. It shows that the proposed framework results in very low cost solutions that are achieved faster relative to traditional random propagation approaches.
Ii Problem Setup and Notation
Consider a system with state space - divided into collision-free () and obstacle () subsets - and control space , governed by the dynamics (where ). The process
can be an analytical ordinary differential equation (ODE) or modeled via a physics engine like Bullet[coumans2019]. A plan is defined by a sequence of piecewise-constant controls executed in order. A plan of length induces a trajectory , where . For a start state and a goal set , a feasible motion planning problem admits a solution trajectory of the form . Each solution trajectory has a cost according to a function , which the motion planner must minimize. In this setup, is defined as the set of all states that reach a specific state with some predefined tolerance.
Iii Sampling-Based Motion Planning Framework
We focus on sampling-based motion planners that only have access to a forward propagation model of the system’s dynamics. This class of algorithms explore by incrementally constructing a tree by sampling . Algorithm 1 outlines the high-level operation of a sampling-based motion planner (Tree-SBMP) that builds a tree of states reachable from .
Each iteration of Tree-SBMP starts with the selection of an existing tree node to expand that corresponds to a robot state (Line 3). A candidate control is considered for the selected node, and an end state is obtained by forward propagation of the system dynamics (Lines 4-5). The resulting edge is added to the tree if it is not in collision (Lines 6-7). By varying how these operations are implemented, different algorithms can be obtained. After iterations (or a preset time limit), if the tree has states in , the best found solution trajectory according to can be returned.
This work presents a version of EXPAND-NODE (Fig. 2) that selects a local goal in the environment and uses a learned controller to predict a control that makes progress to the local goal. The proposed node expansion framework is integrated into the DIRT algorithm [LB-DIRT], which also achieves AO properties. DIRT is selected due to its informed implementation of SELECT-NODE that biases selection towards high quality nodes close to the unexplored parts of the state space.
DIRT also evaluates multiple candidate controls as part of EXPAND-NODE. This Blossom expansion
procedure ranks the resulting trajectories and processes them in order so as to prioritize expansions that bring the system closer to the goal according to a heuristic cost-to-go.
Iv Proposed Method
The proposed pipeline has two stages. Offline: For each system, a controller is trained using DRL to reach a goal state in an obstacle-free environment. For every planning environment, a representation in the form of an obstacle map or of a cost map is assumed available. Online:
Upon planner initialization, a vector field is constructed for the given planning problem to bias the planner. At every iteration, a local goal state is generated as input to the learned controller. The controller then outputs a candidate control to be propagated towards the local goal state. The local goal is generated either in an exploitative manner using the goal-biasing vector field or in an exploratory manner by using random sampling.
Iv-a Learning the controller
Given the current state of the system , the aim is to train a controller to reach the goal set in the absence of obstacles. Let be the distribution of trajectories induced by the controller , and the initial state sampled from an initial state distribution . The controller can be optimized using a DRL algorithm that maximizes the expected return
where is a discount factor and is a bounded reward function defined as if and otherwise. Alternative DRL algorithms can be used to maximize . This work employs Hindsight Experience Replay (HER) [Andrychowicz2017HindsightER] to facilitate learning from sparse rewards.
As the focus is mainly on vehicular systems, is defined relative to the robot’s current state. For the state variables of the system that are transformation-invariant (e.g., current position and steering angle), the initial state of the system is fixed at the origin. The values of the initial state variables that are not transformation-invariant (e.g., velocities of the wheels) are sampled uniformly within bounds. This re-parameterization reduces data requirements while learning the controller.
Comparison alternative: A Supervised Learned Controller (SLC), is also evaluated. It learns the inverse of a discretized set of forward dynamics. SLC acts as a locally optimal surrogate for the system’s dynamics. The control-duration space is discretized to form a discrete set of plans and all plans in are forward propagated from a fixed initial state. The resulting end states form a reachability region . A dataset is obtained by using a nearest neighbor data structure to store the best plan that reaches a neighborhood in for every initial state. A neural network is then trained to minimize the mean squared error (MSE) between the plan in the dataset and the plan output by SLC given , which is defined as the difference in states between an initial state and a local goal state .
Iv-B Generating informed local goal states among obstacles
A two-dim. workspace can be divided into two disjoint subsets: the collision-free subset and the obstacle subset containing all obstacles. The boundary of the environment is also considered as an obstacle. Define points as projections of states from to . Then, let be the projection in of each tree state/node selected for propagation. Then, the objective is to generate a local goal point so as to perform the propagation with the learned controller. Given the global goal , is generated by a local goal function so that: i) where is a cost-to-go function given ; and ii) maximizes clearance from obstacles.
The local goal function is implemented using a medial axis . Given an obstacle surface (e.g., an edge of a polygonal obstacle in 2D) there exists a subset of all free points, such that is the closest obstacle surface. The medial axis is the intersection of all such subsets: . To compute the medial axis, the environment is discretized. The computation is performed once for every planning environment (Fig 4).
For a given , 3 vectors are computed (Fig 5 left). First, a repulsive vector that points away from the closest obstacle towards the nearest point on . An attractive vector that points to such that: (a) has line of sight (with clearance) to , and (b) is minimum, i.e., makes the most progress to the goal among all visible points from . Both vectors are used to compute , where . The magnitude of is set so it reaches when applied at . is the final output of . The normalized version of is shown in Fig 5 right. It is computed once offline given a discretization of the environment.
Iv-C Generating informed local goal states given a costmap
To use the learned controller in the traversability cost environment, the environment is discretized on a grid in . For every the traversability cost is defined, where is traversable (Fig. 6, left). The higher the cost, the longer it takes for the robot to traverse. Equation (1) computes the path cost of a trajectory of length as:
Offline, once per planning problem, a modified version of the any-angle path planning algorithm [thetastar] is computed over the grid to create a vector field (Fig. 6, right). Instead of checking for an obstacle-free line of sight when smoothing the solution path, the modified version checks for a line of decreasing cost of a minimum length. For a point , the vector points to the next node along the lowest cost path to the goal according to . Online, the end point of is returned as the output of
The systems considered in our evaluation are an analytically simulated first-order () and second-order differential drive vehicles () and a Bullet-simulated Segway robot ().
V-a Controller training evaluation
The following algorithms were tested to learn : Deep Q-Networks (DQN) [mnih2015human], Twin-Delayed Deep Deterministic Policy Gradient (TD3) [fujimoto2018addressing] and Soft Actor-Critic (SAC) [Haarnoja2018SoftAO]. For DQN, the discretized action space consists of controls. For each dimension of the control space, the control can take either the minimum, mid or maximum value. The training pipeline was implemented on a Google Colab CPU instance. The DRL algorithm implementations were derived from Gym [openaigym] and Stable Baselines 3 [stable-baselines3].
Each controller is trained for 200 thousand forward propagation steps, and their performance is evaluated using the average time to reach the goal (trajectory cost). Fig. 7 shows the results for the analytical systems. The costs are averaged over 30 trials, where each trial corresponds to 100 randomly sampled goals. DQN does not converge to a controller that finds low cost goal-reaching trajectories due to the restrictive nature of the control space. Although SAC and TD3 exhibit nearly similar performance after parameter tuning, the best-performing SAC controller is selected for the proposed expansion framework since its performance is closest to the baseline. Subsequently, SAC+HER is also used to train the controller for the Segway system.
V-B Planner integration evaluation
The integrated planning and learning solution is evaluated with three different expansion functions using a Blossom expansion of five controls each. (a) Random uses a blossom expansion of random controls in . (b) SLC uses the supervised learning process with a local goal provided by as input for the first control; for subsequent controls, a random local goal in is provided as input to the supervised learned controller. (c) RLC uses SAC with local goal provided by as input for the first control; for subsequent controls, a random local goal in is provided as input to SAC. For the Segway system, since each forward propagation is expensive, a Blossom number of 1 is used.
Across the experiments, the following metrics are measured for every planner: (1) Average normalized cost of solutions found over time/number of planner iterations, and (2) Ratio of experiments for which solution was found over time/number of planner iterations. To account for the difficulty of different planning problems, path costs are normalized by dividing by the best path cost found for a problem across any planner. For the non-cost map problems, the path cost is defined to be the duration of the solution plan. For the cost map problems, the path cost is computed according to Equation 1 with .
For the first and second-order systems, the planner performance is evaluated on three maps from the city benchmarks [sturtevant2012benchmarks]. Planner performance with the first-order system is also evaluated on three cost maps from [jaillet2008transition]. For each map, ten different problem instances are generated by sampling collision-free start and goal states that have a minimum separation of pixels. For the Segway robot, three environments are defined and planner performance is evaluated across all. Example solutions found by RLC are given in Fig 1. For the city environments, the search trees are visualized.
In the city environments, RLC finds the lowest cost solutions very quickly for both first (Fig 7(a)) and second-order (Fig 7(b)) systems. For the first-order system, RLC finds solutions in fewer iterations than Random or SLC. For the second-order system, SLC finds solutions in fewer iterations than RLC, but RLC finds better quality solutions. In the cost map environments (Fig 7(c)), RLC finds the lowest cost solutions quickly. For the same time budget, RLC and Random find solutions to more problems compared to SLC, while RLC does so in fewer iterations than Random. For the Segway (Fig 7(d)), RLC finds lower cost solutions than both Random and SLC, and also finds more solutions in fewer iterations / lesser time. Across experiments, both RLC and SLC achieve a high success rate in fewer iterations than Random. The solutions obtained by RLC, however, are of significantly lower cost than SLC and Random.
V-C Ablation Studies
These experiments evaluate various choices for the planning process in the case of the second-order vehicle: (a) should the first propagation be informed given medial axis information, (b) should consecutive propagations use random controls (Random), or use a random local goal and supervised learning (SLC) or use a random local goal and SAC (Rand Goal), and (c) what is a good blossom number. The planner using Medial Axis Goal finds lower cost solutions in fewer iterations than the one using Random Goal (Fig. 8(a)). The planner using SLC finds solutions in fewer iterations than using Rand Goal but RandGoal finds higher quality solutions overall (Fig. 8(b)). RandGoal also remains competitive in terms of the number of solutions found compared to Random. A smaller Blossom number finds solutions faster (Fig 8(c)). The quality of solutions found, however, improves considerably when a higher Blossom number is used. The best-performing choices across these experiments are used for RLC evaluated in Section V-B.
This paper proposes a learning-based node expansion method that improves the path quality and computational efficiency of sampling-based kinodynamic planners for vehicular systems. It consists of a learned controller that can be trained with small data requirements for different robotic systems, and a selection process of local goal states that is computed once for every planning problem.
The inference time of the learned controller can further be improved by storing it upon training to an integrated circuit, such as an FPGA. Furthermore, the local goal state selection can also be learned in conjunction with the controller. Similarly, the proposed method can also be integrated with other learning primitives for planning, such as a learned selection procedure. Next steps include simulations with more complex systems and experiments on real vehicular systems.