Motion planning is among the core research problems in robotics and artificial intelligence. It aims to find a collision-free, low-cost path connecting a start and goal configuration for an agent . An ideal motion planning algorithm for solving real-world problems should offer following key features: i) completeness and optimality guarantees — implying that a solution will be found if one exists and that the solution will be globally optimal, ii) computational efficiency — finding a solution in either real-time or in sub-second times or better while being memory efficient, and iii) insensitivity to environment complexity — the algorithm is effective and efficient in finding solutions regardless of the constraints of the environment. Decades of research have produced many significant milestones for motion planning include resolution-complete planners such artificial potential fields , sample-based motion planners such as Rapidly-Exploring Random Trees (RRT) , heuristically biased solvers such as RRT*  and lazy search . Many of these developments are highlighted in the related work section (Section II). However, each planner and their variants have tradeoffs amongst the ideal features of motion planners. Thus, no single motion planner has emerged above all others to solve a broad range of problems.
A recent research wave has led to the cross-fertilization of motion planning and machine learning to solve planning problems. Motion planning and machine learning for control are both well established and active research areas where huge progress has been made. In our initial work in motion planning with neural networks , we highlighted that merging both fields holds a great potential to build motion planning methods with all key features of an ideal planner ranging from theoretical guarantees to computational efficiency. In this paper, we formally present Motion Planning Networks, or MPNet, its features corresponding to an ideal planner and its merits in solving complex robotic motion planning problems.
MPNet is a deep neural network-based bidirectional iterative planning algorithm that comprises two modules, an encoder network and planning network. The encoder network takes point-cloud data of the environment, such as the raw output of a depth camera or LIDAR, and embeds them into a latent space. The planning network takes the environment encoding, robot’s current and goal state, and outputs a next state of the robot that would lead it closer to the goal region. MPNet can very effectively generate steps from start to goal that are likely to be part of the optimal solution with minimal-to-no branching required. Being a neural network approach, we also propose three learning strategies to train MPNet: i) offline batch learning which assumes the availability of all training data, ii) continual learning with episodic memory which assumes that the expert demonstrations come in streams and the global training data distribution is unknown, and iii) active continual learning that incorporates MPNet into the learning process and asks for expert demonstrations only when needed. The following are the major contributions of MPNet:
MPNet can learn from streaming data which is crucial for real-world scenarios in which the expert demonstrations usually come in streams, such as in semi self-driving cars. However, as MPNet uses deep neural networks, it can suffer from catastrophic forgetting when given the data in streams. To retain MPNet prior knowledge, we use a continual learning approach based on episodic memory and constraint optimization.
The active continual learning approach asks for demonstrations only when needed, hence improving the overall training data efficiency. This strategy is in response to practical and data-efficient learning where planning problems come in streams, and MPNet attempts to plan a motion for them. In case MPNet fails to find a path for a given problem, only then an Oracle Planner is called to provide an expert demonstration for learning.
MPNet plans paths with a constant complexity irrespective of obstacle geometry and exhibits a mean computation time of less than 1 second in all presented experiments.
MPNet generates informed samples, thanks to its stochastic planning network, for sampling-based motion planners such as RRT* without incurring any additional computational load. The MPNet informed sampling based RRT* exhibits mean computation time of less than a second while ensuring asymptotic optimality and completeness guarantees.
A hybrid planning approach combines MPNet with classical planners to provide worst-case guarantees of our approach. MPNet plans motion through divide-and-conquer since it first outputs a set of critical states and recursively finds paths between them. Therefore, it is straightforward to outsource a segment of a planning problem to a classical planner, if needed, while retaining the computational benefits of MPNet.
MPNet generalizes to similar but unseen environments that were not in the training examples.
We organize the remaining paper as follows. Section II provides a thorough literature review of classical and learning-based planning methods. Section III describes notations required to outline MPNet, and our propositions highlighting the key features of our method. Section IV presents approaches to train the neural models whereas Section V outlines our novel neural-network-based iterative planning algorithms. Section VI presents results followed by Section VII that provides discussion and theoretical analysis concerning worst-cases guarantees for MPNet. Section VIII concludes the paper with consideration of future avenues of development. Finally, an Appendix is dedicated to implementation details to ensure the smooth reproducibility of our results.
Ii Related Work
The quest for solving the motion planning problem originated with the development of complete algorithms which suffer from computational inefficiency, and thus led to more efficient methods with resolution- and probabilistic- completeness . The algorithms with full completeness guarantees   find a path solution, if one exists, in a finite-time. However, these methods are computationally intractable for most practical applications as they require the complete environment information such as obstacle geometry which is not usually available in real-world scenarios . The algorithms with resolution-completeness   also find a path, if one exists, in a finite-time, but require tedious fine-tuning of resolution parameters for every given planning problem. To overcome the limitations of complete and resolution-complete algorithms, the probabilistically complete methods, also known as Sampling-based Motion Planners (SMPs), were introduced 
. These methods rely on sampling techniques to generate rapidly-exploring trees or roadmaps in robot’s obstacle-free state-space. The feasible path is determined by querying the generated graph through shortest path-finding methods such as Dijkstra’s method. These methods are probabilistically complete, since the probability of finding a path solution, if one exists, approaches one as the number of samples in the graph approaches infinity.
The prominent and widely used SMPs include single-query rapidly-exploring random trees (RRT)  and multi-query probabilistic roadmaps (PRM) . In practice, single-query methods are preferred since most multi-query problems can be solved as a series of single-query problems  . Besides, PRM based methods require pre-computation of a roadmap which is usually expensive to determine in online planning problems . Therefore, RRT and its variants have now emerged as a promising tools for motion planning that finds a path irrespective of obstacles geometry. Although the RRT algorithm rapidly finds a path solution, if one exists, they fail to find the shortest path solution . An optimal variant of RRT called RRT* asymptotically guarantees to find the shortest path, if one exists . However, RRT* becomes computationally inefficient as the dimensionality of the planning problem increases. Furthermore, studies show that to determine a -near optimal path in dimensions, nearly samples are required. Thus, RRT* is no better than grid search methods in higher dimensional spaces . Several methods have been proposed to mitigate limitations in current asymptotically optimal SMPs through different heuristics such as biased sampling    , lazy edge evaluation , and bidirectional tree generations  .
Biased sampling heuristics adaptively sample the robot state-space to overcome limitations caused by random uniform exploration in underlying SMP methods. For instance, P-RRT*   incorporates artificial potential fields  into RRT* to generate goal directed trees for rapid convergence to an optimal path solution. In similar vein, Gammell et al. proposed the Informed-RRT*  and BIT* (Batch Informed Trees) 
. Informed-RRT* defines an ellipsoidal region using RRT*’s initial path solution to adaptively sample the configuration space for optimal path planning. Despite improvements in computation time, Informed-RRT* suffers in situations where finding an initial path is itself challenging. On the other hand, BIT* is an incremental graph search method that instantiates a dynamically-changing ellipsoidal region for batch sampling to compute paths. Despite some improvements in computation speed, these biased sampling heuristics still suffer from the curse of dimensionality.
Lazy edge evaluation methods, on the other hand, have shown to exhibit significant improvements in computation speeds by evaluating edges only along the potential path solutions. However, these methods are critically dependent on the underlying edge selector and tend to exhibit limited performance in cluttered environments . Bidirectional path generation improves the algorithm performance in narrow passages but still inherits the limitations of baseline SMPs .
. RL considers Markov Decision Processes (MDPs) where an agent interacts with the environment by observing a state and taking an action which leads to a next state and reward. The reward signal encapsulates the underlying task and provides feedback to the agent on how well it is performing. Therefore, the objective of an agent is to learn a policy to maximize the overall expected reward function. In the past, RL was only employed for simple problems with low-dimensions . Recent advancements have led to solving harder, higher dimensional problems by merging the traditional RL with expressive function approximators such neural networks, now known as Deep RL (DRL)  . DRL has solved various challenging robotic tasks using both model-based    and model-free    approaches. Despite considerable progress, solving practical problems which have weak rewards and long-horizons remain challenging .
There also exist approaches that apply various learning strategies such as imitation learning to mitigate the limitations of motion planning methods. Zucker et al. proposed to adapt the sampling for the SMPs using REINFORCE algorithm  in discretized workspaces. Berenson et al.  use a learned heuristic to store new paths, if needed, or to recall and repair the existing paths. Coleman et al.  store experiences in a graph rather than individual trajectories. Ye and Alterovitz  use human demonstrations in conjunction with SMPs for path planning. While improved performance was noted compared to traditional planning algorithms, these methods lack generalizability and require tedious hand-engineering for every new environment. Therefore, modern techniques use efficient function approximators such as neural networks to either embed a motion planner or to learning auxiliary functions for SMPs such as sampling heuristic to speed up planning in complex cluttered environments.
Neural Motion Planning has been a recent addition to the motion planning literature. Bency et al. 
introduced recurrent neural networks to embed a planner based on its demonstrations. While useful for learning to navigate static environments, their method does not use environment information and therefore, is not meant to generalize to other environments. Ichter et al.
proposed conditional variational autoencoders that contextualize on environment information to generate samples through decoder network for the SMPs such as FMT*. The SMPs use the generated samples to create a graph for finding a feasible or optimal path for the robot to follow. In a similar vein, Zhang et. al  learns a rejection sampling policy that rejects or accepts the given uniform samples before making them the part of the SMP graph. The rejection sampling policy is learned using past experiences from the similar environments. Note that a policy for rejection sampling implicitly learns the sampling distribution whereas Icheter et. al  explicitly generates the guided samples for the SMPs. Bhardwaj et al  proposed a method called SAIL that learns a deterministic policy which guides the graph expansion of underlying graph-based planner towards the promising areas that potentially contains the path solution. SAIL learns the guiding policy using the oracle Q-values (encapsulates the cost-to-go function), argmin regression, and full environment information. Like these adaptive sampling methods, MPNet can also generate informed samples for SMPs, but in addition, our approach is also outputs feasible trajectories with worst-case theoretical guarantees.
There has also been attempts towards building learning-based motion planners. For instance, Value Iteration Networks (VIN) 
approximates a planner by emulating value iteration using recurrent convolutional neural networks and max-pooling. However, VIN is only applicable for discrete planning tasks. Universal Planning Networks (UPN) extends VIN to continuous control problems using gradient descent over generated actions to find a motion plan connecting the given start and goal observations. However, these methods do not generalize to novel environments or tasks and thus require frequent retraining. The most relevant approach to our neural planner (MPNet) is L2RRT  that plans motion in learned latent spaces using RRT method. L2RRT learns state-space encoding model, agent’s dynamics model, and collision checking model. However, it is unclear that existence of a path solution in configuration space will always imply the existence of a path in the learned latent space and vice versa.
for a given loss function. Fig (b) shows the online execution of MPNet’s neural models. Enet takes the environment information as a point cloud and embed them into latent space . Pnet takes the encoding , robot current and goal states, and incrementally produces a collision-free trajectory connecting the given start and goal pair.
Iii MPNet: A Neural Motion Planner
MPNet is a learning-based neural motion planner that uses expert demonstrations to learn motion planning. MPNet comprises of two neural networks: an encoder network (Enet) and a planning network (Pnet). Enet takes the obstacle information as a raw point-cloud and encodes them into a latent space. Pnet takes the encoding of the environment, the robot’s current state and goal state to output samples for either a path or tree generation. In remaining section, we describe the notations necessary to outline MPNet and its key features.
Let robot configuration space (C-space) be denoted as comprising of obstacle space and obstacle-free space , where is the C-space dimensionality. Let robot’s surrounding environment, also known as workspace, be denoted as , where is a workspace dimension. Like C-space, the workspace also comprise of obstacle, , and obstacle-free,
, regions. The workspaces could be up to 3-dimensions whereas the C-space can have higher dimensions depending on the robot’s degree-of-freedom (DOF). Let robot initial and goal configuration space beand , respectively. Let be an ordered list of length . We assume corresponds to the -th state in , where . For instance corresponds to state . Furthermore, we consider corresponds to the last element of , i.e., . A motion planning problem can be concisely denoted as where the aim of a planning algorithm is to find a feasible path solution, if one exists, that connects to while completely avoiding obstacles in . Therefore, a feasible path can be represented as an ordered list such that , , and a path constituted by connecting consecutive states in lies entirely in . In practice, C-space representation of obstacles is unknown and rather a collision-checker is available that takes workspace information, , and robot configuration, , and determines if they are in collision or not.
In MPNet, we go a step further toward a practical scenario, where MPNet plans motion using raw point-cloud data of obstacles . However, like other planning algorithms, we do assume an availability of a collision-checker that verifies the feasibility of MPNet generated paths based on . Precisely, Enet, with parameters , takes the raw point-cloud data and compresses them into a latent space ,
Pnet, with parameters , takes the environment encoding , robot’s current or initial configuration , and goal configuration to produce a trajectory through incremental generation of states (Fig. 2 (b)),
Therefore, we propose that MPNet performs the feasible path planning,
Proposition 1 (Feasible Path Planning) Given a planning problem , and a collision-checker, MPNet finds a path solution , if one exists, such that , , and .
Another important problem in motion planning is to find an optimal path solution for a given cost function. Let a cost function be defined as . An optimal planner provides guarantees, either weak or strong, that if given enough running time, it would eventually converge to an optimal path, if one exists. The optimal path is a solution that has the lowest possible cost w.r.t . Optimal SMPs such as RRT* exhibits asymptotic optimality, i.e., a probability of finding an optimal path, if one exists, approaches one as the number of random samples in the tree approaches infinity. Our MPNet can generate adaptive samples for SMPs, such as RRT*. We propose that MPNet queried for informed sampling combined with an asymptotic optimal SMP eventually finds a lowest-cost path solution w.r.t , if one exists,
Proposition 2 (Optimal Path Planning) Given a planning problem , a collision-checker, and a cost function , MPNet can adaptively sample the configuration space for an asymptotic optimal SMP such that the probability of finding an optimal path solution, if one exists, w.r.t approaches one as the number of samples in the graph approaches infinity.
A fundamental aim of learning-based planning methods is to adapt to new planning problems while retaining the past learning experiences. Let a mean model success rate on a dataset after learning from an expert example at time and be denoted as and , respectively. A metric described as a backward transfer can be written as . A positive backward transfer indicates that after learning from a new experience, the learning-based planner performed better on the previous tasks represented as . A negative backward transfer indicates that on learning from new experience the model on previous tasks deteriorated. In proposition 3, we highlight that MPNet can learn from a streaming data and asks for demonstration only when needed through active continual learning while minimizing negative backward transfers,
Proposition 3 (Active Continual Learning) Let and be the mean success rates of MPNet over the planning-problems after observing a task at time step and , respectively. Let a backward transfer . Given a new planning problem , at time , MPNet attempts to find a feasible path. If MPNet fails, it asks for an expert demonstration and learns from it in a way that minimizes negative and maximizes positive backward transfer .
Iv MPNet: Training
In this section, we present three training methodologies for MPNet neural models, Enet and Pnet: i) offline batch learning, ii) continual learning, and iii) active continual learning.
Offline batch learning method assumes the availability of complete data to train MPNet offline before running it online to plan motions for unknown/new planning problems. Continual learning enables MPNet to learn from streaming data without forgetting past experiences. Active continual learning incorporates MPNet into the continual learning process where MPNet actively asks for an expert demonstration when needed for the given problem. Further details on training approaches are presented as follow.
Iv-a Offline batch learning
The offline batch learning requires all the training data to be available in advance for MPNet  . As mentioned earlier, MPNet comprises of two modules, Enet and Pnet. In this training approach, both modules can either be trained together in an end-to-end fashion using planning network loss function or separately with their individual loss functions.
To train Enet and Pnet together, we back-propagate the gradient of Pnet’s loss function in Equation 3
through both modules. For the standalone training of Enet, we use an encoder-decoder architecture whenever there is an ample amount of environmental point-cloud data available for unsupervised learning. There exist several techniques for encoder-decoder training such as variational auto-encoders and their variants 
or the class of contractive autoencoders (CAE). For our purpose, we observed that the contractive autoencoders learn robust feature spaces desired for planning and control, and give better performance than other available encoding techniques. The CAE uses the usual reconstruction loss and a regularization over encoder parameters,
where , are the encoder and decoder parameters, respectively, and denotes regularization coefficient. The variable represents reconstructed point-cloud. The training dataset of obstacles’ point-cloud is denoted as which contains point cloud from different workspaces.
The planning module (Pnet) is a stochastic feed-forward deep neural network with parameters . Our demonstration trajectory is a list of waypoints, , connecting the start and goal configurations such that the fully connected path lies in obstacle-free space. To train Pnet either end-to-end with Enet or separately for the given expert trajectory we consider one-step look ahead prediction strategy. Therefore, MPNet takes the obstacles’ point-cloud embedding , robot’s current state and goal state as an input to output the next waypoint towards the goal-region. The training objective is a mean-squared-error (MSE) loss between the predicted and target waypoints , i.e.,
where is the trajectory length, is the total number of training trajectories, and .
Iv-B Continual Learning
In continual learning settings, both modules of MPNet (Enet and Pnet) are trained in an end-to-end fashion, since both neural models need to adapt to the incoming data (Fig. 2
(a)). We consider a supervised learning problem. Therefore, the data comes with targets, i.e.,
where is the input to MPNet comprising of the robot’s current state , the goal state , and obstacles point cloud . The target is the next state in the expert trajectory given by an oracle planner. Generally, continual learning using neural networks suffers from the issue of catastrophic forgetting since taking a gradient step on a new datum could erase the previous learning. The problem of catastrophic forgetting is also described in terms of knowledge transfer in the backward direction. The positive backward transfer leads to better performance and the negative backward transfer leads to poor performance on previous tasks. To overcome negative backward transfer leading to catastrophic forgetting, we employ the Gradient Episodic Memory (GEM) method for lifelong learning . GEM uses the episodic memory that has a finite set of continuum data seen in the past to ensure that the model update doesn’t lead to negative backward transfer while allowing only the positive backward transfer. For MPNet, we adapt GEM for the regression problem using the following optimization objective function.
where is a squared-error loss, is the MPNet model at time step (see Fig. 2). Furthermore, note that, if the angle between proposed gradient at time , and the gradient over is positive, i.e., , there is no need to maintain the old function parameters because the above equation can be formulated as:
where denotes empirical mean.
In most cases, the proposed gradient violates the constraint , i.e., the proposed gradient update will cause increase in the loss over previous data. To avoid such violations, David and Razanto  proposed to project the gradient to the nearest gradient that keeps , i.e,
The projection of proposed gradient to can be solved efficiently using Quadratic Programming (QP) based on the duality principle, for details refer to .
Various data parsing methods are available to update the episodic memory . These sample selection strategies for episodic memory play a vital role in the performance of continual/life-long learning methods such as GEM . There exist several selection metrics such as surprise, reward, coverage maximization, and global distribution matching . In our continual learning framework, we use a global distribution matching method to select samples for the episodic memory. For details and comparison of different sample selection approaches, please refer to Section VII-B. The global distribution matching, also known as reservoir sampling, use random sampling techniques to populate the episodic memory. The aim is to approximately capture the global distribution of the training dataset since it is unknown in advance. There are several ways to implement reservoir sampling. The simplest approach accepts the new sample at -th step with probability to replace a randomly selected old sample from the memory. In other words, it rejects the new sample at -the step with probability to keep the old samples in the memory.
In addition to episodic memory , we also maintain a replay/rehearsal memory . The replay buffer lets MPNet rehearse by learning again on the old samples (Line 14-19). We found that rehearsals further mitigate the problem of catastrophic forgetting and leads to better performance, as also reported by Rolnick et al.  in reinforcement learning setting. Note that replay or rehearsal on the past data is done with the interval of replay period .
Finally, Algorithm 1 presents the continual learning framework where an expert provides a single demonstration at each time step, and MPNet model parameters are updated according to the projected gradient .
Iv-C Active Continual Learning
Active continual learning (ACL) is our novel data-efficient learning strategy which is practical in the sense that the planning problem comes in streams and our method asks for expert demonstrations only when needed. It builds on the framework of continual learning presented in the previous section. ACL introduces a two-level of sample selection strategy. First, ACL gathers the training data by actively asking for the demonstrations on problems where MPNet failed to find a path. Second, it employs a sample selection strategy that further prunes the expert demonstrations to fill episodic memory so that it approximates the global distribution from streaming data. The two-level of data selection in ACL improves the training samples efficiency while learning the generalized neural models for the MPNet.
Algorithm 2 presents the outline of ACL method. At every time step , the environment generates a planning problem ( comprising of the robot’s initial and goal configurations and the obstacles’ information (Line 7). Before asking MPNet to plan a motion for a given problem, we let it learn from the expert demonstrations for up to iterations (Line 9-10). If MPNet is not called or failed to determine a solution, an expert-planner is executed to determine a path solution for a given planning problem (Line 13). The expert trajectory is stored into a replay buffer and an episodic memory based on their sample selection strategies (Line 14-15). MPNet is trained (Line 16-19) on the given demonstration using the constraint optimization mentioned in Equations 5-7. Finally, similar to continual learning, we also perform rehearsals on the old samples (Line 20-25).
V MPNet: Online Path Planning
MPNet can generate both end-to-end collision-free paths and informed samples for the SMPs. We denote our path planner and sample generator as MPNetPath and MPNetSMP, respectively. MPNet uses two trained modules, Enet and Pnet.
Enet takes the obstacles’ point-cloud and encodes them into a latent-space embedding . Enet is either trained end-to-end with Pnet or separately with encoder-decoder structure. Pnet, is a stochastic model as during execution it uses Dropout in almost every layer. The layers with Dropout 
get their neurons dropped with a probability
. Therefore, every time MPNet calls Pnet, due to Dropout, it gets a sliced/thinner model of the original planning network which leads to a stochastic behavior. The stochasticity helps in recursive, divide-and-conquer based path planning, and also enables MPNet to generate samples for SMPs. Further benefits of making Pnet a stochastic model are discussed in Section VII-A. The input to Pnet is a concatenated vector of obstacle-space embedding, robot current configuration , and goal configuration . The output is the robot configuration for time step that would bring the robot closer to the goal configuration. We iteratively execute Pnet (Fig. 2(b)), i.e., the new state becomes the current state in the next time step and therefore, the path grows incrementally.
V-a Path Planning with MPNet
Path planning with MPNet uses Enet and Pnet in conjunction with our iterative, recursive, and bi-directional planning algorithm. Our planning strategy has the following salient features.
Forward-Backward Bi-directional Planning: Our algorithm is bidirectional as we run our neural models to plan forward, from start to goal, as well as to plan backward, from goal to start, until both paths meet each other. To connect forward and backward paths we use a greedy RRT-Connect  like heuristic.
Recursive Divide-and-Conquer Planning: Our planner is recursive and solves the given path planning problem through divide-and-conquer. MPNet begins with global planning (Fig. 3 (a)) that results in critical states that are vital to generating a feasible trajectory. If any of the consecutive critical nodes in the global path are not connectable (Beacon states), MPNet takes them as a new start and goal, and recursively plans a motion between them (Figs. 3 (b-c)). Hence, MPNet decomposes the given problem into sub-problems and recursively executes itself on those sub-problems to eventually find a path solution.
In the remainder of this section, we describe the essential components of MPNetPath algorithm followed by the overall algorithm execution and outline.
V-A1 Bidirectional Neural Planner (BNP)
In this section, we formally outline our forward-backward, bidirectional neural planning method, outlined in Algorithm 3. BNP takes the obstacles’ point cloud embedding , the robot’s initial and target configurations as an input. The bidirectional path is generated as two paths, from start to goal and from goal to start , incrementally march towards each other. The paths and are initialized with the robot’s start configuration and goal configuration , respectively (Line 1). We expand paths in an alternating fashion, i.e., if at any iteration , a path is extended, then in the next iteration, a path will be extended, and this is achieved by swapping the roles of and at the end of every iteration (Line 10). Furthermore, after each expansion step, the planner attempts to connect both paths through a straight-line, if possible. We use (described later) to perform straight-line connection which makes our connection heuristic greedy. Therefore, like RRT-Connect , our method makes the best effort to connect both paths after every path expansion. In case both paths and are connectable, BNP returns a concatenated path that comprises of states from and (Line 8-9).
The bidirectional neural planner outputs a coarse path of critical points (Fig. 3 (a)). If all consecutive nodes in are connectable, i.e., the trajectories connecting them are in obstacle-free space then there is no need to perform any further planning. However, if there are any beacon nodes in , a replanning is performed to connect them (Fig. 3 (b-c)). The replanning procedure is presented in Algorithm 4. The trajectory consists of states given by BNP. The algorithm uses function (described later) and iterates over every connective state pairs and in a given path to check if there exists a collision-free straight trajectory between them. If a collision-free trajectory does not exist between any of the given consecutive states, a new path is determined between them through replanning. To replan, the beacon nodes are presented to the replanner as a new start and goal pair together with the obstacles information. We propose two replanning methods:
(i) Neural replanning (NP): NP takes the beacon states and makes a limited number of attempts () to find a path solution between them using BNP. In case of NP, the in Algorithm 4 is set to 0.
(ii) Hybrid replanning (HP): HP combines NP and oracle planner. HP takes beacon states and tries to find a solution using NP. If NP fails after a fixed number of trials , an oracle planner is executed to find a solution, if one exists, between the given states (Line 9). HP is performed if in Algorithm 4.
V-A3 Lazy States Contraction (LSC)
This process is often known as smoothing or shortcutting . The term contraction was coined in graph theory literature . We implement LSC as a recursive function that when executed on a given path , leaves no redundant state by directly connecting, if possible, the non-consecutive states, i.e., and , where , and removing the lousy/lazy states (Fig. 3 (d)). This method improves the computational efficiency as the algorithm will have to process fewer nodes during planning.
V-A4 Steering ()
The function, as it name suggests, walks through a straight line connecting the two given states to validate if their connecting trajectory lies entirely in the collision-free space or not. To evaluate the connecting trajectory, the function discretize the straight line into small steps and verifies if each discrete node is collision-free or not. The discretization is done as between nodes and using a small step-size. In our algorithm, we use different step-sizes for the global planning, replanning, and for final feasibility checks (for details refer to Appendix).
This function uses the to check whether the given path lies entirely in collision-free space or not.
MPNetPath Execution Summary: Algorithm 5 outlines our MPNetPath framework. Enet encodes the raw point-cloud into latent embedding (Line 1). BNP takes the given planning problem and outputs a coarse path (Line 2). The LSC function takes to remove the redundant nodes and leaves only critical states crucial for finding a feasible path solution (Line 3). If a path constituted by connecting the consecutive nodes in does not belong to the collision-free region, a neural replanning (NP), followed by LSC, is performed for a fixed number of iterations (Line 8-12). The neural-replanning recursively gets deeper and finer in connecting the beacon states whereas LSC function keeps on pruning out the redundant states after every replanning step. In most case, the neural replanner is able to compute a path solution. However, for some hard cases, where neural replanner fails to find a path between beacon states in , we employ hybrid planning (HP) using an oracle planner (Line 13-14). Note that, in case of hybrid replanning, the oracle planner is executed only for a small segment of the overall problem which helps MPNetPath retains its computational benefits while being able to determine a path solution if one exists.
V-B Informed Sampling with MPNet
As mentioned earlier, our planning network (Pnet) is a stochastic model as it has Dropout in almost every layer during the execution. We exploit the stochasticity of Pnet to generate multiple informed samples that would be used as a sample generator for a classical SMP. Because samples are informed in such a way that has high probability to be part of a connectable and near-optimal path, it allows the underlying SMP to find solutions efficiently and quickly.
Algorithm 6 outlines the procedure to integrate our MPNetSMP with any SMP. MPNet’s BNP performs one-step-ahead prediction, i.e., given obstacles representation , robot current state , and goal state , it predicts a next state closer to the goal than previous state . We execute MPNetSMP to incrementally generate samples from start to goal in a loop for a fixed number of iterations before switching to uniform random sampling (Line 5-8). The underlying SMP takes the informed samples (Line 9) and builds a tree starting from an initial configuration. Due to informed sampling, the tree, in the beginning, is biased towards a subspace that potentially contains a path solution and after certain iterations, the tree begins to expand uniformly. Hence, our MPNetSMP perform both exploitation and exploration. Once a path solution is found, it is either returned as a solution or further optimized for a given cost function.
We also propose that our MPNetSMP can be adapted to generate samples for bidirectional SMPs . MPNetSMP generates informed samples incrementally between the given start and goal, and just like the bidirectional heuristic of MPNetPath, it can be queried to generate bidirectional samples. A simple modification for bidirectional sampling in Algorithm 6
would be to initialize two random variableand with start and goal states, respectively, in place of Line 2. Then use instead of and swap the roles of and at the end of every iteration. The proposed modification would allow informed bidirectional generation of trees that are crucial for solving the narrow passage planning problems.
|Simple 2D||Complex 2D||Complex 3D||Rigid-body|
Mean computation times with standard deviations are presented for MPNet (all variations), Informed-RRT* and BIT* on two test datasets, i.e., seen and unseen (shown inside brackets), in four different environments. In all cases, MPNet path planners, MPnetPath:NP (with neural replanning) and MPnetPath:HP (with hybrid replanning), and sampler MPNetSMP with underlying RRT*, trained with continual learning (C), active continual learning (AC) and offline batch learning (B), performed significantly better than classical planners such as Informed-RRT* and BIT* by an order of magnitude.
|Simple 2D||Complex 2D||Complex 3D||Rigid-body|
implementations for classical planners. MPNet models were trained with the PyTorch Python API, exported using TorchScript222https://pytorch.org/tutorials/advanced/cpp_export.html so that they could be loaded in OMPL for execution. Furthermore, for MPNet, we present the results of MPNetPath and MPNetSMP trained with offline batch learning (B), continual learning (C), and active continual learning (AC). The MPNetPath:NP and MPNetPath:HP uses neural and hybrid replanning, respectively. The hybrid replanning exploits neural replanner for a fixed number of iterations before calling an oracle-based planner for replanning, if needed. The system used for training and testing has 3.40GHz 8 Intel Core i7 processor with 32 GB RAM and GeForce GTX 1080 GPU.
Vi-a Training & Testing Environments
We consider problems requiring the planning of 2D/3D point-mass robot, rigid-body, and a 7DOF Baxter robot manipulator. In case of 2D/3D point-mass and rigid-body planning, MPNet is trained over one hundred workspaces with each containing four thousand trajectories, and it is evaluated on two test datasets, i.e., seen- and unseen-. The seen- comprises of one hundred workspaces seen by MPNet during the training but two hundred unseen start and goal pairs in each environment. The unseen- consists of ten new workspaces, not seen by MPNet during training, with each containing two thousand start and goal pairs. For the 7DOF Baxter manipulator, we created a training dataset containing ten cluttered environments with each having nine hundred trajectories, and our test dataset contained the same ten workspaces as in training but one hundred new start and goal pairs for each scenario. For all planners, we evaluate multiple trials on seen- and unseen-, and their mean performance metrics are reported.
Vi-B MPNet Comparison with its Expert Demonstrator (RRT*)
We compare MPNet against its expert demonstrator RRT*. Fig. 4 shows the paths generated by MPNetPath (red), and its expert demonstrator RRT* (blue). The average computations times of MPNetPath and RRT* are denoted as and , respectively. The average Euclidean path cost of MPNetPath and RRT* solutions are denoted as and , respectively. It can be seen that MPNet finds paths of similar lengths as its expert demonstrator RRT* while retaining consistently low computational time. Furthermore, the computation times of RRT* are not only higher than MPNetPath computation time but also grows exponentially with the dimensionality of the planning problems. Overall, we observed that MPNetPath is at least faster than RRT* and finds paths that are within a 10% range of RRT*’s paths cost.
Fig. 5 presents informed trees generated by MPNetSMP with an underlying RRT* algorithm. We report average path computation times and Euclidean costs denoted as and , respectively. The generated trees by MPNetSMP are in a subspace of given configuration space that most of the time contains path solutions. It should be noted that MPNetSMP paths are almost optimal and are observed to be better than MPNetPath and its expert demonstrator RRT* solutions. Moreover, our sampler not only finds near-optimal/optimal paths but also exhibit a consistent computation time of less than a second in all environments which is much lower than the computation time of RRT* algorithm. Thus, MPNetSMP has effectively made the underlying SMP (in this case, RRT*) a more efficient motion planner.
Vi-C MPNet Comparison with Advanced Motion Planners
Table I presents the comparison results over the four different scenarios, i.e., simple 2D (Fig. 4 (a-b)), complex 2D (Fig. 4 (c-d)), complex 3D (Fig. 4 (e-f)) and rigid-body (Fig. 4 (g-h)). Each scenario comprises of two test datasets, seen- and unseen-. We let Informed-RRT* and BIT* run until they find a solution of Euclidean cost within range of the cost of MPNetPath solution. We report mean computation times with standard deviation over five trials. In all planning problems, MPNetPath:NP (with neural replanning), MPNetPath:HP (with hybrid replanning), and MPNetSMP exhibit a mean computation time of less than a second. The state-of-the-art classical planners, Informed-RRT* and BIT*, not only exhibit higher computation times than all versions of MPNet but, just-like RRT*, their computation times increase with the planning problem dimensionality. In simplest case (Fig. 4 (a-b)), MPNetPath:NP stand out to be atleast and faster than BIT* and Informed-RRT*, respectively. On the other hand, MPNetSMP provides about and computation speed improvements compared to BIT* and Informed-RRT*, respectively, in simple 2D environments. Furthermore, it can be observed that the speed gap between classical planner and MPNet (MPNetPath and MPNetSMP) keeps increasing with planning problem dimensions.
Fig. 6 and Fig. 7 present the mean computation time of all MPNet variants, Informed-RRT* (IRRT*), and BIT* on seen- and unseen- test datasets, respectively. Note that MPNetPath trained with offline batch learning (B), continual learning (C), and active continual learning (AC) show similar computation times as can be seen by their plots hugging each other in all planning problems. Furthermore, in Figs. 6-7, it can be seen that MPNetPath and MPNetSMP computation times remain consistently less than one second in all planning problems irrespective of their dimensions. The computation times of IRRT* and BIT* are not only high but also lack consistency and exhibits high variations over different planning problems. Although the computation speed of MPNetSMP is slightly lower than that of MPNetPath, it performs better than all other methods in terms of path optimality for a given cost function.
Vi-D MPNet Data Efficiency & Performance Evaluation
shows the number of training paths consumed by all MPNet versions and Table II presents their mean success rates on the test datasets in four scenarios- simple 2D, complex 2D, complex 3D, and rigid-body. The success rate represents the percentage of planning problems solved by a planner in a given test dataset. The models trained with offline batch learning provides better performance in term of a success rate compared to continual/active-continual learning methods. However, in our experiments, the continual/active-continual learning frameworks required about ten training epochs compared to one hundred training epochs of the offline batch learning method. Furthermore, we observed that continual/active-continual learning and offline batch learning show similar success rates if allowed to run for the same number of training epochs. The active continual learning is ultimately preferred as it requires less training data than traditional continual and offline batch learning while exhibiting considerable performance.
Vi-E MPNet on 7DOF Baxter
Since BIT* performs better than Informed-RRT* in high dimension planning problems, we consider only BIT* as our classical planner in our further comparative analysis. We evaluate MPNet against BIT* for the motion planning of 7DOF Baxter robot in ten different environments where each environment comprised of one hundred new planning problems. Fig. 9 presents four out of ten environment settings. Each planning problem scenario contained an L-shape table of half the Baxter height with randomly placed five different real-world objects such as a book, bottle, or mug as shown in the figure. The planner’s objective is to find a path from a given robot configuration to target configuration while avoiding any collisions with the environment, or self-collisions. Table III summarizes the results over the entire test dataset. MPNet on average find paths in less than a second with about 80% success rate. BIT* also finds the initial path in less than a second with similar success rate as MPNet. However, the path costs of BIT* initial solution are significantly higher than the path cost of MPNet solutions. Furthermore, we let BIT* run to improve its initial path solutions so that the cost of the paths are not greater than 140% of MPNet path costs. The results show that BIT* mostly fails to find solutions as optimal as MPNet solutions but demonstrate about 56% success rate in finding solutions that are usually 40% larger in lengths/costs than MPNet path costs. Furthermore, the computation time of BIT* also increases significantly, requiring an average of about 9 seconds to plan such paths. Fig. 11 shows the mean computation time and path cost comparison of BIT* and MPNet over ten individual environments. It can be seen that MPNet computes paths in less than a second while giving incredibly optimized solutions and is much faster compared to BIT*.
Fig. 10 shows a multi-goal planning problem for 7 DOF Baxter robot. The task is to reach and pick up the target object while avoiding any collision with the environment and transfer it to another target location, shown as yellow block. Note that in previous experiments we let BIT* run until it finds paths within of MPNet path costs. Now, we let BIT* run until it finds a path within cost of the cost of MPNet path solution. Our result shows that MPNetPath:NP plans the entire trajectory in less than a second whereas BIT* took about 3.01 minutes to find a path of similar cost as our solution. Note that MPNet is never trained on trajectories that connect two configurations on the table. However, thanks to MPNet ability to generalize that it can solve completely unseen problems in no time compared to classical planners that have to perform an exhaustive search before proposing any path solution.
|Time||Path cost||Success rate|
|BIT* (Initial Path)|
|BIT* ( MPNet cost)|
This section presents a discussion on various attributes of MPNet, sample selection methods for continual learning, and a brief analysis of its completeness, optimality, and computational guarantees.
Vii-a Stochastic Neural Planner
In MPNet, we apply Dropout  to almost every layer of the planning network, and it remains active during path planning. The Dropout randomly selects the neurons from its preceding layer with a probability and masks them so that they do not affect the model decisions. Therefore, at any planning step, Pnet would have a probabilistically different model structure that results in stochastic behavior.
Yarin and Zubin  demonstrate the application of Dropout to model uncertainty in the neural networks. We show that the Dropout can also be used in learning-based motion planners to approximate the subspace of a given configuration space that potentially contains several path solutions to a given planning problem. For instance, it is evident from the trees generated by MPNetSMP (Fig. 5) that our model approximates the subspace containing path solutions including the optimal path.
The perturbations in Pnet’s output, thanks to Dropout, also play a crucial role during our neural replanning (Algorithm 4). In neural replanning, MPNet takes its own global path and uses its stochastic neural planner (Algorithm 5) to replan a motion between any beacon states in that global path. MPNet is able to recover from any failures or anomalies that hindered finding a feasible solution in the first place.
Vii-B Sample Selection Strategies
In continual learning, we maintain an episodic memory of a subset of past training data to mitigate catastrophic forgetting when training MPNet on new demonstrations. Therefore, it is crucial to populate the episodic memory with examples that are important for learning a generalizable model from streaming data. We compare four MPNet models trained on a simple 2D environment with four different sample selection strategies for the episodic memory. The four selection metrics include surprise, reward, coverage maximization, and global distribution matching. The surprise and reward metrics give higher and lower priority to examples with large prediction errors, respectively. The coverage maximization maintains a memory of -nearest neighbors. The global distribution matching, as described earlier, uses random selection techniques to approximate the global distribution from streaming data. We evaluate four MPNet models on two test datasets, seen- and unseen-, from simple 2D environment, and their mean success rates are reported in Fig. 12. It can be seen that global distribution matching exhibits slightly better performance than reward and coverage maximization metrics. The surprise metric gives poor performance because satisfying the constraint in Equations 7 becomes nearly impossible when the episodic memory is populated with examples that have large prediction losses. Since global distribution matching provides the best performance overall, we have used it as our sample selection strategy for the episodic memory in the continual learning settings.
In this section, we discuss the completeness guarantees (Proposition 1) of MPNet (MPNetPath and MPNetSMP) which imply that for a given planning problem, MPNet will eventually find a feasible path, if one exists. We show that the worst-case completeness guarantees of our approach rely on the underlying SMP for both path planning (MPNetPath) and informed neural sampling (MPNetSMP). In our experiments, we use RRT* as our oracle SMP. RRT* exhibits probabilistic completeness.
The probabilistic completeness is described in Definition 1 based on the following notations. Let be a connected tree in obstacle-free space, comprising of vertices, generated by an algorithm . We also assume always originates from the robot’s initial configuration . An algorithm is probabilistically complete if it satisfies the following definition.
Definition 1 (Probabilistic Completeness) Given a planning problem for which there exists a solution, an algorithm AL with a tree that originates at is considered probabilistically complete iff .
RRT* algorithm exhibits probabilistic completeness as it builds a connected tree originating from initial robot configuration and randomly exploring the entire space until it finds a goal configuration. Therefore, as the number of samples in the RRT* approach to infinity the probability of finding a path solution, if one exists, gets to one, i.e.,
In remainder of the section, we present a brief analysis showing that MPNetPath and MPNetSMP also exhibit probabilistic completeness like their underlying RRT*.
Vii-C1 Probabilistic completeness of MPNetPath
To justify MPNetPath worst-case guarantees, we introduce the following assumptions that are crucial to validate a planning problem.
Assumption 1 (Feasibility of States to Connect) The given start and goal pairs , for which an oracle-planner is called to find a path, lie entirely in obstacle-free space, i.e., and .
Assumption 2 (Existence of a Path Solution) Under Assumption 1, for a given problem , there exists at least one feasible path that connects and , and .
Based on Definition 1 and Assumptions 1-2, we propose in Theorem 1, with a sketch of proof, that MPNetPath also exhibits probabilistic completeness, just like its underlying oracle planner RRT*.
Theorem 1 (Probabilistic Completeness of MPNetPath) If Assumptions 1-2 hold, for a given planning problem , and an oracle RRT* planner, MPNetPath will find a path solution, if one exists, with a probability of one as the underlying RRT* is allowed to run till infinity if needed.
Sketch of Proof: Assumption 1 puts a condition that the start and goal states in the given planning problem lie in the obstacle-free space. Assumption 2 requires that there exist at least one collision-free trajectory for the given planning problem. During planning, MPNetPath first finds a coarse solution that might have beacon (non-connectable consecutive) states (see Fig. 3). Our algorithm connects those beacon states through replanning. Assumption 1 holds for the beacon states as each generated state of MPNetPath is evaluated by an oracle collision-checker before making it a part of . In replanning, we perform neural replanning for a fixed number of trials to further refine , and if that fails to conclude a solution, we connect any remaining beacon states of the refined by RRT*. Hence, if Assumption 1-2 hold for beacon states, MPNetPath inherits the convergence guarantees of the underlying planner which in our case is RRT*. Therefore, MPNetPath with an underlying RRT* oracle planner exhibits probabilistic completeness.
Vii-C2 Probabilistic completeness of MPNetSMP
MPNetSMP generates samples for SMPs such as RRT*. Our method performs exploitation and exploration to build an informed tree. It begins with exploitation by sampling a subspace that potentially contains a solution for a limited time and switches to exploration via uniform sampling to cover the entire configuration space. Therefore, like RRT*, the tree of MPNetSMP is fully connected, originates from the initial robot configuration, and eventually expands to explore the entire space. Hence, the probability that MPNetSMP tree will find a goal configuration, if possible, approaches one as the samples in the tree reaches a large number, i.e.,
MPNet learns through imitating the expert demonstrations which could come from either an oracle motion planner or a human demonstrator. In our experiments, we use RRT* for generating expert trajectories, hybrid replanning in MPNetPath, and as a baseline SMP for our MPNetSMP framework. We show that MPNetPath imitates the optimality of its expert demonstrator. In Fig. 5, it can be seen that MPNet path solutions are of similar Euclidean costs as its expert demonstrator RRT* path solutions. Therefore, the quality of the MPNet paths relies on the quality of its training data.
In the case of MPNetSMP, there exists optimality guarantees depending on the underlying SMP. Since we use RRT* as a baseline SMP for MPNetSMP, the proposed method exhibits asymptotic optimality. Asymptotically optimal SMP, like RRT*, eventually converge to a minimum cost path solution, if one exists, as the number of samples in the tree approaches to infinity (for proof, refer to ). RRT* gets weak-optimality guarantees from its rewiring heuristic. The rewiring incrementally updates the tree connections such that over the time, each node in the tree would be connected to a branch that ensures lowest cost path to the root state (initial robot configuration) of the tree. Therefore, due to rewiring, RRT* asymptotically guarantees the shortest path between the given start and goal states.
In our sampling heuristic, MPNetSMP generates informed samples for a fixed number of iterations and performs uniform random sampling afterward. Therefore, RRT* tree formed by the samples of MPNetSMP will explore the entire C-space and will be rewired incrementally to ensure asymptotic optimality. Since MPNetSMP only performs intelligent sampling and does not modify the internal machinery of RRT*, the optimality proofs will exactly be the same as provided in .
Vii-E Computational Complexity
In this section, we present the computational complexity of our method. Both MPnetPath and MPNetSMP take the output of Enet and Pnet, which is only a forward-pass through a neural network. The forward pass through a neural network is known to have a constant complexity since it is a matrix multiplication of a fixed size input vector with network weights to generate a fixed size output vector. Since MPNetSMP produces samples by merely forward passing a planning problem through Enet and Pnet, it does not add any complexity to the underlying SMP. Therefore, the computational complexity of MPNetSMP with underlying RRT* is the same as RRT* complexity, i.e., , where is the number of samples .
On the other hand, MPNetPath has a crucial replanning component that uses an oracle planner in the cases where neural replanning fails to determine an end-to-end collision-free path. The oracle planner in the presented experiments is RRT*. Therefore, in the worst-case, MPNetPath operates with complexity of . However, we show empirically that MPNetPath succeeds, without any oracle planner, for up to 90% of the cases in a challenging test dataset. For the remaining 10% cases, oracle planner is executed only for a small segment of the overall path (see Algorithm 5). Hence, even for hard planning problems where MPNetPath fails, our method reduces a complicated problem into a simpler problem, thanks to its divide-and-conquer approach, for the underlying oracle planner. Thus, the oracle planner operates at a complexity which is practically much lower than its worst-case theoretical complexity, as is also evident from the results of MPNetPath with hybrid replanning (HB) in Table I.
Viii Conclusions and Future work
We present a learning-based approach to motion planning using deep neural networks. For a given planning problem, our method is capable of i) finding collision-free near-optimal paths; ii) generating samples for sampling-based motion planners in a subspace of a given configuration space that most likely contains solutions including the optimal path. We also present the active continual learning strategy to train our models with a significant improvement in training data-efficiency compared to naive training approaches. Our experimentation shows that our neural motion planner consistently finds collision-free paths in less than a second for the problems where other planners may take up to several minutes.
In our future works, one of our primary objectives is to tackle the environment encoding problem for motion planning. Environment encoding is one of the critical challenges in real-world robotics problems. Current perception approaches consider encoding of individual objects rather than the entire scene that retains the inter-object relational geometry which is crucial for motion planning. Henceforth, we aim to address the problem of motion planning oriented environments’ encoding from raw point-cloud data. Another future objective is to learn all fundamental modules required for motion planning that not only include the motion planner but also the collision checker and cost function. Recently, L2RRT  is proposed that learns the motion planner and collision checker in latent space. However, providing worst-case theoretical guarantees is notoriously hard for planning in latent spaces. Therefore, a modular approach is necessary that learns decision-making directly in robot configuration space so that it can be combined with any classical motion planner to provide theoretical guarantees. Furthermore, learning a cost function, using inverse reinforcement learning , is also crucial for kinodynamic and constraint motion planning problems where defining a cost function satisfying all constraints is challenging.
-  S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal motion planning,” The international journal of robotics research, vol. 30, no. 7, pp. 846–894, 2011.
-  J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Informed rrt*: Optimal sampling-based path planning focused via direct sampling of an admissible ellipsoidal heuristic,” in Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on. IEEE, 2014, pp. 2997–3004.
-  J. D. Gammell, S. Srinivasa, and T. D. Barfoot, “Batch informed trees (bit*): Sampling-based optimal planning via the heuristically guided search of implicit random geometric graphs,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 3067–3074.
-  S. M. LaValle, Planning algorithms. Cambridge university press, 2006.
-  J.-C. Latombe, Robot motion planning. Springer Science & Business Media, 2012, vol. 124.
-  O. Khatib, “Real-time obstacle avoidance for manipulators and mobile robots,” in Autonomous robot vehicles. Springer, 1986, pp. 396–404.
-  N. Haghtalab, S. Mackenzie, A. D. Procaccia, O. Salzman, and S. S. Srinivasa, “The provable virtue of laziness in motion planning,” in Twenty-Eighth International Conference on Automated Planning and Scheduling, 2018.
-  A. H. Qureshi, A. Simeonov, M. J. Bency, and M. C. Yip, “Motion planning networks,” arXiv preprint arXiv:1806.05767, 2019.
-  A. H. Qureshi and M. C. Yip, “Deeply informed neural sampling for robot motion planning,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 6582–6588.
-  J. T. Schwartz and M. Sharir, “On the “piano movers” problem. ii. general techniques for computing topological properties of real algebraic manifolds,” Advances in applied Mathematics, vol. 4, no. 3, pp. 298–351, 1983.
-  T. Lozano-Pérez and M. A. Wesley, “An algorithm for planning collision-free paths among polyhedral obstacles,” Communications of the ACM, vol. 22, no. 10, pp. 560–570, 1979.
-  J. Canny, The complexity of robot motion planning. MIT press, 1988.
-  R. A. Brooks and T. Lozano-Perez, “A subdivision algorithm in configuration space for findpath with rotation,” IEEE Transactions on Systems, Man, and Cybernetics, no. 2, pp. 224–233, 1985.
-  S. M. LaValle, “Rapidly-exploring random trees: A new tool for path planning,” 1998.
-  L. E. Kavraki and J.-C. Latombe, “Probabilistic roadmaps for robot path planning,” 1998.
-  K. Hauser, “Lazy collision checking in asymptotically-optimal motion planning,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 2951–2957.
-  A. H. Qureshi and Y. Ayaz, “Potential functions based sampling heuristic for optimal path planning,” Autonomous Robots, vol. 40, no. 6, pp. 1079–1093, 2016.
-  A. H. Qureshi, S. Mumtaz, K. F. Iqbal, B. Ali, Y. Ayaz, F. Ahmed, M. S. Muhammad, O. Hasan, W. Y. Kim, and M. Ra, “Adaptive potential guided directional-rrt,” in 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2013, pp. 1887–1892.
-  A. H. Qureshi and Y. Ayaz, “Intelligent bidirectional rapidly-exploring random trees for optimal motion planning in complex cluttered environments,” Robotics and Autonomous Systems, vol. 68, pp. 1–11, 2015.
-  Z. Tahir, A. H. Qureshi, Y. Ayaz, and R. Nawaz, “Potentially guided bidirectionalized rrt* for fast optimal path planning in cluttered environments,” Robotics and Autonomous Systems, vol. 108, pp. 13–27, 2018.
-  B. Hou and S. Srinivasa, “Deep conditional generative models for heuristic search on graphs with expensive-to-evaluate edges.”
-  R. S. Sutton, A. G. Barto et al., Introduction to reinforcement learning. MIT press Cambridge, 1998, vol. 135.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
-  M. P. Deisenroth, G. Neumann, J. Peters et al., “A survey on policy search for robotics,” Foundations and Trends® in Robotics, vol. 2, no. 1–2, pp. 1–142, 2013.
-  J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
-  Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International Conference on Machine Learning, 2016, pp. 1329–1338.
-  S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,”The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 421–436, 2018.
-  A. Yahya, A. Li, M. Kalakrishnan, Y. Chebotar, and S. Levine, “Collective robot reinforcement learning with distributed asynchronous guided policy search,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 79–86.
-  Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine, “Path integral guided policy search,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3381–3388.
-  S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3389–3396.
-  I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. Barth-Maron, M. Vecerik, T. Lampe, Y. Tassa, T. Erez, and M. Riedmiller, “Data-efficient deep reinforcement learning for dexterous manipulation,” arXiv preprint arXiv:1704.03073, 2017.
-  A. H. Qureshi, J. J. Johnson, Y. Qin, B. Boots, and M. C. Yip, “Composing ensembles of policies with deep reinforcement learning,” arXiv preprint arXiv:1905.10681, 2019.
-  M. Zucker, J. Kuffner, and J. A. Bagnell, “Adaptive workspace biasing for sampling-based planners,” in Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on. IEEE, 2008, pp. 3757–3762.
-  R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
-  D. Berenson, P. Abbeel, and K. Goldberg, “A robot path planning framework that learns from experience,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 3671–3678.
-  D. Coleman, I. A. Şucan, M. Moll, K. Okada, and N. Correll, “Experience-based planning with sparse roadmap spanners,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 900–905.
-  G. Ye and R. Alterovitz, “guided motion planning,” in Robotics research. Springer, 2017, pp. 291–307.
-  M. J. Bency, A. H. Qureshi, and M. C. Yip, “Neural path planning: Fixed time, near-optimal path generation via oracle imitation,” arXiv preprint arXiv:1904.11102, 2019.
-  B. Ichter, J. Harrison, and M. Pavone, “Learning sampling distributions for robot motion planning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 7087–7094.
-  L. Janson and M. Pavone, “Fast marching trees: A fast marching sampling-based method for optimal motion planning in many dimensions,” in Robotics Research. Springer, 2016, pp. 667–684.
-  C. Zhang, J. Huh, and D. D. Lee, “Learning implicit sampling distributions for motion planning,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 3654–3661.
-  M. Bhardwaj, S. Choudhury, and S. Scherer, “Learning heuristic search via imitation,” arXiv preprint arXiv:1707.03034, 2017.
-  A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel, “Value iteration networks,” in Advances in Neural Information Processing Systems, 2016, pp. 2154–2162.
-  A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn, “Universal planning networks,” arXiv preprint arXiv:1804.00645, 2018.
-  B. Ichter and M. Pavone, “Robot motion planning in learned latent spaces,” IEEE Robotics and Automation Letters, 2019.
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
-  I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrained variational framework,” in International Conference on Learning Representations, vol. 3, 2017.
S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive auto-encoders: Explicit invariance during feature extraction,” inProceedings of the 28th International Conference on International Conference on Machine Learning. Omnipress, 2011, pp. 833–840.
-  D. Lopez-Paz et al., “Gradient episodic memory for continual learning,” in Advances in Neural Information Processing Systems, 2017, pp. 6467–6476.
-  D. Isele and A. Cosgun, “Selective experience replay for lifelong learning,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  D. Rolnick, A. Ahuja, J. Schwarz, T. P. Lillicrap, and G. Wayne, “Experience replay for continual learning,” arXiv preprint arXiv:1811.11682, 2018.
-  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  J. J. Kuffner Jr and S. M. LaValle, “Rrt-connect: An efficient approach to single-query path planning,” in ICRA, vol. 2, 2000.
-  K. Hauser and V. Ng-Thow-Hing, “Fast smoothing of manipulator trajectories using optimal bounded-acceleration shortcuts,” in 2010 IEEE international conference on robotics and automation. IEEE, 2010, pp. 2493–2498.
-  S. Skiena, “Implementing discrete mathematics: combinatorics and graph theory with mathematica1991.”
-  I. A. Şucan, M. Moll, and L. E. Kavraki, “The Open Motion Planning Library,” IEEE Robotics & Automation Magazine, vol. 19, no. 4, pp. 72–82, December 2012, http://ompl.kavrakilab.org.
-  Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning, 2016, pp. 1050–1059.
-  A. H. Qureshi, B. Boots, and M. C. Yip, “Adversarial imitation via variational inverse reinforcement learning,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=HJlmHoR5tQ
-  L. Trottier, P. Giguère, and B. Chaib-draa, “Parametric exponential linear unit for deep convolutional neural networks,” arXiv preprint arXiv:1605.09332, 2016.
J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
-  R. B. Rusu and S. Cousins, “Point cloud library (pcl),” in 2011 IEEE international conference on robotics and automation, 2011, pp. 1–4.